RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

Size: px

Start display at page:

Download "RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors"

Caren Spencer
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors Cagdas Akturan and Margarida F. Jacome, Senior Member, IEEE Abstract The paper proposes a novel software-pipelining algorithm, Register-Sensitive Force-Directed Retiming Algorithm (RS-FDRA), suitable for optimizing compilers targeting embedded very large instruction word processors. The key difference between RS-FDRA and previous approaches is that this algorithm can handle code-size constraints along with latency and resource constraints. This capability enables the exploration of Pareto optimal points with respect to code size and performance. RS-FDRA can also minimize the increase in register pressure typically incurred by software pipelining. This ability is critical since the need to insert spill code may result in significant performance degradation. Extensive experimental results are presented demonstrating that the extended set of optimization goals and constraints supported by RS-FDRA enables a thorough compiler-assisted exploration of tradeoffs among performance, code size, and register requirements for time-critical segments of embedded software components. Index Terms Code generation, compilers, embedded software, embedded systems, ILP extraction, performance optimization, software pipelining. I. INTRODUCTION SOFTWARE pipelining is an effective performance-enhancing loop transformation aimed at extracting instruction-level parallelism (ILP) hidden in inner loop bodies. Software pipelining increases throughput by overlapping the execution of loop body iterations. Since the time-critical segments of embedded digital signal processing and multimedia applications are typically loops, software pipelining is particularly effective in improving the performance of such embedded applications. Very large instruction word (VLIW) processors are ideal platforms for executing the resulting high ILP multimedia/dsp code examples of such processors are Texas Instruments TMS320C62x/C67x [1], Philips Trimedia [2] and StarCore s SC1XX [3]. Although software pipelining can lead to dramatic increases in performance, it may also lead to a significant increase in memory size requirements. Particularly in the context of compilers for embedded VLIW processors, memory size is an important cost factor. The increase in memory size Manuscript received January 12, 2001; revised October 27, 2001 and March 29, This work was supported in part by the National Science Foundation ITR under Grant ACI , by the National Science Foundation under Grant CCR , and by the Texas Higher Education Coordinating Board under Grant ATP This paper was recommended by Associate Editor R. Gupta. The authors are with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin TX USA ( akturan@ece.utexas.edu; jacome@ece.utexas.edu). Digital Object Identifier /TCAD requirements may arise in two forms: increase in code size (program memory) and increase in number of live data objects (registers/data memory). In this paper, we propose a novel software-pipelining algorithm, Register-Sensitive Force-Directed Retiming Algorithm (RS-FDRA), suitable for optimizing compilers targeting embedded VLIW processors. The key difference between RS-FDRA and previous approaches reported in the literature (see Section V) is that our algorithm can effectively handle code-size constraints along with latency and resource constraints. We argue that this novel ability to explicitly consider code-size constraints is critical, since it allows embedded system designers to perform compiler assisted exploration of Pareto optimal points with respect to code size and performance, both critical figures of merit for embedded software. Another important capability of RS-FDRA is that it can also minimize the increase in number of simultaneously live data objects, i.e., register pressure, typically incurred by software pipelining. Note that increase in register pressure tends to occur when ILP (i.e., concurrency of operations) increases. This ability is critical since the need to insert spill code may result in significant performance degradation. In other words, a maximum throughput schedule may not be feasible, due to the limited number of registers available in a machine s data path. Given its unique set of combined capabilities, the proposed RS-FDRA algorithm can be effectively used by embedded system designers to explore different code optimization tradeoffs/alternatives, so as to develop customized retiming solutions for desired memory and throughput requirements. RS-FDRA targets inner loop bodies comprised of a single basic block. A significant percentage of the time critical code segments of signal processing and multimedia applications are indeed single basic block loops. We note, however, that a hierarchical reduction technique, such as the one described by Lam [4], [5], can be incorporated into our algorithm, as discussed in Section III, making it completely general. Finally, note that RS-FDRA poses no restrictions on the functional resources available in the target datapath, i.e., it can handle multicycle and pipelined functional units. As mentioned before, to the best of our knowledge the set of combined capabilities of RS-FDRA is unique. In order to be able to directly compare the quality of the results produced by RS-FDRA with those produced by state-of-the-art algorithms (that can only handle a subset of RS-FDRA s capabilities), we present four sets of experimental results. The first set of experiments considers the problem of finding a retiming solution with minimum code size with a constraint on latency (initiation interval). The second set of experiments considers the same basic /02$ IEEE

2 1396 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 1. Example 1: kernel, retiming graph, and schedule. problem, yet incorporating register pressure minimization. The third and fourth set of experiments considers the dual problem, that is, finding a retiming solution with minimum initiation interval under a constraints on code size, with and without register pressure minimization. The first set of experiments provides extensive experimental evidence that RS-FDRA compares favorably with a state-of-the-art algorithm that addresses the traditional software-pipelining problem, i.e., latency minimization under resource constraints. The second set of experiments, which adds register minimization, shows, again, that our algorithm compares favorably with one of the best state-of-the-art software pipelining algorithms aimed at register pressure minimization under latency and resource constraints. The third and fourth sets of experiments show that RS-FDRA can explore a variety of code-size retiming solutions, while minimizing latency (and register pressure) under resource constraints. The experiments described above empirically demonstrate two important points. First, they show that the extended set of optimization goals and constraints functional resources, latency, code size, and register requirements are supported by RS-FDRA without compromising the quality of the individual solutions generated by the algorithm. In other words, when possible to compare, individual solutions generated by our algorithm compare favorably with those produced by some of the best state-of-the-art algorithms. Second, they show that RS-FDRA can be used to explore a much larger space of tradeoff solutions than that possible to explore by such previous algorithms. The organization of the paper is as follows. Section II gives a brief background discussion on software pipelining and introduces our notation. Section III defines the optimization problems addressed in the paper. Section IV presents our proposed software-pipelining algorithm. Section V reviews previous work. Section VI discusses experimental results and Section VII presents concluding remarks and work in progress. II. BACKGROUND A. Retiming Graph A retiming graph is a data-flow graph representation of a basic block loop body, where denotes the set of operations on the loop body and denotes the data dependencies between those operations. Fig. 1(a) and (b) shows a simple loop body with three instructions (operations) and the corresponding retiming graph, respectively. This example will be used to illustrate the discussion. Operations may need to consume data objects produced in the current or in previous loop iterations. For example, instruction/operation 0 in Fig. 1(a) consumes the result generated by instruction 2 two iterations ago. (Edge (2,0) can thus be conceptualized as a queue of size 2.) Thus, for each data dependence (i.e., edge on ), a delay function is defined as the difference between the loop index at which the data object (associated with that edge) is consumed and the loop index at which it is produced. Such iteration distances are modeled by annotating each edge of the retiming graph with its associated number of iteration delays ( ); see e.g., edge (2,0) in Fig. 1(b). When a data object is produced and consumed at the same iteration, the iteration distance is zero, i.e., the associated edge represents a zero delay (or direct) data dependence. For simplicity, only edges carrying nonzero delays are annotated with delay information in our examples. B. Machine Description We model the data path of a VLIW processor as follows. The set of functional unit types is denoted by. A functional unit type function, denoted by, associates each operation with a functional unit type. Specifically, if operation executes on a functional unit type, then ( ) evaluates to 1, else it evaluates to 0. The number of instances of each resource type in the datapath is given by the function. The execution time of an operation on a functional unit of type is denoted by. For pipelined functional units, the corresponding data introduction interval is denoted by. C. Schedule Function and Initiation Interval The schedule function specifies a starting clock cycle (or step) for each operation in, such that data dependences and resource constraints are not violated. In Fig. 1(c), we show a schedule for the retiming graph in Fig. 1(b), assuming a datapath configuration with two adders and one multiplier, where for all ( ), i.e., all operations take one step to execute. The initiation interval of a loop (denoted by ) is defined as the total number of steps between the start of two consecutive iterations of the loop body. For example, the schedule function shown in Fig. 1(c) would correspond to an initiation interval of three steps.

3 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1397 Fig. 2. Example 1 (continued): after retiming, a schedule with two pipeline stages. D. Retiming Retiming is a transformation performed on the original retiming graph aimed at pipelining several loop body iterations within the same execution cycle. Formally, given a retiming function the set of delays on the original retiming graph is transformed into a new set of delays, where is given by the following ([6], [7]): Recall that the ultimate goal of software pipelining is to decrease the initiation interval of the loop body, i.e., increase throughput. This is done by eliminating direct data dependencies between operations, using the retiming transformation defined above. Consider, for example, the loop body given in Fig. 1(a) and its corresponding retiming graph shown in Fig. 1(b) (and repeated in Fig. 2(a) for convenience). As illustrated in Fig. 2(c), the throughput of this loop can be increased by 33%, i.e., its initiation interval can be reduced from three to two execution steps, if, for example, operation 2 [marked in gray in Fig. 2(a)] is retimed (i.e., delayed) by 1 iteration, with respect to the other operations. As can be seen, the zero delay edges (1,2) and (0,2) were eliminated by the retiming transformation, thus allowing for a more compact schedule. In Fig. 2(c), a scheduling solution for the retimed loop body is shown. Note that, as indicated in (1), the retiming function should be such that the resulting edge delays are greater than or equal to zero ( ), that is, data dependencies must not be violated. Before retiming is performed, all nodes/operations in the retiming graph belong to the same iteration, i.e., pipeline-stage. After retiming, the total number of pipeline stages (that is, iterations executing concurrently), denoted by, is given by (3). For example, in Fig. 2(c) since the maximum retiming applied to any operation is 1, the schedule contains operations from two different iterations executing concurrently, hence the total number of pipeline stages of the software pipelined loop body is equal to 2 (1) (2) E. Code Size As emphasized above, code size is a critical figure of merit for embedded processors. Unfortunately, software pipelining may increase code size significantly and, thus, an optimum performance retiming solution may be unfeasible due to limited program memory available in the target embedded VLIW processor. Note that, after retiming, in order to fill and empty the pipeline, a prolog and an epilog need to be added to the code. Specifically, the operations on the prolog and epilog start and conclude the set of iterations that execute simultaneously in the retimed (steady state) version of the loop body. Fig. 3(a) and (c) shows the original and the retimed version of the loop body under discussion. Fig. 3(b) and (d) show the corresponding retiming graphs. As can be seen, after retiming the total number of operations implementing the loop [Fig. 3(c)] is twice the number of operations on the original loop body. Indeed, after retiming, the prolog and epilog will increase the original number of operations by a factor of, because iterations need to be explicitly coded, i.e., need to be individually started and concluded, in the prolog and epilog, respectively. Two common approaches are used to define VLIW instructions. The first approach considers fixed size instructions, that is, very large instruction words with a fixed number of operations (microinstructions). Such an approach, although simpler, may be prohibitively costly for embedded systems, since a significant percentage of such microinstructions on a given code segment may be actually NOPs. The second and more efficient approach considers variable size VLIW instructions. For example, in the case of Texas Instruments TMS320C62x/C67x [1], a status bit (p-bit) is included in each microinstruction ( ) (i.e., operation ( )), indicating whether the next operation in the sequence, i.e., microinstruction ( ), is scheduled to be executed in parallel with microinstruction ( ), or it is to be serialized. In our code-size calculation, we assume the latter approach. Accordingly, A final observation is in order. In machines that support predicated execution [8] [10], the need for a prolog and epilog can be eliminated. Specifically, in such machines the loop body operations (microinstructions) may be predicated (guarded) and the corresponding predicates defined so as to guarantee (3)

1398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 3. Example 1 (continued): prolog and epilog after retiming.

4 1398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 3. Example 1 (continued): prolog and epilog after retiming. that only required loop body operations/instructions are committed at each execution cycle of the retimed loop body. Note, however, that many top-of-the-line state-of-the-art VLIW DSPs, e.g., Texas Instruments TMS320C62x/C67x [1] and Philips Trimedia [2], do not support predicated execution. Indeed, for some classes of embedded applications, the increase in machine complexity incurred by the required microarchitectural support to predication may prove to be noncost-effective in terms of power/energy. However, even when such machines are used, (i.e., number of pipeline stages) influences the complexity of the corresponding predicate define operations, which may in turn affect the ability to compress code; see e.g., Sias et al. [11]. Thus, the ability to perform software pipelining under constraints on the number of pipeline stages is clearly desirable when considering compilers for embedded VLIW processors. Fig. 4. Example 1 (continued): register requirements. Accordingly, we can compute the number of live copies of a data object (that is, data object produced by operation ) at time, using F. Register Requirements We assume that the lifetime of a data object associated with edge begins after the producer operation ( ) completes its execution and ends when the consumer operation ( ) starts executing, as in where if otherwise. (6) The lifetime of a data object produced by operation is thus equal to the maximum length edge among all edges originated from, given by (4) where (5) In Fig. 4(c), we show the layout of data object lifetimes for the original (nonretimed) loop body example. As can be seen, the lifetime of the data object produced at iteration by operation 2 [associated with edge (2,0)] starts at cycle 0 of the next iteration (i.e., ) and ends at cycle 0, two iterations later (i.e., ). Consequently, the data object produced by node 2 stays live during two consecutive iterations. As can be seen in Fig. 4(c), two copies of this data object overlap at execution step 0, while only one copy is live at steps 1 and 2 of the schedule. Thus, using

5 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1399 Fig. 5. Example 1 (continued): register requirements after retiming. Fig. 6. Cyclic graphs: delay redistribution via retiming. (6) to determine the number of data objects produced by node 2 that are live at step 0, one obtains The minimum number of registers required at each execution step of the schedule can be computed by adding up the register requirements posed by each data object (i.e., result produced by an operation), as shown in (7). In the last column of Fig. 4(c), we show the register requirements of the example loop body, at each execution step of the schedule As indicated in (8), the maximum number of live data objects on any given step of a schedule gives a lower bound on the number of registers required by the schedule In Fig. 5(c), we show the data object layout and register requirements of the retiming solution first given in Fig. 2 (repeated in Fig. 5(a) for convenience). As can be seen, this retiming solution (denoted by solution-1 ) compacts the schedule by 33%; yet, unfortunately, it also increases the register requirements by 50% (see Fig. 4). As mentioned before, ILP increasing transformations enable the generation of more compact schedules and thus typically lead to increases in register requirements. Fortunately, different retiming functions may achieve similar performance gains with lesser impact on register pressure. (7) (8) G. Strongly Connected Components cda graph is strongly connected if, for every pair of nodes, there exist a path and a path. The strongly connected components (SCCs) of a graph impose restrictions on the number of retimings that can be applied to each node belonging to these components, i.e., make the software pipelining problem harder [15]. Consider the strongly connected graph shown in Fig. 6(a). When the nodes of this SCC are scheduled on a datapath with two single-cycle adders and one single-cycle multiplier, the resulting schedule has latency of three steps [see Fig. 6(b)]. This is so due to the zero-delay path (0,1,2). However, if node 2 is retimed by one iteration, as shown in Fig. 6(c), then the graph can be scheduled in two steps. A schedule for this retiming function is shown in Fig. 6(d). Indeed, two steps is the optimum initiation interval for this strongly connected component, due to cyclic dependence constraints. Specifically, the total delay on the edges of a cycle does not change by retiming, i.e., delays can only be circularly redistributed across the edges see Fig. 6. Thus, it may not be possible to eliminate all direct data dependencies for a cyclic graph such is the case for the example in Fig. 6. Now consider the feedforward graph (i.e., noncyclic graph) given in Fig. 7(a). Note that this graph is similar to that in Fig. 6(a), except that we removed edge (2,0). Differently from strongly connected graphs, in a feedforward graph, direct data dependencies can always be fully eliminated. Using the retiming solution shown in Fig. 7(b), for example, we can eliminate all direct data dependencies and schedule this feedforward graph only in one step. Unfortunately, the initiation interval, number of pipeline stages, and/or register requirements posed by the individual retiming solutions chosen for the strongly connected components of a retiming graph, may increase the corresponding

6 1400 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 7. Retiming graphs with and without SCCs. requirements of the solution for the complete graph. Hence, for graphs containing both strongly connected components and feedforward code segments, such as the one shown in Fig. 7(d), a software-pipelining algorithm should carefully handle the strongly connected components, in order to achieve the desired performance. III. PROBLEM DEFINITION As discussed above, although software pipelining is a highly effective performance-enhancing loop transformation technique, it has two significant drawbacks: increase in code size and increase in register pressure. In the context of embedded VLIW machines, these penalties may be prohibitively costly or even unattainable. In order to properly handle these cost factors, we define two basic optimization problems. Problem 1: Find a retiming function that minimizes number of pipeline stages (i.e., code size-primary cost function) and register requirements (secondary cost function) subject to constraints on datapath resources (functional units) and on initiation interval (latency). Problem 2: Find a retiming function that minimizes initiation interval (i.e., latency-primary cost function) and register requirements (secondary cost function), subject to constraints on number of pipeline stages (code size) and on datapath resources (functional units). Problem 1 is the traditional software-pipelining problem. The objective is to derive a software-pipelining solution for a given VLIW datapath configuration that meets the latency constraint and leads to a minimum increase in code size, as well as a minimum increase in register pressure (for such code size). Note that, indeed, most software pipelining algorithms reported in the literature do attempt to minimize the number of pipeline stages, i.e., the depth of retiming, when trying to meet a latency/initiation-interval constraint. A subset of those minimize also register pressure. The objective in Problem 2 is to find a minimum latency and minimum register pressure software pipelining solution (for such latency) that meets the maximum allowed increase in code size. The ability to handle this problem is one of the key differentiating characteristics of RS-FDRA with respect to previous work reported in the literature. In what follows, we refer to algorithms that can handle minimization of register pressure (i.e., the secondary cost function in Problems 1 and 2) as register sensitive and algorithms that cannot handle it as register insensitive (or RI). Accordingly, the simpler (register insensitive) versions of Problems 1 and 2 (that is, the versions with the secondary cost function eliminated) are denoted by Problem 1-RI and Problem 2-RI, respectively. A preliminary register-insensitive version of our framework (i.e., handling only Problems 1-RI and 2-RI) was presented in [12]. In the following sections, we describe our proposed optimization framework and a core algorithm that can effectively handle the two problems defined above. A. Comments on the Generality of the Algorithm Although RS-FDRA targets only inner basic block loop bodies, transformations, such as hierarchical reduction [4] or if-conversion [13], can be incorporated in our framework, making it completely general. Hierarchical reduction [4] works by contracting each innermost loop or branching construct into a single node on the graph. This is performed by aggregating the resource and timing constraints of a (local) schedule of the set of nodes belonging to those constructs. Then, such hierarchically reduced nodes are scheduled, along with all other nodes of the graph. Note that, when the innermost segment being contracted is a conditional branching construct, then the entire conditional statement is reduced to a single node with the aggregate resource usage of the two branches. Thus, the execution time of the resulting node is equal to the execution time of the branch with maximum execution time and resource utilization at each step is given by the maximum over the two branches. When the contracted node is an inner software pipelined loop, then, in order to avoid overlapping of the steady state portion of the construct with the nodes outside the contracted node, all free resources are marked as occupied for the node representing the loop. Although hierarchical reduction is a very simple technique, it may lead to low resource utilization and latency suboptimality. Yet another technique, if-conversion, can be utilized to convert control dependencies into data dependencies in code segments with conditional branches this technique requires machines with support to predication. Specifically, after if-conversion, operations are guarded by predicates capturing the condition(s) on their control paths. Accordingly, such predicated operations are committed when their predicate evaluates to true (that is, when their control path is taken), or nullified otherwise. Since after if-conversion the resulting code has no branching constructs, it can be handled by our software pipelining framework. Similar to the hierarchical reduction technique, if-conversion may lead to low resource utilization. However, both techniques enhance the ability to extract/explore additional ILP in an application.

7 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1401 Fig. 8. Execution flow of RS-FDRA. IV. RS-FDRA In this section, we start by describing the problem decomposition (phasing) implemented in our proposed software pipelining framework. Then, in the following subsections, we discuss the set of algorithms employed in each such phase and the set of heuristics used to orchestrate the global optimization process. The overall phasing and execution flow of RS-FDRA is shown in Fig. 8. Our algorithm starts with a preprocessing phase, where it computes lower bounds on latency and on number of pipeline stages for the problem instance at hand. These are used to restrict the search space, as discussed in the sequel. When an instance of Problem 1 is being solved, that is, when RS-FDRA is searching for a software pipelining solution with a given constraint on initiation interval while minimizing the number of pipeline stages (and register pressure), then the current number of pipeline stages ( ) is initialized to the lower bound on the number of pipeline stages found during the preprocessing stage and the space of solutions is constrained by such value. Similarly, when an instance of Problem 2 is being solved, that is, when RS-FDRA is searching for a software pipelining solution with a given maximum number of pipeline stages that minimizes initiation interval (and register pressure), then the current initiation interval ( ) is initialized to the lower bound on latency found during the preprocessing stage and only retiming solutions within such space are searched for. Details on the preprocessing phase are given in Section IV-A. Note that, after the preprocessing phase is completed, a first problem instance, defined by the and values just determined, has been constructed. Specifically, when solving Problem 1, the value assigned to is the actual initiation interval constraint and the value assigned to is the best or minimum possible number of pipeline stages, i.e., the lower bound. Similarly, when solving Problem 2, the value assigned to is the actual constraint on maximum number of pipeline stages and the value assigned to is and the value assigned to is the lower bound on latency. (Naturally, if the constraints specified in the original problem are greater that the corresponding lower bounds, RS-FDRA reports immediately failure.) Given such problem instance, RS-FDRA enters a complex iterative optimization phase (which is identical for both problem types). The basic steps at each inner iteration loop of the algo-

8 1402 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 rithm (see Fig. 8) are as follows. First, a promising retiming solution for each of the strongly connected components of the graph is selected. After that, global retiming and scheduling functions are derived for the complete retiming graph, preserving the original retiming functions for the SCCs (or, more specifically, the delays on their corresponding edges). Details on the core retiming algorithm are given in Section IV-C. Thus, RS-FDRA s inner iteration loop is driven by carefully generated and ranked alternative retiming solutions for each of the SCCs of the input loop body. In case of failure to find a feasible solution for the current problem instance, a new (unique) set of SCCs retiming functions is considered (the combination with the next best rank ) and the search is resumed within that region of the solutions space. Indeed, one of the key strengths of our framework is its aggressive (joint) exploration of both, the retiming space for the loop body s strongly connected components and the retiming space for the feedforward part of the graph. Specifically, when the region (in the search space) defined by one such a unique combination of SCCs retiming functions fails to contain an actual solution to the problem instance, the algorithm considers the next most promising region. A detailed description of the algorithms used to generate alternative retiming functions for each SCC, as well as ranking heuristics used to evaluate the potential of each such function, in terms of leading to overall solutions with lower latency, number of pipeline-stages, and/or register requirements, is given in Section IV-B. RS-FDRA s outer iteration loop (see Fig. 8) handles the case of failure to find a feasible solution for the current value of (Problem 1) or (Problem 2). As shown in Fig. 8, this is effectively the only point where the iterative flow for both optimization problems diverges. Specifically, when the framework is solving an instance of Problem 1, the value of (the quantity being minimized) is incremented and when it is solving an instance of Problem 2, then is incremented. The search process is then resumed for the new problem instance. When a feasible solution is eventually found, then the register sensitivity of the algorithm (captured on parameter ) is increased and a search for retiming solutions with the same and, but with less register pressure, starts. Details on this phase are given in Section IV-D. A. Preprocessing Phase The first step of the preprocessing phase is to compute lower bounds on the initiation interval and number of pipeline stages, as described in Sections IV-A-I and IV-A-II. Then, the strongly connected components of the input loop body are identified and the set of constraints needed for exploring the space of retiming solutions for these components is computed, as described in Section IV-A-III. 1) Lower Bound on Initiation Interval: Equation (9) gives the lower bound on initiation interval ( ) (9) In the above formula, is a lower bound on the initiation interval posed by resource constraints and is a lower bound on the initiation interval posed by recurrence circuits, that is, cyclic dependence constraints, if any. The lower bound on initiation interval formed by resource constraints ( ) is computed using (10). In simple terms, one divides the total load on any given resource type by the number of instances of that resource type available in the datapath [4]. Note that nonpipelined resources, in (10), should be replaced by (10) The lower bound on initiation interval posed by recurrence circuit constraints ( ) is given by (11) and guarantees that the cyclic data dependence constraints of the loop body are not violated. In order to compute of a recurrence circuit, the total number of cycles required to execute all operations on the recurrence circuit is divided by the total number of delays on the recurrence circuit. Then, the maximum among all recurrence circuits is assigned to abbl 14 TotalExecutionTime(Cycle TotalDelay(Cycle where SetOfCycles (11) 2) Lower Bound on Number of Pipeline Stages: The lower bound on number of pipeline stages ( ) is computed as follows. The critical path of the original input graph is determined (ignoring nonzero delay edges) and then divided by the constraint on the initiation interval. Indeed, it can be easily shown that, in order to achieve a given target initiation interval, the subgraph induced by the nodes on the graph s critical path must be pipelined into at least stages, where is given by CriticalPathLength (12) 3) Preparing Retiming Solutions for the SCCs: As emphasized in Section II-G, finding good retiming solutions for graphs with strongly connected components is a harder problem than that of retiming feedforward graphs. When retiming graphs with SCCs, a common approach followed by current state-of-the-art algorithms is to schedule first the strongly connected components individually and then treat these components as any other node in the graph (i.e., considering their aggregate resource usage and total delay). Another commonly followed approach is to give higher priority to nodes belonging to these components, during software pipelining, in order to address first the harder cyclic dependence constraints (see Section V). These approaches fail to aggressively explore the space of possible retiming solutions for these components. Consequently, the delays on the edges of the strongly connected components may not be used efficiently, leading to suboptimal results. Unlike those approaches, we schedule all nodes of the graph uniformly and iteratively consider a selected set of promising retiming solutions for these strongly connected components. Thus, in the preprocessing step of RS-FDRA, we start by identifying all nontrivial 1 SCCs of the input graph, using the al- 1 Nontrivial SCCs are characterized by having more than one node.

9 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1403 Fig. 9. Example 2: retiming graph with two recurrence circuits, schedule, and register requirements. gorithm Strong[14]. Next, we prepare for the retiming solution generation process, utilizing the results presented in Denk et al. [15], [16] and Simon et al. [17]. Specifically using AlgorithmIE in Denk et al. [15], we generate (interval or equality) constraints for each edge. Using these constraints, alternative retiming solutions are generated, according to the type of the problem being solved. B. Solution Selection for the SCCs As shown in Fig. 8, at each inner loop iteration, our algorithm selects the next promising set of SCC solutions. After a set of SCC solutions is selected for consideration (one for each SCC component of the graph), they are inserted in the retiming graph. The selected SCC solutions are preserved during the entire retiming phase by placing constraints on the relative retiming distances between the nodes of those components. Two heuristic ranking functions are used to decide on the next best set of solutions. The first ranking function evaluates the retiming solutions according to their initiation interval and number of pipeline stages (code size) and the second ranks them based on register requirements. 1) Solution Ranking for Minimum Code Size and Initiation Interval: We have empirically found that, in general, considering first the retiming solutions (for the strongly connected components) that have the least number of pipeline-stages and the smallest initiation interval improved the efficiency (speed) of our algorithm. That is, we are likely to find a retiming solution satisfying the constraints of the specific problem instance sooner/faster. The simple ranking function given in (13), which capture both such values, was experimentally found to provide a good order/rank for the SCC solutions (13) where CP denotes the critical path (i.e., the longest zero delay path) of the SCC solution. 2) Solution Ranking for Minimum Register Requirements: An SCC can be alternatively defined as a group of recurrence circuits that have one or more nodes in common. Whenever a given node (operation) on the SCC appears in more than one recurrence circuit, we can say that the data object produced by is shared among those recurrence circuits. Accordingly, a data object shared by more than one recurrence circuits is called a shared data object (SDO). We denote a shared data object produced by operation by. Fig. 10. All possible retiming solutions for recurrence circuit Rec-1 in Fig. 9. Note that, since a node cannot appear more than once in a recurrence circuit, each edge associated with a shared data object must belong to a different recurrence circuit of the strongly connected component. In other words, the number of edges associated with a shared data object is less than or equal to the number of recurrence circuits that intersect on the source node of the shared data object. For example, in Fig. 9(a) we show a strongly connected graph with one SDO, generated by operation 0 (the edges of are highlighted with bold arrows). As can be seen, two recurrence circuits, Rec-1 and Rec-2, intersect at. In Fig. 9(b), (c), and (d), we show a schedule for the retiming graph, the layout of data object lifetimes, and the number of live data objects at each step of the schedule, respectively. In an attempt to reduce register pressure in an SCC s schedule, we select first (favor) retiming solutions that have larger delays on the edges associated with shared data objects. The key idea is to reduce the delays on the remaining edges of the SCC. This heuristic is based on the fact that each recurrence circuit has a fixed total delay. For example, in Fig. 10 we show all possible retiming solutions for the recurrence cicuit-1 (Rec-1) in Fig. 9(a). As can be seen, the total delay on the recurrence circuit is always equal to 2. Indeed, retiming does not alter the overall register requirements of a single recurrence circuit. Therefore, only by overlapping the delays of the several recurrence circuits of an SCC can one reduce the overall register requirements of a strongly connected graph. For example, in Fig. 11(a), we show an alternative retiming solution for the SCC originally given in Fig. 9(a). In Fig. 11(b), (c), and (d) we show a schedule for the retiming solution in Fig. 11(a), the layout of data object lifetimes, and the number of live data objects at each step of the schedule, respectively. When compared to the solution in Fig. 9(a), this last solution leads to a 25% decrease in register requirements. Note that, consistently with our previous discussion, the solution in Fig. 11(a) redistributes the edge delays on the recurrence circuits so as to maximize the number of delays on the shared edges. Our minimum register requirements ordering heuristic would thus give higher priority (rank) to the solution in Fig. 11(a).

10 1404 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 11. Example 2 (continued): after retiming, a schedule with lesser register requirements. Fig. 12. Scheduling spaces and reservation table. C. Retiming Phase During the retiming phase, a schedule for the current problem instance, i.e., a schedule under resource, code size (number of pipeline stages ) and latency ( ) constraints is derived, for the complete graph. Moreover, the scheduling algorithm implicitly derives a retiming function, as it constructs a scheduling function for the operations in the retiming graph. Specifically, similarly to the extension algorithm described in Paulin et al. [18], the time dimension is sliced into the target number of pipeline stages each with a latency equal to the target initiation interval (using the corresponding values specified for the current problem instance). Assume, for example, a problem instance with constraints and 3. As shown in Fig. 12(a), the schedule for this instance can have a maximum of six steps (i.e., steps). Note that steps 0 2 are located in pipeline stage 1 (PS-1), while steps 3 5 are located in pipeline stage 0 (PS-0); see Fig. 12(b). Accordingly, an operation scheduled in steps 3 5 is being implicitly retimed by one (i.e., is being delayed by one iteration), while an operation scheduled in steps 0 2 is not being retimed (i.e., delayed). A corresponding module reservation table is used during scheduling, to keep track of resource usage at each step of. In the example in Fig. 12(c), we assume that there are two resources of type and one resource of type. Thus, for example, operations scheduled in steps 0 and 3 (marked in gray) directly compete for such resources, since mod mod. In summary, scheduling an operation to a time step in the scheduling space given by function [illustrated in Fig. 12(a)], also specifies the retiming function for that operation, given by (14) Consider the more concrete example given in Fig. 13. Fig. 13(a) shows the original retiming graph, to be executed on a target datapath configuration with one adder and one multiplier, both single cycle. Assuming,, in Fig. 13(b) we show the scheduling space after operations 0 and 1 of the retiming graph have been scheduled. Note that, since operation 0 is using the only adder available in the datapath at step 0 (see modulo reservation table in Fig. 13(c), ), operation 2 can only be scheduled at step 3, i.e., step 1 of pipeline stage PS-0, as shown in Figs. 13(d) and (f). This corresponds to operation 2 being retimed by one (i.e., delayed by one iteration with respect to operations 0 and 1), as shown in Fig. 13(f). As mentioned above, during the global scheduling process, the individual retiming solutions for the SCCs are preserved, that is, the relative iteration differences between the nodes of a strongly connected component (which have been previously set) must be preserved. Thus, differently from conventional retiming, our retiming function depends on the node membership type. Specifically, we define two node membership types: Connected Node, for nodes belonging to strongly connected components and Free Node, for the remaining nodes of the graph. The retiming function used for Connected Nodes is as follows: when a connected node needs to be retimed (moved to a different iteration), all nodes belonging to the same strongly

11 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1405 Fig. 13. Illustrative example of implicit retiming during scheduling. connected component are retimed by the same amount. The retiming of Free Nodes is performed in the standard way. RS-FDRA uses the fundamental ideas of Force Directed Scheduling [18] so as to optimize/maximize resource utilization and thus achieve highly compact schedules. In simple terms, the decisions of Force Directed Scheduling are geared toward reducing operations concurrency, i.e., competition for resources, across all scheduling steps. Specifically, when considering to schedule an operation on a given step, if the demand for resources at that step is above the average demand for the operation s range of alternative scheduling steps, then the force associated with that step is larger than the forces on these alternative steps. Thus, by selecting steps with minimum forces, the algorithm balances resource usage across the schedule. Extensive experimental results presented in Section VI show that, indeed, RS-FDRA obtains very compact schedules, frequently superior to the reference/comparison algorithms. At the core of this improvement is the ability of our algorithm to take full advantage of the functional resources available in the datapath. Pseudo-code, outlining the main execution steps of our Modified Force Directed Scheduling algorithm, is shown in Fig. 14. In what follows, we describe each of these steps in detail, using the small illustrative example in Fig. 15. Fig. 15(a) gives the original retiming graph considered in the example as can be Fig. 14. Main execution steps of Modified Force Directed Scheduling algorithm. seen, it contains one strongly connected component (SCC with operations 1 and 2). In Fig. 15(b), we show the solution currently under consideration for the SCC, with operation 2 retimed by 1. As mentioned in the introduction to Section IV, throughout the retiming of the complete graph (i.e., during this particular iteration over the inner loop of RS-FDRA, see also Fig. 8), the iteration/retiming distance (between operations 1 and 2) defined by this SCC solution will be preserved. For the sake of illustration, we assume that the target datapath has two single cycle adders and two single cycle multipliers.

12 1406 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 15. ASAP and ALAP for datapath with two multipliers and two adders. Fig. 16. Example of probability and pipelined distribution functions. 1) Defining Feasible Scheduling Steps: The first step of our Modified Force Directed Scheduling Algorithm is to compute the time frame for each operation, that is, the set of alternative/feasible scheduling steps for each operation (see function in line 1 of Fig. 14). Specifically, for each (unscheduled) node we derive a modified ASAP schedule, denoted by and a modified ALAP schedule, denoted by, of the retiming graph, so as to determine such ranges. There are three main differences between our modified and the conventional ASAP and ALAP algorithms. The first difference is that and identify unfeasible execution steps for Connected Nodes by taking into consideration the precedence relationships defined in the retiming graph. For example, if there is a zero delay edge from node to, then node must be placed at a later execution step than its predecessor. If node cannot be scheduled in the same pipeline stage as, then is pushed down to the next pipeline stage, in order to remove the direct data dependence between these two nodes. When pushing such a node down to the next pipeline stage, the modified retiming function mentioned before is updated for the connected nodes, if any. The second difference is that, when the input graph is feedforward (i.e., contains no SCCs), we use the TASAP algorithm proposed in Peixoto et al. [19], in order to obtain tighter bounds on the ASAP and ALAP scheduling times of the nodes. The use of TASAP improves the quality of the final solutions, by allowing the generation of more precise load distribution (resource demand) profiles, to be used during scheduling. It also reduces the execution time of RS-FDRA by eliminating impossible execution steps from the operations time frames and, thus, eliminating unnecessary force calculations. The third difference is that, after computing such time frames, our algorithm identifies scheduling steps with zero available resources and removes them from the relevant time frames. Thus, for each unscheduled node, a set of feasible scheduling steps (denoted by ) is constructed. In Fig. 15(c) and (d), we show the corresponding and schedules for, for,, and. Note that such steps may not be contiguous, due to lack of resources. 2) Probability Function: Similar to the standard Force Directed Scheduling algorithm [18], after computing, the set of feasible scheduling steps for each unscheduled operation, we then associate each such operation with a probability function (see function in line 2 of Fig. 14). Specifically, we assume that each operation has a uniform probability of being scheduled to any of its feasible scheduling steps, given by if otherwise. (15) The probability functions of operations 0 2 (see Fig. 15) are graphically depicted in Figs. 16(a)-(c), respectively. Note that operation 2, for example, has two feasible scheduling steps (2 and 3) and, thus, and. 3) Pipelined Distribution Function: After computing the probability function for each operation, a pipelined distribution function for each resource type is derived, using (16) (see function in line 3 of Fig. 14). Note that the pipelined distribution function of a resource type gives a profile of the demand for that resource at each execution step. In simple terms, the pipelined distribution function is the sum of the probabilities of the operations that can execute on a specific resource type

13 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1407 at step, where denotes individual pipeline stages and (16) Consider again our illustrative example in Fig. 15. Fig. 16(d) shows the corresponding pipelined distribution function for the resource type multiplier (note that only operations 0 and 2 are of that type). 4) Force Computations: Using the pipelined distribution functions, we then compute the self-forces and predecessor and successor forces for each operation at each feasible scheduling step [18] (see function in line 4 of Fig. 14). The total force of an operation is given by the sum of those partial forces, as discussed in the sequel. The self-force of operation at a scheduling step, given in (17) [18], measures the change in concurrency resulting from scheduling at step. Specifically, the second term in (17) averages the resource demand over the set of feasible scheduling steps for operation and subtracts this amount from the resource demand at step. Thus, a positive sign self-force signifies an increase in concurrency, while a negative self-force indicates a decrease in concurrency, with respect to the average alluded to above (17) The predecessor and successor forces for operation, given in (18) [18], on the other hand, account for the change in the predecessor and successor operations concurrency, resulting from scheduling at step. In order to compute predecessor and successor forces for operation at step, we start by temporarily scheduling operation to step. If this tentative assignment reduces the time frame (i.e., the set of feasible scheduling steps) of any of its predecessor or successor operation(s), then the variation on the self-force of the predecessor/successor (denoted by ) operation is computed, by subtracting the average resource demand over the predecessor/successor operation s original time frame ( ), given by the second term in (18), from the average resource demand over the predecessor/successor operation s reduced time frame ( ), given by the first term in (18). Thus, similarly to self-forces, a positive predecessor successor force for operation at time step indicates that there will be an increase in average resource demand after scheduling the operation at step. Hence, step should be avoided, if possible (18) In the case of our small illustrative example, the total force of operations 0 and 1 is zero, since both operations have a single feasible scheduling step. Consider now operation 2. Since the average demand for resources at its set of feasible scheduling steps is 1 (see Fig. 16(d), its self-force at steps 2 and 3 is 0.5 (i.e., 0.5 1). Moreover, since the range of valid steps of its predecessor operation (specifically, operation 1) is not affected by any of operation 2 s feasible scheduling steps, the predecessor force of operation 2 is zero. The relative meaning of these values is discussed in the following section. 5) Scheduling Decision: The next step of our Modified Force Directed scheduling algorithm is to select the next operation to be scheduled (see function in line 5 of Fig. 14) and schedule the selected operation to the most cost efficient step (see function in line 6 of Fig. 14). We experimented with the standard criteria of Force Directed Scheduling, which simply selects the pair (operation, step) with minimum force and found it to be ineffective, i.e., to yield poor results, in general. Thus, we developed our own criterion, which is as follows. We start by computing a total sum force for each operation, over all its feasible scheduling steps, using (19). The idea behind this is to identify the most urgent operation, in terms of resource load distributions (19) Given these sum forces, the selection criterion depends on the mode in which the algorithm is executing, that is, with or without sensitivity to register pressure. Specifically, when the algorithm is solving Problem 1-RI or Problem 2-RI (i.e., without sensitivity to register pressure), the operation with maximum sum force is selected and scheduled on the time step where it has minimum force. As mentioned before, this scheduling strategy allows the algorithm to schedule the most urgent operations first (that is, operations competing more severely for resources), to the less congested time steps in their time frame. In the case of our ongoing example, the total sum force of operations 0 and 1 is zero and the total sum force of operation 2is 1. Thus, operation 2 (the easiest operation to schedule) would be assigned a priority lower than that of operations 0 and 1. When the algorithm is running in the register sensitive mode (i.e., when the secondary cost function is being also considered), the process is more elaborate. Specifically, the operation with the maximum sum force is still selected for scheduling, but the decision on the scheduling step is a two-phase process. In the first phase, all time steps that are within a minimum force threshold are marked as high-priority scheduling steps. In order to do that, the algorithm starts by sorting the scheduling steps in the operation s time frame by increasing force. Then, it selects the set of time steps whose associated forces are within percent of the maximum force in the operation s time frame. In all our experiments we used, yet this parameter can be set by the user. (Note that, if the forces happen to be identical over all scheduling steps, the algorithm selects the complete set for consideration.) Returning to our ongoing example, operation 0 has total force of 0 in step 0, operation 1 has a total force of 0 in step 1, and operation 2 has a total force of 0.5 in steps 2 and 3. Since these operations have the same total force on all their feasible scheduling steps, this is the special case alluded to above (i.e., all feasible steps would be selected for consideration). A more interesting example is given in Fig. 17 we show the forces for a hypothetical operation at each execution cycle in its time frame. For the particular threshold value ( ), steps 1 and

14 1408 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 Fig. 17. Forces and high-priority scheduling steps for operation n. 2 would be included in the set of high-priority scheduling steps of operation (as opposed to just step 2, the one with minimum force). So, informally speaking, parameter relaxes the first heuristic (aimed at maximizing functional resource utilization), by augmenting the set of candidate scheduling steps. As discussed in the sequel, the resulting (larger) space of alternatives is used to attempt to minimize register pressure. Specifically, in the second phase of the selection process, for each of these high-priority scheduling steps, we compute the register requirements of the data object produced by the operation that is about to be scheduled, with respect to its already scheduled successor operations, using (5). We then select the time step with minimum register requirements and schedule the operation to that step. After updating the set of feasible time steps for the unscheduled operations, the calculation of forces is repeated, until no nodes can be further scheduled (failure) or all nodes are scheduled (success). D. Final Register Pressure Improvement Phase As indicated in Fig. 8, after a feasible solution is found (for a given and ), if the algorithm is executing in register sensitive mode, it enters a final register pressure improvement phase. Specifically, the and values of the feasible solution just found, as well as the particular SCC solutions embedded in it, are fixed and, as shown in Fig. 8, a new round of iterations over the retiming phase only (using our modified force directed scheduling algorithm) is performed, driven by a stepwise increase of. Note that, by increasing the value of at each iteration, the algorithm increases register sensitivity (i.e., increases gradually the set of high-priority steps to be considered for each operation), in an attempt to obtain retiming solutions with less register pressure. In the current implementation we start with a and increase it up to 1 by increments of 0.1. The global retiming solution (and corresponding scheduling solution) with minimum register requirements (for the values of and identified in the previous phase) are selected as the best solution for the original software pipelining problem. E. Time Complexity RS-FDRA has two main iteration loops, on a core retiming algorithm with cubic asymptotic complexity [18]. The outer loop iterates over increasing values of or, depending on the optimization problem being solved. For example, when the algorithm is solving Problem 1, i.e., when is minimizing code size ( ) subject to constraints on initiation interval ( ), at each iteration, is incremented by 1. However, after a certain maximum number of attempts RS-FDRA gives up in our implementation, the maximum number of attempts (for when solving Problem 1 and for when solving Problem 2) was set to 3, yet this can be set by the user. The inner loop of RS-FDRA iterates over different sets of retiming solutions for the SCCs. As is well known, the number of different retiming solutions for SCCs is exponential. Our algorithm attacks this problem first by eliminating the unfeasible retiming solutions, i.e., solutions with larger critical path or more pipelined than the constraints on and, respectively. Then, it ranks and orders all feasible solutions, as explained in Section IV-B. Throughout our extensive experimentation, all SCCs in our representative benchmarks were consistently small sized (at most nine nodes). However, if a loop body with a very large SCC needs to be handled, the search may become prohibitively expensive. A simple way to address this problem (i.e., restrict the exponential search) is to include only a maximum number of solutions with the best rank and this limit can be implemented as a user defined parameter. Thus, by limiting the number of iterations on or, as well as the maximum number of alternative sets of SCCs solutions to be considered, the asymptotic complexity of RS-FDRA remains cubic. Although such time complexity is still quite significant, we find it compatible with the need to synthesize high-quality embedded software components. Indeed, in the context of embedded software synthesis, designers are likely to be willing to wait significantly longer for a better quality solution, that is, for a solution that aggressively minimizes the cost function, while meeting the specified constraints. RS-FDRA is proposed for use in such a context. V. PREVIOUS WORK In Denk et al. [15], an algorithm to find all valid retiming solutions for a strongly connected graph is proposed. RS-FDRA uses this technique in its solution generation process for the strongly connected components of the input graph. The software-pipelining algorithm proposed by Lam [4], [5] considers resource and latency constraints and is able to handle conditionals in the loop body. The graph s strongly connected components are first individually scheduled using list scheduling. Then, a reduced feedforward graph (i.e., a DAG) is constructed, by reducing each strongly connected component in the original graph to a single node, with the corresponding aggregate resource usage. A modified list-scheduling algorithm is then used to find a schedule for the reduced graph, considering a given target initiation interval. If a schedule cannot be found, the initiation interval is increased and the list-scheduling algorithm is reinvoked. Although this approach is attractive due to its simplicity, the a priori individual scheduling of strongly connected components may give suboptimal results. In this algorithm, conditional branching constructs are handled using a hierarchical reduction technique. Specifically, the branches of the conditionals are first scheduled and then contracted into a single node. Software pipelining is applied only after the graph has been completely reduced, using the same technique employed for basic blocks.

15 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1409 In Wang et al. [20], an algorithm for latency constrained scheduling (using implicit retiming) is proposed in the context of high-level synthesis of signal processing architectures. In this approach, if the target latency is fractional, then before retiming the input graph is unfolded. Similar to Lam s algorithm [4], nodes belonging to the strongly connected parts of the graph are given higher priority. The algorithm starts by breaking cycles into parts the breaking points are the edges with positive delays. Then, a set of cycle parts that covers all nodes of all strongly connected components is found. This set of cycle parts is scheduled according to the priority function mentioned before. Resource conflicts are resolved by shifting flexible cycles to later time steps. If shifting is not possible, the number of resources is increased. After scheduling all strongly connected components, the algorithm then schedules the feedforward parts of the graph. The algorithm proposed in Lee et al. [21] schedules a loop body under latency and resource constraints, trying to minimize the number of pipeline stages. This algorithm has two phases. In the first phase, the input graph is scheduled assuming unlimited resources. Resource conflicts are then solved by delaying selected operations. If a valid schedule cannot be generated, the number of pipeline-stages is incremented and the algorithm is repeated. The software pipelining algorithm proposed in Goossens et al. [22] also performs an initial scheduling assuming unlimited resources. The resulting retimed graph is then scheduled for minimum latency under resource constraints, using list scheduling. If the latency of the scheduled graph is larger than the target latency, the nodes at the tail of the graph are retimed repeatedly, until the target latency is reached. The resource-constrained software-pipelining algorithm proposed in Aiken et al. [23] handles conditionals on the loop body. A scheduler repeatedly unfolds the loop and schedules operations selected by a dependence analyzer, until a repeating pattern is detected. In order to guarantee the termination of the algorithm, a scheduling window (of operations) is used. The scheduling window can include operations in at most further iterations, where is an input parameter. The retiming algorithm proposed in Chao et al. [24] compacts a given valid schedule by applying a phased iterative retiming and scheduling on the first steps of the schedule. The number of down rotations and the rotation size (i.e., the number of scheduling steps to be down rotated by the algorithm) are input parameters to the algorithm. If adequate phase and rotation sizes are specified, this algorithm is likely to converge to an optimum solution. The algorithm proposed in Potkonjak et al. [25] uses a probabilistic rejectionless algorithm, aiming at achieving high-resource utilization. Each iteration of the algorithm randomly selects a candidate move of a delay in the graph. The probability of selecting a certain move is proportional to the increase in demand for all types of resources. This algorithm is similar to Chao et al. [24], in the sense that when the running time is sufficiently large, the algorithm is likely to converge to an optimum solution. None of the algorithms described so far is capable of handling code-size constraints. To the best of our knowledge, the only exception is our fast heuristic algorithm reported in Jacome et al. [26], which, similarly to RS-FDRA, handles minimum code size under latency constraints or minimum latency under code-size constraints. This algorithm uses list scheduling and gives higher priority to the nodes in the strongly connected components of the input graph during the scheduling process. As mentioned before, giving higher priority to strongly connected components may eliminate optimal solutions from consideration. Since in the context of embedded systems the quality of the generated solutions is of major importance, we designed RS-FDRA to efficiently handle the high-quality requirements of such applications, as well as enable explicit tradeoff exploration by embedded system designers. In what follows, we overview previous software pipelining algorithms that can handle register pressure minimization. Slack Scheduling [27] follows a bidirectional scheduling strategy, i.e., schedules some operations early and some operations late, in order to reduce register pressure. The operations with smaller slack are given higher priority and thus are scheduled earlier. Also, a limited backtracking approach is used, that is, the scheduler may eject selected operations form the schedule, to eliminate resource conflicts. In Eichenberger et al. [28] and [29], an LP-based approach is proposed to schedule loop operations for minimum register requirements, for a given modulo reservation table. An exact methodology to minimize register requirements for an optimum rate schedule is also presented in Govindarajan et al. [30]. Exact methods to solve the NP-complete problem under consideration may be prohibitively expensive to be used in a compiler. In Eichenberger et al. [31], a set of low computational complexity stage-scheduling heuristics is presented, aimed at reducing the register requirements of a given modulo schedule solution. These heuristics shift operations (by multiples of ) to earlier or later pipeline stages. Unfortunately, the quality of the final schedule, in terms of performance and register requirements, mainly depends on the input modulo schedule. In Llosa et al. [32], [33], Hypernode Reduction Modulo Scheduling, a software pipelining algorithm aiming at register pressure minimization, is proposed. This algorithm first orders the nodes of the graph. When ordering the nodes, first an initial node, called hypernode, is randomly selected. Then, in an iterative process, all the nodes in the dependence graph are reduced to this hypernode, i.e., added to the ordered list of nodes conceptualized as a hypernode. Specifically, in order to include new nodes in the hypernode, all edges between the hypernode and its neighbors are deleted and all edges to/from those neighbors are directed to/from the hypernode. If the input graph has strongly connected components, then those components are reduced to hypenodes before the nodes in the feedforward parts of the graph. After reducing the complete graph to a single hypernode, the nodes of the graph are scheduled, in the reduction order using a modulo scheduling like technique. This algorithm thus aims at minimizing register pressure by scheduling operations as close as possible to their neighbors. Swing modulo scheduling [34], [35] schedules the operations of the input graph using a elegant heuristic ordering. Before ordering the nodes, one back-edge of each recurrence circuit is eliminated. Then, the nodes are ordered for scheduling, such that

16 1410 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 each node has only predecessors or successors in the currently ordered list of operations. During the ordering phase, the nodes of the strongly connected components are given higher priority and are placed together. Then, operations are scheduled as close as possible to their already scheduled neighbors (if any), in order to reduce register pressure. VI. EXPERIMENTAL RESULTS This section presents experimental validation results. Experimental data was collected for a number of digital signal-processing benchmarks widely referenced in the retiming/software pipelining literature, considering various VLIW datapath configurations. The detailed experimental data is given in Appendix A. We start by presenting the results obtained when RS-FDRA is solving Problem 1-RI. Recall that this is the traditional software pipelining problem in register insensitive mode, that is, considering only the primary cost function (see Section III). In these experiments, we compare our results with those produced by our implementation of Rotation Scheduling [24]. Then, in Section VI-B, we present experimental results obtained when RS-FDRA is solving Problem 1, that is, minimizing code size and register pressure under constraints on initiation interval. In these experiments we compare our results with our implementation of Swing modulo scheduling [34], since unlike Rotation Scheduling, this algorithm can handle register pressure minimization. Finally, we present results obtained when RS-FDRA is solving Problem 2 and Problem 2-RI. Recall that Problem 2 is the code-size constrained software pipelining problem, minimizing initiation interval (and register pressure) under code-size constraints. In these two final sets of experiments, we compare our set of tradeoff results, with the single results produced by the selected reference algorithms, Rotation Scheduling and Swing Modulo Scheduling. A final clarification before presenting our set of experimental results. The two optimization problems addressed in this paper are NP complete, thus finding an actual optimal solution is virtually impossible for most reasonably sized problem instances. Thus, in what follows, when we refer to a minimum,,or solution, we specifically mean a solution that has the best (i.e., smallest), or found by any of the algorithms under consideration. A. Experiments: Register-Insensitive Latency Constrained Retiming Problem (Problem 1-RI) Rotation scheduling [24] was chosen to be the comparison algorithm for this set of experiments because it is one of the best software pipelining algorithms proposed to date indeed, our experiments show that it consistently finds minimum latency schedules under resource constraints, with a corresponding minimum depth retiming function (i.e., minimum number of pipeline stages). Although Rotation scheduling does not handle code-size constraints, the results of this experiment are still informative, since they empirically demonstrate that RS-FDRA handles code-size minimization under latency and resource constraints (i.e., Problem 1) at least as effectively as previous state-of-the-art approaches. Fig. 18. Experimental results: Problem 1-RI. The bar chart in Fig. 18 summarizes the results of our experiments with RS-FDRA and Rotation scheduling. As can be seen, our algorithm finds the best solution in most cases. Specifically, it outperforms or gives identical results to rotation scheduling in 98% of the 126 conducted experiments. Moreover, when is improved with respect to rotation scheduling, RS-FDRA was able to decrease it on average by 17% and up to 33%. Considering the cases for when RS-FDRA has obtained the minimum/best initiation interval solution, in 94% of such cases it has also obtained the solution with minimum code size. Rotation scheduling has obtained the minimum code-size solution only in 72% of those cases. Again, for the cases in which RS-FDRA outperformed Rotation Scheduling, it decreased code size on average 25% and up to 40%. This empirical evidence strongly suggests that RS-FDRA, when handling latency minimization under resource constraints, compares favorably with previous state-of-the-art approaches. B. Experiments: Register-Sensitive Latency Constrained Retiming Problem (Problem 1) In Fig. 19, we present experimental results obtained when RS-FDRA is solving Problem 1, that is, executing with register minimization turned-on. For this set of experiments, we selected Swing Modulo Scheduling [34] to be the comparison algorithm, because, according to results presented in Llosa et al. [34], this algorithm performs almost as well as the exact method reported in Cortadella et al. [36]. As can be seen in Fig. 19, our algorithm found a minimum latency solution in 98% of the cases, while Swing Modulo Scheduling generated minimum latency solutions in 77% of the cases. For the cases where RS-FDRA obtained the best,aminimum code-size solution was also obtained in 95% of the cases, whereas Swing Modulo Scheduling obtained it in 25% of the cases. For the cases where RS-FDRA obtained the best and, it was able to generate a solution with minimum register requirements in 95% of the cases whereas Swing Modulo Scheduling obtained it in 25% of the cases. For the cases we decreased register pressure with respect to Swing Modulo Scheduling, we have improved on average 26% and up to 46%. This empirical evidence strongly suggests that, when handling latency and register pressure minimization under resource constraints, RS-FDRA again compares favorably with previous state-of-the-art approaches.

AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1411 Fig. 19. Experimental results: Problem 1. Fig. 20. Exploration of different code-size retiming solutions. C.

17 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1411 Fig. 19. Experimental results: Problem 1. Fig. 20. Exploration of different code-size retiming solutions. C. Experiments: Code-Size Constrained Retiming Problem (Problem 2-RI and Problem 2) In Fig. 20, we plot the results of two of our experiments obtained with RS-FDRA executing in Problem 2 mode for four-cascaded FIR Filter with eight adders and eight multipliers. (Recall that Problem 2 corresponds to minimization of latency and register requirements, under resource and code-size constraints.) D. Execution Times In all of our experiments, the execution time of RS-FDRA varied from a few seconds to a few hundred seconds, while the execution time of Rotation Scheduling was always a few seconds and the execution time of Swing Modulo Scheduling was always under one second. As mentioned in Section IV-E, given the need for high-quality synthesis of embedded software components, we do find RS-FDRA s execution times to be acceptable. Detailed execution times for all experiments are reported in Appendix A. VII. CONCLUSION AND WORK IN PROGRESS This paper proposes a novel software-pipelining algorithm, Register Sensitive Force Directed Retiming Algorithm (RS-FDRA), suitable for optimizing compilers targetiffng embedded VLIW processors. The key difference between RS-FDRA and previous approaches is that our algorithm can effectively handle code-size constraints along with latency and resource constraints. This novel ability to explicitly consider code size is critical, since it allows embedded system designers to perform compiler assisted exploration of Pareto optimal points with respect to code size and performance, both critical figures of merit for embedded software. Another important capability of RS-FDRA is that it can also minimize the increase in register pressure typically incurred by software pipelining. This ability is critical since, unfortunately, the need to insert spill code may sometimes result in significant performance degradation.these experiments aim at demonstrating that RS-FDRA is capable of exploring a much larger set of tradeoffs as compared to the reference algorithms. Note that, by varying the constraint on number of pipeline stages, several Pareto optimal points, exhibiting different latency (and register requirements) versus code-size tradeoffs, were generated by RS-FDRA. For example, as shown in Fig. 20, for the four-cascaded FIR Filter with eight adders and eight multipliers, Swing Modulo Scheduling can generate a single retiming solution with seven pipeline stages and an initiation interval of 3. Our algorithm is capable of generating solutions with nine pipeline-stages and an initiation interval of two steps, six pipeline-stages and an initiation interval of three steps, five pipeline-stages and an initiation interval of four, etc. Naturally, deciding on which solution is the best depends on the performance, register and code-size requirements/budgets, defined for each specific embedded application. Additional experimental data on Problem 2-RI and Problem 2 results can be found in Appendix A. Extensive experimental results are provided demonstrating that the extended set of optimization goals and constraints is supported by RS-FDRA without compromising the quality of the individual problem instance solutions. In other words, it has been shown that the individual solutions generated by our algorithm compare favorably with those produced by some of the best state-of-the-art algorithms. In conclusion, the proposed RS-FDRA enables a thorough compiler-assisted exploration of tradeoffs among performance, code size, and register requirements for time critical segments of embedded software components. We are currently working on extending the algorithm so as to handle clustered VLIW machines, i.e., machines with multiple register files [37].

1412 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 TABLE I EXPERIMENTAL RESULTS FOR PROBLEM 1 RI APPENDIX DETAILED EXPERIMENTAL DATA A.

18 1412 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 TABLE I EXPERIMENTAL RESULTS FOR PROBLEM 1 RI APPENDIX DETAILED EXPERIMENTAL DATA A. Comparison to Rotation Scheduling Table I presents experimental results obtained when RS-FDRA and Rotation Scheduling are solving the traditional software pipelining problem, that is, minimizing latency under resource constraints and simultaneously deriving the minimum number of pipeline stages required by the solution. RS-FDRA was thus executed in Problem 1-RI optimization mode for these experiments. Each row in Table I represents a different experiment. Columns 1 and 2 specify the DSP benchmark and the VLIW datapath configuration considered in the particular experiment, respectively. The following two columns (labeled lb and ) show the initiation interval and number of pipeline stages achieved by ours and by the reference algorithm, considering three alternative multipliers. Specifically, the first multiplier has execution delay ( ) of1 cycle. The second multiplier has execution delay of 2 cycles and is pipelined with initiation interval ( ) of 1 cycle. Finally, the third multiplier is nonpipelined and has execution delay ( ) of 2 cycles. The third column (labeled ) shows execution time in seconds for an Intel Pentium II XEON Processor. Solutions known to be inferior are marked in gray. In Table II, we present experimental results obtained when our algorithm was executed in Problem 2 mode, i.e., latency minimizing under resource constraints, yet considering also a constraint on the maximum number of pipeline stages. The constraint on the number of pipelin stages ( ) is shown in column 2.

points, exhibiting different latency versus code size tradeoffs, were generated by RS-FDRA.

19 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1413 TABLE II EXPERIMENTAL RESULTS FOR PROBLEM 2 RI TABLE III EXPERIMENTAL RESULTS FOR PROBLEM 1 By varying the constraint on number of pipeline stages, several Pareto optimal points, exhibiting different latency versus code size tradeoffs, were generated by RS-FDRA. We compare these multiple tradeoff solutions with a single solution obtained by rotation scheduling. As before, solutions known to be inferior are marked in gray.

1414 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 TABLE IV EXPERIMENTAL RESULTS FOR PROBLEM 2 B.

20 1414 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 TABLE IV EXPERIMENTAL RESULTS FOR PROBLEM 2 B. Comparison to Swing Modulo Scheduling In Table III, we present experimental results obtained when RS-FDRA is solving Problem 1, that is, finding a solution with minimum code size and register requirements under code-size constraints. Similar to Table I, the columns labeled lb and show the initiation interval (i.e., latency) and number of pipeline stages. In addition, the column labeled shows the minimum register requirements of the resulting solutions. Our results are contrasted to those achieved by Swing Modulo Scheduling. In Table IV, we present the various tradeoff solutions obtained with RS-FDRA executing in Problem 2 mode, i.e., minimization of latency and register requirements, under resource and maximum number of pipeline stages constraints and contrast those with the single solutions generated by Swing Modulo Scheduling. REFERENCES [1] TMS320C62x/C67x CPU and Instruction Set Reference Guide, Texas Instruments, [2] [Online]. Available: [3] [Online]. Available: [4] M. Lam, A systolic array optimizing compiler, Ph.D., Carnegie Mellon Univ., Pittsburgh, PA, [5], Software pipelining: An effective scheduling technique for VLIW machines, in Proc. SIGPLAN 88, Conf. Programming Language Design and Implementation, Atlanta, GA, June 1988, pp [6] C. E. Leiserson and J. B. Saxe, Optimizing synchronous circuitry and retiming, in Proc. 3rd Caltech Conf. VLSI, 1983, pp [7], Retiming synchronous circuitry, Algorithmica, pp. 5 35, June [8] J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt, Overlapped loop support in the Cydra 5, in Proc. 3rd Int. Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 1989, pp [9] S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, and W. W. Hwu, A comparison of full and partial predicated execution support for ILP processors, in Proc. 22nd Int. Symp. Computer Architecture, June 1995, pp [10], [11] J. W. Sias, D. I. August, and W. W. Hwu, Accurate and efficient predicate analysis with binary decision diagrams, in Proc. 33rd Int. Symp. Microarchitecture, Dec. 2000, pp [12] C. Akturan and M. F. Jacome, FDRA: A software pipelining algorithm for embedded VLIW processors, in Proc. Int. Symp. System Synthesis, Sept. 2000, pp [13] M. S. Schlansker and B. R. Rau, EPIC: Explicitly parallel instruction computing, IEEE Comput., vol. 33, pp , Feb [14] E. M. Reingold, J. Nievergelt, and N. Deo, Combinatorial Algorithms: Theory and Practice. Englewood Cliffs, NJ: Prectice-Hall, [15] T. C. Denk and K. K. Parhi, Exhaustive scheduling and retiming of digital signal processing systems, IEEE Trans. Circuits Systems-II, vol. 45, pp , July [16], A unified framework for characterizing retiming and scheduling solutions, in Proc. IEEE Int. Symp. Circuits and Systems, Atlanta, GA, May 1996, pp [17] S. Simon, E. Bernard, M. Sauer, and J. Nossek, A new retiming algorithm for circuit design, in Proc. IEEE ISCAS, London, U.K., May 1994, pp [18] P. G. Paulin and J. P. Knight, Force directed scheduling for the behavioral synthesis of ASIC s, IEEE Trans. Computer-Aided Design, vol. 8, pp , June [19] H. P. Peixoto and M. F. Jacome, A new technique for estimating lower bounds on latency for high level synthesis, in Proc. IEEE Great Lakes Symp., Mar. 2000, pp [20] C. Wang and K. K. Parhi, High level DSP synthesis using MARS design system, in Proc. Int. Symp. Circuits Systems, 1992, pp [21] T. Lee, A. C. Wu, D. D. Gajski, and Y. Lin, An effective methodology for functional pipelining, in Proc. Int. Conf. Computer-Aided Design, Dec. 1992, pp [22] G. Goossens, J. Vandewalle, and H. De Man, Loop optimization in register-transfer scheduling for DSP-systems, in Proc. ACM/IEEE Design Automation Conf., 1989, pp [23] A. Aiken, A. Nicolau, and S. Novack, Resource-constrained software pipelining, IEEE Trans. Parallel Distributed Syst., vol. 6, pp , Dec [24] L. Chao, A. LaPaugh, and E. H. Sha, Rotation scheduling: A loop pipelining algorithm, IEEE Trans. Computer-Aided Design, vol. 16, pp , Mar [25] M. Potkonjak and J. Rabaey, Retiming for scheduling, VLSI Signal Processing IV, pp , Nov [26] M. F. Jacome, G. de Veciana, and C. Akturan, Resource constrained dataflow retiming heuristics for VLIW ASIP s, in Proc. IEEE/ACM 7th Int. Workshop Hardware/Software Codesign, Apr. 1999, pp [27] R. A. Huff, Lifetime-Sensitive modulo scheduling, in Proc. ACM SIG- PLAN 93 Conf. Programming Language, Design and Implementation, 1993, pp [28] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham, Minimum register requirements for a modulo schedule, in MICRO-27, San Jose, CA, Nov. 1994, pp [29], Minimizing register requirements of a modulo schedule via optimum stage scheduling, Int. J. Parallel Programming, vol. 24, no. 2, pp , [30] R. Govindarajan, E. R. Altman, and G. R. Gao, Minimizing register requirements under resource-constrained rate-optimal software pipelining, in Proc. MICRO-27, San Jose, CA, Nov. 1994, pp [31] A. E. Eichenberger and E. S. Davidson, Stage scheduling: A technique to reduce the register requirements of a modulo schedule, in Proc. MICRO-28, Nov. 1995, pp

AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1415 [32] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez, A hypernode reduction modulo scheduling, in Proc. 28th Annu.

21 AKTURAN AND JACOME: RS-FDRA: REGISTER-SENSITIVE SOFTWARE PIPELINING ALGORITHM 1415 [32] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez, A hypernode reduction modulo scheduling, in Proc. 28th Annu. Int. Symp. Microarchitecture, Dec. 1995, pp [33] J. Llosa, M. Valero, E. Agyuade, and A. Gonzalez, Modulo scheduling with reduced register pressure, IEEE Trans. Comput., vol. 47, pp , June [34] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, Swing modulo scheduling: A lifetime sensitive approach, in Proc. Pact96, Oct. 1996, pp [35] J. Llosa, E. Ayguade, A. Gonzalez, M. Valero, and J. Eckhardt, Lifetime-sensitive modulo scheduling in a production environment, IEEE Trans. Comput., vol. 50, no. 3, pp , Mar [36] J. Cortadella, R. M. Badia, and F. Sanchez, A mathematical formulation of the loop pipelining problem, in Proc. 11th Design Integrated Circuits Systems Conf., Nov. 1996, pp [37] E. Nystrom and A. E. Eichenberger, Effective cluster assignment for modulo scheduling, in Proc. MICRO-31, Dec. 1998, pp Cagdas Akturan received the B.S. degree in computer engineering from Ege University, Turkey, in Later, she received the MSE and Ph.D. degrees from the University of Texas at Austin, in 1998 and 2001, respectively. She currently works at Intel Corporation, Portland, OR. Margarida F. Jacome (S 92 M 94 SM 2001) received the B.S. and M.S. degrees from the Technical University of Lisbon, IST, in 1981 and 1988, respectively, and the Ph.D. degree in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, PA, in She is an Associate Professor of Electrical and Computer Engineering at The University of Texas at Austin. Her research interests include embedded computing, hardware-software codesign, and high-level synthesis. Dr. Jacome is a member of the Executive Committee of the Design Automation Technical Committee (DATC) of the IEEE Computer Society. She has served on the technical program committee of a number of conferences, including the International Conference on Computer-Aided Design (ICCAD), the Design Automation and Test in Europe (DATE), the International Symposium on Hardware/Software Codesign (CODES), the International Conference on Computer Design (ICCD), and the International Symposium on System Synthesis (ISSS). She served as the General Co-Chair of the Workshop on Electronic Design Processes (EDP) in 1998 and She is a recipient of a 1996 National Science Foundation CAREER Award and a co-recipient of the IEEE/ACM Design Automation Conference (DAC) Best Paper Award for 1992 as well as a co-recipient of the IEEE/CAS William J. McCalla Best ICCAD Paper Award for 2000.

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,