ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Rama Krishna Pullaguntla August, 2008

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS Rama Krishna Pullaguntla Thesis Accepted: Approved: Advisor Dr. Timothy W. O Neil Dean of the College Dr. Ronald F. Levant Faculty Reader Dr. Kathy J. Lizska Dean of the Graduate School Dr. George R. Newkome Faculty Reader Dr. Timothy S. Margush Date Department Chair Dr. Wolfgang Pelz ii

ABSTRACT Scheduling loops optimally is one of the important steps in parallel processing, since many applications are made up of iterative processes. There are iterative processes, which can be best described using synchronous data flow graphs (SDFGs) or multi-rate graphs. A great deal of research has been done to optimize SDFGs using techniques such as retiming. In this research, we apply a technique called rotation scheduling to reduce the execution times of SDFGs. We also present an algorithm that schedules synchronous data flow graphs based on the types of functional units and number of copies available for each functional unit. Finally, we demonstrate the contributions of our research using suitable examples. iii

TABLE OF CONTENTS Page LIST OF TABLES... LIST OF FIGURES... LIST OF ALGORITHMS... vii viii x CHAPTER I. INTRODUCTION... 1 1.1 Synchronous data flow graph... 1 1.2 Rotation Scheduling... 2 1.3 Contributions and outline... 5 II. BACKGROUND...... 7 2.1 Directed graph...... 7 2.2 Data flow graph...... 8 2.3 Scheduling iterative processes... 10 2.4 Length of a schedule... 11 2.5 Clock period...... 12 2.5.1 Computation delay ratio... 13 2.5.2 Critical cycle... 14 2.6 Synchronous data flow graphs... 14 iv

2.6.1 Definition...... 15 2.6.2 Topology matrix... 16 2.6.3 Basic repetition vector... 17 2.6.4 Equivalent homogeneous graph... 18 2.6.5 Consistency and Liveness of SDFG... 18 2.6.6 Iteration bound for SDFG... 20 2.7 Summary...... 22 III. IMPLEMENTATION...... 23 3.1 Retiming...... 23 3.2 Scheduling operations in a SDFG... 23 3.3 Example to demonstrate scheduling algorithm... 25 3.4 Example to Demonstrate the Down Rotate Operation... 28 3.5 Rotation Heuristics... 32 3.6 Limitations of the Heuristic Model... 33 3.7 Summary...... 34 IV. RESULTS...... 35 4.1 First example...... 35 4.2 Second example... 40 4.3 Summary...... 46 V. CONCLUSIONS AND FUTURE WORK... 48 5.1 Contributions...... 48 5.2 Rotation scheduling... 49 v

5.3 Scheduling the synchronous data flow graph... 50 5.4 Future work... 50 REFERENCES...... 52 vi

LIST OF TABLES Table Page 3.1 Schedule for the SDFG in Figure 3.1... 27 3.2 Initial schedule for the SDFG in Figure 3.3... 30 3.3 Schedule for SDFG after single rotation..... 31 4.1 Initial schedule for SDFG in first example... 37 4.2 Schedule for SDFG after single rotation... 38 4.3 Schedule for SDFG after second rotation... 39 4.4 Initial schedule for SDFG in second example... 42 4.5 Schedule for SDFG in example 2 after single rotation... 44 4.6 Schedule for SDFG in example 2 after second rotation... 46 vii

LIST OF FIGURES Figure Page 2.1 Data flow graph example... 9 2.2 Directed acyclic graph... 13 2.3 Synchronous data flow graph example... 15 2.4 Topology matrix... 17 2.5 Inconsistent synchronous data flow graph... 18 2.6 Topology matrix of the inconsistent synchronous data flow graph... 18 2.7 Liveness of Synchronous data flow graph... 19 2.8 Equivalent homogeneous graph to demonstrate liveness... 19 2.9 Synchronous data flow graph to demonstrate Iteration bound... 21 2.10 Equivalent homogeneous graph to demonstrate Iteration bound... 21 3.1 SDFG to demonstrate scheduling algorithm... 25 3.2 SDAG for the SDFG in Figure 3.1... 26 3.3 SDFG to demonstrate down rotation... 29 3.4 Topology matrix for the SDFG in Figure 3.3... 29 3.5 Down Rotated SDFG... 31 4.1 Synchronous data flow graph for first example... 35 4.2 Topology matrix for the first example 1... 36 4.3 Synchronous directed acyclic graph of example 1... 36 viii

4.4 SDFG after one down rotation... 37 4.5 Synchronous directed acyclic graph after single rotation... 38 4.6 Synchronous data flow graph after second rotation... 39 4.7 Synchronous directed acyclic graph after second rotation... 39 4.8 Synchronous data flow graph for second example... 40 4.9 Topology matrix for the SDFG in Figure 4.8... 41 4.10 SDAG of the SDFG in Figure 4.8... 42 4.11 SDFG after single rotation...... 43 4.12 SDAG after first rotation...... 43 4.13 SDFG after second rotation...... 45 4.14 SDAG after second rotation...... 45 ix

LIST OF ALGORITHMS Algorithm Page 3.1 Schedule-SDFG (G)... 24 3.2 Down_Rotate (G, S G, l, q)... 28 3.3 Rotation_Phase (G, SG, l, η, Sopt, q)... 32 3.4 Heuristic RH1 (G, η, κ)... 33 x

CHAPTER I INTRODUCTION Parallelism is one of the most important aspects in any application that is being developed today. Many researchers are constantly working to improve the degree of parallelism that can be explored in any type of application irrespective of whether it s hardware or software. One such technique which extracts a good amount of parallelism from the iterative processes is called rotation scheduling on synchronous data flow graphs. 1.1 Synchronous Data Flow Graphs (SDFG) Synchronous data flow graphs, also called multirate graphs, are used to represent the data flow in multiprocessor applications. They are similar to data flow graphs but the nodes that are represented within SDFGs are capable of producing and consuming multiple units of data. The nodes of a SDFG represent functional elements and the edges between them denote the connections between the functional elements. These nodes produce or consume a fixed number of data tokens. 1

1.2 Rotation scheduling Rotation scheduling is a technique that is used to schedule iterative processes that can be represented using data flow graphs. The goal of this technique is to optimize the length of the schedule for a particular process. The output of this technique is a static schedule that indicates the start time of each node in the data flow graph. The basic idea behind rotation scheduling is to reduce the idle time of the hardware by moving the operations that are scheduled at the beginning to the end of the schedule. Existing research has contributed a lot towards scheduling data flow graphs with the help of several techniques. We are going to briefly discuss some of these contributions that are related to our research interests. The basis for rotation scheduling is the retiming technique. This technique has been clearly demonstrated in [1]. In retiming, the delays on the edges of the circuit are redistributed without affecting the dependencies between the operations. This technique pulls some of the delays from the incoming edge and pushes them onto the outgoing edge thereby reducing the length of the longest non-zero delay path. Scheduling data flow graphs using the retiming technique has been clearly explained in [2]. This article gives a detailed description of finding the optimal schedule for data flow graphs by combining two techniques, retiming and unfolding. 2

The authors in [3] have proposed a method which applies retiming to optimize the schedule of synchronous data flow graphs with the help of another type of graph called an equivalent homogeneous graph (EHG). This graph is used to represent the SDFG in the form of a data flow graph where each operation is capable of producing and consuming a single unit of data. The application of the rotation scheduling technique on iterative processes was discussed in [4] where the iterative process is represented using a data flow graph. This article emphasized the methodology used in the implementation of rotation scheduling. It proposed new heuristics such as rotation span, best span and half rotation. These methods help in finding all possible schedules that can be obtained from the initial setup. The behavior and intricacies of synchronous data flow graphs have been clearly explained in [7]. This paper describes the basic features of a synchronous data flow graphs such as consistency, liveness, etc. This article also provides an algorithm to find the static schedule for a synchronous data flow graph. The authors in [6] describe another facet of rotation scheduling, using it as a technique to reduce the size of the code. This is used in the applications which are pipelined. The loop generally requires a piece of code that needs to be executed before it is started. This is called the prologue and similarly, it needs a piece of code after it is executed, which is called the epilogue. This technique uses rotation scheduling to 3

integrate the prologue and epilogue into the pipeline which significantly reduces the amount of code that needs to be written. The authors in [5] discuss some of the important features like boundedness and liveness of an SDFG. They use Petri nets to explain these characteristics. Liveness of an SDFG indicates whether the process can be executed infinitely and boundedness describes whether the process can be implemented with a limited amount of memory. The article provides in depth information on different facets of boundedness such as strict boundedness and self-timed boundedness. The authors in [8] propose an algorithm that gives the minimum achievable latency for concurrent real time applications. This article used synchronous data flow graphs to analyze streaming applications such as video conferencing, telephony and gaming. Latency is one of the important metrics to measure the performance of a real time application. The research paper also proposed a heuristic that optimizes the latency under a given throughput. The authors in [9] discuss an algorithm called dynamic scheduling which reduces the cost of data exchanges between the processors once the tasks are assigned to them. Many scheduling algorithms tend to neglect the cost of data exchange once the tasks are scheduled. The main objective of this algorithm is to minimize the cost by effectively mapping synchronous data flow graphs onto processor networks. 4

It is always a difficult proposition to generate an optimal schedule for the processes without knowing the computation times of the tasks involved. The authors in [10] address this situation by introducing a new facet of rotation scheduling. They propose two new algorithms called probabilistic retiming and probabilistic rotation scheduling that generate good schedules for these kinds of processes. These algorithms use data flow graphs where each node denotes the task with probabilistic computation time. As we observe, most of the existing research has emphasized optimizing homogeneous data flow graphs using methods such as retiming and rotation scheduling in combination with several other techniques. In this research, we mainly focus on the application of the rotation scheduling technique on the synchronous data flow graph to find an optimal schedule for the iterative process represented by such a graph. 1.3 Contributions and Outline In this research, we present the following contributions that are related to the application of the rotation scheduling technique on synchronous data flow graphs. 1. We develop a Down_Rotate () algorithm that applies the rotation scheduling technique on synchronous data flow graphs. 2. We describe a new algorithm to schedule synchronous data flow graphs without converting them into their EHGs (equivalent homogenous graphs). This algorithm 5

assumes that there are different kinds of functional units corresponding to each type of operation, with a limited number of copies for each functional unit. This algorithm takes the SDFG as its input and gives the optimum schedule as its output. The rest of this thesis is organized as follows 1. Chapter 2 will give detailed information on the concepts of retiming, rotation scheduling, data flow graphs and synchronous data flow graphs. It also describes the properties of data flow graphs such as schedule length, iteration bound, clock period, liveness, consistency, etc. 2. Chapter 3 will describe the implementation of algorithms that are used to apply rotation scheduling on synchronous data flow graphs. It also describes some of the heuristics developed to find the optimum schedules and discusses their limitations. This chapter also includes the algorithm which schedules the SDFG and finds the start time of each vertex in the SDFG. 3. Chapter 4 presents the results of our research. In this section, we demonstrate the application of algorithms on synchronous data flow graphs and record results after each iteration. 4. Chapter 5 will provide the conclusions that are inferred from this research and provides information on enhancements that can be done to this research. 6

CHAPTER II BACKGROUND INFORMATION In this chapter, we present a detailed description of different forms of representations that we used in this research. As the focus of this research is on the iterative processes, we use data flow graphs to represent the data flow between the operations that exist in the process. We also describe several other parameters of these graphs that measure the efficiency of the process in terms of time taken. 2.1 Directed Graph A directed graph is a graph in which each pair of nodes is connected using a directed edge. It can be represented using a notation G (V, E) where V is a set of nodes E is a set of directed edges connecting the nodes. The nodes represent the processes and the edges connecting them describe the dependency relationship between the processes. For instance, the edge between the nodes u and v denoted by u v indicates that process v is dependent on process u, and it cannot be scheduled unless process u has completed. 7

A set of vertices and the set of edges which form a loop is called a simple cycle. This simple cycle, c, is a sub-graph of graph G (V, E) where each edge ei represents the dependence between the processes vi and vi+1 for i = 1 to n-1 and the process vn is dependent on v1. 2.2 Data Flow Graph Data flow graphs (DFGs) are an extension of directed graphs. These graphs are generally used to describe the flow of data between the operations in which each operation produces and consumes a single unit of data. The data flow graph G (V, E, t, d) can be represented using the following parameters: V is a set of nodes that represent the operations. E is a set of edges where each edge represents the dependence between two operations. Each node v is associated with a function t (v) which denotes the time required to complete the operation. Each edge e between nodes u and v is associated with a function d (e) that denotes the delay in receiving the data by the sink node v which has been produced by the source node u. This delay is given in terms of iterations. All the operations in a data flow graph will take a non-zero amount of time to complete. In our research, we deal with graphs where each process takes an integral 8

amount of time to complete. These graphs have no cycles where the delays on all the edges are equal to zero. An example of a data flow graph is presented in Figure 2.1. This graph can be treated as pictorial representation for the following code snippet where each line represents an operation. for i = 1 to N { a [i] = c [i-1] + 1 /* Operation A */ b [i] = 2 * a [i-2] /* Operation B */ c [i] = 2 * b [i] +1 /* Operation C */ } Figure 2.1 Data flow graph example In Figure 2.1, {A, B, C} form the set of operations V that are part of an iterative process. The set E consists of edges {AB, BC, CA} representing the dependencies between the operations. The numbers above the nodes represent the computation times of each of the operations. In Figure 2.1, the time required to complete t (v) for each operation A, B and C is 1, 2 and 3 respectively. The delays on each edge d (e) are represented using the bars that 9

cut across the edges. From the above Figure 2.1, it can be seen that the delay count on the edges AB, BC and CA is 2, 0 and 1, respectively. A directed acyclic graph is related to a data flow graph. It is constructed by removing the edges which have delays on them. This type of graph is very useful when scheduling the operations within iterations without violating the dependencies between them. A directed acyclic graph can also be used to find the set of processes that can be isolated from the remaining set so that both sets of processes can be scheduled independently. 2.3 Scheduling Iterative processes Scheduling an iterative process represented by a data flow graph is the process of assigning start times and end times for each of the operations without violating the dependencies between them. The output of the scheduling process is called a static schedule. A static schedule specifies the exact time step at which the operation needs to be executed. It also makes sure that all the operations dependent on it are not started before this operation has finished. The start times of each operation v in a DFG are given by the function s : V Z + where V is the set of vertices in the data flow graph and Z+ is the set of positive integers. The start time of each node u must satisfy the relation 10

s (u) max { s( v) + t( v) } { v u : d( v u) = 0} (2.1) Here, u and v are the operations and u is dependent on v. s (u) and s (v) are the start times of the nodes u and v, and t (u) and t (v) are the time steps required to complete the operations u and v. The scheduler then assigns the operations to the functional units available in the given hardware. This assignment is given by a function f : V F, where V is the set of vertices and F is a set of functional units available. The above function f can be used to track the idle time periods of a given functional unit. Several scheduling algorithms have been introduced in order to explore the parallelism in the hardware. These algorithms will be helpful only when there are sufficient functional units to support the operations and their dependencies. For a simple graph without any delays on it, we do not need any scheduling algorithm since it is a trivial task. We now look at some of the parameters that measure the efficiency of the scheduling algorithm. 2.4 Schedule Length The length of a schedule is the number of time steps taken to complete one iteration of a process with the given hardware. The efficiency of a scheduling algorithm 11

lies in finding the most optimum schedule length for a given iterative process. The length of the schedule can be given using the equation 2.2 as shown in [5]: Length (SG) = max { s( u) + t( u) } min{ s( u) } u V u V (2.2) where S G denotes the length of the schedule generated for data flow graph G, while s(u) and t(u) are the start time and computation time for the operation u. 2.5 Clock Period The clock period is a metric used to determine the quality of a scheduling algorithm. It is the time taken to complete the longest sequence of operations where the edges connecting them have no delays. It can also be called the iteration period. It can be computed using the equation 2.3 as shown in [4]. Clock period = t( v ) max (2.3) p G v p where p is a sub graph of the data flow graph G which is constructed by removing all the edges with non zero delay. By doing this we are trying to group the operations which need to be executed in the same order and which can be isolated from the remaining part of the graph. 12

For example, consider the data flow graph in Figure 2.1, when the edges with non zero delays are removed from it. The longest sequence we get is shown in Figure 2.2. Figure 2.2 Longest sequence with non zero delays The clock period can be computed by summing up the computation times of the vertices in the above graph. clock period = t(b) + t(c) = 2 + 1 = 3. 2.5.1 Computation Delay Ratio The computation delay ratio is defined for a given cycle C in the SDFG. It is the ratio of the sum of the computation times of all the vertices in C to the total number of edge delays in C. This is a significant parameter in finding the time required to complete a subset of operations without resource constraints. It can be computed using the formula below which has been proposed in [4]. t( v) r( C G) = v C (2.4) d( e) e C Here, r is the computation delay ratio for a simple cycle C in the data flow graph G, t(v) denotes the computation time of operation v and d(e) is the delay on the edge e. 13

2.5.2 Critical Cycle The critical cycle is the one which has the highest computation delay ratio among all the cycles. This signifies the part of the process which takes the highest amount of time. The computation delay of the critical cycle can be called the iteration bound. For any scheduling algorithm, the clock period of the schedule cannot be less than the iteration bound of the data flow graph. In other words, it is the best achievable length of the given schedule. It can be computed using the equation shown below [4]. r(c) = max{ r( D)} D G (2.5) Here, r(c) is the iteration bound and D represents all the simple cycles that belong to the graph G. 2.6 Synchronous Data Flow Graphs There are a few processes where each of the operations needs to produce and consume multiple units of data. Data flow graphs cannot be used to represent these types of applications since each node in a DFG is capable of producing and consuming a single unit of data. In order to address this situation, Lee in [12] has introduced the concept of synchronous data flow graphs where each node is capable of producing or consuming multiple tokens of data on every iteration. The number of tokens produced or consumed by each node is predetermined. 14

2.6.1 Definition Synchronous data flow graphs are an extension to data flow graphs where each edge in a SDFG is assigned two additional parameters to denote the number of data tokens produced and consumed at either end. It is represented as G (V, E, d, t, p, c) where each of its parameters is explained below V is the set of operations denoted by nodes in the graph. E is the set of edges that connect the nodes in the graph. They determine the dependencies between the operations. d(e) denotes the number of delays on edge e, for any edge d(e) 0. t(v) denotes the number of time steps required to compute operation v. p(e) denotes the number of data tokens produced by the source node of edge e. c(e) denotes the number of data tokens consumed by the sink node of edge e. For example, consider the SDFG in the Figure 2.3. Figure: 2.3 Synchronous data flow graph example 15

In the above figure, it can be observed that the set {A, B, C } represents the set of vertices, and the numbers above the nodes represent the computation times of each of the operations. The number of delays d (e) on each edge e are denoted by the vertical bars cutting across the edges. On each edge e, we can find two numbers representing the number of data tokens produced and consumed by the edge. In the graph in Figure 2.3, it can be seen that node A produces a single data token while node B consumes 2 data tokens per iteration. 2.6.2 Topology matrix A topology matrix is one that stores the information about every edge in the synchronous data flow graph. It is an E V matrix where E denotes the number of edges in the graph, and V denotes the number of vertices in the graph. An entry in the (i, j) th position indicates the number of tokens produced or consumed by vertex j on edge i. A positive value indicates the number of data tokens produced by the vertex and a negative value indicates the number of data tokens consumed by the vertex. A value of zero indicates that the vertex has not used the edge to produce or consume data units. Figure 2.4. For example, the topology matrix for the SDFG shown in Figure 2.3 is shown in 16

Figure 2.4 Topology matrix for the SDFG in Figure 2.3 2.6.3 Basic Repetition Vector The authors in [7] prove that a sequential schedule can be constructed for a SDFG if the rank of the topology matrix is one less than the number of nodes in the synchronous data flow graph. From this hypothesis, we can find a vector of q integers in the null space of the topology matrix which is called a repetition vector. This type of vector with the smallest norm is called the basic repetition vector (BRV). The number of elements in the basic repetition vector is equal to the number of vertices of the graph. Each element q i in the basic repetition vector q indicates the number of copies of node vi that are to be scheduled in each iteration. The basic repetition vector for the topology matrix in Figure 2.4 is [2 1 1]. Hence we need to schedule two copies of node A, one copy of node B and one copy of node C in every iteration of the iterative process. 17

2.6.4 Equivalent Homogeneous Graph In a SDFG, each operation is capable of producing and consuming multiple tokens of data. This characteristic makes it difficult to schedule the operations in a SDFG. To address this problem, equivalent homogeneous graphs were introduced in [7] which convert the SDFG into a graph where each node produces or consumes a single token of data. The topology matrix and the basic repetition vector play an important role in this conversion. The number of copies of each node that need to be scheduled is obtained from the BRV. The delays are distributed among the connecting edges of the EHG. 2.6.5 Consistency and Liveness of a SDFG A SDFG is said to be consistent if it has a basic repetition vector associated with it. The graph cannot be scheduled without the help of the vector since we otherwise do not know the number of copies of each node that need to be scheduled. Consider the SDFG in Figure 2.5 [7] and the topology matrix representation as shown in Figure 2.6. Figure 2.5 Inconsistent SDFG Figure 2.6 Topology matrix 18

From the topology matrix given above, we can observe that on each iteration, each node will fire exactly once before node A blocks the system waiting for the second token which leads to a deadlock. Hence the SDFG shown above is not consistent. It has been demonstrated in [7] that a SDFG is said to be live if the EHG created from it does not have any zero delay cycles. Any cycle with zero delay will lead to a deadlocked state where it is impossible to derive a static schedule from it. For example, consider the synchronous data flow graph in Figure 2.7. The basic repetition vector for the given SDFG is [1 2]. Hence, we need to schedule 1 copy of A and 2 copies of B in each iteration. The EHG constructed from the SDFG is shown in Figure 2.8. [7] The delays are distributed between the edges B1A and B2A. We observe that the loop between the nodes A and B2 contains no delays, leading to a deadlocked state. Figure 2.7 SDFG Figure 2.8 EHG of the SDFG in Figure 2.7 19

2.6.6 The Iteration Bound for SDFG As defined earlier, the clock period of a data flow graph is the length of the longest zero delay sequence in the graph. This definition cannot be applied directly to the SDFG, since the above definition applies only to the single rate graphs, so we define the clock period of the SDFG as the clock period of its EHG. We will be facing a similar situation while minimizing the clock period of the SDFG since an iteration of a synchronous data flow graph completes only when all the nodes of its EHG gets executed. The average computation time of an iteration is called the iteration period of the SDFG. If a SDFG contains a loop, then the iteration period is bounded from below by the iteration bound, which will be the loop with the maximum computation delay ratio among all the loops present in the EHG. For example, consider the SDFG in Figure 2.9 and its EHG in Figure 2.10. The EHG consists of three loops, (A1, B, C) and (A2, B, C) each with total computation time of 4 and delay count 1, and loop (B, C) has a computation time 2 and delay count 1. Here the first two loops have maximum time to delay ratio of 4, hence the iteration period of the below SDFG can be given as 4. 20

Figure 2.9 Synchronous data flow graph Figure 2.10 EHG of the SDFG in Figure 2.9 In [5], an equation to find the iteration bound for synchronous data flow graphs without converting them to EHGs was developed. The equation is given below. I( G) = max l G e: u v t( v) v l d( e) / max( q u, q v ) (2.6) In the above equation, l G are the loops present in the graph G, e: u v are the edges present in the loop l, and qu and qv are the corresponding elements in the basic repetition vector for the vertices u and v. 21

2.7 Summary In this chapter, we discussed several data structures that can be used to represent iterative processes. We also discussed several metrics such as length of the schedule, clock period, iteration bound, etc, which measure the efficiency of an iterative process. We discussed several features of a synchronous data flow graph such as liveness, consistency, basic repetition vector, topology matrix, etc, and explained the importance of these features in scheduling an iterative process represented using a SDFG. 22

CHAPTER III IMPLEMENTATION In this chapter, we will discuss the mechanisms and methodologies used in implementing the proposed techniques. These techniques will be demonstrated with the help of algorithms. 3.1 Retiming The retiming technique was initially introduced by Leiserson and Saxe in [1] for designing VLSI circuits. The basic idea of retiming is to adjust the delays in the circuit in order to minimize the clock period of the process. It is applied by pulling the delays from the incoming edges and pushing them onto the outgoing edges. The delays are more evenly distributed among the data flow graph so that it decreases the longest zero delay path in the given circuit thereby reducing the clock period of the data flow graph. The iteration bound remains the same even after retiming the edges. 3.2 Scheduling Operations in a SDFG The scheduling of operations in a synchronous data flow graph is an important part of the rotation scheduling algorithm. A scheduling algorithm based on the list scheduling from [11] is given as follows: 23

Algorithm 3.1 Schedule-SDFG (G) Input: The SDFG G (V, E, t, d, p, c), the Basic Repetition Vector (BRV) of the SDFG. different types of Functional units F1 Fn. Number of copies for each Functional Unit. Output: The schedule of all the operations in the SDFG. 1. Construct the SDAG by removing all the edges with non zero delays. 2. Schedule the vertices which have a dependency between them. t = 0 /* Initialize the time to zero*/ for each node a = 1 to n the SDAG { for each copy of node j = 1 to Q(Va) { S (Va) = t /*Schedule the operation Va with start time t*/ temp = t k = 1 Fnk ( m = temp to t ) = Va /*Assigning the operation to FU.*/ k = k + 1 /* Increase k after scheduling each node*/ if ( j = NFa) { t = t + T(Va); Q (Va) = Q (Va) NFa; k = 1, j = 1 } } end for loop 2 } end for loop 1 3. Schedule vertices in other dependencies if SDAG has more than one dependency time t =0; for each node a = 1 to n in SDAG { for each copy of node j = 1 to Q (Va) { for each copy of functional unit { if Fnk(t, t + T (Va)) = NOP Fnk (t, t+ T(Va)) = Va, S (Va) = t; } } if ( j = Q(Va)) then t = t + T(Va) else t = t + 1. } 4. Schedule the vertices which have no dependencies with other vertices for each node x = 1 to m /* operations with no dependencies.*/ { time t = 0; /* reset the timer for each operation. for each copy y = 1 to Q ( Vm) { Search each FU k of the corresponding operation for a continuous set of T(Vm) NOP s. if found assign the operation Vm to FU k k = k + 1 /* Go for the next available Functional Unit*/ if ( k = NFUa) /* If all the Functional Units are visited*/ k = 1 } end for loop 2 } end for loop 1 24

Notations used in the above scheduling algorithm are S (Va): The starting time of the operation Va. Q (Va): The number of copies that need to be scheduled for the operation Va. T (Va): Time required for computing the operation Va Fni: This denotes the i th copy of the functional unit of type n. This algorithm takes several crucial factors into consideration. These factors include total number of operations, basic repetition vector, and different types of functional units. The basic idea of the above algorithm is to create a synchronous directed acyclic graph for the given SDFG and schedule all the operations present in the acyclic graph. This step makes sure that all the dependencies between the operations are preserved. The time complexity of this algorithm is given as O ( V ) where V is the number of operations that are need to be scheduled [11]. 3.3 Example to demonstrate scheduling algorithm We now explain the above scheduling algorithm with an example to help understand the basic idea behind the scheduling. We attempt to schedule the synchronous data flow graph in Figure 3.1. Figure 3.1 SDFG to be scheduled. 25

The basic repetition vector of the graph in Figure 3.1 is found to be [2 1 1]. In this example we assume that the hardware has two kinds of functional units U+ and U* with 2 functional units of type U+ and 1 functional unit of type U*. The first step of the scheduling algorithm is to construct the SDAG for the given synchronous data flow graph. The SDAG of the SDFG in Figure 3.1 is shown in Figure 3.2. Figure 3.2 SDAG for the SDFG in 3.1 We now schedule the operations A and B without losing dependencies between them. A1 and A2 are 2 copies of operation A that need to be scheduled. The start time of the operation A1 is given by S(A1) = 1 and it is allocated to the first copy of the functional unit U+. A2 is allocated to the second copy of the functional unit U+ since it is still unallocated. Hence, the start time of A2 is S (A2) = 1. Now all of the copies of A have been scheduled and the time units taken are 1. We now schedule operation B in the functional unit U+. This operation starts at time unit 2 and is allocated to the first copy of the functional unit U+. We now finished scheduling the operations in the SDAG. The timer is reset to 1.We continue the scheduling process by scheduling the other operations in the SDAG. The only operation we now have is C and it is allocated to the functional unit U*. The start time of C is S(C) = 1.The final schedule of the SDFG is given in Table 3.1. 26

Table 3.1 Schedule of the SDFG in Figure 3.1 Time Step U+ U+ U* 1 A1 A2 C 2 B NOP C Rotation scheduling is one resource - constrained retiming technique for producing an optimum clock period for the data flow graph. Rotation scheduling is a heuristic technique that combines both retiming and scheduling by performing cumulative retiming in a particular order. In this research, we demonstrate the application of this technique to achieve optimum schedules in a short amount of time. The primary operation in the rotation scheduling technique is called down rotation. When the down rotation operation of length N is applied on a synchronous data flow graph, it extracts all the nodes that are scheduled in the first N time steps of the schedule. The retiming technique is then applied on these nodes. Below is the rotation scheduling algorithm that clearly explains the procedure. 27

Algorithm 3.2 Down_Rotate (G, SG, l, q) Input: G (V, E, t, d, p, c) is a SDFG, SG is the schedule of the SDFG G, l is the number of time steps to rotate the synchronous data flow graph. q is the basic rotation vector BRV for the SDFG. Output: The rotated SDFG based on the number of time steps. / < l /* nodes scheduled within time step l */ u V X v V S( v) min{ S( u) } I { u v E u V / X, v X } O { u v E / u X, v V / X } for all v X do extract (SG, v) / /* incoming edges for the nodes */ /* outgoing edges from the node */ end for for all incoming edges e I Decrement the delay on edge e by q x where q x is the repetition number for x. d(e) d (e) q x end for for all outgoing edges e O Increment the delays on the edge e by (q x * p (x)) d (e) d(e) + q x * p(x). end for Schedule (SG, G, X) /*Schedule the retimed vertices of the graph.*/ The above down rotation algorithm is applied for a series of transformations cumulatively and the lengths of the schedules are recorded to check whether there is any improvement in the schedule length. 3.4 Example to Demonstrate the Down Rotate Operation We now explain the Down_Rotate algorithm with the aid of an example to help familiarize the reader with this operation. Consider the synchronous data flow graph shown in Figure 3.3. 28

Figure 3.3 Example SDFG The iterative process represented by the SDFG in Figure 3.3 consists of 3 operations A, B and C. The operations A and B are of a similar kind represented by the square boxes, each of them taking 1 time unit for completion. The operation C is another kind of operation represented by a circle which takes 2 time units for completion. The topology matrix for the above graph is shown in Figure 3.4. Figure 3.4 Topology matrix for the SDFG in Figure 3.3 From the above topology matrix, the basic repetition vector can be computed as q = [2 1 1]. Hence we get the following values: qa = 2, qb = 1, qc = 1. So, we need to schedule 2 copies of A and 1 copy each for B and C in every iteration. 29

Now let us rotate the graph with the time step 1. In order to find all the operations that are scheduled under the time step 1, we need to find the initial schedule for the graph. The process has a dependency A B C. The initial schedule for the graph with one functional unit for each operation is given in Table 3.2 below. U+ represents the functional unit for operations with computation time 1 and U* is the functional unit for operations with computation time 2. Table 3.2 Initial schedule for the SDFG in Figure 3.3 Time Step U+ U* 1 A1 NOP 2 A2 NOP 3 B NOP 4 NOP C 5 NOP C As we observe, the operation scheduled within the time step 1 is A. The set of incoming edges and outgoing edges for these operations are IA = {CA}, OA = {AB}, where Ix and Cx represent the set of incoming and outgoing edges for the vertex x. The next step is to redistribute the delays along the above edges. As per the algorithm we decrement the delays on the edge CA by qa and increment the delays on the edge AB by qa * p(a). The new SDFG we get after a single down rotation is shown in Figure 3.5. 30

Figure 3.5 Rotated SDFG The dependency we have in the new graph is B C. The schedule for the SDFG in Figure 3.5 is shown in Table 3.3 below. Table 3.3 Schedule for the SDFG after single rotation Time Step U+ U * 1 B NOP 2 A1 C 3 A2 C The above steps describe the basic idea behind the rotation scheduling operation. The series of applied down rotations can be termed as a rotation phase. The number of down rotations that need to be applied is η. This value of η is an experimentally determined constant. The length of the schedule gets reduced after each down rotation if the hardware consists of enough idle time. Hence, the number of time steps for down rotation needs to be adjusted after every step. The rotation phase algorithm described in [4] is shown below. 31

Algorithm 3.3 Rotation_Phase (G, SG, l, η, Sopt, q) Input: G (V, E, t, d, p, c) is a consistent and live SDFG, SG is the schedule of the SDFG G, l is the number of time steps to rotate the Synchronous Data Flow Graph. q is the Basic Repetition Vector BRV for the SDFG. Output: The rotated SDFG G, after a series of down rotations have been to the initial SDFG, The optimal Schedule Sopt. Sopt = S G // Initializing the Sopt with the initial Schedule. for i = 1 to η while l >= length (SG) do l l/2 end while down_rotate (G, SG, l, q) if length(s G ) < length (Sopt) Sopt SG end if End for 3.5 Rotation Heuristics The application of every rotation phase to the initial schedule produces a better schedule for the SDFG since we record all the schedules obtained after each down rotation and compute the best among them. There are several rotation heuristics that can be applied to an SDFG to improve the schedule length of the graph [4]. These heuristics apply a series of rotation phases to the given schedule, and any improvements in the schedule length are recorded. The range of rotation phases to be tried is an experimentally determined constant κ. For our purpose we will use the heuristic RH1 described in [2, 4]. 32

Algorithm 3.4 Heuristic RH1 (G, η, κ) Input: G is a SDFG, c is the number of down rotations that need to be applied in each rotation phase. e is the range of Rotation phases to be tried. Output: The optimal solution found, is stored in Sopt. Sini Initial Schedule(SDFG) L Length (Sini) Sopt S ini for i = 1 to κ. L do S G S ini Rotation_Phase (G, SG, i, η, Sopt). end for In the above rotation heuristic, it can be observed that each time after the rotation phase is applied, the Schedule S G is re-initialized to the original Schedule S ini ; i.e.; the rotation phases are applied independently to the original schedule and any improvements are recorded as S opt. By applying the rotation phases cumulatively, we can avoid duplicate calculations that occur during the RH1 heuristic. 3.6 Limitations of the Heuristic Model The rotation heuristics discussed above have a few limitations when implemented on the synchronous data flow graphs which are described in [4]. The first one is the time required to implement the above algorithms, which depends on the values η and κ. The algorithm generally produces good results for higher values of η and κ. The time complexity of the heuristic RH1 can be give as O (η κ V E ) [6] where κ is the range of rotation phases that are implemented, η is the number of down rotations that are applied 33

in each rotation phase, V and E are the number of nodes and edges respectively. Hence, any increase in the values of η and κ will increase the running time of the algorithm. The values η and κ are experimentally determined constants. Finding the correct values for the above constants is based on a trial and error method. The other issue with this kind of approach is that there is no attempt made to stop the method early if the optimal value is reached before the completion of the heuristic. Another drawback we have from this heuristic is that we might encounter the same schedule after several iterations. No mechanism has been implemented to use the already computed schedule. The schedule needs to be re-computed every time it occurs. 3.7 Summary In this chapter, we have developed an algorithm Down_Rotate which rotates the given SDFG based on a time step equal to l, and produces a better schedule for the graph, provided there is enough idle time in the hardware to support the down rotation. We also developed an algorithm to generate a static schedule for an SDFG with limited hardware. This algorithm considers factors such as kinds of functional units available, number of copies for each functional unit and the basic repetition vector of the SDFG. We have also discussed a rotation heuristic that helps in producing all possible schedules for an SDFG and explained some of the drawbacks of this heuristic. 34

CHAPTER IV RESULTS In this section, we will apply our findings to some of the examples of the synchronous data flow graphs. 4.1 First example In this example, we try to apply the rotation scheduling algorithm to the following SDFG given in the Figure 4.1 Figure 4.1 Synchronous data flow graph In the above synchronous data flow graph, it can be observed that there are 3 operations that need to be scheduled, denoted by A, B and C. The operations A and B are denoted using squares and operation C is denoted using a circle indicating different types of 35

functional units. The topology matrix for the SDFG in Figure 4.1 is shown in Figure 4.2 below. Figure 4.2 Topology matrix of the SDFG in Figure 4.1 The basic repetition vector for the above SDFG is [1 2 1]. That is, we need to schedule 1 copy of A, 2 copies of B and one copy of C. The time required to compute each node is t( A) = 1, t (B) = 1, t(c) = 2. We assume that there is one functional unit for Adder (A, B) and one for Multiplier (C). In order to schedule the above graph, we first remove the edges with sufficient delays in it and form a synchronous directed acyclic graph (SDAG). It can be shown in the figure below. Figure 4.3 SDAG for the graph in Figure 4.1 36

We initially schedule the operations in the SDAG in the order of their execution. We need to make sure that the operations A, B, and C must be scheduled in the above order only. The schedule for the above graph is given in Table 4.1 below Table 4.1 Initial schedule of SDFG Time Step U* U+ 1 NOP A 2 NOP B1 3 NOP B2 4 C NOP 5 C NOP The total time taken for the schedule is 5 time units. B1 and B2 are the 2 copies that need to be scheduled. Now we down rotate the above SDFG with time step 1. There is only one operation A within the time step 1. Hence we rotate the operation A by extracting qa * 1 delays from the incoming edges and pushing qa * p(a) number of delays into the outgoing edges. We get the following SDFG after rotating the node A. Figure 4.4 SDFG after single rotation 37

4.5 below The synchronous directed acyclic graph for the above SDFG is given in Figure Figure 4.5 SDAG after single rotation Hence we schedule B and C first in the above graph and then schedule the operation A. Table 4.2 Schedule of SDFG after single rotation Time Step U * U + 1 NOP B1 2 NOP B1 3 C A 4 C NOP In the above schedule, U* represents the functional unit related to operation C and U+ denotes the functional unit related to the operations A and B. Now, we rotate operation B, since it is the only node which can be scheduled in the time step 1. It pulls q b number of delays from the edge A B and pushes them onto the outgoing edge B C. The SDFG after second rotation is shown in Figure 4.6. 38

Figure 4.6 SDFG after second rotation The scheduling algorithm first schedules the SDAG of the above graph which is shown in Figure 4.7. Figure 4.7 SDAG after second rotation The updated time schedule is shown in Table 4.3 Table 4.3 Schedule of SDFG after second rotation Time Step U * U + 1 C A 2 C B1 3 NOP B2 39

Hence, from the above example, we can infer that the length of the schedule for a SDFG can be significantly improved by using the rotation scheduling algorithm provided there is enough idle time in the hardware to support the down rotate operation. 4.2 Second Example We now look at a complex synchronous data flow graph and apply the rotation scheduling algorithm to the graph. Consider the synchronous data flow graph in Figure 4.8. Figure 4.8 Initial synchronous data flow graph In the graph shown in Figure 4.8, {A, B, C, D, E, F} form the set of operations. The corresponding topology matrix is given in Figure 4.9. 40

Figure 4.9 Topology matrix for the SDFG in Figure 4.8 The basic repetition vector for the above SDFG can be found using the topology matrix given in Figure 4.9. The BRV of the graph q = [16 1 1 1 4 1]. Hence in each iteration 16 copies of A, 4 copies of E, and one copy each of B, C, D and F need to be scheduled. We have 2 types of operations in this iterative process. The set of operations S = {A, B, D, E} come under one group where each operation requires 1 time unit for completion and the operations T ={C, F} require 2 time units for completion. Since there are two types of operations, we need to have 2 kinds of functional units U+ and U * where U+ is used for the set of operations S and U* is used for set T. We assume that the given hardware consists of 2 copies of functional unit U+ and one copy of the functional unit U*. The initial schedule of the given SDFG can be found using the algorithm Schedule-SDFG. First we need to generate the synchronous directed acyclic graph which can be shown in the Figure 4.10. 41

Figure 4.10 SDAG of the SDFG in Figure 4.8 It can be observed from Figure 4.10 that B cannot be scheduled until the operations A and F are finished, and A cannot be scheduled until the operation E is finished. The initial schedule for the SDFG is given in Table 4.4. Table 4.4 Initial schedule for the SDFG. Time Step U+ U+ U* 1 E1 E2 F 2 E3 E4 F 3 A1 A2 NOP 4 A3 A4 NOP 5 A5 A6 NOP 6 A7 A8 NOP 7 A9 A10 NOP 8 A11 A12 NOP 9 A13 A14 NOP 10 A15 A16 NOP 11 B D NOP 12 NOP NOP C 13 NOP NOP C 42

We now apply the down rotation algorithm to the given SDFG with the time step 1. The operations scheduled under time step 1 are E and F. So we rotate the graph by pulling qe delays from the incoming edge DE and pushing (qe * 4) delays onto the outgoing edge EA. Similarly, we pull qf delays from DF and push (qf * 1) delays onto the edge FB. The SDFG now transforms to the graph in Figure 4.11. Figure 4.11 SDFG after single rotation. The SDAG for this graph can be given as shown in the Figure 4.16. Figure 4.12 SDAG after first rotation 43

Now the order of execution for the operations is A B C and D F. The schedule obtained from the scheduling algorithm is given in Table 4.5. Table 4.5 Schedule of SDFG after single rotation Time Step U+ U+ U* 1 D A1 NOP 2 A2 A3 F 3 A4 A5 F 4 A6 A7 NOP 5 A8 A9 NOP 6 A10 A11 NOP 7 A12 A13 NOP 8 A14 A15 NOP 9 A16 E1 NOP 10 B E2 NOP 11 E3 E4 C 12 NOP NOP C We observe that the time taken to complete the iteration has decreased compared to the original clock period. We continue to rotate the SDFG with time step 1. The operations scheduled at time step 1 are D and A. So we rotate the graph by pulling qd delays from the incoming edge CD and pushing (qd * 4) delays onto the outgoing edge DE and (qd *1) delays onto another outgoing edge DF. Similarly, we pull qa delays from EA and push (qa * 1) delays onto the edge AB. The SDFG now transforms to the graph in Figure 4.13. 44

Figure 4.13 SDFG after second rotation We generate the SDAG for this graph to schedule the graph using the Schedule- SDFG algorithm. The SDAG is shown in the Figure 4.14. Figure 4.14 SDAG after second rotation. The dependencies we now have are E A and B C D. The schedule can be found using the scheduling algorithm and is shown in Table 4. 6. 45