Code generation for a coarse-grained reconfigurable architecture

Size: px

Start display at page:

Download "Code generation for a coarse-grained reconfigurable architecture"

Rudolf Brown
6 years ago
Views:

1 Eindhoven University of Technology MASTER Code generation for a coarse-grained reconfigurable architecture Adriaansen, M. Award date: 2016 Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Apr. 2018

2 Code generation for a Coarse-Grained Reconfigurable Architecture Michaël Adriaansen Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands m.adriaansen@student.tue.nl Abstract Good tool support is essential for computing platforms because they increase programmability. This is especially the case for reconfigurable architectures because applications need to be mapped on the architecture for each configuration individually. This paper introduces a compiler backend for Coarse Grained Reconfigurable Arrays (CGRA) based on LLVM. The CGRA compiler must be retargetable to support all possible architecture configurations. Instruction scheduling is done using an operation based list scheduling algorithm which uses a resource graph. The scheduler uses an improved version of Dijkstra s algorithm for finding operand paths. Energy efficiency is taken into account by using the explicit bypassing capabilities of the hardware. The number of register file operations is decreased by 23% on average. I. INTRODUCTION Mobile embedded devices are constrained by the energy available in the battery. For a typical mobile phone the power budget available for applications is around 1 Watt [1] for a typical battery-lifetime. The energy efficiency of such a platform therefore constrains the amount of computations that can be performed by applications. Although Application Specific Integrated Circuits (ASICs) can be used to provide high performance and high energy efficiency, this is often not a viable option because of the lack of flexibility it provides. Programmability becomes increasingly important as available development time reduces. General purpose processors provide a high level of flexibility and programmability but lack compute performance, energy efficiency, or both. In general purpose processors a large percentage of the overhead can be attributed to instruction fetching and decoding, but even more so to data movement between memories, caches and register files [2]. Field Programmable Gate Arrays (FPGAs) can avoid much of this type of overhead by allowing spatial mapping of the applications, but their fine grained reconfigurability leads to a high configuration cost. For most signal processing applications, FPGAs have a finer granularity of reconfigurability than strictly necessary. Coarse Grained Reconfigurable Arrays (CGRAs) require fewer configuration bits, due to their coarser grained units, which results in a lower energy consumption while still allowing spatial application mapping [3]. In general, the programmability of a device depends on the tools that are provided for programming it. Most CGRAs can be seen as reconfigurable processors. For the programmer this is an advantage because of familiarity with such architectures. The benefit over general purpose processors is that the processor can be adapted to the application after manufacturing. This results in a higher energy efficiency for a wide range of applications. However, manually mapping the application on a single platform instantiation is a difficult and time consuming process. Thus, in order for the programmer to make efficient use of the configured architecture, we require a compiler which translates a high level language (e.g. C) into the highly parallel CGRA instructions. The compiler for this architecture needs to be very flexible as it can make few assumptions about the architecture of the configured processor. Two architectural aspects of CGRAs, aimed at reducing the energy consumption, have previously proved difficult to exploit for a highlevel compiler: explicit register-file bypassing and distributed register files. Explicit register-file bypassing significantly decreases the energy consumption of a processor architecture by transferring intermediate results of computations directly from one function unit into the next, avoiding the energy cost of writing into (and reading from) the register file [4]. Distributed register files allow the processor architecture to use multiple smaller register files, which are less costly than one single large register file in both area and energy consumption [5]. Such register files prove challenging for compilers when not all function units have access to all register files [6]. This paper introduces the compiler for a CGRA architecture, which is being developed using the popular LLVM compiler framework [7] (LLVM). The design of the backend is presented along with improvements that can be done. The paper is organized as follows. Section III describes the architecture of the target CGRA. Section IV shows the features of this architecture with an example kernel and demonstrates the difficulties encountered during code generation. Section V then describes how these code generation aspects can be achieved using LLVM and explains the various target specific optimizations that were added. Section VI discusses the current state of the compiler. Section VII provides an overview of improvements that can be done in the future. II. RELATED WORK Several research projects have proposed architectures with explicit register-file bypassing in the past, some of which have already used (parts of) LLVM. Transport Triggered Architectures (TTAs) are closely related to the CGRA architecture described in this paper in the way code needs to be generated by the compiler. In particular, TTA also have an explicit datapath, so the task of the compiler is to map operations to function units and find paths (which may or may not pass

3 through the register file) between function units to route the operands. TTAs are presented in [8]; a compiler is described in [9] and [6]. Newer versions of TTA architectures are also available as part of the TCE framework [10], [11]. The Synchronous Transfer Architecture (STA) is described in [12] and is similar to TTA but execution triggers are provided by a Very Long Instruction Word (VLIW)-like instruction set architecture. The architecture consists of multiple function units which have input and output ports to transfer data. Inputs are selected by multiplexers. Instructions are split in segments. Each segment controls a function unit and contains an execution trigger, information for the multiplexers and optionally an immediate value. A compiler approach is described in [13]. The compiler uses integer linear programming to schedule instructions. Another processor with explicit datapath is the wide Single Instruction Multiple Data (SIMD) described in [14]. The datapath of the processing elements in the SIMD is similar to that of a Reduced Instruction Set Computing (RISC) architecture, but the pipeline registers are exposed to the compiler through the Instruction Set Architecture (ISA) as operand sources. By explicitly bypassing short-lived variables from the pipeline registers, many accesses to the register file can be avoided, which results in significant energy saving. Unlike a CGRA however, datapaths are fixed, requiring a less flexible compiler. III. ARCHITECTURE The CGRA consists of a host processor and reconfigurable logic, as shown in figure 1. The host processor is responsible for configuring the CGRA. Application data can be transfered between the two processors via the shared global memory. Loading the configuration data and instructions of the CGRA is also performed by the host processor. The configuration data contains a bitfile, similar to bitfiles for FPGAs, that configures the control and data network and function unit behavior. At the core of the CGRA are the function units: Accumulate Branch Units (ABUs), which can be used for both control-flow and as an accumulator. Arithmetic Logic Units (ALUs), capable of performing basic arithmetic and logic operations. Immediate Units (IMMs), used for producing constant values required by the program. Load Store Units (LSUs), capable of loading and storing values from/into the attached local memory bank or the global memory. Multiplier Units (MULs), used for computing multiplications. Register Filess (RFs), used for storing intermediate results. Function units are controlled by Instruction Fetch and Instruction Decode units (/s). Each / unit has a dedicated instruction memory from which it reads a single instruction each cycle. A connection between an / unit and one or more function units can be established at configuration time. Multiple /-Function Units (FU) groups are combined Host Instruction memory Instruction memory Instruction memory Instruction memory Local Mem Local Mem Global memory Local Mem Local Mem LSU LSU LSU LSU LSU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Figure 1: Architecture overview [3] Local Mem to form a VLIW processor. By connecting a single / unit to multiple function units an SIMD configuration can be created. Each function unit s in- and outputs can be connected via the reconfigurable data network. Doing so allows the architecture to adapt to the datapath of the application and enables spatial mapping of the application onto the hardware. Results generated by the function units do not need to be stored in a register file but can be directly connected to one or more function units that consume the result, this creates an explicit datapath between function units. Although this can be very energy efficient, compilation for such an architecture is non-trivial. ABU IMM ALU RF LSU Figure 2: Minimal architecture configuration Figure 2 shows an instantiation of a minimal processor. It contains an IMM for generating immediate values that can be used for computations performed by the ALU. The register file can be used to store live values that cannot be kept inside the explicit datapath s registers. The LSU can load and store data from/to the local and global memories and the ABU is used to calculate the program counter and perform branches. It can be observed that data is allowed to bypass certain function units, e.g. data produced by the LSU can be directly consumed by the ALU without passing though the register file. For compilation this means that there may exist multiple paths between a producing and consuming function unit.

4 Multiple processors may be instantiated at the same time, allowing multiple programs to run in parallel. To support multiple independent programs, no additional compiler features are required. To compile programs that can communicate with each other, address spaces need to be supported by the compiler. LSU ALU IMM 1 IMM 2 ALU loop RF ABU IV. CGRA PROGRAMMING EXAMPLE A binarization algorithm will be used to illustrate how code on the CGRA is executed. The pseudocode for the example algorithm is shown in algorithm 1. Binarization can be used to produce a black and white version of an image by comparing the pixel data with a threshold and writing zero or one to the output image, depending on the outcome of the comparison. In algorithm 1, each pixel of array A is compared with the threshold T H. The result of this comparison is written to the corresponding entry in array B. When the algorithm is finished, array B will contain a black and white version of the image in array A. Algorithm 1 Binarization algorithm Input: Array A with image, threshold TH Output: Array B with binary image 1: procedure BINARIZATION(A[ ], B[ ], TH, Count) 2: for int I = Count; I!= Count; I = I - 1 do 3: B[I] (A[I] >= TH) 4: end for 5: end procedure The custom architecture configuration that was used is shown in figure 3a and a manually created schedule for this particular CGRA configuration is shown in figure 3b. Although binarization could be computed with the minimal architecture configuration shown in Figure 2 this configuration uses more function units to reduce the loop body cycle count. In the binarization configuration two ALUs are used, one for the compare operation required for the actual binarization and a second for managing the loop counter. The for loop has been software-pipelined to show the full potential of the architecture. The loop body is located between the red lines. Labels in the immediate unit and register file indicate which variable is produced by that unit. Labels in the LSU, ALU and ABU indicate which operation each function unit performs. The left three function units are used for processing the data. Two values of array A are compared with the threshold each loop iteration. The result of the comparison is written to array B. The address calculation is done implicitly in the LSUs. The right four function units are used for control flow of the loop. The variable I is reduced by one each cycle. When I is equal to zero, the branch will not be taken which ends the for loop. Software pipelining can be observed by looking at the critical path (LSU-ALU-LSU) of the kernel code which is longer than the distance between the loop boundaries. Bypassing of the register file can be observed from the fact that only the L TH 1 (a) Custom architecture configuration LSU ALU IMM 1 IMM 2 ALU loop RF ABU L GE TH 0 SUB LS GE TH 1 NEQ LS GE TH 0 SUB S S GE RD I WR I RD I BC t = 1 t = 2 t = 3+ 2I t = 4+ 2I t = 5+ 2I t = 6+ 2I (b) Scheduled program instructions for binarization. Red lines indicate the loop body. Figure 3: Binarization kernel on the CGRA. loop iterator I is ever stored in the register file. All other program values are passed directly between function units. By extending the configuration with more function units, one extra IMM and ALU, a single cycle loop body can be achieved. Doing so has a large energy advantage since no new instructions need to be fetched and decoded while the loop body is executed. In order to generate schedules such as the one in figure 3b, a compiler needs to have a number of features: It must be retargetable to generate schedules for all possible CGRA configurations. Explicit bypassing of the register file must be supported by the scheduler. The scheduler must be operation based. Software pipelining must be supported to generate efficient schedules. Multiple register files need to be supported to allow scheduling for all CGRA configurations. V. COMPILER Adding support for a new architecture within LLVM requires the addition of a target specific backend. This backend translates the target independent Intermediate Representation (IR) Time

5 of LLVM into target specific instructions, schedules instructions, allocates registers, and generates the final assembly or binary code. This process is split into several steps, as is illustrated by figure 4. First the IR operations are translated into target specific operations during instruction selection. Then the selected instructions are scheduled and registers are allocated for the storage of intermediate results. The ordering in which scheduling and register allocation are performed can have a large impact on the resulting code quality [6]. The scheduler may bypass the register file in some cases which reduces the register pressure. Allocating registers after scheduling is a logical choice for the CGRA compiler because a lower register pressure makes it easier to allocate registers. Finally the prologue and epilogue are inserted to handle the entry and exit points of the function that is being compiled. Normally instructions can be inserted directly at the beginning and end of the generated code. For the CGRA an extra scheduling step is required because LLVM s prologue and epilogue pass cannot schedule operations for the CGRA. After this the code generation is complete and the assembler text can be printed to the output file. The red borders in figure 4 indicate which steps required most changes to support the CGRA. LLVM IR Assembly code Instruction selection Assembly printing Scheduling and Packetizing Scheduling inserted instructions Register Allocation Pro-/Epilogue Insertion Figure 4: Ordering of scheduling passes in the compiler A. Compiler framework adaptation LLVM uses instructions that operate on registers, memory locations and immediate values. Contrary to this, the instructions of the CGRA contain information about which source port operands come from and which destination port the result needs to be written to. A dedicated register class can be used to model the input and output ports of a function unit in such a way that the architecture still adheres to the expected model. Instruction selection transforms standard LLVM instructions to pseudo instructions that still operate on registers and immediates as usual. A custom scheduler pass then transforms them into real instructions and sets the correct source and destination ports. Since the prologue/epilogue pass inserts new pseudo instructions into the existing schedule, another scheduling round needs to run afterwards. Both scheduling passes use the same code. The backend currently uses the existing register allocator of LLVM. However, to support multiple RFs, another register allocator needs to be implemented. More information on this topic can be found in section V-G. Only a few assumptions can be made to simplify the backend organization since the structure of the CGRA processor instance is not fixed. The compiler for this architecture should be very generic because it must be able to generate code for all sane processor configurations (i.e. have a complete instructionset and sufficient connectivity for the target program). B. Explicit bypassing support A list scheduler can be used to schedule explicit bypassing architectures after some modifications. One approach is to use the list scheduler followed by an additional pass which checks for bypassing possibilities as is used in [15]. Applying this method to the CGRA architecture template is difficult because there is no guarantee that results can be written to the register file immediately. This approach also does not provide a solution to handle the flexibility of the architecture. Another possibility is to allocate the bypass registers before the instruction scheduling as is done in [10]. This has the advantage that the normal scheduling algorithm already can take the bypassed connections into consideration and is more likely to find good schedules. Finally, a third possibility is to use a scheduler which automatically assigns the bypass registers during the instruction scheduling process. This produces a schedule for which the register file entries still need to be allocated but which efficiently uses the bypassing opportunities of the program. Of all approaches, this one is the most flexible. A list scheduler which implements this third option was chosen because it is very flexible. This scheduler uses a resource graph to determine the availability of function units and bypass registers during the scheduling process. The resource graph is constructed based on the CGRA configuration, described in an XML file. The configuration description includes which function units are present and how they are connected. The loading of this resource graph is completely custom for the CGRA backend as LLVM normally does not offer such finegrain retargetability to backends. By design a single backend will only support a fixed set of architecture variations which provides much less flexibility than provided by the CGRA template. C. Operation based scheduling algorithm The instruction scheduling passes of LLVM currently only provide cycle based instruction scheduling algorithms and are not able to generate schedules for explicit datapath architectures for which an operation based scheduling algorithm is more suitable. Generating these schedules requires more bookkeeping than what is being done in the current schedulers. An explicit datapath architecture scheduler needs to insert extra operations to keep values alive if they are needed later on. The destination of the operand is unknown at that moment, because the scheduler does not know when and where the consuming operation will be scheduled. An operationbased scheduler can postpone routing the produced operand or reroute the operand when its destination is known. This will result in better usage of resources. The adapted scheduling algorithm is shown in algorithm 2. The operand paths are found by Dijkstra s shortest path algorithm, similarly as done in [4]. Additionally, the shortest

6 path algorithm is used for function unit selection. For each existing operand the shortest path from the producer of the operand to any other function unit is calculated. This way the shortest path algorithm for the first operand only needs to run once. The function unit with the lowest total cost is selected to execute the operation. The cost of a path is calculated by using the weights of the edges in the resource graph. Immediate operands are different from the register operands because they can be created anywhere in the schedule, as long as there is a path to the function unit that needs them. The location where an immediate value is produced is found by calculating the shortest path from the function unit that processes them to an immediate unit. This means that immediate operand paths are found when the processing function unit has already been selected. Another approach is to simply choose an immediate unit and try to find a path to the processing function unit. This may lead to extra resource usage however. Algorithm 2 Operation based scheduling algorithm Input: DDG D of the basic block B and resource graph RG Output: Scheduled instructions of the basic block M B 1: procedure SCHEDULE(D, M) 2: R initialize ready list(d) 3: while R do Schedule the basic block 4: o select next operation(r, D) 5: F find available FUs(o, RG) 6: while true do Find a function unit 7: f select FU(F) 8: if o can be scheduled on f then 9: claim resources(rg) 10: R R \ {o} 11: N ew ops = add ready ops(o, D) 12: R R New ops 13: Add operations to ready list 14: break 15: else 16: F F \ f 17: end if 18: end while 19: end while 20: end procedure When an operation produces a result that needs to be used by another operation, care must be taken that this result can still be routed when the consuming operation needs to be scheduled. The scheduler should either be able to guarantee that the result can be routed or be able to reschedule the operation that produced the result. If this is not the case the scheduler may end up with an infeasible partial schedule. The most trivial way of guaranteeing that a result can later be consumed is to route all results to the register file such that it can be read back later on. Although this guarantees that the operand can be routed later on, it also allocates more resources than strictly necessary. This may result in other operations being scheduled in a later cycle than when the resources would not have been allocated. Another approach is to use max-flow graphs as described in [4]. This method guarantees that enough resources are available to route all live variables to a register file, without actually allocating the resources. This improves the schedule quality, but also increases the scheduling time since for each operation a graph needs to be constructed and the max-flow of the graph needs to be calculated. In some situations the result can not be routed to the register file but it may still be possible to use the result, for example as input to an LSU. The maxflow graph approach will prevent the producing instruction from being scheduled, which makes this approach slightly pessimistic. A backtracking scheduler can be used to recover from infeasible partial schedules since it always has the option to unschedule an operation whose result can not be routed to its destination. Care must be taken that it will still terminate when no solution can be found. Finally an Integer Linear Programming or Constraint Programming approach can be used, but these have much higher compile times. For the CGRA compiler the first approach is chosen because it is the fastest one. D. Shortest path algorithm Dijkstra s algorithm can be used to find single-source shortest paths in graphs with positive weights. Figure 5 shows a simplified resource graph for which the shortest path from the source node (red) to all other nodes is calculated. The colors of the nodes indicate the distance from the source node to that node. The darker a node is, the further away it is from the source. Black nodes cannot be reached from the source node. The numbers in the nodes indicate the order in which the nodes are visited. Note that only one unreachable node needs to be visited. As soon as this node is visited, the algorithm knows that all remaining nodes are unreachable. Min-heaps are often used to store the distances of the nodes for Dijkstra s algorithm. This results in a running time of O((V +E) log V ) [16]. V is the number of vertices, E is the number of edges in the graph. Unfortunately the Standard Template Library (STL) does not support the decrease-key operation for heaps. One solution is to rebuild the heap each time a distance value is changed. This results in a running time of O(V 2 ) Another option is switching to another library that does support the decrease-key operation, instead another approach was tried. The purpose of the min-heap in Dijkstra s algorithm is to make sure nodes are visited in a correct order. For Directed Acyclic Graphs (DAGs) this order can also be determined by topologically ordering the graph. Topologically ordering the graph can be done in O(V + E) [16]. Since the heap is not needed anymore, the running time of Dijkstra s algorithm reduces to O(V +E). Note that the resource graph may not be a DAG if there are function units in the configuration that allow bypassing the pipeline registers. Routing operands in a cycle does not make sense however, so this cycle can be removed before running Dijkstra s algorithm.

7 Cycle 1 Cycle 2 17 size may prevent preferred nodes from being selected because they are not included in the search window. Cycle 1 Cycle 3 1 Cycle 2 Cycle Cycle Cycle Cycle Cycle Cycle 5 Cycle 7 13 Figure 5: Regular approach of Dijkstra s algorithm. The source node has a red border. Numbers in the nodes indicate in which order the nodes are visited. Nodes without a number are never visited. Colors indicate the distance of nodes to the source. The higher the distance, the darker the node. White nodes are unreachable from the source. A topological ordering of a graph is typically not unique. One way of obtaining the topological ordering is to run Depth First Search (DFS) on the entire graph. Another topological ordering can be found by running the topological ordering algorithm on the nodes of a single cycle of the graph and repeating this order for every cycle. This can be done because there are no edges going back in time. The advantage of the latter approach is that the topological ordering algorithm only needs to run once on a small subgraph for an entire program. Dijkstra s algorithm can be used to find the shortest path from the source to a single destination or from the source to all other reachable nodes in the graph. The algorithm is used to find suitable function units to schedule an operation upon, so the destination is not known in this case. Calculating the paths from the source to all other nodes is not necessary most of the time however. If a suitable function unit is located two cycles later than the source, all distances after that cycle are calculated in vain. The solution is to calculate distances incrementally, as shown in figure 6. The search starts at the source. Distances are calculated for search windows of a few cycles (between the dotted lines). After calculating the distances, the scheduler tries to schedule the selected operation onto function units in the search window. If no function unit can perform the operation, later windows are tried until a function unit is found that can execute the operation. While searching, the resource graph is expanded if this is necessary. Changing the search window size can affect the schedules. If the weights in the resource graph are configured in such a way that delaying an operation is preferred, a small search window Cycle 6 Cycle 7 Figure 6: Incremental approach of Dijkstra s algorithm. Red dotted lines indicate search windows. White nodes are unreachable from the source. Nodes without numbers have not been visited (yet). When a path needs to be found for an operand, the location where it was first defined is used as the source node for the shortest path algorithm. During scheduling the graph will expand and some parts will be allocated. This causes the paths of operands that were defined early in the graph that still need to be scheduled to become longer, which results in a longer compile time. To reduce the compile time for large schedules, the operand source locations can be changed to reduce the path lengths. A simple heuristic is implemented which relocates all the operands that are still needed to the latest cycle every time the graph is enlarged to a multiple of X cycles, where X is an option of the compiler. The heuristic is very effective in reducing the compile time, but the schedule quality also reduces. Relocating the operands to cycles earlier than the last cycle is expected to result in better schedules. If then def vreg1 Entry Tail use vreg1 If else def vreg1 Figure 7: Control Flow Graph with hard to schedule code for multiple register file configurations

8 E. Scheduling order heuristics The list scheduler uses a heuristic to determine the order in which operations will be scheduled. Many heuristics are described in literature [17]. Which heuristic is chosen depends on the goal that needs to be achieved. The goals of the CGRA compiler are optimizing for execution time and energy. Two heuristics were tried for the CGRA compiler: largest height first and least slack first. Figure 8 shows a dependence graph with height and slack annotated in the nodes. The height is defined as the distance from a node to the root node. For the slack heuristic we first calculate the As Soon As Possible (ASAP) and As Late As Possible (ALAP) times of the nodes. ASAP times are calculated according to equation (1); ALAP times are calculated according to equation equation (2). The slack time of a node is given in equation (3). Slack is calculated before scheduling, so scheduling decisions do not affect the slack of nodes in the graph. This is cheaper than recalculating the slack after scheduling an operation, but may result in worse schedules (a) Possible height order (b) Possible slack order Figure 9: Dependence graphs with different scheduling orders for the height and slack heuristics ASAP i = { 0, if i == leaf node. ASAP parent + 1, otherwise. (1) Figure 10: Dependence graph with scheduling order for both the height and slack heuristic ALAP i = Height: 2 Slack: 0 { ASAP, ALAP parent + 1, if i == root node. otherwise. (2) SLACK i = ALAP i ASAP i (3) Height: 1 Slack: 0 Height: 2 Slack: 0 Height: 0 Slack: 0 Height: 1 Slack: 1 Figure 8: Dependence Graph annotated with height and slack of nodes The CGRA compiler must guarantee that an operand can arrive at its destination. To accomplish this paths may be reserved temporarily, which may cause other operations to be delayed because the required resources are occupied. When the last use of an operand is scheduled, the temporary reserved paths can be released. The scheduling order heuristic decides which operation is scheduled next, and hence decides which temporary path can be freed, if any. This way the scheduling order heuristic influences the schedule quality. Figure 9 shows two dependence graphs with scheduling orders for the height and slack heuristic. The slack heuristic schedules the operations in such an order that live operands are killed before producing new live operands. The height heuristic can obtain the same ordering, but this is not guaranteed. Figure 10 shows a dependence graph for which the slack heuristic may not find the best ordering. It is possible that it will obtain the same ordering as the height heuristic, because all operations have zero slack. Adding a tiebreak to the slack heuristic may result in better schedules. A least-height-first tiebreak would force a node lower in the graph to be scheduled if its parents were scheduled. The problem remains that the heuristic does not know which top nodes to schedule to free a node lower in the graph. This can be resolved by adding the number of uncovered children as a second tiebreaker. The number of uncovered children indicates how many nodes will become schedulable as a result of scheduling that node. Finally the number of remaining uses may be used as a metric. Scheduling operations with few uses is preferred because scheduling those operations will result in resources being released. F. Enabling software pipelining in LLVM As demonstrated with the example application, software pipelining is a key optimization technique for architectures exploiting Instruction Level Parallelism (ILP). A target independent software pipelining pass can be used to redistribute operations of loops [18], [19]. The advantage of this approach is that it is target-independent and requires less work than building a custom software pipelining scheduler. The actual scheduling is left to the normal scheduling algorithm. The current compiler assumes that live-in virtual registers

9 reside in the register file because no mechanism exists to indicate what the location of a virtual register is. With such a mechanism it would be possible to pass virtual registers in pipeline registers of a function unit across basic block boundaries. To determine which virtual register should be passed in a pipeline register, an additional pass can be run before the scheduler. If the Control Flow Graph (CFG) of figure 11 would be processed by such a pass, it would need to assign vreg 2 as live-out of block entry and live-in and live-out of block loop. Entry def vreg1 def vreg2 Loop use vreg2... def vreg2 End use vreg1 Figure 11: Control Flow Graph of loop with live-in and liveout registers G. Supporting multiple (distributed) register files The architecture allows to build configurations with multiple register files. Supporting multiple register files is not trivial when using LLVM. Figure 7 shows a CFG of an application in which the same virtual register is defined in multiple basic blocks and used in another basic block. Eventually vreg 1 needs to be assigned to the same physical register. Since the scheduler allocates operand paths, it will also assign vreg 1 to a register file. The scheduler needs to write vreg 1 to the same register file in block If then as in block If else. Furthermore it must read vreg 1 in block Tail from the same register file. This technique can be referred to as integrated scheduling and register file allocation. After this pass a register allocator still needs to assign virtual registers to physical registers. LLVM does not provide data structures that are used for storing the location of an operand, because it assumes there is only one register file. Besides adding this data structure, the scheduler and register allocator must be adapted to use this data structure. Storage of virtual register locations can be done at function level. When a virtual register is defined that is a live-out of a basic block, the scheduler must check whether it has been assigned to a register file already. If this is the case, the scheduler must route it to that location. If the virtual register has not been assigned to a location yet, the scheduler may greedily choose a register file to route the virtual register to. When a use of a virtual register is scheduled, the scheduler can look up the location of the virtual register. It is important that register are divided equally between register files. A register file that is full can limit the scheduler because no paths can be routed through that register file anymore. The edges between nodes of the register file in the resource graph can be used to achieve an equal distribution. The weights on the edges of a register file may for example be equal to the reciprocal of the number of registers allocated in that register file. This causes emptier register files to get preference over fuller register files when scheduling operands to the register file. H. Target specific energy efficiency optimizations The weights on the edges between nodes of the resource graph determine what decisions the scheduler will make. This can be used to steer the scheduler to generate energy efficient schedules. Many operands are only used a few times in a schedule. Reading an operand twice from the register file may consume more energy than reading the value once and using an ALU to store the value until it is needed. When using ALUs to route an operand, energy may be saved by using a single ALU for a few cycles instead of sending the value to another ALU every cycle. Fewer instruction bits will switch if a single ALU is used. When making a choice between function units, preference may be given to the function unit with the lowest output connectivity since transferring the result operand to the other function units will charge fewer capacitances. Scheduling for low energy consumption is presented in [20]. In this paper energy tables are used which indicate how much it costs to transition from one instruction to another. These energy tables could be used in the same way to calculate which operand path is the cheapest. In the paper a cycle-based scheduler is used. If an operation based scheduler is used such as in the CGRA compiler, the energy cost calculations will be less accurate. This is because the scheduler will calculate how much it costs to schedule operation B after operation A, but the scheduler may replace operation A at a later point in the scheduling process. The current scheduler assigns opcodes, operand sources and an operand destination to function units which need to execute some task. By default, function units that are not used execute a No Operation (NOP). Keeping instructions the same as the previous cycle may be a better approach to reduce energy consumption. If an instruction does not change between cycles, fewer bits toggle in the instruction fetch and decode stage. Care must be taken that the saved energy is not lost during the execution phase. When executing NOPs the output of a function unit remains the same. If instead an instruction is executed which changes the output, more energy is used than when executing the NOP. Figure 12 shows two schedules that provide the same functionality. The schedule in figure 12a can be converted to the schedule in figure 12b to save energy. In this case the entire add operation including the production of an immediate and loading a register can be copied to the next cycle. If R1 was written to during cycle 2 the choice would be less trivial as this would introduce extra toggling in the ALU. A heuristic needs to be found that can make this trade off. A. Supported features VI. EVALUATION The current compiler is able to compile four benchmarks. A calling convention is implemented which allows the compiler

10 IMM ALU RF 4 R1 + 6 R2 + (a) Initial schedule IMM ALU RF 4 R1 4 + t = 1 t = 2 t = 3 t = 4 t = 1 Time register allocation fails while the operation based scheduling information is still available. Further recent additions to LLVM such as the upcoming support for software pipelining are also promising. The proposed implementation for the Hexagon target [18] seems to match relatively well with the current organization of the CGRA backend and it is our expectation that it can be integrated without too much complications. The main additions required to the CGRA backend will be several target hooks for supplying information about the available parallel function units in the configured architecture. B. Benchmark set The tested benchmarks are listed in table I. For matrix multiplication matrices of size 2x2 are used due to register file constraints. The entire kernel is unrolled for this benchmark to provide more ILP to the scheduler. The other benchmarks have few ILP. More benchmarks were tried but could not run due to the limited amount of registers available or because they require operations that are not yet supported such as variable amount shifts and select operations. Figure 13 shows the architecture configuration that was used to test all benchmarks. The arrows in the figure indicate data connections between the function units. An / is assigned to every function unit for control. R1 t = 2 Time 6 + R2 t = 3 ABU MUL + (b) Optimized schedule t = 4 RF ALU Figure 12: Optimizing schedule for energy LSU IMM to provide basic function call support. Code can be generated for architectures that contain multiple register files, although no storage is possible in multiple register files across basic blocks due to limitations of the register allocator. Testing the compiler showed that care must be taken to choose the weights on the edges of the resource graph wisely when implementing the operation based scheduler. In some cases scheduling the first operand of an operation will make it impossible to schedule the second operand because the path of the first operand is blocking the path of the second operand. Currently the scheduler is included as a separate pass in the LLVM toolflow which replaces LLVM s own scheduling algorithm. In order to better support multiple register files in the future and to further improve the explicit bypassing utilization in the schedule, the register allocator should be included in the scheduler pass. This allows for rescheduling when Figure 13: Architecture configuration used for benchmarks Compile time tests are performed for both the normal shortest path finding algorithm and the improved incremental version. The measured time indicates how long it takes for the entire program to compile from LLVM IR to assembly text. The compiler is build in release mode for this test. The scheduling order heuristic used for the compile time test is largest-height-first. Schedule quality tests are performed to compare the largestheight-first and slack heuristics. The code size, utilization of function units and register usage is measured only for the kernel of the benchmarks. Statistics are derived from the

11 Operations Benchmark Operations Basic blocks Basic block Bubble sort Insertion sort Fibonacci Matrix mult TABLE I: Benchmarks tested on the CGRA schedule, not by running the code in simulation. If runtime statistics were used a small difference in the schedule would be amplified if that part of the code was executed many times. Although this is a good way to see what effect a certain change of the schedule has, it is harder to see how much the scheduler actually contributed to this change. C. Experimental results Table II shows the results of the compile-time test. The shortest path v1 compiler uses the quadratic-time shortest path algorithm which searches the entire resource graph; the shortest path v2 compiler is the linear-time shortest path algorithm using an incremental approach. As expected the second version performs much better. The search window size of the incremental algorithm has a very small impact on the schedule quality. Only in one benchmark a small difference was measured when using window size of 1 and 2 cycles. Larger window sizes did not affect the schedules. The compile times of the slow compiler are not an issue if some code needs to be compiled during development of a program. When architecture configuration Design Space Exploration (DSE) is performed however, the same program may be compiled numerous times to discover which configuration is best for that program. In that case the speedup of the compiler is very welcome. The X86 compiler is included in the table as a reference. The results of our compiler are reasonable considering the higher complexity of the scheduling problem. Compile time Compile time X86 Benchmark shortest path v1 shortest path v2 Speedup Compiler Bubble sort 614 ms 41 ms ms Insertion sort 271 ms 27 ms ms Fibonacci 121 ms 23 ms ms Matrix mult 1601 ms 42 ms ms TABLE II: Compile times of benchmarks Figure 14 shows the code size of the benchmarks for both the largest-height-first and slack scheduling order heuristics. The performance of the heuristics is very similar. The quality of the schedules is hard to quantify because no scheduler exists yet that can produce optimal schedules for this architecture. Figure 15 shows the percentage of register operations that were eliminated using both heuristics. An approximation of how many register operations would be required if the register file were not bypassed is made. The approximation is done by counting operations in the generated schedule and calculating how many read and write operations they would require. The approximated required number of register operations is used to calculate what percentage of register operations is eliminated by bypassing the register file. The approximation does not take moves of immediate values to registers into account, because they can not be extracted from the schedule. This explains why bubble sort uses more registers than what was the approximated upper bound. Overall the slack heuristic scores better than the height heuristic. This is expected because the slack heuristic has a higher chance of scheduling operations that consume operands before scheduling operations that produce new operands. This raises the chance that the register file can be bypassed. Code size (#instructions) Bsort Ins. sort Fibonacci Matrix mult. Height Slack Figure 14: Performance of scheduling order heuristics in terms of code size Eliminated register operations (%) Bsort Ins. sort Fibonacci Matrix mult. Height reads Height writes Slack reads Slack writes Figure 15: Performance of scheduling order heuristics in terms of eliminated register operations The average utilization of the function units for both scheduling order heuristics is shown in figure 16. The uti-

12 Utilization (%) Bsort Ins. sort Fibonacci Matrix mult. ABU IMM ALU ALU (including pass) RF LSU MUL Avg (including pass) Figure 16: Utilization of function units lization is calculated by the following formula: Utilization = #Operations (#NOP s + #P asses) #Operations Pass operations are operations that are used for routing values and are performed only by the ALU. Since they do not contribute to the actual program, they are not used to calculate the utilization. The ALU may also use NOPs in the routing process. These NOPs must stay in the schedule because otherwise the output of the ALUs would be overwritten. If in schedule A many pass operations are used and in schedule B many NOPs are used for routing, then from the results it would seem that schedule A yields a higher utilization of the ALUs if the pass operations were also considered when calculating the utilization. Most of the results shown in figure 16 can be explained rather easily. The utilization of the multiply unit is zero for most benchmarks because these benchmarks do not perform multiply operations at all. The utilization of the ABU is correlated with the number of instructions per basic block, shown in table I. Programs with fewer operations per basic block contain relatively more branch and jump operations, which will result in a higher utilization of the ABU. The average utilization is also correlated with the number of instructions per basic block. Although the scheduler can fill delay slots, this does not happen very often. Having fewer operations per basic block will result in basic blocks which execute in fewer cycles. The delay slot is always a fixed amount of cycles, so the amount of cycles that are spend in the delay slot become relatively more for smaller basic blocks. Since the part of the basic block in which little work is done becomes relatively larger, the average utilization of the hardware will decrease. A low hardware utilization indicates that the hardware is capable of performing more work than is required by the compiled program. Achieving a high average utilization with the current CGRA is an unrealistic goal. Unlike VLIW processors the CGRA does not cluster function units. Instead every function unit is connected to a dedicated /. This causes a low average utilization because there is an abundance of function units that can perform work, but not enough operations that can be mapped to those function units. The utilization of the register file is relatively high for all benchmarks. The register file has only one read port. If an operations requires two operands that are both in the register file, two reads are required to provide the operands of only one operation. The register file can be bypassed in some cases, but when a value is to be used multiple times it is hard to keep it in pipeline registers. The following hardware modifications could help to create configurations that can bypass the register file more often: Add routing capabilities to other function units Add extra input connections to function units Many of the function units already have pipeline registers, but only the ALU and the register file can be used for routing. Adding routing capabilities to other function units makes it possible to bypass the register file more often. The multiplier and LSU are the most logical candidates to add this feature to since they are often already connected to the register file and an ALU. It should be investigated how much extra hardware is required to support this feature and how much the schedules will benefit from it. When generating simple configurations the ALU and register file use almost all input connections, as can be seen in figure 13. Adding more ALUs will not improve the schedules much because they cannot be connected in a proper way. Similarly, using multiple output ports of a single ALU is problematic because of the limited amount of input ports. VII. FUTURE WORK In this project a working compiler is delivered. The following features can be added to enlarge the set of programs that can be compiled: Variable amount shifts Select operations More extensive support for distributed register files The CGRA hardware only supports shifts with a fixed amount. Variable amount shifts can be supported by inserting loops which execute fixed amount shifts. This feature is present in the backend for another architecture (AVR) and should be ported to the CGRA backend. Select operations use the internal flag of the ALU as a third operand. This flag is not yet included in the hardware model, thus the select operation is not yet supported. A method must be implemented to save and recover the flag since the internal flag can be overwritten by other operations. This is more complex than saving regular values temporarily to the register file because the flag does not appear on the output of the ALU. Saving the flag can be done by performing a conditional move on immediate

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight