Code generation for a coarse-grained reconfigurable architecture

Size: px
Start display at page:

Download "Code generation for a coarse-grained reconfigurable architecture"

Transcription

1 Eindhoven University of Technology MASTER Code generation for a coarse-grained reconfigurable architecture Adriaansen, M. Award date: 2016 Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Apr. 2018

2 Code generation for a Coarse-Grained Reconfigurable Architecture Michaël Adriaansen Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands m.adriaansen@student.tue.nl Abstract Good tool support is essential for computing platforms because they increase programmability. This is especially the case for reconfigurable architectures because applications need to be mapped on the architecture for each configuration individually. This paper introduces a compiler backend for Coarse Grained Reconfigurable Arrays (CGRA) based on LLVM. The CGRA compiler must be retargetable to support all possible architecture configurations. Instruction scheduling is done using an operation based list scheduling algorithm which uses a resource graph. The scheduler uses an improved version of Dijkstra s algorithm for finding operand paths. Energy efficiency is taken into account by using the explicit bypassing capabilities of the hardware. The number of register file operations is decreased by 23% on average. I. INTRODUCTION Mobile embedded devices are constrained by the energy available in the battery. For a typical mobile phone the power budget available for applications is around 1 Watt [1] for a typical battery-lifetime. The energy efficiency of such a platform therefore constrains the amount of computations that can be performed by applications. Although Application Specific Integrated Circuits (ASICs) can be used to provide high performance and high energy efficiency, this is often not a viable option because of the lack of flexibility it provides. Programmability becomes increasingly important as available development time reduces. General purpose processors provide a high level of flexibility and programmability but lack compute performance, energy efficiency, or both. In general purpose processors a large percentage of the overhead can be attributed to instruction fetching and decoding, but even more so to data movement between memories, caches and register files [2]. Field Programmable Gate Arrays (FPGAs) can avoid much of this type of overhead by allowing spatial mapping of the applications, but their fine grained reconfigurability leads to a high configuration cost. For most signal processing applications, FPGAs have a finer granularity of reconfigurability than strictly necessary. Coarse Grained Reconfigurable Arrays (CGRAs) require fewer configuration bits, due to their coarser grained units, which results in a lower energy consumption while still allowing spatial application mapping [3]. In general, the programmability of a device depends on the tools that are provided for programming it. Most CGRAs can be seen as reconfigurable processors. For the programmer this is an advantage because of familiarity with such architectures. The benefit over general purpose processors is that the processor can be adapted to the application after manufacturing. This results in a higher energy efficiency for a wide range of applications. However, manually mapping the application on a single platform instantiation is a difficult and time consuming process. Thus, in order for the programmer to make efficient use of the configured architecture, we require a compiler which translates a high level language (e.g. C) into the highly parallel CGRA instructions. The compiler for this architecture needs to be very flexible as it can make few assumptions about the architecture of the configured processor. Two architectural aspects of CGRAs, aimed at reducing the energy consumption, have previously proved difficult to exploit for a highlevel compiler: explicit register-file bypassing and distributed register files. Explicit register-file bypassing significantly decreases the energy consumption of a processor architecture by transferring intermediate results of computations directly from one function unit into the next, avoiding the energy cost of writing into (and reading from) the register file [4]. Distributed register files allow the processor architecture to use multiple smaller register files, which are less costly than one single large register file in both area and energy consumption [5]. Such register files prove challenging for compilers when not all function units have access to all register files [6]. This paper introduces the compiler for a CGRA architecture, which is being developed using the popular LLVM compiler framework [7] (LLVM). The design of the backend is presented along with improvements that can be done. The paper is organized as follows. Section III describes the architecture of the target CGRA. Section IV shows the features of this architecture with an example kernel and demonstrates the difficulties encountered during code generation. Section V then describes how these code generation aspects can be achieved using LLVM and explains the various target specific optimizations that were added. Section VI discusses the current state of the compiler. Section VII provides an overview of improvements that can be done in the future. II. RELATED WORK Several research projects have proposed architectures with explicit register-file bypassing in the past, some of which have already used (parts of) LLVM. Transport Triggered Architectures (TTAs) are closely related to the CGRA architecture described in this paper in the way code needs to be generated by the compiler. In particular, TTA also have an explicit datapath, so the task of the compiler is to map operations to function units and find paths (which may or may not pass

3 through the register file) between function units to route the operands. TTAs are presented in [8]; a compiler is described in [9] and [6]. Newer versions of TTA architectures are also available as part of the TCE framework [10], [11]. The Synchronous Transfer Architecture (STA) is described in [12] and is similar to TTA but execution triggers are provided by a Very Long Instruction Word (VLIW)-like instruction set architecture. The architecture consists of multiple function units which have input and output ports to transfer data. Inputs are selected by multiplexers. Instructions are split in segments. Each segment controls a function unit and contains an execution trigger, information for the multiplexers and optionally an immediate value. A compiler approach is described in [13]. The compiler uses integer linear programming to schedule instructions. Another processor with explicit datapath is the wide Single Instruction Multiple Data (SIMD) described in [14]. The datapath of the processing elements in the SIMD is similar to that of a Reduced Instruction Set Computing (RISC) architecture, but the pipeline registers are exposed to the compiler through the Instruction Set Architecture (ISA) as operand sources. By explicitly bypassing short-lived variables from the pipeline registers, many accesses to the register file can be avoided, which results in significant energy saving. Unlike a CGRA however, datapaths are fixed, requiring a less flexible compiler. III. ARCHITECTURE The CGRA consists of a host processor and reconfigurable logic, as shown in figure 1. The host processor is responsible for configuring the CGRA. Application data can be transfered between the two processors via the shared global memory. Loading the configuration data and instructions of the CGRA is also performed by the host processor. The configuration data contains a bitfile, similar to bitfiles for FPGAs, that configures the control and data network and function unit behavior. At the core of the CGRA are the function units: Accumulate Branch Units (ABUs), which can be used for both control-flow and as an accumulator. Arithmetic Logic Units (ALUs), capable of performing basic arithmetic and logic operations. Immediate Units (IMMs), used for producing constant values required by the program. Load Store Units (LSUs), capable of loading and storing values from/into the attached local memory bank or the global memory. Multiplier Units (MULs), used for computing multiplications. Register Filess (RFs), used for storing intermediate results. Function units are controlled by Instruction Fetch and Instruction Decode units (/s). Each / unit has a dedicated instruction memory from which it reads a single instruction each cycle. A connection between an / unit and one or more function units can be established at configuration time. Multiple /-Function Units (FU) groups are combined Host Instruction memory Instruction memory Instruction memory Instruction memory Local Mem Local Mem Global memory Local Mem Local Mem LSU LSU LSU LSU LSU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Figure 1: Architecture overview [3] Local Mem to form a VLIW processor. By connecting a single / unit to multiple function units an SIMD configuration can be created. Each function unit s in- and outputs can be connected via the reconfigurable data network. Doing so allows the architecture to adapt to the datapath of the application and enables spatial mapping of the application onto the hardware. Results generated by the function units do not need to be stored in a register file but can be directly connected to one or more function units that consume the result, this creates an explicit datapath between function units. Although this can be very energy efficient, compilation for such an architecture is non-trivial. ABU IMM ALU RF LSU Figure 2: Minimal architecture configuration Figure 2 shows an instantiation of a minimal processor. It contains an IMM for generating immediate values that can be used for computations performed by the ALU. The register file can be used to store live values that cannot be kept inside the explicit datapath s registers. The LSU can load and store data from/to the local and global memories and the ABU is used to calculate the program counter and perform branches. It can be observed that data is allowed to bypass certain function units, e.g. data produced by the LSU can be directly consumed by the ALU without passing though the register file. For compilation this means that there may exist multiple paths between a producing and consuming function unit.

4 Multiple processors may be instantiated at the same time, allowing multiple programs to run in parallel. To support multiple independent programs, no additional compiler features are required. To compile programs that can communicate with each other, address spaces need to be supported by the compiler. LSU ALU IMM 1 IMM 2 ALU loop RF ABU IV. CGRA PROGRAMMING EXAMPLE A binarization algorithm will be used to illustrate how code on the CGRA is executed. The pseudocode for the example algorithm is shown in algorithm 1. Binarization can be used to produce a black and white version of an image by comparing the pixel data with a threshold and writing zero or one to the output image, depending on the outcome of the comparison. In algorithm 1, each pixel of array A is compared with the threshold T H. The result of this comparison is written to the corresponding entry in array B. When the algorithm is finished, array B will contain a black and white version of the image in array A. Algorithm 1 Binarization algorithm Input: Array A with image, threshold TH Output: Array B with binary image 1: procedure BINARIZATION(A[ ], B[ ], TH, Count) 2: for int I = Count; I!= Count; I = I - 1 do 3: B[I] (A[I] >= TH) 4: end for 5: end procedure The custom architecture configuration that was used is shown in figure 3a and a manually created schedule for this particular CGRA configuration is shown in figure 3b. Although binarization could be computed with the minimal architecture configuration shown in Figure 2 this configuration uses more function units to reduce the loop body cycle count. In the binarization configuration two ALUs are used, one for the compare operation required for the actual binarization and a second for managing the loop counter. The for loop has been software-pipelined to show the full potential of the architecture. The loop body is located between the red lines. Labels in the immediate unit and register file indicate which variable is produced by that unit. Labels in the LSU, ALU and ABU indicate which operation each function unit performs. The left three function units are used for processing the data. Two values of array A are compared with the threshold each loop iteration. The result of the comparison is written to array B. The address calculation is done implicitly in the LSUs. The right four function units are used for control flow of the loop. The variable I is reduced by one each cycle. When I is equal to zero, the branch will not be taken which ends the for loop. Software pipelining can be observed by looking at the critical path (LSU-ALU-LSU) of the kernel code which is longer than the distance between the loop boundaries. Bypassing of the register file can be observed from the fact that only the L TH 1 (a) Custom architecture configuration LSU ALU IMM 1 IMM 2 ALU loop RF ABU L GE TH 0 SUB LS GE TH 1 NEQ LS GE TH 0 SUB S S GE RD I WR I RD I BC t = 1 t = 2 t = 3+ 2I t = 4+ 2I t = 5+ 2I t = 6+ 2I (b) Scheduled program instructions for binarization. Red lines indicate the loop body. Figure 3: Binarization kernel on the CGRA. loop iterator I is ever stored in the register file. All other program values are passed directly between function units. By extending the configuration with more function units, one extra IMM and ALU, a single cycle loop body can be achieved. Doing so has a large energy advantage since no new instructions need to be fetched and decoded while the loop body is executed. In order to generate schedules such as the one in figure 3b, a compiler needs to have a number of features: It must be retargetable to generate schedules for all possible CGRA configurations. Explicit bypassing of the register file must be supported by the scheduler. The scheduler must be operation based. Software pipelining must be supported to generate efficient schedules. Multiple register files need to be supported to allow scheduling for all CGRA configurations. V. COMPILER Adding support for a new architecture within LLVM requires the addition of a target specific backend. This backend translates the target independent Intermediate Representation (IR) Time

5 of LLVM into target specific instructions, schedules instructions, allocates registers, and generates the final assembly or binary code. This process is split into several steps, as is illustrated by figure 4. First the IR operations are translated into target specific operations during instruction selection. Then the selected instructions are scheduled and registers are allocated for the storage of intermediate results. The ordering in which scheduling and register allocation are performed can have a large impact on the resulting code quality [6]. The scheduler may bypass the register file in some cases which reduces the register pressure. Allocating registers after scheduling is a logical choice for the CGRA compiler because a lower register pressure makes it easier to allocate registers. Finally the prologue and epilogue are inserted to handle the entry and exit points of the function that is being compiled. Normally instructions can be inserted directly at the beginning and end of the generated code. For the CGRA an extra scheduling step is required because LLVM s prologue and epilogue pass cannot schedule operations for the CGRA. After this the code generation is complete and the assembler text can be printed to the output file. The red borders in figure 4 indicate which steps required most changes to support the CGRA. LLVM IR Assembly code Instruction selection Assembly printing Scheduling and Packetizing Scheduling inserted instructions Register Allocation Pro-/Epilogue Insertion Figure 4: Ordering of scheduling passes in the compiler A. Compiler framework adaptation LLVM uses instructions that operate on registers, memory locations and immediate values. Contrary to this, the instructions of the CGRA contain information about which source port operands come from and which destination port the result needs to be written to. A dedicated register class can be used to model the input and output ports of a function unit in such a way that the architecture still adheres to the expected model. Instruction selection transforms standard LLVM instructions to pseudo instructions that still operate on registers and immediates as usual. A custom scheduler pass then transforms them into real instructions and sets the correct source and destination ports. Since the prologue/epilogue pass inserts new pseudo instructions into the existing schedule, another scheduling round needs to run afterwards. Both scheduling passes use the same code. The backend currently uses the existing register allocator of LLVM. However, to support multiple RFs, another register allocator needs to be implemented. More information on this topic can be found in section V-G. Only a few assumptions can be made to simplify the backend organization since the structure of the CGRA processor instance is not fixed. The compiler for this architecture should be very generic because it must be able to generate code for all sane processor configurations (i.e. have a complete instructionset and sufficient connectivity for the target program). B. Explicit bypassing support A list scheduler can be used to schedule explicit bypassing architectures after some modifications. One approach is to use the list scheduler followed by an additional pass which checks for bypassing possibilities as is used in [15]. Applying this method to the CGRA architecture template is difficult because there is no guarantee that results can be written to the register file immediately. This approach also does not provide a solution to handle the flexibility of the architecture. Another possibility is to allocate the bypass registers before the instruction scheduling as is done in [10]. This has the advantage that the normal scheduling algorithm already can take the bypassed connections into consideration and is more likely to find good schedules. Finally, a third possibility is to use a scheduler which automatically assigns the bypass registers during the instruction scheduling process. This produces a schedule for which the register file entries still need to be allocated but which efficiently uses the bypassing opportunities of the program. Of all approaches, this one is the most flexible. A list scheduler which implements this third option was chosen because it is very flexible. This scheduler uses a resource graph to determine the availability of function units and bypass registers during the scheduling process. The resource graph is constructed based on the CGRA configuration, described in an XML file. The configuration description includes which function units are present and how they are connected. The loading of this resource graph is completely custom for the CGRA backend as LLVM normally does not offer such finegrain retargetability to backends. By design a single backend will only support a fixed set of architecture variations which provides much less flexibility than provided by the CGRA template. C. Operation based scheduling algorithm The instruction scheduling passes of LLVM currently only provide cycle based instruction scheduling algorithms and are not able to generate schedules for explicit datapath architectures for which an operation based scheduling algorithm is more suitable. Generating these schedules requires more bookkeeping than what is being done in the current schedulers. An explicit datapath architecture scheduler needs to insert extra operations to keep values alive if they are needed later on. The destination of the operand is unknown at that moment, because the scheduler does not know when and where the consuming operation will be scheduled. An operationbased scheduler can postpone routing the produced operand or reroute the operand when its destination is known. This will result in better usage of resources. The adapted scheduling algorithm is shown in algorithm 2. The operand paths are found by Dijkstra s shortest path algorithm, similarly as done in [4]. Additionally, the shortest

6 path algorithm is used for function unit selection. For each existing operand the shortest path from the producer of the operand to any other function unit is calculated. This way the shortest path algorithm for the first operand only needs to run once. The function unit with the lowest total cost is selected to execute the operation. The cost of a path is calculated by using the weights of the edges in the resource graph. Immediate operands are different from the register operands because they can be created anywhere in the schedule, as long as there is a path to the function unit that needs them. The location where an immediate value is produced is found by calculating the shortest path from the function unit that processes them to an immediate unit. This means that immediate operand paths are found when the processing function unit has already been selected. Another approach is to simply choose an immediate unit and try to find a path to the processing function unit. This may lead to extra resource usage however. Algorithm 2 Operation based scheduling algorithm Input: DDG D of the basic block B and resource graph RG Output: Scheduled instructions of the basic block M B 1: procedure SCHEDULE(D, M) 2: R initialize ready list(d) 3: while R do Schedule the basic block 4: o select next operation(r, D) 5: F find available FUs(o, RG) 6: while true do Find a function unit 7: f select FU(F) 8: if o can be scheduled on f then 9: claim resources(rg) 10: R R \ {o} 11: N ew ops = add ready ops(o, D) 12: R R New ops 13: Add operations to ready list 14: break 15: else 16: F F \ f 17: end if 18: end while 19: end while 20: end procedure When an operation produces a result that needs to be used by another operation, care must be taken that this result can still be routed when the consuming operation needs to be scheduled. The scheduler should either be able to guarantee that the result can be routed or be able to reschedule the operation that produced the result. If this is not the case the scheduler may end up with an infeasible partial schedule. The most trivial way of guaranteeing that a result can later be consumed is to route all results to the register file such that it can be read back later on. Although this guarantees that the operand can be routed later on, it also allocates more resources than strictly necessary. This may result in other operations being scheduled in a later cycle than when the resources would not have been allocated. Another approach is to use max-flow graphs as described in [4]. This method guarantees that enough resources are available to route all live variables to a register file, without actually allocating the resources. This improves the schedule quality, but also increases the scheduling time since for each operation a graph needs to be constructed and the max-flow of the graph needs to be calculated. In some situations the result can not be routed to the register file but it may still be possible to use the result, for example as input to an LSU. The maxflow graph approach will prevent the producing instruction from being scheduled, which makes this approach slightly pessimistic. A backtracking scheduler can be used to recover from infeasible partial schedules since it always has the option to unschedule an operation whose result can not be routed to its destination. Care must be taken that it will still terminate when no solution can be found. Finally an Integer Linear Programming or Constraint Programming approach can be used, but these have much higher compile times. For the CGRA compiler the first approach is chosen because it is the fastest one. D. Shortest path algorithm Dijkstra s algorithm can be used to find single-source shortest paths in graphs with positive weights. Figure 5 shows a simplified resource graph for which the shortest path from the source node (red) to all other nodes is calculated. The colors of the nodes indicate the distance from the source node to that node. The darker a node is, the further away it is from the source. Black nodes cannot be reached from the source node. The numbers in the nodes indicate the order in which the nodes are visited. Note that only one unreachable node needs to be visited. As soon as this node is visited, the algorithm knows that all remaining nodes are unreachable. Min-heaps are often used to store the distances of the nodes for Dijkstra s algorithm. This results in a running time of O((V +E) log V ) [16]. V is the number of vertices, E is the number of edges in the graph. Unfortunately the Standard Template Library (STL) does not support the decrease-key operation for heaps. One solution is to rebuild the heap each time a distance value is changed. This results in a running time of O(V 2 ) Another option is switching to another library that does support the decrease-key operation, instead another approach was tried. The purpose of the min-heap in Dijkstra s algorithm is to make sure nodes are visited in a correct order. For Directed Acyclic Graphs (DAGs) this order can also be determined by topologically ordering the graph. Topologically ordering the graph can be done in O(V + E) [16]. Since the heap is not needed anymore, the running time of Dijkstra s algorithm reduces to O(V +E). Note that the resource graph may not be a DAG if there are function units in the configuration that allow bypassing the pipeline registers. Routing operands in a cycle does not make sense however, so this cycle can be removed before running Dijkstra s algorithm.

7 Cycle 1 Cycle 2 17 size may prevent preferred nodes from being selected because they are not included in the search window. Cycle 1 Cycle 3 1 Cycle 2 Cycle Cycle Cycle Cycle Cycle Cycle 5 Cycle 7 13 Figure 5: Regular approach of Dijkstra s algorithm. The source node has a red border. Numbers in the nodes indicate in which order the nodes are visited. Nodes without a number are never visited. Colors indicate the distance of nodes to the source. The higher the distance, the darker the node. White nodes are unreachable from the source. A topological ordering of a graph is typically not unique. One way of obtaining the topological ordering is to run Depth First Search (DFS) on the entire graph. Another topological ordering can be found by running the topological ordering algorithm on the nodes of a single cycle of the graph and repeating this order for every cycle. This can be done because there are no edges going back in time. The advantage of the latter approach is that the topological ordering algorithm only needs to run once on a small subgraph for an entire program. Dijkstra s algorithm can be used to find the shortest path from the source to a single destination or from the source to all other reachable nodes in the graph. The algorithm is used to find suitable function units to schedule an operation upon, so the destination is not known in this case. Calculating the paths from the source to all other nodes is not necessary most of the time however. If a suitable function unit is located two cycles later than the source, all distances after that cycle are calculated in vain. The solution is to calculate distances incrementally, as shown in figure 6. The search starts at the source. Distances are calculated for search windows of a few cycles (between the dotted lines). After calculating the distances, the scheduler tries to schedule the selected operation onto function units in the search window. If no function unit can perform the operation, later windows are tried until a function unit is found that can execute the operation. While searching, the resource graph is expanded if this is necessary. Changing the search window size can affect the schedules. If the weights in the resource graph are configured in such a way that delaying an operation is preferred, a small search window Cycle 6 Cycle 7 Figure 6: Incremental approach of Dijkstra s algorithm. Red dotted lines indicate search windows. White nodes are unreachable from the source. Nodes without numbers have not been visited (yet). When a path needs to be found for an operand, the location where it was first defined is used as the source node for the shortest path algorithm. During scheduling the graph will expand and some parts will be allocated. This causes the paths of operands that were defined early in the graph that still need to be scheduled to become longer, which results in a longer compile time. To reduce the compile time for large schedules, the operand source locations can be changed to reduce the path lengths. A simple heuristic is implemented which relocates all the operands that are still needed to the latest cycle every time the graph is enlarged to a multiple of X cycles, where X is an option of the compiler. The heuristic is very effective in reducing the compile time, but the schedule quality also reduces. Relocating the operands to cycles earlier than the last cycle is expected to result in better schedules. If then def vreg1 Entry Tail use vreg1 If else def vreg1 Figure 7: Control Flow Graph with hard to schedule code for multiple register file configurations

8 E. Scheduling order heuristics The list scheduler uses a heuristic to determine the order in which operations will be scheduled. Many heuristics are described in literature [17]. Which heuristic is chosen depends on the goal that needs to be achieved. The goals of the CGRA compiler are optimizing for execution time and energy. Two heuristics were tried for the CGRA compiler: largest height first and least slack first. Figure 8 shows a dependence graph with height and slack annotated in the nodes. The height is defined as the distance from a node to the root node. For the slack heuristic we first calculate the As Soon As Possible (ASAP) and As Late As Possible (ALAP) times of the nodes. ASAP times are calculated according to equation (1); ALAP times are calculated according to equation equation (2). The slack time of a node is given in equation (3). Slack is calculated before scheduling, so scheduling decisions do not affect the slack of nodes in the graph. This is cheaper than recalculating the slack after scheduling an operation, but may result in worse schedules (a) Possible height order (b) Possible slack order Figure 9: Dependence graphs with different scheduling orders for the height and slack heuristics ASAP i = { 0, if i == leaf node. ASAP parent + 1, otherwise. (1) Figure 10: Dependence graph with scheduling order for both the height and slack heuristic ALAP i = Height: 2 Slack: 0 { ASAP, ALAP parent + 1, if i == root node. otherwise. (2) SLACK i = ALAP i ASAP i (3) Height: 1 Slack: 0 Height: 2 Slack: 0 Height: 0 Slack: 0 Height: 1 Slack: 1 Figure 8: Dependence Graph annotated with height and slack of nodes The CGRA compiler must guarantee that an operand can arrive at its destination. To accomplish this paths may be reserved temporarily, which may cause other operations to be delayed because the required resources are occupied. When the last use of an operand is scheduled, the temporary reserved paths can be released. The scheduling order heuristic decides which operation is scheduled next, and hence decides which temporary path can be freed, if any. This way the scheduling order heuristic influences the schedule quality. Figure 9 shows two dependence graphs with scheduling orders for the height and slack heuristic. The slack heuristic schedules the operations in such an order that live operands are killed before producing new live operands. The height heuristic can obtain the same ordering, but this is not guaranteed. Figure 10 shows a dependence graph for which the slack heuristic may not find the best ordering. It is possible that it will obtain the same ordering as the height heuristic, because all operations have zero slack. Adding a tiebreak to the slack heuristic may result in better schedules. A least-height-first tiebreak would force a node lower in the graph to be scheduled if its parents were scheduled. The problem remains that the heuristic does not know which top nodes to schedule to free a node lower in the graph. This can be resolved by adding the number of uncovered children as a second tiebreaker. The number of uncovered children indicates how many nodes will become schedulable as a result of scheduling that node. Finally the number of remaining uses may be used as a metric. Scheduling operations with few uses is preferred because scheduling those operations will result in resources being released. F. Enabling software pipelining in LLVM As demonstrated with the example application, software pipelining is a key optimization technique for architectures exploiting Instruction Level Parallelism (ILP). A target independent software pipelining pass can be used to redistribute operations of loops [18], [19]. The advantage of this approach is that it is target-independent and requires less work than building a custom software pipelining scheduler. The actual scheduling is left to the normal scheduling algorithm. The current compiler assumes that live-in virtual registers

9 reside in the register file because no mechanism exists to indicate what the location of a virtual register is. With such a mechanism it would be possible to pass virtual registers in pipeline registers of a function unit across basic block boundaries. To determine which virtual register should be passed in a pipeline register, an additional pass can be run before the scheduler. If the Control Flow Graph (CFG) of figure 11 would be processed by such a pass, it would need to assign vreg 2 as live-out of block entry and live-in and live-out of block loop. Entry def vreg1 def vreg2 Loop use vreg2... def vreg2 End use vreg1 Figure 11: Control Flow Graph of loop with live-in and liveout registers G. Supporting multiple (distributed) register files The architecture allows to build configurations with multiple register files. Supporting multiple register files is not trivial when using LLVM. Figure 7 shows a CFG of an application in which the same virtual register is defined in multiple basic blocks and used in another basic block. Eventually vreg 1 needs to be assigned to the same physical register. Since the scheduler allocates operand paths, it will also assign vreg 1 to a register file. The scheduler needs to write vreg 1 to the same register file in block If then as in block If else. Furthermore it must read vreg 1 in block Tail from the same register file. This technique can be referred to as integrated scheduling and register file allocation. After this pass a register allocator still needs to assign virtual registers to physical registers. LLVM does not provide data structures that are used for storing the location of an operand, because it assumes there is only one register file. Besides adding this data structure, the scheduler and register allocator must be adapted to use this data structure. Storage of virtual register locations can be done at function level. When a virtual register is defined that is a live-out of a basic block, the scheduler must check whether it has been assigned to a register file already. If this is the case, the scheduler must route it to that location. If the virtual register has not been assigned to a location yet, the scheduler may greedily choose a register file to route the virtual register to. When a use of a virtual register is scheduled, the scheduler can look up the location of the virtual register. It is important that register are divided equally between register files. A register file that is full can limit the scheduler because no paths can be routed through that register file anymore. The edges between nodes of the register file in the resource graph can be used to achieve an equal distribution. The weights on the edges of a register file may for example be equal to the reciprocal of the number of registers allocated in that register file. This causes emptier register files to get preference over fuller register files when scheduling operands to the register file. H. Target specific energy efficiency optimizations The weights on the edges between nodes of the resource graph determine what decisions the scheduler will make. This can be used to steer the scheduler to generate energy efficient schedules. Many operands are only used a few times in a schedule. Reading an operand twice from the register file may consume more energy than reading the value once and using an ALU to store the value until it is needed. When using ALUs to route an operand, energy may be saved by using a single ALU for a few cycles instead of sending the value to another ALU every cycle. Fewer instruction bits will switch if a single ALU is used. When making a choice between function units, preference may be given to the function unit with the lowest output connectivity since transferring the result operand to the other function units will charge fewer capacitances. Scheduling for low energy consumption is presented in [20]. In this paper energy tables are used which indicate how much it costs to transition from one instruction to another. These energy tables could be used in the same way to calculate which operand path is the cheapest. In the paper a cycle-based scheduler is used. If an operation based scheduler is used such as in the CGRA compiler, the energy cost calculations will be less accurate. This is because the scheduler will calculate how much it costs to schedule operation B after operation A, but the scheduler may replace operation A at a later point in the scheduling process. The current scheduler assigns opcodes, operand sources and an operand destination to function units which need to execute some task. By default, function units that are not used execute a No Operation (NOP). Keeping instructions the same as the previous cycle may be a better approach to reduce energy consumption. If an instruction does not change between cycles, fewer bits toggle in the instruction fetch and decode stage. Care must be taken that the saved energy is not lost during the execution phase. When executing NOPs the output of a function unit remains the same. If instead an instruction is executed which changes the output, more energy is used than when executing the NOP. Figure 12 shows two schedules that provide the same functionality. The schedule in figure 12a can be converted to the schedule in figure 12b to save energy. In this case the entire add operation including the production of an immediate and loading a register can be copied to the next cycle. If R1 was written to during cycle 2 the choice would be less trivial as this would introduce extra toggling in the ALU. A heuristic needs to be found that can make this trade off. A. Supported features VI. EVALUATION The current compiler is able to compile four benchmarks. A calling convention is implemented which allows the compiler

10 IMM ALU RF 4 R1 + 6 R2 + (a) Initial schedule IMM ALU RF 4 R1 4 + t = 1 t = 2 t = 3 t = 4 t = 1 Time register allocation fails while the operation based scheduling information is still available. Further recent additions to LLVM such as the upcoming support for software pipelining are also promising. The proposed implementation for the Hexagon target [18] seems to match relatively well with the current organization of the CGRA backend and it is our expectation that it can be integrated without too much complications. The main additions required to the CGRA backend will be several target hooks for supplying information about the available parallel function units in the configured architecture. B. Benchmark set The tested benchmarks are listed in table I. For matrix multiplication matrices of size 2x2 are used due to register file constraints. The entire kernel is unrolled for this benchmark to provide more ILP to the scheduler. The other benchmarks have few ILP. More benchmarks were tried but could not run due to the limited amount of registers available or because they require operations that are not yet supported such as variable amount shifts and select operations. Figure 13 shows the architecture configuration that was used to test all benchmarks. The arrows in the figure indicate data connections between the function units. An / is assigned to every function unit for control. R1 t = 2 Time 6 + R2 t = 3 ABU MUL + (b) Optimized schedule t = 4 RF ALU Figure 12: Optimizing schedule for energy LSU IMM to provide basic function call support. Code can be generated for architectures that contain multiple register files, although no storage is possible in multiple register files across basic blocks due to limitations of the register allocator. Testing the compiler showed that care must be taken to choose the weights on the edges of the resource graph wisely when implementing the operation based scheduler. In some cases scheduling the first operand of an operation will make it impossible to schedule the second operand because the path of the first operand is blocking the path of the second operand. Currently the scheduler is included as a separate pass in the LLVM toolflow which replaces LLVM s own scheduling algorithm. In order to better support multiple register files in the future and to further improve the explicit bypassing utilization in the schedule, the register allocator should be included in the scheduler pass. This allows for rescheduling when Figure 13: Architecture configuration used for benchmarks Compile time tests are performed for both the normal shortest path finding algorithm and the improved incremental version. The measured time indicates how long it takes for the entire program to compile from LLVM IR to assembly text. The compiler is build in release mode for this test. The scheduling order heuristic used for the compile time test is largest-height-first. Schedule quality tests are performed to compare the largestheight-first and slack heuristics. The code size, utilization of function units and register usage is measured only for the kernel of the benchmarks. Statistics are derived from the

11 Operations Benchmark Operations Basic blocks Basic block Bubble sort Insertion sort Fibonacci Matrix mult TABLE I: Benchmarks tested on the CGRA schedule, not by running the code in simulation. If runtime statistics were used a small difference in the schedule would be amplified if that part of the code was executed many times. Although this is a good way to see what effect a certain change of the schedule has, it is harder to see how much the scheduler actually contributed to this change. C. Experimental results Table II shows the results of the compile-time test. The shortest path v1 compiler uses the quadratic-time shortest path algorithm which searches the entire resource graph; the shortest path v2 compiler is the linear-time shortest path algorithm using an incremental approach. As expected the second version performs much better. The search window size of the incremental algorithm has a very small impact on the schedule quality. Only in one benchmark a small difference was measured when using window size of 1 and 2 cycles. Larger window sizes did not affect the schedules. The compile times of the slow compiler are not an issue if some code needs to be compiled during development of a program. When architecture configuration Design Space Exploration (DSE) is performed however, the same program may be compiled numerous times to discover which configuration is best for that program. In that case the speedup of the compiler is very welcome. The X86 compiler is included in the table as a reference. The results of our compiler are reasonable considering the higher complexity of the scheduling problem. Compile time Compile time X86 Benchmark shortest path v1 shortest path v2 Speedup Compiler Bubble sort 614 ms 41 ms ms Insertion sort 271 ms 27 ms ms Fibonacci 121 ms 23 ms ms Matrix mult 1601 ms 42 ms ms TABLE II: Compile times of benchmarks Figure 14 shows the code size of the benchmarks for both the largest-height-first and slack scheduling order heuristics. The performance of the heuristics is very similar. The quality of the schedules is hard to quantify because no scheduler exists yet that can produce optimal schedules for this architecture. Figure 15 shows the percentage of register operations that were eliminated using both heuristics. An approximation of how many register operations would be required if the register file were not bypassed is made. The approximation is done by counting operations in the generated schedule and calculating how many read and write operations they would require. The approximated required number of register operations is used to calculate what percentage of register operations is eliminated by bypassing the register file. The approximation does not take moves of immediate values to registers into account, because they can not be extracted from the schedule. This explains why bubble sort uses more registers than what was the approximated upper bound. Overall the slack heuristic scores better than the height heuristic. This is expected because the slack heuristic has a higher chance of scheduling operations that consume operands before scheduling operations that produce new operands. This raises the chance that the register file can be bypassed. Code size (#instructions) Bsort Ins. sort Fibonacci Matrix mult. Height Slack Figure 14: Performance of scheduling order heuristics in terms of code size Eliminated register operations (%) Bsort Ins. sort Fibonacci Matrix mult. Height reads Height writes Slack reads Slack writes Figure 15: Performance of scheduling order heuristics in terms of eliminated register operations The average utilization of the function units for both scheduling order heuristics is shown in figure 16. The uti-

12 Utilization (%) Bsort Ins. sort Fibonacci Matrix mult. ABU IMM ALU ALU (including pass) RF LSU MUL Avg (including pass) Figure 16: Utilization of function units lization is calculated by the following formula: Utilization = #Operations (#NOP s + #P asses) #Operations Pass operations are operations that are used for routing values and are performed only by the ALU. Since they do not contribute to the actual program, they are not used to calculate the utilization. The ALU may also use NOPs in the routing process. These NOPs must stay in the schedule because otherwise the output of the ALUs would be overwritten. If in schedule A many pass operations are used and in schedule B many NOPs are used for routing, then from the results it would seem that schedule A yields a higher utilization of the ALUs if the pass operations were also considered when calculating the utilization. Most of the results shown in figure 16 can be explained rather easily. The utilization of the multiply unit is zero for most benchmarks because these benchmarks do not perform multiply operations at all. The utilization of the ABU is correlated with the number of instructions per basic block, shown in table I. Programs with fewer operations per basic block contain relatively more branch and jump operations, which will result in a higher utilization of the ABU. The average utilization is also correlated with the number of instructions per basic block. Although the scheduler can fill delay slots, this does not happen very often. Having fewer operations per basic block will result in basic blocks which execute in fewer cycles. The delay slot is always a fixed amount of cycles, so the amount of cycles that are spend in the delay slot become relatively more for smaller basic blocks. Since the part of the basic block in which little work is done becomes relatively larger, the average utilization of the hardware will decrease. A low hardware utilization indicates that the hardware is capable of performing more work than is required by the compiled program. Achieving a high average utilization with the current CGRA is an unrealistic goal. Unlike VLIW processors the CGRA does not cluster function units. Instead every function unit is connected to a dedicated /. This causes a low average utilization because there is an abundance of function units that can perform work, but not enough operations that can be mapped to those function units. The utilization of the register file is relatively high for all benchmarks. The register file has only one read port. If an operations requires two operands that are both in the register file, two reads are required to provide the operands of only one operation. The register file can be bypassed in some cases, but when a value is to be used multiple times it is hard to keep it in pipeline registers. The following hardware modifications could help to create configurations that can bypass the register file more often: Add routing capabilities to other function units Add extra input connections to function units Many of the function units already have pipeline registers, but only the ALU and the register file can be used for routing. Adding routing capabilities to other function units makes it possible to bypass the register file more often. The multiplier and LSU are the most logical candidates to add this feature to since they are often already connected to the register file and an ALU. It should be investigated how much extra hardware is required to support this feature and how much the schedules will benefit from it. When generating simple configurations the ALU and register file use almost all input connections, as can be seen in figure 13. Adding more ALUs will not improve the schedules much because they cannot be connected in a proper way. Similarly, using multiple output ports of a single ALU is problematic because of the limited amount of input ports. VII. FUTURE WORK In this project a working compiler is delivered. The following features can be added to enlarge the set of programs that can be compiled: Variable amount shifts Select operations More extensive support for distributed register files The CGRA hardware only supports shifts with a fixed amount. Variable amount shifts can be supported by inserting loops which execute fixed amount shifts. This feature is present in the backend for another architecture (AVR) and should be ported to the CGRA backend. Select operations use the internal flag of the ALU as a third operand. This flag is not yet included in the hardware model, thus the select operation is not yet supported. A method must be implemented to save and recover the flag since the internal flag can be overwritten by other operations. This is more complex than saving regular values temporarily to the register file because the flag does not appear on the output of the ALU. Saving the flag can be done by performing a conditional move on immediate

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 This homework is to be done individually. Total 9 Questions, 100 points 1. (8

More information

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

MOVE-Pro: a Low Power and High Code Density TTA Architecture

MOVE-Pro: a Low Power and High Code Density TTA Architecture MOVE-Pro: a Low Power and High Code Density TTA Architecture Yifan He Eindhoven University of Technology, the Netherlands y.he@tue.nl Dongrui She Eindhoven University of Technology, the Netherlands d.she@tue.nl

More information

Combining Analyses, Combining Optimizations - Summary

Combining Analyses, Combining Optimizations - Summary Combining Analyses, Combining Optimizations - Summary 1. INTRODUCTION Cliff Click s thesis Combining Analysis, Combining Optimizations [Click and Cooper 1995] uses a structurally different intermediate

More information

Loop overhead reduction techniques for coarse grained reconfigurable architectures

Loop overhead reduction techniques for coarse grained reconfigurable architectures Loop overhead reduction techniques for coarse grained reconfigurable architectures Kanishkan Vadivel, Mark Wijtvliet, Roel Jordans, Henk Corporaal Department of Electrical Engineering Eindhoven University

More information

COMPUTER ORGANIZATION & ARCHITECTURE

COMPUTER ORGANIZATION & ARCHITECTURE COMPUTER ORGANIZATION & ARCHITECTURE Instructions Sets Architecture Lesson 5a 1 What are Instruction Sets The complete collection of instructions that are understood by a CPU Can be considered as a functional

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures)

Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures) Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures) *Erkan Diken, **Pierre-Andre Saulais, ***Martin J. O Riordan (*) Eindhoven University of Technology,

More information

CS 351 Final Exam Solutions

CS 351 Final Exam Solutions CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Lecture Compiler Backend

Lecture Compiler Backend Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 Backend Tasks Instruction selection Map virtual instructions To machine instructions

More information

Module 5 - CPU Design

Module 5 - CPU Design Module 5 - CPU Design Lecture 1 - Introduction to CPU The operation or task that must perform by CPU is: Fetch Instruction: The CPU reads an instruction from memory. Interpret Instruction: The instruction

More information

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1 Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

TKT-3526 Processor Design

TKT-3526 Processor Design TKT-3526 Processor Design Compiling for TTA Architectures Vladimir Guzma 1 Compilation for TTA Compilation is process of transferring algorithm written in High Level Language (HLL) to machine code Compiler

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Status of the Bound-T WCET Tool

Status of the Bound-T WCET Tool Status of the Bound-T WCET Tool Niklas Holsti and Sami Saarinen Space Systems Finland Ltd Niklas.Holsti@ssf.fi, Sami.Saarinen@ssf.fi Abstract Bound-T is a tool for static WCET analysis from binary executable

More information

Project - RTL Model. III. Design Decisions. Part 4 - May 17, Jason Nemeth. model is capable of modeling both provided benchmarks.

Project - RTL Model. III. Design Decisions. Part 4 - May 17, Jason Nemeth. model is capable of modeling both provided benchmarks. Project - RTL Model Part 4 - May 17, 2005 Jason Nemeth I. Introduction The purpose of this assignment was to refine the structural model of the microprocessor into a working RTL-level VHDL implementation.

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Name: ANSWER SOLUTIONS This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

register allocation saves energy register allocation reduces memory accesses.

register allocation saves energy register allocation reduces memory accesses. Lesson 10 Register Allocation Full Compiler Structure Embedded systems need highly optimized code. This part of the course will focus on Back end code generation. Back end: generation of assembly instructions

More information

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

MSc THESIS. Design-time and Run-time Reconfigurable Clustered ρ-vex VLIW Softcore Processor. Muez Berhane Reda. Abstract

MSc THESIS. Design-time and Run-time Reconfigurable Clustered ρ-vex VLIW Softcore Processor. Muez Berhane Reda. Abstract Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2014 MSc THESIS Design-time and Run-time Reconfigurable Clustered ρ-vex VLIW Softcore Processor Muez Berhane Reda

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Compilers and Code Optimization EDOARDO FUSELLA

Compilers and Code Optimization EDOARDO FUSELLA Compilers and Code Optimization EDOARDO FUSELLA Contents Data memory layout Instruction selection Register allocation Data memory layout Memory Hierarchy Capacity vs access speed Main memory Classes of

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis

More information

Computer Architecture V Fall Practice Exam Questions

Computer Architecture V Fall Practice Exam Questions Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the

More information

A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation

A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation Jie Guo, Jun Liu, Björn Mennenga and Gerhard P. Fettweis Vodafone Chair Mobile Communications Systems

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Compiler Architecture

Compiler Architecture Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Computer Science 246 Computer Architecture

Computer Science 246 Computer Architecture Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are

More information

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Question Bank Subject: Advanced Data Structures Class: SE Computer

Question Bank Subject: Advanced Data Structures Class: SE Computer Question Bank Subject: Advanced Data Structures Class: SE Computer Question1: Write a non recursive pseudo code for post order traversal of binary tree Answer: Pseudo Code: 1. Push root into Stack_One.

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology Computer Organization MIPS Architecture Department of Computer Science Missouri University of Science & Technology hurson@mst.edu Computer Organization Note, this unit will be covered in three lectures.

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

16.1. Unit 16. Computer Organization Design of a Simple Processor

16.1. Unit 16. Computer Organization Design of a Simple Processor 6. Unit 6 Computer Organization Design of a Simple Processor HW SW 6.2 You Can Do That Cloud & Distributed Computing (CyberPhysical, Databases, Data Mining,etc.) Applications (AI, Robotics, Graphics, Mobile)

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture Computer Science 324 Computer Architecture Mount Holyoke College Fall 2009 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

/ / / Net Speedup. Percentage of Vectorization

/ / / Net Speedup. Percentage of Vectorization Question (Amdahl Law): In this exercise, we are considering enhancing a machine by adding vector hardware to it. When a computation is run in vector mode on the vector hardware, it is 2 times faster than

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree. Group B Assignment 9 Att (2) Perm(3) Oral(5) Total(10) Sign Title of Assignment: Code generation using DAG. 9.1.1 Problem Definition: Code generation using DAG / labeled tree. 9.1.2 Perquisite: Lex, Yacc,

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

CPU ARCHITECTURE. QUESTION 1 Explain how the width of the data bus and system clock speed affect the performance of a computer system.

CPU ARCHITECTURE. QUESTION 1 Explain how the width of the data bus and system clock speed affect the performance of a computer system. CPU ARCHITECTURE QUESTION 1 Explain how the width of the data bus and system clock speed affect the performance of a computer system. ANSWER 1 Data Bus Width the width of the data bus determines the number

More information

Power Estimation of UVA CS754 CMP Architecture

Power Estimation of UVA CS754 CMP Architecture Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Topic Notes: MIPS Instruction Set Architecture

Topic Notes: MIPS Instruction Set Architecture Computer Science 220 Assembly Language & Comp. Architecture Siena College Fall 2011 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture.

More information

Code Generation. CS 540 George Mason University

Code Generation. CS 540 George Mason University Code Generation CS 540 George Mason University Compiler Architecture Intermediate Language Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure

More information

Efficient JIT to 32-bit Arches

Efficient JIT to 32-bit Arches Efficient JIT to 32-bit Arches Jiong Wang Linux Plumbers Conference Vancouver, Nov, 2018 1 Background ISA specification and impact on JIT compiler Default code-gen use 64-bit register, ALU64, JMP64 test_l4lb_noinline.c

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Intermediate Code & Local Optimizations

Intermediate Code & Local Optimizations Lecture Outline Intermediate Code & Local Optimizations Intermediate code Local optimizations Compiler Design I (2011) 2 Code Generation Summary We have so far discussed Runtime organization Simple stack

More information

09-1 Multicycle Pipeline Operations

09-1 Multicycle Pipeline Operations 09-1 Multicycle Pipeline Operations 09-1 Material may be added to this set. Material Covered Section 3.7. Long-Latency Operations (Topics) Typical long-latency instructions: floating point Pipelined v.

More information