Enabling Hardware/Software Co-design in High-level Synthesis. Jongsok Choi

Size: px

Start display at page:

Download "Enabling Hardware/Software Co-design in High-level Synthesis. Jongsok Choi"

Eustace Johns
5 years ago
Views:

1 Enabling Hardware/Software Co-design in High-level Synthesis by Jongsok Choi A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2012 by Jongsok Choi

2 Abstract Enabling Hardware/Software Co-design in High-level Synthesis Jongsok Choi Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2012 A hardware implementation can bring orders of magnitude improvements in performance and energy consumption over a software implementation. Hardware design, however, can be extremely difficult. High-level synthesis, the process of compiling software to hardware, promises to make hardware design easier. However, compiling an entire software program to hardware can be inefficient. This thesis proposes hardware/software co-design, where computationally intensive functions are accelerated by hardware, while remaining program segments execute in software. The work in this thesis builds a framework where user-designated software functions are automatically compiled to hardware accelerators, which can execute serially or in parallel to work in tandem with a processor. To support multiple parallel accelerators, new multi-ported cache designs are presented. These caches provide low-latency high-bandwidth data to further improve the performance of accelerators. An extensive range of cache architectures are explored, and results show that certain cache architectures significantly outperform others in a processor/accelerator system. ii

3 Acknowledgements First, I would like to thank my parents for raising me to be an independent individual and always supporting me in my life decisions. Without you guys, I would not be here. I would like to thank Professor Stephen Brown for giving me the opportunity to work in this research group and to be a part of such an intriguing research project. I would like to thank Professor Jason Anderson for the daily summer meetings, weekly status meetings, and for the many insightful ideas. Both of you have been amazing mentors and I look forward to working with both of you for the many years ahead. I would also like to thank Andrew Canis for the numerous discussions on and off the record. To much dismay, the debugging must go on. Lastly, I would like to thank my close friends, Dave, Ben, and Jin, for the many beers that we ve had, that made many stressful nights bearable. I look forward to having many more with all of you. iii

4 Contents 1 Introduction Motivation Contributions Thesis Organization Background LegUp Design Flow Hardware/Software Interface Altera Avalon Interface Xilinx Interconnects CoreConnect AXI Interconnect OpenCores Wishbone Related Work Summary Sequential Execution System Architecture Data Cache Architecture Processor/Accelerator System Generation iv

5 3.2.1 Software Flow Wrapper Function Generation Remaining Software Flow Hardware Flow Accelerator Architecture Processor/Accelerator Interface Accelerator/Cache Interface Controlling Altera SOPC Builder Experimental Methodology Benchmarks Results Summary Parallel Execution Parallel Execution Memory Access Profiler Enabling Parallel Execution Parallel Wrapper Function Multi-ported Cache Live-Value Table Approach Original LVT approach: Modified LVT approach: Multi-Pumping Related Work Summary Experiments for Parallel Execution Evaluated System Architectures v

6 5.2 Memory Configurations Parallel Benchmarks Experimental Methodology Results Heterogeneous Computing Data Partitioning Heterogeneous Results Summary Conclusion Summary Future Work Synchronization Support for a Parallel Programming API XOR-based Multi-ported Memory Multiple Clock Domains Multiple Multi-ported Caches A Cache Simulator 92 B Complete Benchmark Results for Sequential Execution 96 C Complete Benchmark Results for Parallel Execution 103 D Complete Benchmark Results for Heterogeneous Execution 149 Bibliography 149 vi

7 List of Tables 2.1 Subset of C supported/unsupported by LegUp Interface between Accelerator/Cache Benchmarks used for Sequential Execution Speed performance results Area results Cache configurations evaluated Parallel benchmarks in LegUp Baseline system results Individual benchmark results Input data partitions for Processor/Accelerator for Perfect Hash benchmark Worst/Best/2nd Best Results for each Cache Architecture A.1 Cycle count for CHstone benchmark for 8KB cache with 16/32/64B line sizes A.2 Cycle count for CHstone benchmark for 16KB 2-way set-associative cache 93 B.1 Results for accelerating each function in Adpcm benchmark B.2 Results for accelerating each function in Aes benchmark B.3 Results for accelerating each function in Blowfish benchmark B.4 Results for accelerating each function in Dfadd benchmark B.5 Results for accelerating each function in Dfdiv benchmark vii

8 B.6 Results for accelerating each function in Dfmul benchmark B.7 Results for accelerating each function in Dfsin benchmark B.8 Results for accelerating each function in Dhrystone benchmark B.9 Results for accelerating each function in Gsm benchmark B.10 Results for accelerating each function in Jpeg benchmark B.11 Results for accelerating each function in Motion benchmark B.12 Results for accelerating each function in Sha benchmark C.1 Results for sequential execution on Add benchmark C.2 Results for parallel 2-port cache on Add benchmark C.3 Results for parallel 4-port MP cache on Add benchmark C.4 Results for parallel 4-port LVT cache on Add benchmark C.5 Results for parallel 7-port LVT cache on Add benchmark C.6 Results for sequential execution on Box filter benchmark C.7 Results for parallel 2-port cache on Box filter benchmark C.8 Results for parallel 4-port MP cache on Box filter benchmark C.9 Results for parallel 4-port LVT cache on Box filter benchmark C.10 Results for parallel 7-port LVT cache on Box filter benchmark C.11 Results for sequential execution on Dot product benchmark C.12 Results for parallel 2-port cache on Dot product benchmark C.13 Results for parallel 4-port MP cache on Dot product benchmark C.14 Results for parallel 4-port LVT cache on Dot product benchmark C.15 Results for parallel 7-port LVT cache on Dot product benchmark C.16 Results for sequential execution on GSMx6 benchmark C.17 Results for parallel 2-port cache on GSMx6 benchmark C.18 Results for parallel 4-port MP cache on GSMx6 benchmark C.19 Results for parallel 4-port LVT cache on GSMx6 benchmark C.20 Results for parallel 7-port LVT cache on GSMx6 benchmark viii

9 C.21 Results for sequential execution on Histogram benchmark C.22 Results for parallel 2-port cache on Histogram benchmark C.23 Results for parallel 4-port MP cache on Histogram benchmark C.24 Results for parallel 4-port LVT cache on Histogram benchmark C.25 Results for parallel 7-port LVT cache on Histogram benchmark C.26 Results for sequential execution on Line of sight benchmark C.27 Results for parallel 2-port cache on Line of sight benchmark C.28 Results for parallel 4-port MP cache on Line of sight benchmark C.29 Results for parallel 4-port LVT cache on Line of sight benchmark C.30 Results for parallel 7-port LVT cache on Line of sight benchmark C.31 Results for sequential execution on Matrix multiply benchmark C.32 Results for parallel 2-port cache on Matrix multiply benchmark C.33 Results for parallel 4-port MP cache on Matrix multiply benchmark C.34 Results for parallel 4-port LVT cache on Matrix multiply benchmark C.35 Results for parallel 7-port LVT cache on Matrix multiply benchmark C.36 Results for sequential execution on Matrix transpose benchmark C.37 Results for parallel 2-port cache on Matrix transpose benchmark C.38 Results for parallel 4-port MP cache on Matrix transpose benchmark C.39 Results for parallel 4-port LVT cache on Matrix transpose benchmark C.40 Results for parallel 7-port LVT cache on Matrix transpose benchmark C.41 Results for sequential execution on perfect hash benchmark C.42 Results for parallel 2-port cache on perfect hash benchmark C.43 Results for parallel 4-port MP cache on perfect hash benchmark C.44 Results for parallel 4-port LVT cache on perfect hash benchmark C.45 Results for parallel 7-port LVT cache on perfect hash benchmark D.1 Heterogeneous execution results with parallel 2-port cache (processor input data size : 3,000/24,000) ix

10 D.2 Heterogeneous execution results with parallel 2-port cache (processor input data size : 6,000/24,000) D.3 Heterogeneous execution results with parallel 2-port cache (processor input data size : 9,000/24,000) D.4 Heterogeneous execution results with parallel 2-port cache (processor input data size : 12,000/24,000) D.5 Heterogeneous execution results with parallel 2-port cache (processor input data size : 15,000/24,000) D.6 Heterogeneous execution results with parallel 4-port MP cache (processor input data size : 750/24,000) D.7 Heterogeneous execution results with parallel 4-port MP cache (processor input data size : 1,500/24,000) D.8 Heterogeneous execution results with parallel 4-port MP cache (processor input data size : 3,000/24,000) D.9 Heterogeneous execution results with parallel 4-port MP cache (processor input data size : 4,500/24,000) D.10 Heterogeneous execution results with parallel 4-port MP cache (processor input data size : 6,000/24,000) x

11 List of Figures 2.1 Target system architecture Abstract system architecture Default system architecture Modified data cache architecture Example software program structure C function targeted for hardware Wrapper for hardware-designed function in Figure Accelerator architecture Memory access steering logic for accelerator Top-level verilog module for accelerator Function argument receivers for accelerator Processor/accelerator system generation flow Performance and area results Energy results Loop unrolling to execute in parallel Example outputs from the memory access profiler Wrapper functions for parallel accelerators write/4-read port memory with LVT LVT-based 4-ported cache xi

12 4.6 Cache line conflict with LVT ported cache with double-pumping Architectures evaluated in this work Execution time (geometric mean) Execution cycles (geometric mean) Fmax (geometric mean) Area in Stratix IV ALMs (geometric mean) Memory consumption (geometric mean) Heterogenous results with parallel dual-port cache architecture Heterogenous results with parallel 4-port MP cache architecture A.1 Example output of cache simulator xii

13 Chapter 1 Introduction 1.1 Motivation Two approaches are possible for implementing computations: software or hardware. Software involves a designer writing a program using a standard software language, such as C or C++, to express computations algorithmically. This software runs on existing hardware that interprets and executes the instructions. This hardware may be a generalpurpose CPU such as an Intel or an AMD x86 processor, or a special-purpose processor such as a DSP processor or an embedded processor, which is tailored towards a particular type of application for increased efficiency. Implementing computations in hardware can bring orders of magnitude improvement in speed and power-efficiency compared to a software implementation [18]. However, hardware design can be extremely difficult, as designers are required to write complex code in HDL, which can be error prone and difficult to debug. This is becoming increasingly arduous, as hardware designs are becoming bigger and more complicated with increasing chip sizes. Software design, on the other hand, is comparatively straightforward, with mature debugging and analysis tools freely accessible. Moreover, software engineering skills are widely available, with software engineers outnumbering hardware 1

14 Chapter 1. Introduction 2 engineers by a factor of 10 [40]. Despite its apparent energy and performance benefits, hardware design can be too difficult and time consuming for many applications. To make the advantages of hardware design more accessible, improved design flows, which allow software approaches to be used for hardware design, are needed. A promising approach in this direction is high-level synthesis (HLS), which compiles a software program written in a traditional software language, such as C, to hardware. For designers, this raises the abstraction from register-transfer level (RTL) to the algorithmic level. Thus HLS promises to provide the performance and energy benefits of hardware, while retaining the ease-of-use of software. However, not all software programs are suitable for hardware. There may be parts of a program which are better suited to stay in software. Software techniques, such as dynamic memory allocation or recursion, as well as inherently sequential computations, such as traversing a linked-list, are better left in software. Computationally intensive and parallel applications, such as matrix multiply, are ideally suited for hardware. The need to mix software with hardware motivates hardware/software co-design, where hardware components are used to accelerate critical portions of a program to augment the software process, thereby enhancing the performance of the overall system. To address this challenge, this thesis provides a framework which allows computationally intensive program segments to be automatically compiled into hardware accelerators, while remaining program segments execute in software on a MIPS soft processor. This framework developed in this work, called LegUp 1, targets a reconfigurable platform called a field-programmable gate array (FPGA). FPGAs have recently been garnering attention for their successful use in computing applications, where they implement custom hardware specially tailored to a particular application. FPGAs can be instantly programmed to function as any digital circuit without incurring the overhead of cus- 1 LegUp is a HLS tool being developed at the University of Toronto, which encompasses several different efforts, one of which is the work done for this thesis.

15 Chapter 1. Introduction 3 tom chip fabrication. On the contrary, building an ASIC (application-specific integrated circuit), which has been the traditional method of creating hardware, suffers from high non-recurring engineering (NRE) costs, where the masks alone costs millions of dollars. The high costs of ASICs is a significant barrier for new entry into the market which makes it only feasible for products with extremely high volumes. Exacerbating this, long design cycles involving the initial design, verification, timing closure, and fabrication, make time-to-market long for ASICs. Long time-to-market, considering the short lifespan of modern electronic products, is a critical cost which can obliterate any potential success that could be gained from the design. Thus, FPGAs are a key enabling technology, as users are not faced with high NRE costs and can benefit from shorter time-to-market. Therefore, FPGAs are the IC (integrated circuit) medium targeted in this work. 1.2 Contributions The principal objective of this research is to enable hardware/software co-design in the LegUp HLS framework. The contributions of this thesis are: Automating the generation of processor/accelerator systems where compute-intensive C functions are accelerated by hardware. A preliminary version of the work appears in [9, 10]. Enhancing this architecture to allow parallel execution of accelerators with new multi-ported cache designs which provide high memory throughput [13]. Analyzing the impact of cache architecture and interface on performance and area of processor/parallel-accelerator systems [13]. 1.3 Thesis Organization The rest of this thesis is organized as follows:

16 Chapter 1. Introduction 4 Chapter 2 provides background information on the LegUp HLS framework. It also describes different SoC interfaces which were investigated as a basis for building the processor/accelerator system. It outlines the basic communication protocol of the SoC interfaces and reviews previous efforts in high-level synthesis. Chapter 3 introduces the default system architecture targeted by LegUp. It describes the communication interface between the processor and hardware accelerators and illustrates how the hybrid SoC 2 is automatically generated. It also shows results for sequential execution, where either the processor or a single accelerator executes at a time, but not both at the same time. Chapter 4 describes LegUp s ability to execute multiple accelerators in parallel. A memory access profiler, which detects memory dependencies between functions, is also described. Two types of multi-ported caches, called the LVT cache and MP cache, are presented. Previous work in creating multi-ported caches are also discussed. To the best of our knowledge, our multi-ported caches, which do not require memory partitioning and allow single cycle access to all regions of the cache, are the first of their kind to be implemented on an FPGA. Chapter 5 presents results for parallel execution of accelerators using various multiported cache architectures. The results for a total of 1,760 different system architectures are presented. Two different scenarios are investigated: 1)When 6 accelerators execute in parallel to perform all of the computations, 2)When 6 accelerators, as well as the processor, perform the computations in parallel. Chapter 6 presents concluding remarks and suggestions for future work. It describes some of the work currently in progress, as well as other future extensions to the framework which will help to increase performance and make the tool more flexible. 2 A hybrid system comprises of the MIPS soft processor and one or more accelerators.

17 Chapter 2 Background 2.1 LegUp This research is part of a larger project called LegUp, whose overarching goal is to create a self-accelerating processor in which a program can be accelerated automatically using custom hardware accelerators. LegUp is an open-source HLS framework that compiles a standard C program to target a hybrid FPGA-based hardware/software system. Some program segments execute on a 32-bit MIPS soft processor, while other program segments are automatically synthesized into FPGA circuits (hardware accelerators) that communicate and work in tandem with the soft processor. Two modes of execution exist in our system: sequential and parallel. In sequential mode, either the processor or a single accelerator execute at a given time, but not both (described in Chapter 3). Thus, once the processor starts an accelerator, it is stalled until the accelerator finishes. In parallel mode, the processor and all accelerators can execute at the same time (described in Chapter 4). LegUp works on a function granularity. Hence it can compile one or more C functions into hardware accelerators, but it cannot work on a smaller granularity, such as loops within a function. LegUp supports a large subset of ANSI C as shown by Table

18 Chapter 2. Background 6 Table 2.1: Subset of C supported/unsupported by LegUp Supported Unsupported Functions Dynamic Memory Arrays, Structs Floating Point Global Variables Recursion Pointer Arithmetic Hence any functions which do not contain unsupported subset of C can be compiled into hardware accelerators. If the entire C program does not contain any unsupported operations, LegUp can also compile the entire program to hardware, instead of targeting a processor/accelerator system. LegUp leverages the low-level virtual machine (LLVM) compiler framework - the same framework used by Apple for iphone/ipad development. At the core of LLVM is an intermediate representation (IR), which is essentially machine-independent assembly language. C code is translated into LLVM IR, then analyzed and modified by a series of compiler optimization passes. Transformations and optimizations in the LLVM framework are structured as a series of compiler passes. Passes include optimization passes such as dead code elimination, analysis passes such as alias analysis, and back-end passes that produce assembly for a particular target machine (e.g. MIPS or ARM). The infrastructure is flexible, allowing passes to be reordered, substituted with alternatives, and disabled if needed. LegUp HLS algorithms have been implemented as LLVM passes that fit neatly into the existing framework. Implementing the HLS steps as distinct passes also allows easy experimentation with alternate HLS algorithms Design Flow The LegUp design flow comprises first compiling and running a program on a standard processor, profiling its execution, selecting program segments to target to hardware,

19 Chapter 2. Background 7 and then re-compiling the program to a hybrid hardware/software system. Figure 2.1 illustrates the detailed flow. Referring to the labels in the figure, at step 1, the user compiles a standard C program to a binary executable using the LLVM compiler. At 2, the executable is run on an FPGA-based MIPS processor. We evaluated several publiclyavailable MIPS processor implementations and selected the Tiger MIPS processor from the University of Cambridge [47], based on its support for the full MIPS instruction set, established tool flow, and well-documented modular Verilog. The MIPS processor has been augmented with extra circuitry to profile its own execution. Using its profiling ability, the processor is able to identify sections of program code that would benefit from hardware implementation, improving program throughput and power. Specifically, the profiling results drive the selection of program code segments to be re-targeted to custom hardware from the C source. Profiling a program s execution in the processor itself provides the highest possible accuracy, as the executing code does not need to be altered to be profiled and can run at full speed. Moreover, with hardware profiling, system-level characteristics that affect performance are properly accounted for, such as off-chip memory access times. Given the profile results, the user places the names of the functions to be accelerated in a Tcl file that is read by LegUp. Having chosen program segments to target to custom hardware, at step 3 LegUp is invoked to compile these segments to synthesizable Verilog RTL, called as hardware accelerators in this work. Entire functions are synthesized to hardware from the C source. Moreover, if a hardware function calls other functions, such called functions are also synthesized to hardware. In other words, we do not allow a hardware-accelerated function to call a software function. As illustrated in the figure, LegUp s hardware synthesis and software compilation are part of the same LLVM-based compiler framework. In step 4, the SoC is assembled from the hardware accelerators and the MIPS soft processor. The hardware interface for accelerators is created, which allows the accelerators to connect to the MIPS processor as well as memories via an on-chip interconnect.

20 Chapter 2. Background 8... y[n] = 0; for (i = 0; i < 8; i++) { y[n] += coeff[i] * x[n-i]; }... Program code 1 C Compiler Self-Profiling MIPS Processor (MIPS) 2 5 Altered SW binary (calls HW accelerators) LegUp Profiling Data: P 6 FPGA fabric 4 Hardened program segments High-level synthesis 3 Suggested program segments to target to HW Execution Cycles Power Cache Misses Figure 2.1: Target system architecture. In step 5, the C source is modified such that the functions implemented as hardware accelerators are replaced by wrapper functions that call the accelerators (instead of doing computations in software). This new modified source is compiled to a MIPS binary executable. Finally, in step 6 the hybrid processor/accelerator system executes on the FPGA. The research for this thesis pertains to steps 4 and 5 of Figure 2.1. These steps create the hybrid flow allowing hardware/software co-execution. Step 2 was described in [2, 3], and step 3, the compilation of C to Verilog, is an on-going work, particularly described in [9, 10]. An abstract view of the system targeted by LegUp is shown in Figure 2.2. It comprises the MIPS soft processor and one or more hardware accelerators which communicate over the Avalon Interconnect, Altera s on-chip interface. The details of Avalon are discussed in Section Both the processor and hardware accelerators share an on-chip data cache, which can also access off-chip memory. This architecture is described in more detail in Section 3.1.

21 Chapter 2. Background 9 MIPS Processor Hardware Accelerator Hardware Accelerator AVALON INTERCONNECT On-Chip Cache Off-Chip Memory Figure 2.2: Abstract system architecture. 2.2 Hardware/Software Interface Efficient interfaces between the processor and hardware accelerators are required for the overall system to achieve high performance. As the number of processors and accelerators increase in the SoC, its interconnection network can easily become the bottleneck. Thus it is imperative for a system to have an efficient interconnection network which does not hinder the performance of its processing elements. This section describes the different SoC interfaces which were investigated as a basis for our system architecture. Different SoC interfaces have been developed by commercial FPGA vendors. In addition to commercial products, open-source architectures exist which are developed and maintained by researchers and enthusiasts. We have evaluated three major FPGA interconnects, two of which are from the industry, with the other from the research community. These are the Altera Avalon Interface, the Xilinx CoreConnect and AXI Interfaces, and the OpenCores Wishbone Interface. The notable features of each bus architecture are reviewed and analyzed based on the specifications published by the manufacturers, as well as published papers by other research groups.

22 Chapter 2. Background Altera Avalon Interface The Avalon Interface is developed by Altera Corporation and available through their design software, Quartus II. More specifically, the SOPC (System-on-a-Programmable- Chip) builder, which is part of the Quartus II design software, automatically generates an SoC when components with an Avalon Interface are selected from the end user. The SOPC builder incorporates a library of pre-made components, including the Nios II processor, memory controllers, peripherals as well as an interface for including custom designs which allows one to create a complete embedded system with simple interactions using the GUI. Bus arbitration, bus width matching, and even clock domain crossing are all handled automatically. The Avalon Interface offers single cycle read or write transfers [20]. It has separate dedicated data and address paths where width of the data path can be up to 1024 bits and the address path can be up to 32 bits. The data width of 1024 bits is especially large compared to other SoC interfaces. It can handle any number of master and slave components. A master is able to initiate a transaction (read/write), where a slave can only respond to a transaction from a master. The ability to create multi-master architectures is especially important since it permits one to build an SoC with many processors or DMA (Direct Memory Access) devices. Most bus architectures from other manufacturers limit the number of masters. The interconnection fabric implements a point-to-point structure that provides independent paths between masters and slaves. The Avalon interface supports pipelined read transfers, allowing a master to start multiple read transfers without waiting for the prior transfers to complete. The maximum number of pipelined transfers is a property of the slave interface and it not limited by the Avalon interface [20]. Write transfers cannot be pipelined. The Avalon also offers burst capability, which is especially useful for DMA to off-chip memory. There are two type of Avalon Interfaces: The memory-mapped interface, and the streaming interface. In the memory-mapped interface, the bus masters communicate to

23 Chapter 2. Background 11 their slaves via memory-mapped addresses. Thus, each slave is mapped to a certain address, and a write to that address from a master sends data to the slave, and a read from the address reads data from the slave. The Avalon Streaming Interface is used for components with high I/O bandwidth, low latency, and unidirectional data. Typical applications include multiplexed streams, packets, and DSP data. It creates a point-topoint connection between a source and sink. The streaming interface does not require the components to be mapped to any addresses. The arbitration required for multi-master systems is especially important since it can significantly impact performance when there is a lot of network traffic. For the Avalon memory-mapped interface, multiple masters can be connected to a single slave and the arbitration hardware is automatically created by the SOPC builder. Unlike traditional central arbitration schemes, an arbiter is created for each slave, hence masters can simultaneously perform transfers with independent slaves and transfers are only stalled when multiple masters attempt to access the same slave. The arbiter is also tunable, where the user can set a particular master to have more access to a slave by setting the number of shares with respect to the slave. For example, when Master 1 is assigned three shares and Master 2 is assigned four shares to a slave, the arbiter grants Master 1 access for three transfers, then Master 2 for four transfers. When multiple masters contend for access to a slave, the arbiter grants shares in a round-robin order. The arbitration is done with zero latency, hence when two masters contend for a single slave, the granted master can access the slave in the same cycle. The Avalon Interface offers many benefits, one of which is its ease of use. Based on the configuration, the SOPC Builder automatically generates HDL which can be directly synthesized to hardware.

24 Chapter 2. Background Xilinx Interconnects Xilinx offers two types of SoC interfaces: the CoreConnect, and the AXI Interface. The CoreConnect was originally developed by IBM, and it has been integrated into Xilinx IP cores. Recently, Xilinx has also integrated a newer interconnect, called the AXI, which conforms to the AMBA AXI version 4 specification from ARM CoreConnect CoreConnect has a hierarchically organized architecture. It provides three types of buses, the Processor Local Bus (PLB), the On-chip Peripheral Bus (OPB), and the Device Control Register Bus (DCR), each of which can be used depending on the performance requirements [34]. The PLB is a high-bandwidth, low-latency bus which connects performance critical components such as processors, memory, and DMA controllers. Bridged to the PLB, the OPB connects lower data rate peripherals. The DCR is a separate control bus that links to all of the devices to allow a user to monitor the individual control registers. The goal of providing different types of buses is to offload the negative effect of slower devices from the high performance bus. Similar to the Avalon Interface, CoreConnect is shipped with design tools from Xilinx, which allows a user to construct an SoC by connecting different components such as the MicroBlaze processor, custom logic, and other peripherals. Xilinx s EDK 1 (Embedded Development Kit) is used to generate the interconnection fabric. The CoreConnect architecture provides many similar features to the Altera Avalon Interface. For the PLB, the width of the data bus can be either 32 bits, 64 bits, or 128 bits, whereas the address bus can be up to 32 bits. Even though the maximum width of the data bus is much narrower than that of Avalon, the PLB implements separate read and write data buses, allowing concurrent read and write transfers in a clock cycle. It also allows multi-master 1 Xilinx EDK is an analogous piece of software to Altera s SOPC Builder.

25 Chapter 2. Background 13 architectures, although the number of masters and slaves is limited to 16 of each. The interconnection network can be configured as a bus topology for multiple masters and slaves or as a point-to-point topology between a single master/slave pair. Although the Avalon Interface only allows pipelined reads, the PLB allows both pipelined reads and writes. It is also capable of burst reads and writes AXI Interconnect The new AXI Interconnect is supported for the newer Virtex-6 and Spartan-6 devices [33]. It connects one or more AXI memory-mapped master devices to one or more memorymapped slave devices. The AXI also has the same limitation of supporting up to 16 masters and slaves, but the data bus width has been increased from 128 bits to 256 bits. The interconnection topology is implemented in a crossbar manner to allow concurrent transfers from multiple masters to independent slaves. It also includes automatic datawidth conversion, where the conversion is performed for each master and slave connection when its width does not match the width of the crossbar. Similar to Avalon, the AXI also allows built-in clock-rate conversion, where a master and a slave can use independent clock rates OpenCores Wishbone Wishbone was originally developed by Silicore Corporation but was handed over to Open- Cores in August 2002 [38]. Since it is now in the public domain, it can be freely used and distributed. The OpenCores Wishbone is not an IP core itself, but rather a specification for creating an IP core [42]. In other words, it does not have a system builder tool like Altera s SOPC Builder or Xilinx s EDK. Rather, it simply specifies a set of interfaces, signals, and timing information to achieve a high performance bus. It aims to standardize bus interfaces to ensure compatibility between IP cores and to create a robust standard that does not constrain the creativity of the end user [41]. It is very flexible since many

26 Chapter 2. Background 14 aspects, such as the interconnection topology or the arbitration mechanism, are left up to the designer. Thus, the designer is able to choose the most suitable implementation for the design. The specification does not require the use of any particular development tool and is technology independent, meaning that it is not vendor specific and can be targeted towards different types of medium (ASICs or FPGAs). Furthermore, it is fully compliant with any synthesis tool. Wishbone supports many of the features which are supported by both Avalon and CoreConnect/AXI. It allows single cycle data transfers as well as multi-master architectures. The maximum bit width for both the data bus and the address bus is 64 bits. The interconnection topology is flexible, as the user can choose between point-to-point connection, dataflow network (ring), shared bus, and crossbar switch. Arbitration in Wishbone is completely up to the end user to implement. Hence, different arbitration schemes, such as time multiplexing, round-robin, or static priority can be considered. The wishbone architecture does not, however, support pipelined transfers. The OpenCores Wishbone architecture can be openly distributed. Its strength its portability, as it is not tied to any vendor, and its flexibility, as the user is not constrained to one type of arbiter or network topology. However, unlike using design tools from Altera or Xilinx, the user has to either build the network and its arbitration, or use existing IPs built by other users of Wishbone. Although OpenCores carries a number of IPs available for Wishbone, the performance/area of these IPs are unknown, which can be a significant overhead for a designer. To summarize, the Altera Avalon Interface provides ease of use, support for large bit widths and unlimited number of master/slave pairs, as well as different types of interconnects. With these versatile configurations available, and since our target medium is Altera FPGAs, the Avalon Interface was the logical choice for us to use as the SoC interconnect for LegUp processor/accelerator systems.

27 Chapter 2. Background Related Work Automatic compilation of a high-level language program to silicon has been a decadeslong quest in the EDA field, with early seminal work done in the 1980s. We highlight several recent efforts, with emphasis on tools that target FPGAs. Several HLS tools have been developed for targeting specific applications. GAUT is a high-level synthesis tool that is designed for DSP applications [23]. GAUT synthesizes a C program into an architecture with a processing unit, a memory unit, and a communication unit, and requires that the user supply specific constraints, such as the pipeline initiation interval. ROCCC is an open source high-level synthesis tool that can create hardware accelerators from C [49]. ROCCC is designed to accelerate critical kernels that perform repeated computations on streams of data, for instance DSP applications such as FIR filters. ROCCC does not support several commonly-used aspects of the C language, such as generic pointers, shifting by a variable amount, non-for loops, and the ternary operator. ROCCC has a bottom-up development process that involves partitioning ones application into modules and systems. Modules are C functions that are converted into computational datapaths with no FSM, with loops fully unrolled. These modules cannot access memory but have data pushed to them and output scalar values. Systems are C functions that instantiate modules to repeat computation on a stream of data or a window of memory, and usually consist of a loop nest with special function parameters for streams. ROCCC supports advanced optimizations such as systolic array generation, temporal common sub-expression elimination, and it can generate Xilinx PCore modules to be used with a Xilinx MicroBlaze processor. The tool is integrated with Eclipse IDE to provide a GUI interface, which can be convenient for the user. However, ROCCC s strict subset of C is insufficient for compiling any non-trivial C programs. Broadly speaking, ROCCC works and excels for a specific class of applications (streaming-oriented applications), but it is not a general C-to-hardware compiler.

28 Chapter 2. Background 16 General (application-agnostic) tools have also been proposed in recent years. CHiMPS (Compiling High-level Languages into Massively Pipelined Systems) is a compiler that takes an ANSI-C application and automatically generates a customized, parallel FPGA accelerator in VHDL [43]. It is a tool developed by Xilinx and the University of Washington that synthesizes programs into a many cache architecture, taking advantage of the abundant small block RAMs available in modern FPGAs. Each cache corresponds to a particular region of global memory, based on an analysis of a program s access patterns. In CHiMPS, the regions of memory that are covered by different caches may overlap, and in such cases, cache coherency is maintained by flushing. Unfortunately, no source or binary is available for this tool, which makes it unusable for others in the research community. Other general tools include LiquidMetal, which is a compiler being developed at IBM Research. LiquidMetal comprises a HLS compiler and a new (non-standard) language, LIME, that incorporates hardware-specific constructs, such as bitwidth specification on integers [31]. xpilot is a tool that was developed at UCLA [15] and used successfully for a number of HLS studies (e.g., [12]). Trident is a tool developed at Los Alamos National Labs, with a focus on supporting floating point operations [46]. xpilot and Trident have not been under active development for several years and are no longer maintained. Among prior academic work, the Warp Processor proposed by Vahid, Stitt and Lysecky bears the most similarity to our framework [48]. In a Warp Processor, software running on a processor is profiled during its execution. The profiling results guide the selection of program segments to be synthesized to hardware. Such segments are disassembled from the software binary to a higher-level representation, which is then synthesized to hardware [44]. The software binary running on the processor is altered automatically to leverage the generated hardware. LegUp uses a somewhat similar approach, with the key differences being that we compile hardware from the high-level language source code (not from a disassembled binary). As is the case for CHiMPS, the source-code or the

29 Chapter 2. Background 17 binary is not available for the Warp Processor. With regard to commercial tools, there has been considerable activity in recent years, both in start-ups and major EDA vendors. Current offerings include AutoPilot from AutoESL [6] (a commercial version of xpilot, recently acquired by Xilinx, Inc.), Catapult C from Mentor Graphics [37], C2R from CebaTech [11], excite from Y Explorations [51], CoDeveloper from Impulse Accelerated Technologies [32], Cynthesizer from Forte [25], and C-to-Silicon from Cadence [8]. On our experience, attaining a binary executable for evaluation has not been possible for most tools. Also on the commercial front is Altera s C2H tool [19]. C2H allows a user to partition a C program s functions into a hardware set and a software set, where the softwaredesignated functions execute on a Nios II soft processor, and the hardware-designated functions are synthesized into custom hardware accelerators that connect to the Nios II through the Avalon interface. The C2H target system architecture closely resembles that targeted by our tool. However, only the processor has access to the cache (the accelerator can only access off-chip memory), and as such, the cache must be flushed before the accelerator is activated if the two are to share memory. This can be a significant overhead as it can cause excess off-chip memory accesses. 2.4 Summary This chapter introduced the LegUp HLS framework and described the overall flow of compiling software to a processor/accelerator system. It discussed 3 different SoC interfaces which were investigated which led us to select the Avalon Interface as our SoC interface. It also reviewed previous efforts in creating HLS tools that target FPGAs. To our knowledge, there is currently no other open-source HLS tool that compiles a standard C program to a hybrid processor/accelerator system architecture, which provides the flexibility to support both sequential and parallel accelerators while providing

30 Chapter 2. Background 18 high memory bandwidth with a parametrized multi-ported cache architecture. LegUp is freely distributed to the research community. It is a framework that allows researchers around the world to experiment with new approaches to HLS and hardware/software co-design.

31 Chapter 3 Sequential Execution This chapter introduces the system architecture targeted by LegUp and underlines how the accelerators communicate and work in tandem with the MIPS soft processor. It describes the flow of automatically generating the hybrid system consisting of the MIPS soft processor and sequential accelerators with a single Makefile command. Lastly, it presents results on accelerating the most compute-intensive and the second most compute-intensive functions in each of 13 benchmarks. These results are compared against executing the entire program in software as well as executing the entire program in hardware. 3.1 System Architecture While a variety of different memory architectures are possible in processor/accelerator systems, a commonly-used approach is one where data shared between the processor and accelerators resides in a shared memory hierarchy comprised of a single cache and main memory. The advantage of such a model is its simplicity, as cache coherency mechanisms are not required. The disadvantage is the potential for contention when multiple accelerators and/or the processor access memory concurrently. Despite this potential limitation, we use this model as the basis of our initial investigation, but extend 19

Chapter 3. Sequential Execution 20 MIPS Processor Local Mem Local Mem Hardware Accelerator Hardware Accelerator Hardware Accelerator Avalon Interconnect On-Chip Data Cache Off-Chip Memory Figure 3.

We currently target the Altera DE2 board, with the Cyclone II FPGA, and the DE4 board, with the Stratix IV FPGA.

It is composed of a MIPS soft processor with one or more hardware accelerators, supported by memory components, including on-chip dual-port data cache and off-chip memory.

32 Chapter 3. Sequential Execution 20 MIPS Processor Local Mem Local Mem Hardware Accelerator Hardware Accelerator Hardware Accelerator Avalon Interconnect On-Chip Data Cache Off-Chip Memory Figure 3.1: Default system architecture. this architecture to multi-ported caches (see Chapter 4) which attempt to mitigate this limitation. We currently target the Altera DE2 board, with the Cyclone II FPGA, and the DE4 board, with the Stratix IV FPGA. In terms of off-chip memory, the DE2 board contains 8MBs of SDRAM, whereas the DE4 board supports up to 2GBs of highperformance DDR2 memory. The default system architecture is shown in Figure 3.1. It is composed of a MIPS soft processor with one or more hardware accelerators, supported by memory components, including on-chip dual-port data cache and off-chip memory. The MIPS soft processor is a 32-bit 5-stage RISC processor that supports the MIPS1 instruction set. It has both instruction and data caches and the program instructions and data are stored in offchip memory. The instruction cache is instantiated within the MIPS processor as it is only accessed by the processor. As previously discussed, the components are connected to each other over the Avalon Interface in a point-to-point manner and the interconnection network is generated by Altera s SOPC Builder tool. Communication between two components occur via memory-mapped addresses. For example, the MIPS processor communicates with an accelerator by writing to the address associated with the accelerator. When multiple components are connected to a single component, such as the

33 Chapter 3. Sequential Execution 21 on-chip data cache, a round-robin arbiter is automatically created by the SOPC Builder. The solid arrows in Figure 3.1 represent the communication links between the processor and accelerators. These links are used by the processor to send arguments to accelerators, invoke accelerators, query an accelerators done status, and retrieve returned data, if necessary. This is described in more detail in Section The dotted arrows represent communication links between the processor/accelerators and the shared memory hierarchy. The data cache comprises on-chip dual-port block RAMs and memory controllers. On a read, if it is a cache hit, the data is returned to the requester in a single cycle. On a miss, the memory controller bursts to fetch a cache line from off-chip memory. In our system, this takes 20 cycles (depending on the type of off-chip memory) when there is no contention from other accesses. Depending on the cache line size, the number of bursts is varied. On a burst, after an initial delay of 20 cycles, each additional cycle returns 32 bits of data for the DE2 board, and 256 bits of data for the DE4 board. This burst is continued until a cache line is filled. As with many L1 caches, we employ a write-through cache owing to its simplicity, as write-through caches do not require bookkeeping to track of which cache lines are dirty. Note that our approach is not that of a single monolithic memory hierarchy. Each accelerator has its own local memory for data that is not shared with the processor or other accelerators. This allows single cycle memory access for all local memories Data Cache Architecture The on-chip data cache is implemented using block RAMs within the FPGA fabric (M4K blocks on Cyclone II, and M9K blocks on Stratix IV). The cache plays a crucial role in providing low-latency memory accesses to the processor and accelerators. The data cache was heavily modified from the original Tiger MIPS source code [47]. Initially, it was a single-port direct-mapped cache of 8 KBs in size, which was instantiated inside the processor. The modified cache architecture is shown in Figure 3.2. We first removed the

34 Chapter 3. Sequential Execution 22 cache from the processor to place it as a separate hardware component on the Avalon Interconnect. This was done so that accelerators could directly access the cache without having to go through the processor. Then, instead of using a single-port RAM, a truedual port RAM was used. A true dual-port RAM contains two ports that can both read and write. In this dual-port RAM, one port is used by the processor, with the other used by an accelerator. An accelerator memory controller, in addition to the existing memory controller for the processor, was designed to handle memory requests from accelerators. Each memory controller was connected to one port of the memory, and hence a dual-port cache was implemented. This architecture allows both the processor and an accelerator to directly access the cache without contention. Due to having different memory controllers (with different control signals), one port of the cache is reserved for the processor, with the other port reserved for any accelerators. Hence, if there is more than one accelerator, they share one port of the cache, as shown in Figure 3.1, with the arbitration logic automatically created by the SOPC builder. This allows the architecture to support an arbitrary number of accelerators. In sequential execution, only the processor or a single accelerator executes at a time, thus this cache architecture provides contention-free cache access for any number of accelerators executing sequentially. In terms of improvements to the features of the cache, we have added set-associativity, which allows a particular memory entry to be mapped to more than one line of the cache. This can lower cache miss rates, due to less conflict misses [30]. With fewer misses, runtime is improved since off-chip main memory accesses cause lengthy processor/accelerator stalls. We also parametrized the cache so that cache size, line size, and associativity can easily be configured by the user with simple Verilog parameters. The memory controllers shown in Figure 3.2 work as follows: If a read is asserted from the processor or an accelerator, it checks the appropriate cache line to see if the data is present. This takes a single cycle, and this is the minimum access latency for on-chip

35 Chapter 3. Sequential Execution 23 Processor Accelerator Data Cache Processor Memory Controller True Dual-port RAM Accelerator Memory Controller Off-chip Memory Figure 3.2: Modified data cache architecture. memory. If the requested data is present, it returns to the requester in the same cycle. On a miss, the memory controller bursts to fetch the data from off-chip main memory over the Avalon Interface. Bursting allows high-bandwidth off-chip memory accesses which returns large quantities of data after incurring an initial latency. On a write, the memory controller sends a write to main memory, and only writes to the cache if it is a cache hit. Stall logic exists for each memory controller. On a memory access, the processor or an accelerator is stalled under the following conditions: 1) It wants to read but misses in the cache, at which time the processing element (the processor or an accelerator) is stalled until the fetch is complete from main memory, or 2) In attempting to communicate with the cache, another component requests access in the same cycle, and the arbiter grants access to the other requester 1. In both cases, the processor is sent a stall signal, which stalls its pipeline, and the accelerator is given a waitrequest signal. Waitrequest is an Avalon signal which is asserted when a component should wait. It can either be asserted by user-designed logic, which is used in case 1, or by the Avalon Interconnect (arbiter), in case of contention, such as described in case 2. An accelerator is stalled from executing as long as the waitrequest signal is asserted. 1 This case does not occur in sequential execution since only one core is executing at a time. It is covered here for completeness and will be relevant for parallel execution in Chapter 4.

36 Chapter 3. Sequential Execution 24 main compute non compute mult div traverse Figure 3.3: Example software program structure. 3.2 Processor/Accelerator System Generation This section describes how a hybrid system, composed of a processor and accelerators, are automatically generated in LegUp. To create a hybrid system, the user simply has to place the name of the C function to accelerate in the config.tcl file and use a makefile command, make hybridsim, which compiles the C function to a hardware accelerator, connects the accelerator to the processor and the cache over Avalon, and finally simulates the complete system automatically 2. This process can be divided into the software flow and the hardware flow. To illustrate this clearly, a simple example program structure is shown in Figure 3.3. In this program, main calls two functions, compute and noncompute. Function compute calls two functions, mult and div, whereas function noncompute calls another function traverse. Consider that the user wants to accelerate function compute in this program. Figure 3.4 shows the C code for function compute, which is targeted for hardware. It has two function arguments, inputa and inputb, each of which is used as an argument into mult and div functions, respectively. The compute function sums the return values from each function call and returns the total. A step-by-step process of how this function is accelerated is described below, with the software flow shown first and the hardware flow described next. As described previously in Section 2.1, the LLVM compiler, which LegUp is built upon, executes compiler passes where new passes can easily be created and added. An LLVM compiler pass has been 2 make hybrid generates the system without running the simulation.

37 Chapter 3. Sequential Execution 25 int compute (int * inputa, int * inputb) { int result=0; result += mult(inputa); result += div(inputb); } return result; Figure 3.4: C function targeted for hardware. created for each of the software and hardware flow, which are respectively called softwareonly pass and hardware-only pass. These passes create the communication interface between software and hardware, which allow the processor and accelerators to work together Software Flow At a high-level, the software flow runs the software-only pass, which gets the name of designated function from the config.tcl file, generates a C wrapper function for each hardware-designated function, and replaces all function calls from the original C function to the wrapper function. It also generates tcl scripts which control the SOPC Builder Wrapper Function Generation From the processor s perspective, the executing software is oblivious to the fact that there exists a hardware accelerator. Thus, this process must happen seamlessly, without any alterations to the rest of the program. The purpose of the wrapper function to allow processor/accelerator communication without affecting the rest of the software. The wrapper function passes the function arguments to the corresponding hardware accelerator, asserts a start signal to the accelerator, waits until the accelerator has completed execution, and then receives the return value over the Avalon Interconnect. The wrapper function has the same function prototype as the original C function, however,

38 Chapter 3. Sequential Execution 26 #define STATUS #define DATA #define ARG1 #define ARG2 (volatile int *)0xf (volatile int *)0xf (volatile int *)0xf (volatile int *)0xf C int legup_wrap_compute (int * inputa, int * inputb} { // pass arguments to accelerator *ARG1 = inputa; *ARG2 = inputb; // give start signal *STATUS = 1; // wake up and get return data return *DATA; } Figure 3.5: Wrapper for hardware-designed function in Figure 3.4. its function body is replaced with memory-mapped reads and writes. The hardware-designated function shown in Figure 3.4 has 2 arguments, inputa and inputb, and returns an integer type. The number of function arguments, the data types of arguments, and the return type of the function are retrieved by iterating through the functions in LLVM. Figure 3.4 shows the wrapper function generated for the compute function. In SOPC Builder, each accelerator is assigned to a certain memory address range, as mentioned in Section The memory addresses defined in the wrapper function correspond to the assigned memory address range of the hardware accelerator. Writes to this memory address range translate into data communicated across the Avalon interface to the accelerator. The wrapper function first sends all of the arguments of the function, inputa and inputb, then writes to the STATUS pointer which starts the accelerator. At this point, in the case of sequential execution, the accelerator asserts a stall signal back to the processor, causing it to stall. When the accelerator s work is complete, a done signal is asserted to the processor, allowing the processor to move on to the next instruction. A read from the DATA address retrieves the return value from the accelerator.

39 Chapter 3. Sequential Execution 27 Once the wrapper function is generated 3, the rest of the software is altered to call the wrapper function instead of the original C function, meaning that the function is executed in hardware instead of software. This is done by iterating through each function call instruction using LLVM to find calls to the original C function. Once they are found, they are replaced by calls to the the generated wrapper functions. To reduce the program footprint, the original C functions are deleted from software Remaining Software Flow Once wrapper functions have been generated and function calls have been altered, the software flow also creates Tcl scripts which are used to control the Altera SOPC Builder (described in Section 3.2.4). The tcl scripts contain SOPC commands which add accelerators to the system, make the necessary Avalon connections between the processor, cache, and accelerators, as well as assign each accelerator to the memory-mapped addresses defined in the wrapper functions. This script is executed after the accelerators are generated in the hardware flow. In order to make debugging more convenient, the software flow also generates a wave.do file, which is used by ModelSim to add signals to the waveform window. Signals which are considered important for debugging are predetermined and automatically added to the wave.do file. Such signals include the processor s program counter, the instruction signals, cache signals, and Avalon signals from each accelerator. As the last stage of the software flow, the modified software and the generated wrappers are linked and compiled into MIPS assembly. Mips-binutils is used to compile the assembly to an ELF executable file. This is executed on the MIPS soft processor. For simulation, the ELF file is disassembled into a binary format. The disassembly is pass to a C++ application that we created, called elf2sdram, which stores it into a format 3 An actual C file is generated for the wrapper functions, which is subsequently linked together with the main C file.

40 Chapter 3. Sequential Execution 28 Top level compute mult div Local Memories Figure 3.6: Accelerator architecture. required by the test-bench of the MIPS processor. The test-bench uses this file to provide instructions and data to the processor and accelerators from off-chip memory Hardware Flow The hardware flow runs the hardware-only pass, which generates a hardware interface when a designated C function is compiled to hardware. For each hardware accelerator, a top-level module is created, which allows the accelerator to communicate with the processor (analogous to the wrapper function in software) and memory. This section describes how this interface is created, while conforming to the Avalon standards. The process of compiling C to Verilog is not discussed, since that is outside the scope of this thesis. Readers are encouraged to reference papers [9, 10] for our C-to-Verilog algorithms Accelerator Architecture When the compute function from Figure 3.3 is compiled to hardware, an accelerator with the architecture as shown in Figure 3.6 is created. It contains the designated function compute, as well as its descendants, mult and div, which are instantiated within compute. The accelerator may also have local memories. The top-level of an accelerator serves two purposes: 1) It steers memory accesses between local and shared memories, 2) It contains the interface logic which allows the accelerator to communicate with the processor. An accelerator can have both local and shared memories, which are translated from local and shared variables in C. A shared variable is any global variable, or any variable

41 Chapter 3. Sequential Execution 29 Top level compute Shared Variable? N Local Memory Y Shared Memory Figure 3.7: Memory access steering logic for accelerator. that is on the stack of the processor (i.e. a local variable of the processor which is passed in as an argument to the accelerated function). Local memories, as shown in Figure 3.6 are only accessible by a single accelerator as they are instantiated as block RAMs within the accelerator. A local memory offers single-cycle access times and obviates the need to go through Avalon arbitration or fetch from high-latency off-chip memory. To minimize memory access times, any constant variables, whether global or local to the accelerator, are also stored in local memories, as these variables cannot be modified during program execution. Figure 3.7 shows how a memory access is steered between shared and local memories. In an accelerator, each memory address is associated with a 9-bit number called a tag, which is embedded as part of the 32-bit address. This tag identifies which memory is to be accessed. Each local variable is stored in a separate memory block 4 and each memory block is assigned a unique tag number. A tag of 1, indicates that it is accessing the shared memory, whereas other tags indicate local memory access. Thus, on a memory access, the tag bit is examined to steer the request between shared memory, which is sent to the data cache over Avalon, and local memory, which is routed between local memory blocks. Although the tag is currently configured to be a 9-bit number, this width can 4 LLVM can demote small local variables to registers if the accesses can be statically determined at compile time. In this case, the variables are stored in registers on the FPGA, and not in a block RAMs.

42 Chapter 3. Sequential Execution 30 module accel_top ( csi_clockreset_clk, csi_clockreset_reset, //clock //reset //Slave interface to talk to processor //Inputs avs_s1_address, //address bits from processor avs_s1_read, //read signal from processor avs_s1_write, //write signal from processor avs_s1_writedata, //data sent from processor //Outputs avs_s1_readdata, //data returned to processor //Master interface to talk to data cache //Outputs avm_accel_address, //address of cache avm_accel_read, //read signal to cache avm_accel_write, //write signal to cache avm_accel_writedata, //data to write to cache //Inputs avm_accel_readdata, avm_accel_waitrequest //data returned from cache //stall signal from cache ); Figure 3.8: Top-level verilog module for accelerator. easily be changed depending on the size of the total memory space and the number of local memory blocks. The top-level of an accelerator has the module declaration shown in Figure 3.8. These show Avalon Interface signals which allow the accelerator to communicate with the processor and the cache. In Figure 3.8, avs indicates the slave interface whereas avm indicates the master interface. In a master/slave pair, a master is able to initiate a transaction (read/write), where a slave can only respond to a transaction from a master. For an accelerator, the slave interface is used to communicate with the MIPS processor. As discussed in Section , the processor uses this interface to start the accelerator,

43 Chapter 3. Sequential Execution 31 transfer function arguments, and retrieve the return data. The master interface is used for the accelerator to access shared memory Processor/Accelerator Interface As previously discussed, each accelerator is assigned a memory address range and the processor sends data to an accelerator by writing (store instruction) to addresses within the range, and retrieves data from an accelerator by reading (load instruction) from addresses in the range. The size of the address range depends on the bit width of read/write data bus (i.e. avs s1 readdata and avs s1 writeddata in Figure 3.8) as well as the width of the slave address signal (i.e. avs s1 address in Figure 3.8). This is expressed mathematically in Equation 3.1, where base address indicates the starting address of an accelerator and base address + α indicates the ending address. All bus widths are expressed in terms of bits. memory mapped range := [base address : base address + α] (3.1) where α = base address + 2 slave address bus width data bus width 8 1 For instance, if the processor (master) has a 32-bit wide data bus connected to an accelerator (slave) which is assigned to a base address of 0x , and the slave has a 3-bit address bus, equation 3.1 can be used to determine that the slave is mapped to an address range of 0x x F. When the processor writes to the accelerator, the accelerator does not receive the full 32-bit address (the address bus of the processor is 32 bits wide since it s a 32-bit processor), but only the offset from its base address. For example, if the 32-bit wide master writes to 0x , the slave receives an address of 0. If it writes to 0x , the slave receive an address of 1, since an increment of 4 bytes indicates that it is writing

44 Chapter 3. Sequential Execution 32 to the next 32 bits. The address that the slave receives is incremented by 1 every time the master writes to an address which is offset from the base address by the number of bytes equal to the width of the data bus. This address translation from 32 bits to base address offsets is automatically done by the Avalon Interconnect. The address offsets are used by the accelerator to determine the type of data being transmitted by the processor. For instance, when the processor gives the start signal (by writing to the STATUS pointer as shown in Figure 3.5), the slave receives an offset of 0 (since STATUS pointer is assigned to the base address). When the processor reads from the accelerator to retrieve the return value, the accelerator gets an offset of 1. Depending on the offset that the accelerator receives, the top-level logic controls how to send/receive data. The top-level of the accelerator also contains logic called argument receivers, which are used to receive the arguments of a function from the processor. The argument receivers for the compute function, which has two function arguments, are as shown in Figure 3.9. The processor sends each argument over Avalon with a unique address, which is incremented for each new argument. Hence as shown in Figure 3.9, the slave address offset of 2 indicates that it is the first argument into the function, and the offset of 3 indicates the second argument 5. An argument receiver is generated for each argument into the function, and once all of the ARG ready signals are asserted, which indicates that all of the arguments have been received, the accelerator can start once it receives the start signal. An accelerator can be called multiple times in a program, hence the ARG ready signals are reset each the time accelerator starts or resets. 5 Offset of 0 is used for the start signal, and offset of 1 is used for the return value, as shown in Figure 3.5.

45 Chapter 3. Sequential Execution 33 clk) begin if (start reset) begin ARG1_ready <= 1 b0; //clear the ready signal on start and reset end else if ((avs_s1_address == 2) & (avs_s1_write)) //when the processor asserts a write with 2 as the offset address begin ARG1[31:0] <= avs_s1_writedata[31:0]; //receive the first argument ARG1_ready <= 1 b1; //assert the ready signal for the first argument end end clk) begin if (start reset) begin ARG2_ready <= 1 b0; //clear the ready signal on start and reset end else if ((avs_s1_address == 3) & (avs_s1_write)) //when the processor asserts a write with 3 as the offset address begin ARG2[31:0] <= avs_s1_writedata[31:0]; //receive the second argument ARG2_ready <= 1 b1; //assert the ready signal for the second argument end end Figure 3.9: Function argument receivers for accelerator.

46 Chapter 3. Sequential Execution 34 Table 3.1: Interface between Accelerator/Cache Bus Type Bits Signal Description Write bus 63:0 Data to write 65:64 Size of data 66 Stall processor 67 Unstall processor 99:68 Address of data 127:100 Not used Read bus 63:0 Returned data 127:64 Not used Accelerator/Cache Interface We now describe how an accelerator communicates with the data cache, so that it can access the shared memory space. The data buses (read bus, write bus) are 128 bits wide. The Avalon memory-mapped interface requires the read and write buses of an interface to have the same bit widths, and the widths can be 8, 16, 32, 64, 128, 256, 512, or 1024 bits wide. The signal assignments for the read/write data bus are shown in Table bit wide buses are used as the write bus requires a total of 100 bits. The unused bits can later be used to serve other purposes, as described in Section As shown in Table 3.1, 64 bits are used for data for both read and write buses, as an accelerator can access up to 64 bits of data at once (largest data type is a double, which requires 64 bits). It can access either 8, 16, 32, or 64 bits of data, hence this size information is binary-encoded into 2 bits. For sequential accelerators, the processor is stalled during the execution of an accelerator. When an accelerator starts execution, it sends a signal to the cache to stall the processor, which is subsequently routed to the stall logic of the processor 6. When the accelerator is finished, it sends a signal to unstall the processor, at which point processor execution resumes. 6 This obviates the need to create another master/slave interface pair for an accelerator to send the stall signal directly to the processor. We can use the existing accelerator/cache interface, and the cache/processor interface to route the stall signal.

47 Chapter 3. Sequential Execution Controlling Altera SOPC Builder The Altera SOPC Builder has a Tcl interface which allows the user to build and generate a system via command line. This is very useful for automation, as one does not have to open the GUI and manually add the components to generate a system. Once the both the software and hardware flows are complete, LegUp runs the tcl script generated in the software flow to add each accelerator to the system using SOPC Builder. SOPC Builer then generates the system, which creates the necessary Avalon Interconnect as well as the arbitration between the components. To summarize, a high-level flow diagram which shows how a processor/accelerator system is created is shown in Figure The flow starts by a user placing the name of the function for acceleration in the config.tcl file and running make hybridsim. The flow is divided into software and hardware flows at this point. The software flow generates the wrapper file, replaces the function calls from the original C function to the generated wrapper function, links the wrapper function with the main C program to compile for MIPS, and finally generates the tcl script for SOPC Builder. The hardware flow generates the hardware accelerator as well as the top-level Avalon interface. Once both software and hardware flows are complete, the tcl script runs the SOPC Builder to add the accelerator to the system. Finally, ModelSim simulates the SoC to verify correct functionality and extracts total execution cycles. 3.3 Experimental Methodology The goals of our experimental study for sequential execution are two-fold: 1) To demonstrate LegUp s ability to effectively explore the hardware/software co-design space, and 2) To compare the quality of hardware vs. software implementations of the benchmark programs. We use 13 benchmarks for this study, which are described in the next section, Section With the above goals in mind, we map each benchmark program using

Chapter 3. Sequential Execution 36 Specify function name in config.

function into hardware accelerator Replace all function calls to C function with wrapper function

Avalon Interface and memory controller Generate Tcl scripts to control SOPC builder Instantiate

48 Chapter 3. Sequential Execution 36 Specify function name in config.tcl SW Generate wrapper function for designated C function Make Hybridsim HW Compile designated C function into hardware accelerator Replace all function calls to C function with wrapper function Link wrapper function with main program to compile for MIPS Generate top-level of accelerator with Avalon Interface and memory controller Generate Tcl scripts to control SOPC builder Instantiate accelerator from toplevel Generate complete system with SOPC Builder Simulate in Modelsim Figure 3.10: Processor/accelerator system generation flow.

49 Chapter 3. Sequential Execution 37 4 different flows, representing implementations with successively increasing amounts of computation happening in hardware vs. software. The flows are as follows (labels appear in parentheses): 1. A software-only implementation running on the MIPS soft processor (MIPS-SW ). 2. A hybrid software/hardware implementation where the second most compute-intensive function (and its descendants) in the benchmark is implemented as a hardware accelerator, with the balance of the benchmark running in software on the MIPS processor (LegUp-Hybrid2 ). 3. A hybrid software/hardware implementation where the most compute-intensive function (and its descendants) is implemented as a hardware accelerator, with the balance in software (LegUp-Hybrid1 7 ). 4. A pure hardware implementation produced by LegUp (LegUp-HW ). The two hybrid flows correspond to a system that includes the MIPS processor and a single accelerator, where the accelerator implements a C function and all of its descendant functions. For the back-end of the flow, we use Quartus II ver. 9.1 SP2 to target the Cyclone II FPGA. Quartus II was executed in timing-driven mode with all physical synthesis optimizations turned on. The correctness of the LegUp implementations was verified using post-routed ModelSim simulations and also in hardware using the Altera DE2 board. Three metrics are employed to gauge quality of result: 1) Circuit speed, 2) Area, and 3) Energy consumption. For circuit speed, we consider the cycle latency, clock frequency and total execution time. Cycle latency refers to the number of clock cycles required for a complete execution of a benchmark. Clock frequency refers to the reciprocal of 7 Both LegUp-Hybrid2 and LegUp-Hybrid1 do not consider the main() function.

50 Chapter 3. Sequential Execution 38 Table 3.2: Benchmarks used for Sequential Execution Category Benchmarks Lines of C Arithmetic 64-bit dbl precision add, mult, div, sin Encryption AES, Blowfish, SHA Processor MIPS processor 232 Media JPEG decoder, Motion, GSM, ADPCM General Dhrystone 491 the post-routed critical path delay reported by Altera timing analysis. Total execution time is simply the cycle latency multiplied by the clock period. For area, we consider the number of used Cyclone II logic elements (LEs), memory bits, and 9x9 multipliers. Energy is a key cost metric, as it directly impacts electricity costs, as well as influences battery life in mobile settings. To measure energy, we use Altera s PowerPlay power analyzer tool, applied to the switching activity data obtained through a post-routed full delay simulation with Mentor Graphics ModelSim Benchmarks For sequential execution, we used 13 benchmark C programs, summarized in Table 3.2. Included are all 12 programs in the CHStone high-level synthesis benchmark suite [29], as well as Dhrystone [50] a standard integer benchmark. The programs represent a diverse set of computations falling into several categories: arithmetic, encryption, media, processing and general. They range in size from lines of C code. The arithmetic benchmarks implement 64-bit double-precision floating-point operations in software using integer types. Notice that the CHStone suite contains a benchmark which is a software model of a MIPS processor (which we can then run on a MIPS processor). A key characteristic of the benchmarks is that inputs and expected outputs are in-

51 Chapter 3. Sequential Execution 39 cluded in the programs themselves. The presence of the inputs and golden outputs for each program gives us assurance regarding the correctness of our synthesis results. Each benchmark program performs computations whose results are then checked against golden values. This is analogous to built-in self test in design-for-test methodology. No inputs (e.g. from the keyboard or a file) are required to run the programs. As an example, for the MIPS benchmark program in the CHStone suite, the inputs comprise an array of integer data and a set of MIPS machine instructions that cause the integer array to be sorted in ascending order. The golden result is the same integer array in sorted order. Each program returns 0 on success (all results matched golden values), and non-zero otherwise Results Table 3.3 presents speed performance results for all circuits and flows 8. Three data columns are given for each flow: Cycles contains the latency in number of clock cycles; Freq presents the post-routed critical path delay in MHz; Time gives the total execution time in µs (Cycles/F req). The flows are presented in the order specified above, from pure software on the left, to pure hardware on the right. The second last row of the table contains geometric mean results for each column. The last row of the table presents the ratio of the geomean relative to the software flow (MIPS-SW ). Beginning with the MIPS-SW flow, the data in Table 3.3 indicates that the processor runs at 74 MHz on the Cyclone II and the benchmarks take between 6.7K 29M cycles to complete their execution. In terms of program execution time, this corresponds to a range of K µs. In the LegUp-Hybrid2 flow, where the second most computeintensive function (and its descendants) is implemented as a hardware accelerator, the number of cycles needed for execution is reduced by 50% compared with software, on av- 8 Note that all the results gathered in this section are from LegUp version 1.0, which was released in March, The results in shown in Chapter 5 for parallel execution uses LegUp version 2.0, which was released in December, 2011.

52 Chapter 3. Sequential Execution 40 erage. For hybrid systems, the upper bound in Fmax is the Fmax of the processor, at 74 MHz, since there is only one clock in our system. For benchmarks where the Fmax lower than 74 MHz, the generated accelerators exhibited long combinational delays within its computational logic, which is also shown by the lower Fmax results in LegUp-HW column. For the benchmarks where the Fmax is higher than 74 MHz (by 1 5 MHz), we attribute this small difference to algorithmic noise in the synthesis tool. On average, the Hybrid2 circuits run at 10% lower frequency than the processor. Overall, LegUp-Hybrid2 provides a 45% (1.8 ) speed-up in program execution time vs. software (MIPS-SW ). Moving onto the LegUp-Hybrid1 flow, which represents additional computations in hardware, Table 3.3 shows that cycle latency is 75% lower than software alone. However, clock speed is 12% worse for this flow, which when combined with latency, results in a 72% reduction in program execution time vs. software (a 3.6 speed-up over software). Looking broadly at the data for MIPS-SW, LegUp-Hybrid1 and LegUp-Hybrid2, we observe a trend: execution time decreases substantially as more computations are mapped to hardware. Note that the MIPS processor would certainly run at a higher clock speed on a 40/45 nm FPGA, e.g. Stratix IV, however the accelerators would also speed-up commensurately. The right-most column in Table 3.3 correspond to pure hardware implementations. Since none of the benchmarks have floating point operations, dynamic memory, or any C constructs unsupported by LegUp, each program can be compiled entirely to hardware. In this case, the MIPS soft processor does not exist in the system, hence the Fmax is not limited by the processor and the circuits can run faster than the hybrid cases. The LegUp-HW flow requires just 12% of the execution time of the software implementations Observe that neither of the hybrid scenarios provide a performance win over pure hardware for these particular benchmark circuits. Nevertheless, the hybrid scenarios do serve to demonstrate LegUp s ability to synthesize working systems that contain both hardware and software aspects. Moreover, portions of benchmarks using C language

53 Chapter 3. Sequential Execution 41 constructs that are unsupported for HLS can run in software on the MIPS processor. For example, if a program had a single unsupported operation, such as malloc, then the pure hardware flow would not be feasible and the user would have to run everything in software. However, the hybrid flow allows a user to execute the unsupported portion of the program in software, while accelerating the rest of the program by hardware. This allows one to reap the performance/energy benefits of hardware with increased language coverage, which means wider range of applications are amenable for hardware acceleration. Area results are provided for each circuit in Table 3.4. For each flow, three columns provide the number of Cyclone II logic elements (LEs), the number of memory bits used (# bits), as well as the number of 9x9 multipliers (Mults). As in the performance data above, the geometric mean and ratios relative to MIPS software alone are given in the last two rows of Table 3.4. Beginning with the area of the MIPS processor, the data in Table 3.4 shows it requires 12.2K LEs, 226K memory bits, and 16 multipliers. The hybrid flows include both the MIPS processor, as well as custom hardware, and consequently, they consume considerably more area. When the LegUp-Hybrid2 flow is used, the number of LEs, memory bits, and multipliers increase by 2.23, 1.14, and 2.68, respectively, in Hybrid2 vs. the MIPS processor alone, on average. The LegUp-Hybrid1 flow requires even more area: 2.75 LEs, 1.16 memory bits, and 3.18 multipliers vs. MIPS. Note that link time optimization in LLVM was disabled for the hybrid flows, as was necessary to preserve the integrity of the function boundaries 9. However, link time optimization was enabled for the MIPS-SW and LegUp-HW flows, permitting greater compiler optimization for such flows, possibly improving area and speed. Turning to the pure hardware flows in Table 3.4, the LegUp-HW flow implementations require 28% more LEs than the MIPS processor on average. In terms of memory bits, the LegUp-HW flow requires much fewer 9 Link time optimization permits code optimization across compilation modules.

54 Chapter 3. Sequential Execution 42 Table 3.3: Speed performance results. MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW Benchmark Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone Geomean: Ratio: Table 3.4: Area results. MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW Benchmark LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone Geomean: Ratio:

55 Chapter 3. Sequential Execution 43 Execution time (geometric mean) # of LEs Exec. time MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW Figure 3.11: Performance and area results # of LEs (geometric mean) memory bits than the MIPS processor alone. This is because the MIPS processor contains large caches (instruction/data cache), which consume a lot of memory bits, in addition to FIFOs and memories which are used for peripherals. Figure 3.11 summarizes the speed and area results. The left vertical axis represents geometric mean execution time; the right axis represents area (number of LEs). Observe that execution time drops as more computations are implemented in hardware. While the data shows that pure hardware implementations offer superior speed performance to pure software or hybrid implementations, the plot demonstrates LegUp s usefulness as a tool for exploring the hardware/software co-design space. In terms of area, the pure hardware implementations take considerably less area than the hybrid implementations as the MIPS processor is not included in the system 10. Figure 3.12 presents the geomean energy results for each flow normalized against MIPS-SW. The energy results bear similarity to the trends observed for execution time, though the trends here are even more pronounced. Energy is reduced drastically as computations are increasingly implemented in hardware vs. software. The LegUp-Hybrid2 10 The MIPS processor consumes 12,243 LEs. This constitutes 45% of area for LegUp-Hybrid2 and 36% of area for LegUp-Hybrid1.

56 Chapter 3. Sequential Execution Normalized energy (geometric mean MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW Figure 3.12: Energy results. and LegUp-Hybrid1 flows use 47% and 76% less energy than the MIPS-SW flow, respectively, representing 1.9 and 4.2 energy reductions. The pure hardware flows are even more promising from the energy standpoint. With LegUp-HW, the benchmarks use 94% less energy than if they are implemented with the MIPS-SW flow (an 18 reduction). This demonstrates that energy savings, in addition to performance improvements, is another major advantage of hardware over software. Appendix B presents the full set of results for accelerating each function in the 12 CHStone benchmarks as well as the Dhrystone benchmark (164 functions in total). This illustrates the flexibility and the robustness of LegUp s hardware/software co-design flow. 3.4 Summary This chapter described the system architecture targeted by LegUp and the hardware modifications which enabled processor/accelerator co-execution. It described how the processor communicates with accelerators and how accelerators access memory. It illustrated the flow of automatically generating a processor/accelerator hybrid system and presented results across 13 benchmarks which showed the performance/energy benefits of

57 Chapter 3. Sequential Execution 45 accelerating computationally intensive C functions to hardware. Even though the performance/energy improvements are less than that of the pure hardware case for the given benchmarks, the hybrid flow demonstrates the usefulness of the tool for design space exploration in the hardware/software co-design space.

58 Chapter 4 Parallel Execution The performance of a single-core CPU has begun to plateau in recent years. The conventional approach of scaling the clock frequency with each new technology node is no longer feasible due to exponentially increasing power consumption. At ISSCC 2001 (International Solid-State Circuits Conference), Intel s Vice President Patrick Gelsinger claimed that, If scaling continues at present pace, by 2005, high speed processors would have power density of a nuclear reactor, by 2010, a rocket nozzle and by 2015, surface of the sun. A consequence of having high power consumption is high heat dissipation, which increases cooling costs in addition to adding significant performance overhead to guarantee thermal safety. To overcome this barrier, the industry has decided to go parallel, by creating multiple cores which work together instead of having one fast core. Amdahl s Law models the maximum speed-up that can be achieved when a portion of a program is parallelized [4]. Specifically: S(N) = 1 (1 P ) + P N (4.1) This equation states that if P is the proportion of a program that can be made parallel (i.e., benefit from parallelization), and (1-P) is the proportion that cannot be parallelized (remains serial), then the maximum speed-up that can be achieved by using 46

59 Chapter 4. Parallel Execution 47 N processors is given by S(N). Hence, if the parallel portion of a program is large with a vast number of processors available, a large speed-up can be achieved. FPGAs provide programmable hardware that execute in parallel. Contrary to multi-core CPUs or GPUs, where the number of processing elements is fixed, FPGAs can be programmed to hold any number of processing elements, as long as they fit on the chip. Thus, the value of N in equation 4.1 can be adjusted to meet performance/area requirements. As in a multi-core environment, LegUp can take advantage of parallelism to execute program segments in parallel. This chapter describes how this is achieved by modifying both software and hardware components. It introduces the memory access profiler, a software application that helps a user to determine which functions of a program can run in parallel. It describes changes to the software wrapper functions and introduces hardware implementations of multi-ported caches, which allow more than 2 processing elements to access the cache at the same time. 4.1 Parallel Execution A number of steps are involved in executing functions in parallel. First, given a C program, parallel functions must be identified. These are functions which do not have any data dependencies between them. In our current system, locks are not supported 1, hence two functions can run in parallel only if they are completely data-parallel. In other words, they can not write to the same memory address (may result in a race condition), but they can read from the same address. It is up to the user to determine which functions can run in parallel. A function can execute in parallel with another function, or it can also execute in parallel with itself. In the latter case, if the function is called multiple times in a loop, then this loop can be unrolled to call multiple instances of the function in parallel (where 1 Since the experiments for thesis have been done, locks have been implemented in LegUp.

60 Chapter 4. Parallel Execution 48 //Execute accel sequentially for (i=0; i<3; i++) { compute(i); } //Execute accel in parallel compute_0(0); compute_1(1); compute_2(2); Figure 4.1: Loop unrolling to execute in parallel 2. each instance in mapped to an accelerator), as shown in Figure 4.1. Similarly, if a function operates on a large chunk of data, such as an array, this array can be divided into multiple smaller arrays, to allow different instances of the function to operate in parallel. However, for any non-trivial software program, it can be difficult for a user to determine if two functions have any data dependencies. To address this issue, we developed a memory access profiler that profiles a program to help determine which functions can execute in parallel Memory Access Profiler Our memory access profiler uses the Gxemul simulator to generate an instruction trace. Gxemul is an open-source full-system simulator with support for ARM, MIPS, Motorola 88K, PowerPC, and SuperH processors [26]. To use Gxemul, we first compile the entire software program to MIPS assembly using LLVM, and then assemble it using mipsbinutils into an ELF file. The ELF file is passed to Gxemul, which runs the program and dumps the instruction trace to a text file. Our memory access profiler parses the instruction trace for load and store instructions to build a hash table of all memory addresses accessed. For each accessed address, it builds a list of all the functions that 2 In this case, compute 0, compute 1, and compute 2 are executed in a non-blocking manner. Hence the processor first calls compute 0, then continues to call compute 1 without waiting for compute 0 to finish execution. This is described in more detail in Section

61 Chapter 4. Parallel Execution 49 //Example 1 : When there are no memory dependencies Checking for function: logscl local variable conflict with other: 0 local variable conflict with self: 0 global variable conflict with other: 0 global variable conflict with self: DONE FOR: logscl Invocation count of this function: 100 //Example 2 : When there is a memory dependency Checking for function: encode global variable conflict with other at: global variable conflict with self at: local variable conflict with other: 0 local variable conflict with self: 0 global variable conflict with other: 1 global variable conflict with self: DONE FOR: encode Invocation count of this function: 50 Figure 4.2: Example outputs from the memory access profiler. access the address and keeps track of the number of times the address is accessed. The profiler differentiates between global and local variables as well as between different invocations of a single function 3. Once the hash table is completely built, the profiler goes through each address and checks if there are memory dependencies. Example outputs from the memory access profiler is shown in Figure 4.2. Example 1 in Figure 4.2 shows that there are no memory dependencies for function logscl, hence this function can be parallelized. It also shows that this particular function has been called 100 times in the program. Example 2 shows that two global variable conflicts occurred for function encode at memory address It has one global variable conflict with other, meaning that another function is also writing to the memory address, and one global variable conflict with self, meaning that different invocations 3 If a single invocation of a function accesses the same address twice, this causes no inter-function dependency, hence this function can be parallelized. If two different invocations of a function access the same address twice, then this function cannot be parallelized.

62 Chapter 4. Parallel Execution 50 of encode write to the same address. If the conflicting variable is a global variable, as in this case, then the user can examine the MIPS disassembly to see which global variable is assigned to that address. This enables the user to possibly rearrange the code to make the function parallelizable. Note that the memory accesses and hence the memory conflicts depend on the program s input data. Thus this simulator determines whether there are data dependencies with the current set of inputs, but cannot conclusively determine for all possible inputs. 4.2 Enabling Parallel Execution Both software and hardware modifications were made to LegUp to allow parallel execution. Minor modifications include removal of the stall/unstall signals that were sent from accelerators to the processor, as well as wrapper function changes which implement polling instead of stalling. This way, instead of stalling the processor after starting an accelerator, the processor polls on a set of parallel accelerators to check if they are done. This is described in more detail in the next section, Section A much more significant change is the implementation of a multi-ported cache, which is described in Section Parallel Wrapper Function In parallel execution, each accelerator has two wrapper functions, a calling wrapper, and a polling wrapper. Figure 4.3 shows the wrapper functions that are created when the compute function from Chapter 3 is parallelized twice by creating two instances of the same function. The numbers on the left of Figure 4.3 indicate line numbers. Each parallel accelerator requires it own set of calling/polling wrapper functions, where each set has unique memory-mapped addresses. That is, when there are two instances of accelerators for the compute function, there are two sets of calling/polling

63 Chapter 4. Parallel Execution 51 wrapper functions. The calling wrapper (lines 4 8, and in Figure 4.3) calls the accelerator by transferring the arguments and starting the accelerator. This is identical to the first portion of a sequential wrapper, shown in Figure 3.5, where it writes to memorymapped addresses. Contrary to the sequential case, the accelerator does not send a stall signal back to the processor. Since the processor is not stalled, it can continue to execute while the accelerator is running. At this point, the processor can call other parallel accelerators, or perform computations on its own. A polling wrapper (lines 18 22, and 23 27) is used to check the done status of an accelerator by polling on the STATUS memory-mapped address (lines 0, 9) in a while loop. When the accelerator is done, it asserts a value of 1 as its return data 4. When the processor reads this value, it exits the while loop and reads the return data from the accelerator using its DATA pointer (lines 1, 10). At this time the accelerator sees that the processor is reading from an address with an offset of 1, as explained in Section When receiving this offset, the accelerator asserts its actual return value (the return value of the function) to the processor. This is similar to the second portion of a sequential wrapper, where it reads from the DATA pointer. Note that the STATUS address is used for both starting an accelerator as well as checking to see if it s done. When the processor writes a 1 to the STATUS address, it asserts a write signal and sends a value of 1 on the writedata bus. When the processor reads from the STATUS address, it asserts a read signals and the data is returned on the readdata bus. Since there are separate read and write buses, as well as separate read and write signals, the same address can be used to achieve two different operations. The return type of a calling wrapper is always void, as it is used only for starting an accelerator. A polling wrapper function takes the return type of the original C function. On the other hand, the polling wrapper function does not have any function arguments 4 This is not the actual return data of the function, but it s a value which indicates that the accelerator is done.

64 Chapter 4. Parallel Execution 52 0: #define compute1_status (volatile int *)0xf : #define compute1_data (volatile int *)0xf : #define compute1_arg1 (volatile int *)0xf : #define compute1_arg2 (volatile int *)0xf C //Calling wrapper function for accel 1 4: void legup_call_compute1 (int * inputa, int * inputb} { 5: *compute1_arg1 = inputa; 6: *compute1_arg2 = inputb; 7: *compute1_status = 1; 8: } 9: #define compute2_status (volatile int *)0xf : #define compute2_data (volatile int *)0xf : #define compute2_arg1 (volatile int *)0xf : #define compute2_arg2 (volatile int *)0xf C //Calling wrapper function for accel 2 13: void legup_call_compute2 (int * inputa, int * inputb} { 14: *compute2_arg1 = inputa; 15: *compute2_arg2 = inputb; 16: *compute2_status = 1; 17: } //Processor continues execution //Polling wrapper function for accel 1 18: int legup_poll_compute1 () { 19: while (*compute1_status == 0) 20: {} 21: return *compute1_data; 22: } //Polling wrapper function for accel 2 23: int legup_poll_compute2 () { 24: while (*compute2_status == 0) 25: {} 26: return *compute2_data; 27: } Figure 4.3: Wrapper functions for parallel accelerators.

65 Chapter 4. Parallel Execution 53 as it is used only for checking the status and retrieving the return value. Thus, the calling wrapper takes the same function arguments as the original C function. Instead of just calling/polling the accelerators, the processor can freely execute other software segments between the calling and polling wrapper functions. This facilitates a heterogeneous computing environment where software computations run in parallel with hardware computations. This is explored in Section Multi-ported Cache Today s high-performance computers, such as multi-core CPUs and GPUs, are often memory-bandwidth limited. FPGAs are not an exception, as off-chip memory access is expensive. To alleviate this fact, FPGAs provide high-performance on-chip block RAMs which can run very high speeds. Current FPGA architectures, however, limit the block RAMs to have up to two ports, meaning that there can only be up to two reads or two writes at any point in time. This limitation led us to the default architecture that was shown in Figure 3.2, with the processor accessing one port of the RAM and accelerators accessing the other port. One port is always reserved for the processor, as the processor requires different control signals from the accelerators. Hence, with more than one accelerator, multiple accelerators share the second port, as was shown in Figure 3.1. This architecture is suitable in two scenarios: 1) Sequential execution, where only the processor or a single accelerator is executing at a given time; 2) Parallel execution either with a small number of accelerators, or for compute-intensive applications, where accelerators do not access memory often. For memory-intensive applications, with many accelerators operating in parallel, a dual-ported cache architecture may result in poor performance, as accelerators contend for one port of the cache, leading to the accelerators being stalled most of the time. To overcome this performance barrier, we implemented two types of multi-ported caches, both of which allow multiple (more than 2) concurrent accesses to all regions of the cache

66 Chapter 4. Parallel Execution 54 in every cycle. The first approach is based on a recently proposed multi-ported memory, comprising of the use of multiple RAM banks and a small logical memory, called the live-value table, that tracks which RAM bank holds the most-recently-written value for a memory address [36]. The second approach is based on memory multi-pumping, where the underlying cache memory operates at a multiple of the system frequency, allowing multiple memory reads/writes to happen in a single system cycle. Each multi-ported cache offers different performance/area trade-offs which enable a wide exploration of memory architectures Live-Value Table Approach Original LVT approach: The first multi-ported cache is based on the work by LaForest et al. [36]. The original work in [36] replicates dual-ported RAM blocks to emulate a multi-ported memory which can have more than 2 ports. It replicates memory blocks for each read and write port, while keeping read and write as separate ports, and uses a live-value table (LVT) to indicate which of the replicated memories holds the most recent value for a given memory address. Each write port has its own write bank containing R memories, where R is the number of read ports. An example of a 4-read/4-write port (4 ports that can do reads and 4 ports that can do writes) memory is shown in Figure 4.4. In Figure 4.4, each outer box (labelled WB0 WB3) indicates a write bank, each of which contains as many RAM blocks (labelled M0 M3) as the number of read ports. Each RAM block is a simple dual-port memory, where one port is reserved for writes and one port is reserved for reads. This architecture allows up to 4 reads and 4 writes to occur at the same time. Keeping reads and writes separate is useful for certain uses such as register files in processors, as separate reads and writes need to happen at the same time. Note that the read lines for R1, R2, and R3 are not shown for clarity. As shown for R0, each of the read ports (R1, R2, and R3), connects to exactly one memory from each write bank. All ports (both read and write) connect to the LVT. This is the

67 Chapter 4. Parallel Execution 55 LVT W0 W1 WB0 M0 M1 M2 M3 WB1 M0 M1 M2 M3 R0 R1 For clarity, read lines are not shown for R1, R2, R3. Each port (both W/R) also connects to LVT W2 WB2 M0 M1 M2 M3 R2 W3 WB3 M0 M1 M2 M3 R3 Figure 4.4: 4-write/4-read port memory with LVT. original architecture presented in [36]. On a write, the writing port writes to all memories in its bank, and also stores its write bank number to the corresponding memory address in the LVT, indicating that its write bank contains the most recent data for that address. On a read, the reading port reads from all of its connected memories (one from each write bank), and looks up the memory address in the LVT, which returns the write bank number that holds the most recent data. This is used to select the most-recently-written value from one of the connected memories. For example, let s consider a case where W3 writes last to address 0. W3 writes to M0, M1, M2, and M3 in WB3, and writes its write bank number, WB3, to address 0 in the LVT. When R0 wants to read from address 0, it reads from all of its connected memories, which are M0 in WB0, M0 in WB1, M0 in WB2, and M0 in WB3. At the same time, it looks up address 0 in the LVT, which indicates that WB3 holds the most recent value. Thus, the multiplexer selects the memory block (M0) in WB3, which

68 Chapter 4. Parallel Execution 56 holds the most-recently-written data by W3. The LVT is implemented with registers, as multiple ports can read and write from different memory locations at the same time. A reading port can read from any of the registers in the LVT, hence for each reading port, a multiplexer is required to select between the registers in the LVT. As the size of the multi-ported memory increases, the number of registers in the LVT increases, which subsequently increases the size of the multiplexers. With this architecture, the total memory consumed is α the original memory size, where α is equal to the number of write ports the number of read ports. In the case of caches, the number of read and write ports is equal 5, and with n read ports and n write ports, the cache size would grow by n 2. Modified LVT approach: In our system, a single processing element can either read or write but cannot do both at the same time, hence the read and write ports do not need to be separate as shown in Figure 4.4. A read and a write port can be combined into a single read/write port 6 and one true dual-ported memory 7 can therefore be used for two read/write ports, instead of using 2 simple dual-ported memories with 2 read ports and 2 write ports. As will be shown below, this reduces the total memory consumption to less than half of that in [36]. A 4-ported cache using the new architecture is shown in Figure 4.5, where each M represents a true dual-ported memory, and MC represents a memory controller. For clarity, only the output (read) lines are shown from each memory block. The input (write) lines follow in parallel with the output lines (without the multiplexer). A memory controller is connected to each port of the memory, which is subsequently connected to either the processor or an accelerator. 5 We assume this to be a standard case where a single processing element cannot do multiple reads or multiple writes to the same cache at the same time. 6 A read/write port can perform both reads and writes, but not at the same time. 7 A true dual-ported memory has two ports, each of which can read and write, but not at the same time. Hence it can do up to 2 reads, or 2 writes, or 1 read and 1 write.

Chapter 4. Parallel Execution 57 Cache Memory LVT M1 M2 Proc MC1 Port1 Port2 MC2 Accel 1 Accel 2 MC3 Port3 M3 M4 Port4 MC4 Accel 3 M5 M6 Figure 4.5: LVT-based 4-ported cache.

69 Chapter 4. Parallel Execution 57 Cache Memory LVT M1 M2 Proc MC1 Port1 Port2 MC2 Accel 1 Accel 2 MC3 Port3 M3 M4 Port4 MC4 Accel 3 M5 M6 Figure 4.5: LVT-based 4-ported cache. In our variant of the LVT memory approach, it is required that any two ports have one memory block in common. For example, in Figure 4.5, port 1 shares memory blocks M1, M3 and M4 with ports 2, 3, and 4, respectively. This allows data written by port 1 to be read by all other ports. On a write, a port writes to all of the memory blocks that it is connected to, which is n-1 blocks. As in the original LVT memory implementation, the port also writes to the LVT to indicate it has updated a memory location most recently. On a read, a port reads from all connected RAM blocks and selects the data according to the port number read from the LVT. Compared to the previous work which caused memory to grow by n 2 with n ports, our LVT variant scales as: New cache size = n (n - 1) 2 original cache size (4.2) The 4-port cache in Figure 4.5 replicates memory size by 6, whereas the approach in [36] replicates memory size by 16. The output multiplexer, which selects between the memory blocks is also reduced from an n-to-1 multiplexer to an (n-1)-to-1 multiplexer. This new multi-ported cache based on the LVT approach is referred to as the LVT cache in this thesis. This can be compared to multi-cache architectures, where multiple

LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems

LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Names removed for blind review ABSTRACT It is generally accepted that a custom hardware implementation of a set of computations