LegUp User Manual. Release 7.0. LegUp Computing Inc.

Size: px

Start display at page:

Download "LegUp User Manual. Release 7.0. LegUp Computing Inc."

Warren Jenkins
5 years ago
Views:

1 LegUp User Manual Release 7.0 LegUp Computing Inc. April 01, 2019

2 2

3 CONTENTS 1 Getting Started Installation Windows Linux ModelSim and Vendor CAD tools Quick Start Tutorial Using Example Projects User Guide Introduction to High-Level Synthesis Instruction-level Parallelism Loop-level Parallelism Thread-level Parallelism Data Flow (Streaming) Parallelism LegUp Overview Hardware Flow Processor-Accelerator Hybrid Flow LLVM IR Input Flow Verification and Simulation SW/HW Co-Simulation LegUp Constraints Loop Pipelining A Simple Loop Pipelining Example Multi-threading with Pthreads Loop Multi-threading with OpenMP Supported Pthread/OpenMP APIs Data Flow Parallelism with Pthreads Further Throughput Enhancement with Loop Pipelining Function Pipelining LegUp C/C++ Library Streaming Library Arbitrary Bit-width Data Type Library Bit-level Operation Library AXI Slave Interface (Beta) Software Profiler Using Pre-existing Hardware Modules Custom Verilog Module Interface Specifying a Custom Top-level Function Specifying a Custom Test Bench LegUp Command Line Interface

4 3 Optimization Guide Loop Pipelining Loop Unrolling Function Pipelining Structs Inferring a Shift Register Inferring a Line Buffer Inferring Streaming Hardware via Producer-Consumer Pattern with Pthreads Hardware Architecture Circuit Topology Threaded Hardware Modules Memory Architecture Local Memory Shared-Local Memory Aliased Memory Processor Memory Memory Storage Interfaces AXI Slave Interface (Beta) Hardware Accelerator Interface in Hybrid Flow ARM Processor-Accelerator Hybrid Architecture Restrictions Hybrid Flow Struct Support Floating-point Support Function Pipelining LegUp C/C++ Library Bit-level Operation Library Streaming Library Using Pre-existing Hardware Modules Achronix Support (Beta) Frequently Asked Questions 67 7 Constraints Manual Commonly Used Constraints CLOCK_PERIOD ENABLE_OPENMP loop_pipeline function_pipeline set_custom_top_level_module set_custom_test_bench_module set_custom_test_bench_file set_top_level_argument_num_elements set_resource_constraint set_operation_latency inline_function noinline_function flatten_function preserve_kernel MB_MINIMIZE_HW REPLICATE_ROMS set_synthesis_top_module

5 7.2 Debugging Constraints KEEP_SIGNALS_WITH_NO_FANOUT VSIM_ASSERT Advanced Constraints CASE_FSM GROUP_RAMS GROUP_RAMS_SIMPLE_OFFSET LOCAL_RAMS NO_INLINE NO_OPT UNROLL set_accelerator_function set_argument_interface_type set_combine_basicblock set_project USE_MEM_FILE LegUp automatically compiles a C/C++ program into hardware described in Verilog HDL (Hardware Description Language). The generated hardware can be programmed onto an FPGA (Field-Programmable Gate Array) from any FPGA vendor (Intel, Xilinx, Microsemi, Lattice, and Achronix). Hardware implemented on an FPGA can provide massive performance and power benefits over the same computation running on regular processors. The documentation is comprised of the following sections: Getting Started: Installation and a quick start guide User Guide: How to use LegUp to generate hardware Optimization Guide: How to optimize the generated hardware Hardware Architecture: Synthesized hardware architecture Constraints Manual: Constraints manual Frequently Asked Questions: Frequently asked questions For example applications using LegUp please check out our github at: For general inquries, please contact info@legupcomputing.com. For technical questions and issues, please contact support@legupcomputing.com. To purchase a full license of LegUp, please contact sales@legupcomputing.com. 5

6 6

7 CHAPTER ONE GETTING STARTED First, download the LegUp installer from our website. We support both Windows (64-bit) and Linux (64-bit). We recommend Windows 10 or Ubuntu Installation Windows Run the Windows installer program, LegUp-7.0-Win64-Setup.exe. The first time you simulate with Modelsim, this will take a few minutes as we compile our hardware simulation libraries with Modelsim. Linux CentOS The CentOS installer is packaged as a self-extracting shell script that can be installed to a local directory. To run the installer: cd $(INSTALLATION_DIRECTORY) sh /path/to/legup-7.0-centos-installer.run This will install LegUp into the INSTALLATION_DIRECTORY. After the installation is completed, please add the LegUp bin directory to your path: export PATH=$(INSTALLATION_DIRECTORY)/legup-7.0/legup/bin:$PATH Ubuntu Download the Linux installer file and double-click on it. When the installer opens up, click Install. You may also install it in command line: sudo dpkg -i legup-7.0-linux_distribution_version.deb sudo apt-get install -f If you receive an error about missing the package libc6-dev:i386 you should run the following to enable 32-bit packages: 7

8 sudo dpkg --add-architecture i386 sudo apt-get update These 32-bit packages can be installed without issue on a 64-bit system. Once LegUp is installed, to enable simulation with Modelsim you will need to compile our hardware libraries: cd /usr/lib/legup-7.0/legup/ip/libs sudo make LEGUP_TOOL_PATH=$(PATH_TO_MODELSIM)/ Please note that a slash is required at the end of the path to ModelSim binary directory. ModelSim and Vendor CAD tools You will need ModelSim to simulate the Verilog and vendor CAD tools to synthesize the generated hardware for their FPGA. For Intel, you can download Quartus Web Edition and Modelsim for free from Altera. LegUp HLS has been tested on Quartus For Xilinx, you will need Vivado. For Microsemi, you can download Libero SoC. For Lattice, you will need Diamond. For Achronix, you will need to contact them to download their ACE Software. Note: On some versions of Linux you need to edit line ~210 of /opt/altera/16.0/modelsim_ase/vco and change linux_rh60 to linux. Quick Start Tutorial To get started with LegUp HLS, we will create a pipelined FIR filter from C. First launch LegUp from the start menu in Windows or by running legup_hls from a terminal in Linux. Once the IDE opens, select a workspace. The workspace is where your projects and related preferences will be saved. You may click on Use this as the default and do not ask again, if you do not want to see the workspace prompt every time you start LegUp HLS. Now to create a project, click File -> New -> LegUp C project in the top menu. You may also click on the drop-down menu icon next to the New icon on the top-left corner, then click on LegUp C Project. 8

You may leave the Location as default if you would like to save this project in your default workspace.

9 Once the New Project window opens up, select the project name, and under Select Project Type:, choose the second option, Example LegUp Project: Filter (Function Pipelining). You may leave the Location as default if you would like to save this project in your default workspace. Click Next, and you will be asked to choose a target FPGA vendor and device, For this first example, we will use Intel s Arria 10 device. Then click on Finish. 9

Take a look at the source file fir.cpp, which should automatically open up. It contains four functions, FIRFilterStreaming, test_input_injector, test_output_checker, and main.

Then, test_input_injector, FIRFilterStreaming, and test_output_checker are called in a loop. The test_input_injector function creates input values and writes them to the input_fifo.

10 Take a look at the source file fir.cpp, which should automatically open up. It contains four functions, FIRFilterStreaming, test_input_injector, test_output_checker, and main. These functions are also shown in the Outline pane on the right-hand side. Looking at the main function, it first creates two FIFOs, input_fifo and output_fifo. Then, test_input_injector, FIRFilterStreaming, and test_output_checker are called in a loop. The test_input_injector function creates input values and writes them to the input_fifo. FIRFilterStreaming reads from the input_fifo, computes the filter output and writes it to the output_fifo. Finally, test_output_checker reads from the output_fifo and verifies, at the end of the computation, that the final value is correct. Let s first run this code in software. Click on the Compile Software icon in the middle of the toolbar at the top. You may also click File -> Compile Software in the top menu. 10

You should see in the Console window at the bottom of the screen that the software compiled successfully. Now, click on the Run Software icon next to the compile icon.

You may also want to debug your software, and LegUp HLS has an integrated debugger, (gdb), so that you can debug your code directly within the IDE.

11 You should see in the Console window at the bottom of the screen that the software compiled successfully. Now, click on the Run Software icon next to the compile icon. You should see in the Console window that the total is the expected value of and hence, the result is a PASS. You may also want to debug your software, and LegUp HLS has an integrated debugger, (gdb), so that you can debug your code directly within the IDE. Click on the Debug Software icon (next to the Run Software icon). A prompt will appear, asking if you would like to switch to Debug perspective. Click on Yes and you should see that the view changes. Try clicking on the Step Over icon on the top-left of the toolbar. As you step over the function calls of the loop in the main function, you should see the outputs produced by the software execution appear in the Console window. You may click on the Resume icon to the left to let the debugger finish, or stop the debugger by clicking on the Terminate icon. Switch back to the previous perspective by clicking on the C/C++ icon on the top-right corner of the toolbar. 11

12 The steps above have illustrated how to compile, debug and run software using LegUp HLS. We now turn our attention to synthesis of hardware from software. That is, we can now synthesize the same C code into hardware. Before we execute the hardware synthesis, we can apply some constraints to LegUp HLS that influence the synthesis process and the hardware modules that will be generated. Click on the HLS Constraints icon (to the right of Debug Software icon). You may also click LegUp HLS -> HLS Constraints on the top menu. You should see that there are some pre-set constraints for the FIR filter example project already. The constraints specify the FIRFilterStreaming function to be pipelined and to be set as the top-level function. When a function is designated to be pipelined, it implies that the function s hardware implementation will be able to receive new inputs at regular intervals, ideally every clock cycle. The target clock period is set to 10 ns, and the test bench module and file are set to streaming_tb and streaming_tb.v. Click OK to exit the constraint window. Let s perform the hardware synthesis now. Click on the Compile Software to Hardware icon (to the left of the HLS Constraints icon). In the Console window, you should see that the hardware has been generated properly, and a summary.legup.rpt file should automatically open up in the right panel of the IDE. You have just compiled your first C program into hardware with LegUp HLS! 12

Looking at the Pipeline Result, it also shows that the Initiation Interval (II) is 1, meaning that the hardware is completely pipelined, so that new inputs can be injected into the hardware every

13 Let s examine the report file, which provides an overview of several aspects of the generated hardware. Looking at the Scheduling Result, it shows that the Cycle Latency is 4, meaning that it takes 4 cycles for an output to be produced from start to finish. Looking at the Pipeline Result, it also shows that the Initiation Interval (II) is 1, meaning that the hardware is completely pipelined, so that new inputs can be injected into the hardware every clock cycle, and that a new output is produced every cycle. This implies that computations are overlapped (parallelized): as each new input is received, previous inputs are still being processed. An II of 1 produces the highest throughput for a pipelined circuit. The Memory Usage section shows that there are a number of registers (which form a shift register), as well as the input/output FIFO queues. Let s now simulate this circuit s Verilog RTL implementation. First, make sure you set the path to ModelSim by going to LegUp HLS -> Tool Path Settings in the top menu. While you are there, also set the path to Intel Quartus, which is needed to synthesize the hardware later. Once they are set, click on the Simulate Hardware icon (to the right of Compile Software to Hardware icon). You should see that ModelSim is running in the Console window, and that it prints out a PASS. Having simulated the Verilog and demonstrated functional correctness, let s synthesize the circuit to get its maximum clock frequency (Fmax) and area when implemented on the Intel Arria 10 FPGA. Click on the Synthesize 13

14 Hardware to FPGA icon (to the right of Simulate Hardware icon). This launches Intel Quartus in the background and you can see the output in the Console window. When it finishes running, you will see that the summary.results.rpt file is automatically updated with new information which is extracted from the Intel synthesis reports. The Fmax of this circuit is MHz, and it consumes 393 ALMs and 682 registers on the Arria 10 FPGA. You have learned how to compile, debug, and execute your software, then compile the software to hardware, simulate the hardware, and synthesize the hardware to an FPGA. Great work! Using Example Projects We provide a number of other examples to demonstrate LegUp s features and optimization techniques. Later sections in the user guide reference these examples, and we describe here how you can use them. The examples are contained in LEGUP_INSTALLATION_DIRECTORY/legup/examples/user_guide_examples. For Linux, the LEGUP_INSTALLATION_DIRECTORY is /usr/lib/legup-7.0. For Windows, it is where you specified in the installer during installation. For each example, you will need to create a new LegUp project. As described above, click on File -> New -> LegUp C project to create a LegUp project. Input a project name and choose New LegUp Project as the project type, then click on Next. 14

15 Now you can import our example source files into the project. The table below lists the sources files for each example. Example Name FIR_function_pipelining_wrapper loop_pipelining_simple multi_threading_simple openmp_reduction C source file fir.c loop_pipelining_simple.c multi_threading_simple.c, array_init.h openmp_reduction.c, array_init.h Import the appropriate source files for each project and click Next. Some examples may also come with a custom test bench file. When there is a custom test bench, you also need to specify the name of the test bench module as well as the top-level function of the C program. The table below lists the information about test benches and top-level functions for our example projects. Example Name Top-Level Function Test Bench Module Test Bench Files FIR_function_pipelining_wrapper pipeline_wrapper streaming_tb streaming_tb.v Specify the Top-Level Function and Test Bench Module and include the Test Bench File as shown in the table, then click Next. If your example project is not specified in the table above, you may simply just click Next. Now you can choose the target FPGA vendor and FPGA device. For these examples, let s choose Intel s Arria 10 FPGA as the target. When you click Finish, your new example project will be created. For some examples, you will need to set some constraints. To set new constraints, click on LegUp HLS -> HLS Constraints on the top menu to open the constraints window. For the examples shown below, please make sure that you have the same constraints set. If your example is not shown below, you do not have to set any constraints at the moment. To set a new constraint, select the appropriate constraint from the Constraint Type dropdown menu, enter its corresponding Constraint Value, then click Add. FIR_function_pipelining_wrapper: openmp_reduction: 15

16 16

17 CHAPTER TWO USER GUIDE Introduction to High-Level Synthesis High-level synthesis (HLS) refers to the synthesis of a hardware circuit from a software program specified in a highlevel language, where the hardware circuit performs the same functionality as the software program. For LegUp, the input is a C-language program, and the output is a circuit specification in the Verilog hardware description language. The LegUp-generated Verilog can be input to RTL synthesis, place and route tools to produce an FPGA or ASIC implementation of the circuit. The underlying motivation for HLS is to raise the level of abstraction for hardware design, by allowing software methodologies to be used to design hardware. This eases the effort required for hardware design, improving productivity and reducing time-to-market. While a detailed knowledge of HLS is not required for the use of LegUp, it is nevertheless worthwhile to highlight key steps involved in converting software to hardware: Allocation: The allocation step defines the constraints on the generated hardware, including the number of hardware resources of a given type that may be used (e.g. how many divider units may be used, the number of RAM ports, etc.), as well as the target clock period for the hardware, and other user-supplied constraints. Scheduling: Software programs are written without any notion of a clock or finite state machine (FSM). The scheduling step of HLS bridges this gap, by assigning the computations in the software to happen in specific clock cycles in the hardware. Putting this another way, the scheduling step defines the FSM for the hardware and associates the computations of the software with specific states in the FSM. This must be done in a manner that honors the data-dependencies in the software to ensure that, for example, the inputs to a computation are available/ready before the computation executes. Likewise, given a user-supplied clock period constraint, e.g. 10 ns, the schedule step must ensure the computations in a single FSM state do require time exceeding the target period. Binding: While a software program may contain an arbitrary number of operations of a given type (e.g. multiplications), the hardware may contain only a limited number of units capable of performing such a computation. The binding step of HLS is to associate (bind) each computation in the software with a specific unit in the hardware. Verilog generation: After the steps above, an FSM and datapath have been defined. The final step of HLS to generate a description of the circuit in a hardware description language (Verilog). Executing computations in hardware brings speed and energy advantages over performing the same computations in software running on a processor. The underlying reason for this is that the hardware is dedicated to the computational work being performed, whereas a processor is generic and has the consequent overheads of fetching and decoding instructions, loads/stores to memory, and so on. Drastic speed advantages are possible by using hardware if one is able to exploit hardware parallelism, namely, the execution of computations concurrently that otherwise, on a processor, would need to happen sequentially. With LegUp, one can exploit four styles of hardware parallelism. 17

18 Instruction-level Parallelism Instruction-level parallelism refers to the ability to execute computations concurrently based on an analysis of data dependencies. Computations that do not depend on each other can be executed at the same time. Consider the following C-code snipped which performs three addition operations. z = a + b; x = c + d; q = z + x;... Observe that the first and second additions do not depend on one another. These additions can therefore be executed concurrently, as long as there are two adder units available in the hardware. LegUp automatically analyzes the dependencies between computations in the software to discover such opportunities and maximize concurrency in the execution of computations in hardware. In the above example, the third addition operation depends on the results of the first two, and hence, its execution cannot be overlapped with the others. Instruction-level parallelism is referred to as fine-grained parallelism, as the number of computations that are overlapped with one another in time is usually small. Loop-level Parallelism Software programs typically spend much of their time executing loops. From a software programmer s perspective, the execution of loop iterations is sequential and non-overlapping. That is, loop iteration i executes to completion before iteration i + 1 commences. Now, imagine a loop with N iterations and whose loop body takes 3 clock cycles to complete. Under typical sequential software execution, the loop would take 3N clock cycles to execute. With LegUp, however, it is possible to overlap the loop iterations with one another using a technique called loop pipelining. The idea is to execute a portion of a loop iteration i and then commence executing iteration i + 1 even before iteration i is complete. Returning to the previous example, if loop pipelining were able to commence a new loop iteration every clock cycles, then the total number of cycles to execute the loop would be approximately N a significant reduction relative to 3N. As will be elaborated upon below, LegUp performs loop pipelining on the basis of user constraints. By default, it is not applied automatically. Thread-level Parallelism Today s CPUs have multiple processor cores contained within and, with multi-threaded software, it is possible to parallelize an application to use multiple cores at once, thereby improving run-time. POSIX threads, Pthreads, is the most popular approach for multi-threaded programming in the C language, where, parallelism is realized at the granularity of entire C functions. As such, thread-level parallelism is referred to as coarse-grained parallelism since a large amount of computational work is overlapped in time. LegUp supports the synthesis of multi-threaded C programs into hardware, where concurrently executing software threads are synthesized into concurrently executing hardware units. This allows a software developer to take advantage of spatial parallelism in hardware using a programming style they are likely already familiar with. Moreover, the design and debugging of a parallel hardware implementation can happen in software, which it is considerably easier to identify and resolve bugs. There are two challenging aspects to multi-threaded software programming: 1) synchronization among threads, and 2) interactions among threads through shared memory. In the former case, because threads execute asynchronously within an operating system, it is often desirable to ensure multiple threads have reached a common point in the program s execution before allowing any to advance further. This is typically realized using a barrier. LegUp supports the synthesis of barriers into hardware, where the generated hardware s execution matches the software behaviour. Regarding interactions through shared memory, it is typical that one wishes to restrict the ability of multiple threads to access a given memory location at the same time. This is done using critical sections, which are code segments 18

19 wherein at most one thread can execute at a given time. Critical sections are specified in software using mutexes, and LegUp supports the synthesis of mutexes into hardware. Data Flow (Streaming) Parallelism The second style of coarse-grained parallelism is referred to as data flow parallelism. This form of parallelism arises frequently in streaming applications, which constitute the majority of video and audio processing, but also applications in machine learning and computational finance. In such applications, there is a stream of input data that is fed into the application at regular intervals. For example, in an audio processing application, a digital audio sample may be input to the circuit every clock cycle. In streaming applications, a succession of computational tasks is executed on the stream of input data, producing a stream of output data. For example, in audio processing, a first task may be to filter the input audio to remove high-frequency components. Subsequently, a second task may receive the filtered audio, and boost the bass low-frequency components. Observe that, in such a scenario, the two tasks may be overlapped with one another. Input samples are received by the first task and second task on a continuous basis. LegUp provides a unique way for a software developer to specify data flow parallelism, namely, through the use of Pthreads and the use of a special API to interconnect the computational tasks. As will be elaborated upon below, to specify data flow hardware, one creates a thread for each computational task, and then interconnects the tasks using FIFO buffers via a LegUp-provided API. In hardware, each thread is synthesized into a hardware module that receives/produces inputs/outputs at regular intervals. The FIFO buffers between the modules hold results produced by one module, and waiting to be picked up by a downstream module. LegUp Overview LegUp accepts a C software program as input and automatically generates hardware described in Verilog HDL (hardware description language) that can be synthesized onto an FPGA. LegUp has two different synthesis flows: Hardware Flow: Synthesizes the whole C program (or user-specified C functions) into a hardware circuit. Processor-Accelerator Hybrid Flow: Synthesize the whole C program into a processor-accelerator SoC (System-on-chip). Hardware Flow LegUp s hardware flow synthesizes a C program into a hardware circuit (without a processor). One can compile the entire C program, or specific functions of the program, into hardware. To specify certain C functions to be compiled to hardware in the hardware flow, you need to specify a top-level function (see Specifying a Custom Top-level Function), then LegUp compiles the top-level function and all of its descendant functions to hardware. By default, the main function is the top-level function, hence the entire program is compiled to hardware. To run the hardware flow click the following button: You can also run this flow in command line by running legup hw. 19

20 Processor-Accelerator Hybrid Flow LegUp can automatically compile a C program into a complete processor-accelerator SoC comprising an ARM processor and one or more hardware accelerators. One can designate C functions to be compiled to hardware accelerators and LegUp will automatically partition the program so that the remaining program segments are executed on the ARM processor. The communication between the processor and hardware accelerators, as well as the generation of the system interconnect, are automatically handled by LegUp. To accelerate a function, you can specify the function name in the config.tcl file as shown below: set_accelerator_function "function_name" Then run legup hybrid in command line, which generates the complete system. The architecture of a processoraccelerator hybrid system is described in Hardware Architecture. Note: The hybrid flow is currently only available for Intel FPGAs in the fully licensed version of LegUp. It is not available with an evaluation license. LLVM IR Input Flow By default LegUp accepts C/C++ as input. However, some advanced users may wish to input LLVM intermediate representation directly into LegUp. LegUp s LLVM IR flow synthesizes LLVM intermediate representation code into a hardware circuit. You can compile an LLVM IR into a hardware circuit by specifying an LLVM IR file (either a.ll or.bc file) using the INPUT_BITCODE variable inside the makefile of your project: INPUT_BITCODE=input.ll Verification and Simulation The output of the circuit generated by LegUp should always match the output of the C program for the same set of inputs. Users should not modify the Verilog generated by LegUp, as this is overwritten when LegUp runs. For debugging purposes, LegUp converts any C printf statements into Verilog $write statements so that values printed during software execution will also be printed during hardware simulation. This allows easy verification of the correctness of the hardware circuit. Verilog $write statements are unsynthesizable and will not affect the final FPGA hardware. SW/HW Co-Simulation SW/HW co-simulation can be used when the user specifies a custom top-level function (Specifying a Custom Top-level Function). This avoids having to manually write a custom RTL testbench (Specifying a Custom Test Bench). To use SW/HW co-simulation, the input software program will be composed of two parts, A top-level function (may call other sub-functions) to be synthesized to hardware by LegUp, A C/C++ test bench (typically implemented in main()) that invokes the top-level function with test inputs and verifies outputs. SW/HW co-simulation consists of the following automated steps: 20

21 1. LegUp runs your software program and saves all the inputs passed to the top-level function. 2. LegUp automatically creates an RTL testbench that reads in the inputs from step 1 and passes them into the LegUp-generated hardware module. 3. Modelsim simulates the testbench and saves the LegUp-generated module outputs. 4. LegUp runs your software program again, but uses the simulation outputs as the output of your top-level function. We assume that if the return value of main() is 0 then all tests pass, so you should set a non-zero return value from main() when the top-level function produces incorrect outputs. In step 1, we verify that the program returns 0. In step 4, we run the program using the outputs from simulation and if the LegUp-generated circuit matches the C program then main() should still return 0. If the C program matches the RTL simulation then you should see: C/RTL co-simulation: PASS Note that you must pass in as arguments into the top-level function any values that are shared between software and hardware. For example, if there are arrays that are initialized in the C testbench that are used as inputs to the hardware function, they need to be pass in as arguments into the top-level function, even if the arrays are declared are global variables. In essence, by passing in an argument into the top-level function, you are creating an interface for the hardware core generated by LegUp. Arguments into the top-level function can be constants, pointers, arrays, and FIFO data types. For FIFO arguments, they use our C++ FIFO library (legup::fifo<int>), which also supports FIFOs of structs (legup::fifo<struct>). The top-level function can also have a return value. Please see the provided example, C++ Canny Edge Detection, as a reference. If a top-level argument is a dynamically allocated array (with malloc), the number of elements of the array must be specified with the Set top-level argument number of elements constraint. Please see the set_top_level_argument_num_elements page for more details. Statically allocated arrays do not need to be specified. Limitations: For FIFOs of structs, there cannot be nested structs. A struct cannot contain another struct inside. Possible reasons for SW/HW co-simulation failure: Assigning a value to an arbitrary bitwidth that overflows (i.e. int2 = 40) LegUp Constraints The input language to LegUp is C. LegUp does not require tool-specific pragmas or special keywords. Instead, LegUp accepts user-specified constraints that guide the generated hardware. Each project specifies its constraints in the config.tcl file in the project directory. This file is automatically generated by the LegUp IDE. To modify the constraints, click the HLS Constraints button: The following window will open: 21

22 You can add, edit, or remove constraints from this window. Select a constraint type from the first dropdown menu. If you want more information about a constraint, click the Help button, which will open the corresponding Constraints Manual page. An important constraint is the target clock period (shown as Set target clock period in the dropdown menu). With this constraint, LegUp schedules the operations of a program to meet the specified clock period. When this constraint is not given, LegUp uses the default clock period for each device, as shown below. FPGA Vendor Device Default Clock Frequency (MHz) Default Clock Period (ns) Intel Arria Intel Arria V Intel Stratix V Intel Stratix IV Intel Cyclone V Intel Cyclone IV Xilinx Virtex UltraScale Xilinx Virtex Xilinx Virtex Lattice ECP Microsemi PolarFire Microsemi Fusion Achronix Speedster Details of all LegUp constraints are given in the Constraints Manual. 22

Loop Pipelining Loop pipelining is an optimization that can automatically extract loop-level parallelism to create an efficient hardware pipeline.

23 Loop Pipelining Loop pipelining is an optimization that can automatically extract loop-level parallelism to create an efficient hardware pipeline. It allows executing multiple loop iterations concurrently on the same pipelined hardware. To use loop pipelining, a label needs to be added the loop: my_loop_label: for (i = 1; i < N; i++) { a[i] = a[i-1] + 2 } Then add a loop pipeline constraint in the HLS Constraints window: An important concept in loop pipelining is the initiation interval (II), which is the cycle interval between starting successive iterations of the loop. The best performance and hardware utilization is achieved when II=1, which means that successive iterations of the loop can begin every clock cycle. A pipelined loop with an II=2 means that successive iterations of the loop can begin every two clock cycles, corresponding to half of the throughput of an II=1 loop. By default, LegUp always attempts to create a pipeline with an II=1. However, this is not possible in some cases due to resource constraints or cross-iteration dependencies. This is described in more detail in Optimization Guide. When II=1 cannot be met, LegUp s pipeline scheduling algorithm will try to find the smallest possible II that satisfies the constraints and dependencies. A Simple Loop Pipelining Example We will use a simple example to demonstrate loop pipelining. First import the provided example project, loop_pipelining_simple, contained within the LegUp installation directory (please refer to Using Example Projects for importing this example project into the LegUp IDE). Let s first run the C program in software by clicking the Compile Software and Run Software buttons. #include <stdio.h> #define N 4 // For the purpose of this example, we mark input arrays as volatile to // prevent optimization based on input values. In a real program, the volatile keyword // should only be used when absolutely necessary, as it prevents many optimizations. volatile int a[n] = {1, 2, 3, 4}; volatile int b[n] = {5, 6, 7, 8}; int c[n] = {0}; // Simple loop with an array 23

24 int main() { int i = 0, sum = 0; #pragma unroll 1 // Prevents the loop below from being unrolled. // The loop label is used for setting the pipelining constraint. my_loop_label: for (i = 0; i < N; i++) { printf("loop body\n"); printf("a[%d] = %d\n", i, a[i]); printf("b[%d] = %d\n", i, b[i]); c[i] = a[i] * b[i]; printf("c[%d] = %d\n", i, c[i]); sum += c[i]; } if (sum == ) printf("pass\n"); else printf("fail\n"); } return sum; Note: LegUp automatically unrolls loops with small trip counts. However, for a loop to be pipelined, it cannot be unrolled. Hence, for this example, we have added #pragma unroll 1 to prevent the loop from being unrolled. When the program is executed, you can see in the console window that the array elements of a, b, and c are printed in order. Now click Compile Software to Hardware and Simulate Hardware. You can see that the simulation output matches with the output from software execution, and the reported cycle latency is 24. Note that we have not yet pipelined the loop and it is still executed in sequential order. To pipeline the loop, open up the HLS Constraints window and add the Pipeline loop constraint with the value set to the loop label, my_loop_label. Re-run Compile Software to Hardware and you should see that LegUp reports Pipeline Initiation Interval (II) = 1. When you click on Simulate Hardware, you should see that the cycle latency is now reduced to 12 clock cycles, and the arrays are no longer accessed in order. For instance a[1] is printed out before c[0]. This is because the second iteration (that prints a[1]) now runs in parallel with the first iteration (that prints c[0]). The second iteration s load operations for a[1] and b[1] are now happening at the same time as the first iteration s multiply that computes c[0]. # Loop body # Loop body # a[ 0] = 1 # b[ 0] = 5 # Loop body # a[ 1] = 2 # b[ 1] = 6 # c[ 0] = 5 # Loop body # a[ 2] = 3 # b[ 2] = 7 # c[ 1] = 12 # a[ 3] = 4 # b[ 3] = 8 # c[ 2] = 21 # c[ 3] = 32 # PASS 24

# At t= 290000 clk=1 finish=1 return_val= 70 # Cycles: 12 To get more information about the schedule of the pipelined loop body, click Launch Schedule Viewer to open up LegUp s schedule viewer: You

25 # At t= clk=1 finish=1 return_val= 70 # Cycles: 12 To get more information about the schedule of the pipelined loop body, click Launch Schedule Viewer to open up LegUp s schedule viewer: You will first see the call graph of the program the main function calling the printf function. Double-click the main function s circle to see the control-flow graph of the function. As shown below, you should find a circle named BB_1 with a back edge (an arrow pointing to itself). This means that the block is a loop body, and in this case it corresponds to the pipelined loop. Double-clicking on BB_1 will take you to the Pipeline Viewer, which illustrates the pipeline schedule: 25

26 The top two rows show the time steps (in terms of clock cycles) and the number of pipeline stages. For instance, the loop in the example has 5 pipeline stages and each iteration takes 5 cycles to complete. Each cell in the table shows all operations that are scheduled for each time step of a loop iteration. When looking at table horizontally, it shows all operations that are to occur (over the span of 5 clock cycles) for each loop iteration. When looking at table vertically, it shows all concurrent operations for each time step. For example, the first iteration s multiply operation (%10 = mul nsw i32 %9, %8 1, shown in Cycle 1 column of Iteration: 0 row) is scheduled in cycle 1. Meanwhile, also in cycle 1, the second iteration s load operations are also scheduled. In cycle 4 (circled by a bold box), the pipeline reaches steady-state. This is when pipelined hardware reaches maximum utilization and is concurrently executing 5 loop iterations (since the loop is divided into 5 time steps, the pipelined hardware can execute 5 different loop iterations at the same time). Multi-threading with Pthreads In an FPGA hardware system, the same module can be instantiated multiple times to exploit spatial parallelism, where all module instances execute in parallel to achieve higher throughput. LegUp allows easily inferring such parallelism with the use of POSIX Threads (Pthreads), a standard multi-threaded programming paradigm that is commonly used in software. Parallelism described in software with Pthreads is automatically compiled to parallel hardware with LegUp. Each thread in software becomes an independent module that concurrently executes in hardware. For example, the code snippet below creates N threads running the Foo function in software. LegUp will correspondingly create N hardware instances all implementing the Foo function, and parallelize their executions. LegUp also 1 Use mouse to hover over the cell to see the complete statement of an operation. 26

27 supports mutex and barrier APIs (from Pthreads library) so that synchronization between threads can be specified using locks and barriers. void* Foo (int* arg); for (i = 0; i < N; i++) { pthread_create(&threads[i], NULL, Foo, &args[i]); } To see a complete multi-threading example, please refer to the example project, multi_threading_simple, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). The example also demonstrates the use of a mutex lock to protect a critical region. LegUp supports the most commonly used Pthread APIs, which are listed in Supported Pthread/OpenMP APIs. Loop Multi-threading with OpenMP LegUp also supports the use of OpenMP, which allows parallelizing loops in a multi-threaded fashion (as opposed to loop pipelining which parallelizes loop iterations in a pipelined fashion). OpenMP provides a simple and a high-level approach for parallelization. With a single pragma, a user is able to parallelize loops without performing complicated code changes. For example, in the code snippet below, the loop performs a dot product of two arrays, A_array and B_array. To parallelize this loop using OpenMP, one simply puts an OpenMP pragma before the loop. #pragma omp parallel for num_threads(2) private(i) for (i = 0; i < SIZE; i++) { output[i] = A_array[i] * B_array[i]; } The OpenMP pragma, #pragma omp parallel for, is used to parallelize a for loop. The pragma uses a number of clauses. The num_threads clause sets the number of threads to use in the parallel execution of the for loop. The private clause declares the variables in its list to be private to each thread. In the above example, two threads will execute the loop in parallel, with one thread handling the first half of the arrays, and the other handling the second half of the arrays. LegUp will synthesize the two parallel threads into two concurrently running accelerators, each working on half of the arrays. Note that the parallel pragma in OpenMP is blocking all threads executing the parallel section need to finish before the program execution continues. To see a complete OpenMP example, please refer to the example project, openmp_reduction, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). In this example, you will see how one can simply use OpenMP s reduction clause to sum up all elements in an array with parallel threads. LegUp supports a subset of OpenMP that is used for parallelizing loops. OpenMP functions are listed in Supported Pthread/OpenMP APIs. The supported OpenMP pragmas and Supported Pthread/OpenMP APIs LegUp currently supports the following Pthread and OpenMP functions/pragmas: 27

28 Pthread Functions OpenMP Pragmas OpenMP Functions pthread_create omp parallel omp_get_num_threads pthread_join omp parallel for omp_get_thread_num pthread_exit omp master pthread_mutex_lock omp critical pthread_mutex_unlock omp atomic pthread_barrier_init reduction(operation: var) pthread_barrier_wait Data Flow Parallelism with Pthreads Data flow parallelism is another commonly used technique to improve hardware throughput, where a succession of computational tasks that process continuous streams of data can execute in parallel. The concurrent execution of computational tasks can also be accurately described in software using Pthread APIs. In addition, the continuous streams of data flowing through the tasks can be inferred using LegUp s built-in FIFO data structure (see Streaming Library). Let s take a look at the example project, Fir Filter (Loop Pipelining with Pthreads), provided in LegUp IDE (please refer to Quick Start Tutorial for instructions on how to create a project from provided example). In the example, the main function contains the following code snippet: // Create input and output FIFOs legup::fifo<int> input_fifo; legup::fifo<int> output_fifo; // Build struct of FIFOs for the FIR thread. struct thread_data data; data.input = &input_fifo; data.output = &output_fifo; // Launch pthread kernels. pthread_t thread_var_fir, thread_var_injector, thread_var_checker; pthread_create(&thread_var_fir, NULL, FIRFilterStreaming, (void *)&data); pthread_create(&thread_var_injector, NULL, test_input_injector, input_fifo); pthread_create(&thread_var_checker, NULL, test_output_checker, output_fifo); // Join threads. pthread_join(thread_var_injector, NULL); pthread_join(thread_var_checker, NULL); The corresponding hardware is illustrated in the figure below. Test Input Injector Input FIFO FIR Filter Streaming Output FIFO Test Output Checker The two legup::fifo<int>s in the C++ code corresponds to the creation of the two FIFOs, where the bit-width is set according to the type shown in the constructor argument <int>. The three pthread_create calls initiate and parallelize the executions of three computational tasks, where each task is passed in a FIFO (or a pointer to a struct containing more than one FIFO pointers) as its argument. The FIFO connections and data flow directions are implied by the uses of FIFO read() and write() APIs. For example, the test_input_injector function has a write() call writing data into the input_fifo, and the 28

29 FIRFilterStreaming function uses a read() call to read data out from the input_fifo. This means that the data flows through the input_fifo from test_input_injector to FIRFilterStreaming. The pthread_join API is called to wait for the completion of test_input_injector and test_output_checker. We do not join the FIRFilterStreaming thread since it contains an infinite loop (see code below) that is always active and processes incoming data from input_fifo whenever the fifo is not empty. This closely matches the always running behaviour of streaming hardware. Now let s take a look at the implementation of the main computational task (i.e., the FIRFilterStreaming threading function). void *FIRFilterStreaming(void *threadarg) { struct thread_data *arg = (struct thread_data *)threadarg; legup::fifo<int> *input_fifo = arg->input, *output_fifo = arg->output; loop_fir: // This loop is pipelined and will be "always running", just like how a // streaming module runs whenever a new input is available. while (1) { // Read from input FIFO. int in = input_fifo->read(); static int previous[taps] = {0}; // Need to store the last TAPS - 1 samples. const int coefficients[taps] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}; int j = 0, temp = 0; for (j = (TAPS - 1); j >= 1; j -= 1) previous[j] = previous[j - 1]; previous[0] = in; for (j = 0; j < TAPS; j++) temp += previous[taps - j - 1] * coefficients[j]; int output = (previous[taps - 1] == 0)? 0 : temp; } // Write to output FIFO. output_fifo->write(output); } pthread_exit(null); In the code shown in the example project, you will notice that all three threading functions contain a loop, which repeatedly reads and/or writes data from/to FIFOs to perform processing. In LegUp, this is how one can specify that functions are continuously processing data streams that are flowing through FIFOs. Further Throughput Enhancement with Loop Pipelining In this example, the throughput of the streaming circuit will be limited by how frequently the functions can start processing new data (i.e., how frequently the new loop iterations can be started). For instance, if the slowest function among the three functions can only start a new loop iteration every 4 cycles, then the throughput of the entire streaming circuit will be limited to processing one piece of data every 4 cycles. Therefore, as you may have guessed, we can further improve the circuit throughput by pipelining the loops in the three functions. If you run LegUp synthesis for the example (Compile Software to Hardware), you should see in the report file, summary.legup.rpt, that all loops can be pipelined with an initiation interval of 1. That means all functions can start a new iteration every 29

30 clock cycle, and hence the entire streaming circuit can process one piece of data every clock cycle. Now run the simulation (Simulate Hardware) to confirm our expected throughput. The reported cycle latency should be just slightly more than the number of data samples to be processed (INPUTSIZE is set to 128; the extra cycles are spent on activating the parallel accelerators, flushing out the pipelines, and verifying the results). Function Pipelining You have just seen how an efficient streaming circuit can be described in software by using loop pipelining with Pthreads. An alternative way to describe such a streaming circuit is to use Function Pipelining. When a function is marked to be pipelined (by using the Pipeline Function constraint), LegUp will implement the function as a pipelined circuit that can start a new invocation every II cycles. That is, the circuit can execute again while its previous invocation is still executing, allowing it to continuously process incoming data in a pipelined fashion. This essentially has the same circuit behaviour as what was described in the previous example (loop pipelining with Pthreads) in the Data Flow Parallelism with Pthreads section. This feature also allows multiple functions that are added to the Pipeline function constraint to execute in parallel, achieving the same hardware behaviour as the previous loop pipelining with Pthreads example. When using this feature, LegUp can only generate a hardware IP for the pipelined streaming circuit. That is, users will need to specify a custom top-level function by adding a Set top-level function constraint (LegUp Constraints). The top-level function has to be a function specified with the Pipeline function constraint, or a wrapper function that simply calls sub-functions that are specified with Pipeline function constraint. The example introduced in the Quick Start Tutorial demonstrates the synthesis of a single pipelined function. You should see that the FIRFilterStreaming function is specified with Pipeline function and Set top-level function constraints. To test the generated IP, the example project comes with a custom test bench (in streaming_tb.v), which interfaces with the generated top module s FIFO ports to inject inputs and receive/verify outputs. When synthesizing a top-level function with multiple pipelined sub-functions, LegUp will automatically parallelize the execution of all sub-functions that are called in the top-level function, forming a streaming circuit with data flow parallelism. Consider the following code snippet from the example project, FIR_function_pipelining_wrapper, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). void pipeline_wrapper(legup::fifo<bool> &done_signal_fifo) { legup::fifo<int> input_fifo(/*depth*/ 2); legup::fifo<int> output_fifo(/*depth*/ 2); } test_input_injector(input_fifo); FIRFilterStreaming(input_fifo, output_fifo); test_output_checker(output_fifo, done_signal_fifo); In this example, the wrapper function pipeline_wrapper is set as the top-level function. The three sub-functions, test_input_injector, FIRFilterStreaming, and test_output_checker are specified with the Pipeline function constraint in LegUp. The input_fifo and output_fifo are arguments into the three sub-functions. test_input_injector writes to the input_fifo, which is read from FIRFilterStreaming. FIRFilterStreaming writes to the output_fifo, which is read from test_output_checker. A function pipeline executes as soon as its inputs are ready. In this case FIRFilterStreaming executes as soon as there is data in the input_fifo, and test_output_checker starts running as soon as there is data in the output_fifo. In other words, a function pipeline does not wait for its previous function pipeline to completely finish running before it starts to execute, but rather, it starts running as early as possible. test_input_injector function also starts working on the next data to write to the input_fifo while the previous data is being processed. If the initiation interval (II) is 1, a function pipeline starts processing new data every clock cycle. Once the function pipelines reach steady-state, all function pipelines execute concurrently. 30

31 This example showcases the synthesis of a streaming circuit that consists of a succession of concurrently executing pipelined functions. A custom test bench is also provided in the example project. This time the test_output_checker function is also part of the generated circuit, so that the testbench does not need to verify the FIR filter s output, but simply waits for a done signal from the done_signal_fifo and terminates the simulation. Note: The top-level wrapper function should not have any additional logic other than calling the functions that have the Pipeline function constraint. Note: The start input port of the generated circuit (module top) serves as an enable signal to the circuit. The circuit stops running when the start signal is de-asserted. To have the circuit running continuously, the start input port should be kept high, as you can see in the provided custom test bench. LegUp C/C++ Library LegUp includes a number of C/C++ libraries that allow creation of efficient hardware. Streaming Library The streaming library includes the FIFO (first-in first-out) data structure along with its associated API functions. The library can be compiled in software to run on the user s development machine (e.g., x86). Each FIFO instance in software is implemented as a First Word Fall Through (FWFT) FIFO in hardware. The streaming library comes in two versions, one uses the plain C language and the other uses C++ template class. When using the C version library, the data being written to or read from a FIFO must be an integer with a bit width of up to 64 bits. In the C++ template class version, the FIFO data type can be flexibly defined and specified as a template argument of the FIFO object. For example, the FIFO data type could be defined as a struct containing multiple integers: struct AxisWord { uint64 data; uint8 keep; uint1 last; }; legup::fifo<axisword> my_axi_stream_interface_fifo; Streaming Library - C++ The C++ version of the library provides more flexibility to define the FIFO data type. The FIFO data type will be defined as a template argument of the FIFO template class. A valid data type could be any of the C/C++ primitive integer types or a struct containing primitive integer types. You can use the C++ streaming library by including the header file: #include "legup/streaming.hpp" Note: Users should always use the provided APIs below to create and access FIFOs. Any other uses of FIFOs are not supported in LegUp. 31

32 Class Method Description Example Usage FIFO<T> () Constructor creating a new FIFO. legup::fifo<mystructt> my_fifo; void write(t Writes data to the FIFO. my_fifo.write(data); data) T read() Reads an element from the FIFO. MyStructT data = my_fifo.read(); bool empty() Returns 1 if and only if the FIFO is bool is_empty = my_fifo.empty(); empty. bool full() Returns 1 if and only if the FIFO is bool is_full = my_fifo.full(); full. unsigned get_usedw() Returns the number of elements in the FIFO. unsigned numwords = my_fifo.get_usedw(); Note: The C++ FIFO template class is a beta feature and will be improved and stablized in the later releases. The current version is not thread-safe, and does not print the deadlock warning messages in software (see below). Streaming Library - Blocking Behaviour Note that the fifo read() and write() calls are blocking. Hence if a module attempts to read from a FIFO that is empty, it will be stalled. Similarly, if it attempts to write to a FIFO that is full, it will be stalled. If you do not want the blocking behaviour, you can check if the FIFO is empty (with empty()) before calling read(), and likewise, check if the FIFO is full (with full()) before calling write(). With the blocking behaviour, if the depths of FIFOs are not sized properly, it can cause a deadlock. LegUp prints out messages to alert the user that a FIFO is causing stalls. In software, the following messages are shown. In hardware simulation, the following messages are shown. Warning: fifo_write() has been stalled for Warning: fifo_read() has been stalled for Warning: fifo_read() has been stalled for Warning: fifo_write() has been stalled for Warning: fifo_read() has been stalled for Warning: fifo_read() has been stalled for cycles due to FIFO being full cycles due to FIFO being empty cycles due to FIFO being empty cycles due to FIFO being full cycles due to FIFO being empty cycles due to FIFO being empty. If you continue to see these messages, you can suspect that there may be a deadlock, and we recommend increasing the depth of the FIFOs. Note: We recommend the minimum depth of a FIFO to be 2, as a depth of 1 FIFO can cause excessive stalls. Arbitrary Bit-width Data Type Library For efficient hardware generation, LegUp allows specifying data types of any bit-widths from 1 to 64. You can use LegUp s arbitrary bit-width data type library by including the following header file: #include "legup/types.h" This library defines data types for unsigned integers from 1 bit (uint1) to 64 bits (uint64), as well as for signed integers from 1 bit (int1) to 64 bits (int64). In software, standard data types (i.e., char, int, long long) are used even though their entire bit-widths are not required in many cases. Hardware implemented on an FPGA operates at the bit-level, hence a circuit can be optimized to the 32

33 exact bit-width that is required. Using arbitrary bit-width data types provides information to LegUp which is used to produce hardware with the exact specified bit-widths. This can help to reduce circuit area. An example using arbitrary bit-width data types are shown below: #include "legup/types.h" #include <stdio.h> #define SIZE 8 // marked as volatile, as for a simple program such as this, // LegUp optimizes away the arrays with constant propagation volatile uint3 a[size] = {0, 1, 2, 3, 4, 5, 6, 7}; volatile uint4 b[size] = {8, 9, 10, 11, 12, 13, 14, 15}; int main() { volatile uint7 result = 0; volatile uint4 i; #pragma unroll 1 for (i=0; i<size; i++) { result += a[i] + b[i]; } } printf("result = %d\n", result); In the example, we have reduced all variables and arrays to their minimum bit-widths. This is translated to reduced widths in hardware. When LegUp runs, it prints out the following to inform the user that it has detected the use of arbitrary bit-width data types: Note: We have used the volatile keyword to prevent the variables from being optimized away for the purpose of this example. We do not recommend using this keyword unless absolutely necessary, as it can lead to higher runtime and area. Info: Setting the width of memory 'a' to the user-specified custom width of 3 bit(s) Info: Setting the width of memory 'b' to the user-specified custom width of 4 bit(s) Info: Setting the width of memory 'main_0_result' to the user-specified custom width of 7 bit(s) Info: Setting the width of memory 'main_0_i' to the user-specified custom width of 4 bit(s) Note: Note that in an actual program (where the volatile keyword is not used), some variables may not show up in the LegUp printout. This occurs when LegUp optimizes the program so that the variables are no longer required. Bit-level Operation Library LegUp provides a library to efficiently perform bit-level operations in hardware. You can use LegUp s bit-level operation library by including the following header file: #include "legup/bit_level_operations.h" 33

34 This library defines a number of C functions that perform common bit-level operations. The software implementation of the library matches to that of the generated hardware. Users can simulate the low-level bit operations accurately in software prior to generating its hardware. Note: Note that for the arguments and return values of the API functions shown below, you may use either arbitrary bit-width data types (Arbitrary Bit-width Data Type Library) or standard data types. We use arbitrary bit-width data types to show the maximum allowable bit-widths of the values of the arguments and return values. For instance, for the legup_bit_select function, the msb_index argument can be of int type, but its maximum value is 63 (since it is selecting between bit 63 and bit 0 of value v). Note: The index and width arguments must be constant integers and must be within the bit range of the selecting or updating variable. Selecting A Range of Bits uint64 legup_bit_select(uint64 v, uint6 msb_index, uint6 lsb_index); This function selects a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), from the input variable, v, and returns it. The lower (msb_index - lsb_index + 1) bits of the return value are set to the specified range of v, and the rest of the upper bits of the return value are set to 0. The equivalent Verilog statement will be: return_val[63:0] = { (64 - (msb_index - lsb_index + 1)){1'b0}, // Upper bits set to 0. v[msb_index : lsb_index] // Selected bits of v. }; Updating A Range of Bits uint64 legup_bit_update(uint64 v, uint6 msb_index, uint6 lsb_index, uint64 value); This function updates a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), of the input variable, v, with the given value, value, and returns the updated value. The equivalent Verilog statement will be: return_val[63:0] = v[63:0]; return_val[msb_index : lsb_index] // Update selected range of bits. = value[msb_index - lsb_index : 0]; // The lower (msb_index - lsb_index) bits of value. Concatenation uint64 legup_bit_concat_2(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1); This function returns the bit-level concatenation of the two input variables, v_0, and v_1. The lower width_0 bits of v_0 are concatenated with the lower width_1 bits of v_1, with the v_0 bits being the upper bits of the v_1 bits. The concatenated values are stored in the lower (width_0 + width_1) bits of the return value, hence if the bit-width of the return value is bigger than (width_0 + width_1), the rest of the upper bits of the return value are set to 0. The equivalent Verilog statement will be: 34

35 return_val[63:0] = { (64 - width_0 - width_1){1'b0}, // Upper bits set to 0. v_0[width_0-1 : 0], // Lower width_0 bits of v_0. v_1[width_1-1 : 0] // Lower width_1 bits of v_1. }; Similarly, the following functions concatenate three, four, five, up to eight variables respectively. uint64 legup_bit_concat_3(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2); uint64 legup_bit_concat_4(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3); uint64 legup_bit_concat_5(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3, uint64 v_4, uint6 width_4); uint64 legup_bit_concat_6(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3, uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5); uint64 legup_bit_concat_7(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3, uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5, uint64 v_6, uint6 width_6); uint64 legup_bit_concat_8(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1, uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3, uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5, uint64 v_6, uint6 width_6, uint64 v_7, uint6 width_7); Bit Reduction Operations The functions shown below perform bitwise reduction operations on the input variable, v. All of these functions return a one bit value. Bit Reduction AND: uint1 legup_bit_reduce_and(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the AND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. Returns 0 if any bit in the specified range of v is 0, returns 1 otherwise. The equivalent Verilog statement is: &(v[msb_index : lsb_index]). Bit Reduction OR: uint1 legup_bit_reduce_or(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the OR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. Returns 1 if any bit in the specified range of v is 1, returns 0 otherwise. The equivalent Verilog statement is: (v[msb_index : lsb_index]). Bit Reduction XOR: uint1 legup_bit_reduce_xor(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the XOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. 35

36 Bit Reduction NAND: Returns 1 if there is an odd number of bits being 1 in the specified range of v, returns 0 otherwise. The equivalent Verilog statement is: ^(v[msb_index : lsb_index]). uint1 legup_bit_reduce_nand(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the NAND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. Returns 1 if any bit in the specified range of v is 0, returns 0 otherwise. Equivalent to!legup_bit_reduce_and(v, msb_index, lsb_index) The equivalent Verilog statement is: ~&(v[msb_index : lsb_index]). Bit Reduction NOR: uint1 legup_bit_reduce_nor(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the NOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. Returns 0 if any bit in the specified range of v is 1, returns 1 otherwise. Equivalent to!legup_bit_reduce_or(v, msb_index, lsb_index) The equivalent Verilog statement is: ~ (v[msb_index : lsb_index]). Bit Reduction XNOR: uint1 legup_bit_reduce_xnor(uint64 v, uint6 msb_index, uint6 lsb_index); Applies the XNOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index. Returns 1 if there is an even number of bits being 1 in the specified range of v, returns 0 otherwise. Equivalent to!legup_bit_reduce_xor(v, msb_index, lsb_index) The equivalent Verilog statement is: ~^(v[msb_index : lsb_index]). AXI Slave Interface (Beta) LegUp allows generating an AXI slave interface for the top-level function in hardware. The slave interface is implemented by using FIFOs in the streaming library. To specify a slave interface, you will need to 1. define a struct for the slave memory in a header file. 2. instantiate a global variable of the struct above. 3. specify a top-level function in the Tcl configuration file. 4. add set_argument_interface_type <global_varirable_name> axi_s in the Tcl configuration file. The AXI slave interface support is a beta feature with several limitations: Only one AXI slave interface can be generated for each LegUp project. The interface only supports the AXI-lite protocol with additional support for bursting. The interface always uses 32-bit address and 64-bit data width. SW/HW Co-Simulation is not supported for LegUp project with AXI slave interface specified. We will improve and stablize this feature in the later releases. Please contact support@legupcomputing.com if you require more details about this feature. 36

37 Software Profiler The LegUp IDE comes integrated with a profiler (gprof) that allows you to profile your software program to determine the runtime hot spots. This information can be used to decide which program segment should be accelerated in hardware. To use the profiler, one simply has to click the Profile Software button on an opened LegUp project: Alternatively, you could also click LegUp -> Profile Software on the top menu. This executes the software program, and when it finishes running, a gprof tab should open up at the bottom window (where the Console window is). The gprof tab shows a hierarchy of the percentage of runtime spent on executing the program. In the figure shown above, you can see that most of the time is spent on executing the FIRFilterStreaming function. It also shows which parts of the function contribute to that percentage of runtime. Clicking on one of the lines (e.g., FIRFilterStreaming (fir.c:42) which contributes the highest percentage of runtime) takes you to that line of code in the IDE. Using the software profiler is a useful way to determine which parts of your program is taking up the most runtime. You can use this information to decide whether you want to accelerate that portion of code with LegUp, or re-write the code in software to be more efficient. Note: Note that grof profiles a program based on samples that it takes, and its minimum sampling period is 10 ms. Hence if your program s runtime is very short, it will not give any meaningful result. If this is the case, try to increase the runtime by providing more inputs to the program. Note: The figure shown above for the profiled result is from LegUp running on Linux. On Windows, it shows a text-based result. Using Pre-existing Hardware Modules It is possible to connect existing hardware modules to the hardware generated by LegUp. This can be useful when there are optimized IP blocks that one wants to use as part of a larger circuit. This is called the custom Verilog flow in LegUp. 37

38 To use this, a wrapper C function needs to be created, where its prototype matches the module declaration of the custom Verilog module. The name of the C function needs to match the name of the Verilog module, and the C function must have the same number of arguments as the number of input ports of the Verilog module. The bit-widths of the function arguments also need to match the bit-widths of the input ports. In addition, the bit-width of the return type in C must match the width of the output port corresponding to the return value. To use the custom Verilog flow, one can specify the following in the config.tcl file: set_custom_verilog_function "function_name" nomemory set_custom_verilog_file "verilog_file_name" Note: Currently, a custom Ver module cannot access any memories. All arguments need to be passed in by value. It cannot be in a pipelined section (within a pipelined loop/function). In addition, a custom Ver module cannot invoke other modules (functions or other custom Ver modules). There can be cases where a custom Ver module needs to access I/Os. However, accessing I/Os cannot be easily described in C. This can be done in LegUp by simply specifying the I/Os in the config.tcl file as the following: set_custom_verilog_function "function_name" nomemory \ input/output high_bit:low_bit IO_signal_name set_custom_verilog_file "verilog_file_name" This connects the specified I/O signals of the custom Verilog module directly to the top-level module generated by LegUp. For example, if one specifies the following: set_custom_verilog_function "assignswitchestoleds" nomemory \ output 5:0 LEDR \ input 5:0 SW \ output 5:0 KEY set_custom_verilog_file "assignswitchestoleds.v" This indicates to LegUp that it needs to connect LEDR, SW, and KEY ports directly to the ports of the top-level module generated by LegUp. Custom Verilog Module Interface A custom Verilog module needs to conform to the interface expected by LegUp in order for it to be connected properly. The following interface is required by LegUp for the custom Verilog module: input clk: Clock signal for the custom Verilog module input reset: Reset signal for the custom Verilog module input start: Signal set to start the custom Verilog module (set high for 1 clock cycle) output finish: Set by the custom Verilog module to indicate the to rest of the LegUp-generated hardware that it has finished running. return_val: Only necessary if the wrapper C function has a return value. Its width must equal to the width of the return type in the C function. arg_<argument_name> : Only necessary if the wrapper C function has an argument. The <argument_name> must the same as the name of the argument in the C wrapper function prototype, and its width must be equal to the bit-width of the argument type. In addition, a custom Verilog module may have ports that directly to I/Os, as described above. However, as these ports are not specified in C, they do not have to follow any C conventions. LegUp will create ports at the top-level module with the user-specified names and widths, and connect them directly to the custom Verilog module. 38

Specifying a Custom Top-level Function In the hardware flow, if you only want to compile certain C functions to hardware, rather than the entire program, you can specify a custom top-level function.

39 Specifying a Custom Top-level Function In the hardware flow, if you only want to compile certain C functions to hardware, rather than the entire program, you can specify a custom top-level function. This allows LegUp to compile only the specified top-level function and all of its descendant functions to hardware. If there are multiple functions to be specified, you can create a wrapper function in C which calls all of the desired functions. The custom top-level function can be specified via the HLS Constraints window: It can also be specified directly in the config.tcl file as the following: set_custom_top_level_module "toplevelmodulename" This constraint is also described in set_custom_top_level_module. Specifying a Custom Test Bench LegUp allows one to use a custom test bench to simulate the hardware generated by LegUp. When a custom top-level function is specified by the user, there are two options for simulation: Use SW/HW Co-Simulation. A custom test bench must be provided by the user. A custom test bench can be specified to LegUp via the HLS Constraints window: 39

One must specify both the name of custom Verilog module as well as the name of the custom test bench file. It can also be specified directly in the config.

40 One must specify both the name of custom Verilog module as well as the name of the custom test bench file. It can also be specified directly in the config.tcl file as the following: set_custom_test_bench_module "testbenchmodulename" set_custom_test_bench_file "testbenchfilename.v" This constraint is also described in set_custom_test_bench_module and set_custom_test_bench_file. LegUp Command Line Interface LegUp can run in command line with the following command: legup [-h] <cmd> Where <cmd> can be one of: hw (default): Run the hardware-only flow. sw : Compile and run the program in software on host machine. sim: Simulate the LegUp-generated circuit in Modelsim (only for hardware flow). cosim: Verify the LegUp-generated circuit using C test bench (only for hardware flow). fpga: Fully synthesize design for target FPGA. scheduleviewer: Show the scheduler viewer. clean: Delete files generated by LegUp. Commands for the hybrid flow: hybrid: Run the hybrid flow. hybrid_compile: Synthesize the generated SoC to FPGA bitstream with Intel Quartus (legup hybrid must have been run previously) legup program_board: Program the FPGA with the generated bitstream (.sof) file (legup hybrid_compile must have been run previously) legup run_on_board: Download and run the program on FPGA (legup program_board must have been run previously) 40

41 Intel (Altera) does not provide a simulation model for the ARM processor, hence it is not possible to simulate the ARM hybrid system. 41

42 42

43 CHAPTER THREE OPTIMIZATION GUIDE This chapter describes how to optimize the generated hardware through software code changes, pragmas, and LegUp constraints. Loop Pipelining Loop pipelining is a performance optimization in high-level synthesis (HLS), which extracts loop-level parallelism by executing multiple loop iterations concurrently using the same hardware. The key performance metric when loop pipelining is the time interval between starting successive loop iterations, called the initiation interval (II). Ideally, the initiation interval should be one, meaning that we can start a new loop iteration every clock cycle. This is the highest throughput that a pipelined loop can achieve without unrolling. Although LegUp will always try to achieve an II of 1, sometimes this is not possible, due to resource constraints, or due to cross-iteration dependencies (recurrences) in a loop. Consider the following example: int sum = 0; for (i = 0; i < N; i++) { loop: for (j = 0; j < N; j++) { sum += A[i][j] * B[i][j]; } } This example shows a nested loop, which performs an element-wise multiplication of two 2-dimensional arrays and accumulates the sum. The inner loop is labeled with a loop label, loop. If we tell LegUp to pipeline the loop using the loop_pipeline constraint, LegUp will show the following message when synthesizing the program: Info: Pipelining the loop on line 61 of loop.c with label "loop". Info: Done pipelining the loop on line 61 of loop.c with label "loop". Pipeline Initiation Interval (II) = 1. These info messages let us know that LegUp successfully pipelined the inner loop with an II of 1. Even though an II of 1 has been achieved, the hardware may not meet our desired performance requirements. In this case, we can choose to pipeline the outer loop by moving the loop label to the outer loop: int sum = 0; loop: for(i = 0; i < N; i++) { for(j = 0; j < N; j++) { sum += A[i][j] * B[i][j]; } } When an outer loop is specified to be pipelined, LegUp will automatically unroll all of the inner loops. This can provide higher performance at the expense of higher circuit area. In this example, N is 25, and when the inner loop is 43

44 unrolled, LegUp will create 25 multipliers and adder units working in parallel. However, this does not mean that the performance will be improved by 25x due to the resource constraints on memories A and B. When LegUp runs, we will see the following messages: Info: Unrolling the entire loop nest on line 61 of loop.c. This loop nest is inside a parent loop labelled 'loop', which is specified to be pipelined. Info: Pipelining the loop on line 60 of loop.c with label "loop". Info: Resource constraint limits initiation interval to 13 Resource 'A_local_memory_port' has 25 uses per cycle but only 2 ports available Operation Location # of Uses 'load' operation for array 'A' line 60 of loop.c Total # of Uses Info: Resource constraint limits initiation interval to 13 Resource 'B_local_memory_port' has 25 uses per cycle but only 2 ports available Operation Location # of Uses 'load' operation for array 'B' line 60 of loop.c Total # of Uses Info: Done pipelining the loop on line 60 of loop.c with label "loop". Pipeline Initiation Interval (II) = 13. The first info message indicates that the inner loop is being unrolled, since the outer loop is specified to be pipelined. Next, the info messages tell us there are 25 load operations that need to occur to both memory A and B every clock cycle if II is 1, but there are only two ports (which allows 2 loads per cycle) available for each memory. Local_memory_port indicates that this resource is a memory port of a local memory, which is described in Hardware Architecture. Due to the limited available memory ports, LegUp must increase the loop pipeline II until we can meet the constraint of having 25 load operations to each memory. When the II is 13, meaning that each successive loop iteration is started every 13 cycles, we have enough time to allow 26 load operations, hence the constraint is met (each memory has 2 ports by default, which allows 2 memory accesses per cycle. In 13 cycles, we can perform 26 memory accesses in total). For this particular example, when the outer loop is pipelined, the performance is about 2x higher than when the inner loop is pipelined. However, the area has also increased by about 25x, due to having 25 multipliers and adders. Therfore, we must take care when pipelining outer loops due to the unrolling. In general, we recommend pipelining the innermost loop first, and if the performance requirement is not met, then try pipelining the outer loops. Note: If the loop specified to be pipelined contains any function calls (in the loop or in any of the inner loops), the function calls will be inlined into the loop. Any descendants of the called functions will also be inlined, and all of their loops will also be unrolled automatically. If there are many descendant functions and loops, this can increase the area significantly (also described in Function Pipelining). We recommend the user to examine the program for such cases before pipelining a loop. Lets look at an image filtering example: for (y = 1; y < HEIGHT-1; y++) { loop: 44

45 } for (x = 1; x < WIDTH-1; x++) { out[y][x] = in[y-1][x-1]*k[0][0] + in[y-1][x]*k[0][1] + in[y-1][x+1]*k[0][2] + in[y ][x-1]*k[1][0] + in[y ][x]*k[1][1] + in[y ][x+1]*k[1][2] + in[y+1][x-1]*k[2][0] + in[y+1][x]*k[2][1] + in[y+1][x+1]*k[2][2]; } This example applies a 3 x 3 image kernel filter, array k, to an input image, array in, producing an output image, array out. When we turn on loop pipelining and run LegUp we see the following messages: Info: Pipelining the loop on line 22 of kernel.c with label "loop". Info: Assigning new label to the loop on line 22 of kernel.c with label "loop" Info: Resource constraint limits initiation interval to 5. Resource 'in_local_memory_port' has 9 uses per cycle but only 2 units available Operation Location # of Uses 'load' operation for array 'in' line 23 of kernel.c Total # of Uses Info: Done pipelining the loop on line 22 of kernel.c with label "loop". Pipeline Initiation Interval (II) = 5. The pipeline initiation interval is limited by the memory accesses to the input image (array in). There are 9 loads but only two memory ports, which forces the loop II to be 5, allowing up to 10 loads per iteration from array in. For loops where the II is constrained by memory accesses to an array, you can improve the II by manually splitting the C array into several smaller C arrays. Each C array can be accessed independently, which reduces resource contention. In this case, we can split the image into rows of pixels, where each row is stored in a separate array (in_0, in_1, and in_2). for (y = 1; y < HEIGHT-1; y++) { loop: for (x = 1; x < WIDTH-1; x++) { out[y][x] = in_0[x-1]*k[0][0] + in_0[x]*k[0][1] + in_0[x+1]*k[0][2] + in_1[x-1]*k[1][0] + in_1[x]*k[1][1] + in_1[x+1]*k[1][2] + in_2[x-1]*k[2][0] + in_2[x]*k[2][1] + in_2[x+1]*k[2][2]; } } Now when we run LegUp we will see: Info: Pipelining the loop on line 22 of kernel.c with label "loop". Info: Resource constraint limits initiation interval to 2. Resource 'in_0_local_memory_port' has 3 uses per cycle but only 2 units available Operation Location # of Uses 'load' operation for array 'in_0' line 33 of kernel.c Total # of Uses Info: Resource constraint limits initiation interval to 2. Resource 'in_1_local_memory_port' has 3 uses per cycle but only 2 units available Operation Location # of Uses

46 'load' operation for array 'in_1' line 33 of kernel.c Total # of Uses Info: Resource constraint limits initiation interval to 2. Resource 'in_2_local_memory_port' has 3 uses per cycle but only 2 units available Operation Location # of Uses 'load' operation for array 'in_2' line 33 of kernel.c Total # of Uses Info: Done pipelining the loop on line 22 of kernel.c with label "loop". Pipeline Initiation Interval (II) = 2. Now the initiation interval has improved from 5 to 2, which is a more than a 2x performance improvement just by manually partitioning the C arrays. Consider another example below: int A = 1; loop: for (i = 0; i < N; i++) { A = A * B[i]; } We have a loop where the value of A in the current iteration is dependent on the previous iteration. This is called a cross-iteration dependency or loop recurrence. In order to achieve an II of 1, the value of A is required every clock cycle. This means that the multiplication of A and B[i] has to complete every clock cycle. Now, let s consider a case where we would like to pipeline the multiplier more in order to get a higher Fmax. This can be done by changing the multiplier latency to 2 (using the set_operation_latency constraint in LegUp) from the default latency of 1. When LegUp runs, we will see the following messages: Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop". Info: Cross-iteration dependency does not allow initiation interval of 1. Dependency (distance = 1) from 'mul' operation (at line 11 of loop_recurrence.c) to 'phi' operation (at line 11 of loop_recurrence.c) Recurrence path: Operation Location Cycle Latency 'phi' operation line 11 of loop_recurrence.c 0 'mul' operation line 11 of loop_recurrence.c Total Required Latency Total required latency = 2. Maximum allowed latency = distance x II = 1 x 1 = 1. Total required latency > Maximum allowed latency, we must increase II Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop". Pipeline Initiation Interval (II) = 2. The messages tell us that the II cannot be 1 due to a cross-iteration dependency, which is from the multiply operation of the current loop iteration to the phi operation of the next iteration. You can think of a phi as a select operation, which is need to represent the program s intermediate representation in static single asignment form. In this particular case, the phi selects the value of A between the initial value of 1 (in the first iteration of the loop), and the computed value of A from within the loop (in the iterations after the first). The dependency distance of 1 means that the multiply 46

47 value is used by the phi operation 1 loop iteration later. Or alternatively, the phi operation is dependent on the multiply value from one loop iteration ago. The Recurrence path table shows that the phi operation takes 0 cycles, but the multiply operation takes 2 cycles, hence the total required latency is 2 for the path. However, the maximum allowed latency, if the II were to be 1, is 1 (distance x II = 1 x 1 = 1). In this case, the next loop iteration should be starting after 1 clock cycle (II = 1) but we still have not finished calculating the result of the multiply which is needed by the phi operation in the next iteration. Since the total required latency is greater than the maximum allowed latency, the II has to be increased to 2. With the II being 2, the maximum allowed latency becomes 2, which satisfies the total required latency. In this case, the first iteration will start, we will wait two clock cycles for the multiply to finish (II = 2), then start the next loop iteration. In general, for pipelining, achieving a lower II is the highest priority for achieving the highest performance, even at the expense of slightly lower Fmax. For examples, if we can reduce the II from 2 to 1 then we cut the clock cycles taken by the loop in half, but we are unlikely to double the Fmax by inserting one more pipeline stage (which changes an II of 1 to 2 in this case). If we use the set_operation_latency constraint to reduced the multiplier latency from 2 to 1 and run LegUp, we will see the following messages: Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop". Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop". Pipeline Initiation Interval (II) = 1. We have achieved an II of 1. Note that by default, LegUp sets the multiplier latency to 1, hence this particular case will not occur without changing multiplier latency. However, we use this example to demonstrate how you may vary the latency of operations to achieve a lower II when there is a loop recurrence. The above example illustrated a case of II being increased due to the latency of operations in the presence of a loop recurrence. The II can also be increased due to the delay of operations in a loop with cross-iteration dependencies. Consider the following example: loop: for (iter = 0; iter < MAX_ITER; iter++) { long long squared = x * x + y * y; xtmp = x * x - y * y + x_0; y = 2 * x * y + y_0; x = xtmp; } filter += squared <= 4; The code shows the main computations for the mandelbrot set, the algorithm details are not important. When we run LegUp on this code, we see the following messages: Info: Pipelining the loop on line 39 of mandelbrot.c with label "loop". Info: Cross-iteration dependency does not allow initiation interval (II) of 1. Dependency (distance = 1) from 'trunc' operation (at line 42 of mandelbrot.c) to 'phi' operation (at line 42 of mandelbrot.c) Recurrence path: Operation Location Delay [ns] 'phi' operation line 42 of mandelbrot.c 0.00 'sext' operation line 40 of mandelbrot.c 0.00 'mul' operation line 42 of mandelbrot.c 8.00 'ashr' operation line 42 of mandelbrot.c 0.00 'shl' operation line 42 of mandelbrot.c 0.00 'add' operation line 42 of mandelbrot.c 6.40 'trunc' operation line 42 of mandelbrot.c

48 Total Required Delay Total required delay = ns. Maximum allowed latency = distance x II = 1. Maximum allowed delay = Maximum allowed latency x clock period = 1 x 8.00 ns = 8.00 ns Total required delay > Maximum allowed delay, we must increase II. Tip: Increase the clock period to be greater than the total required delay to improve II. Info: Done pipelining the loop on line 39 of mandelbrot.c with label "loop". Pipeline Initiation Interval (II) = 2. The messages indicate that there is a cross-iteration dependency from the truncate operation to the phi operation, where the total required delay for the operation is ns. On the other hand, the maximum allowed latency, if the II were to be 1, is 1, and the maximum allowed delay, based on the given clock period constraint (8 ns) and the maximum allowed latency (1), is 8 ns. Since the required delay of 14.4 ns for the path is greater than the maximum allowed delay of 8 ns, LegUp must increase the II to 2 to satisfy the required delay. If the II is 2, the maximum allowed latency (distance x II = 1 x 2) becomes 2, hence the maximum allowed delay becomes 16 ns (maximum allowed latency x clock period = 2 x 8 ns), and the required delay can be met. As mentioned above, keeping the II low (and ideally 1) should generally be the priority for achieving the maximum performance. Another way to meet the required delay shown above, based on the equations shown as well as the described Tip, is to increase the clock period rather than increasing the II. With an II of 1 (hence the maximum allowed latency of 1), if the clock period is bigger than 14.4, the maximum allowed delay should be greater than the total required delay. Let s set the clock period to 15 (with the CLOCK_PERIOD constraint), and re-run LegUp: Info: Generating pipeline for loop on line 39 of mandelbrot.c with label "loop". Pipeline initiation interval = 1. You can see that LegUp was now able to generate a circuit with an II of 1. Loop pipelining is a great technique for achieving high performance, and in order to achieve the maximum performance, users should be mindful of the circuit resource constraints and the recurrences that exist in the loop. Loop Unrolling LegUp allows the user to specify a loop to be unrolled through the use of a pragma, #pragma unroll. #pragma unroll for (i = 0; i < N; i++) {... } This unrolls the loop completely. Unrolling a loop can improve performance as the hardware units for the loop body are replicated, but it also increases area. You may also specify a loop to be partially unrolled, to prevent the area from increasing too much. #pragma unroll 2 for (i = 0; i < N; i++) {... } 48

49 This unrolls the loop two times. You may also prevent a loop from being unrolled. LegUp automatically unrolls small loops, but you may not want the loop to be unrolled due to area constraints or to pipeline the loop. If the loop is completely unrolled, the loop disappears, hence you cannot pipeline the loop. #pragma unroll 1 for (i = 0; i < N; i++) {... } This prevents the loop from being unrolled. Function Pipelining Similar to loop pipelining, when a function is specified to be pipelined, LegUp will automatically inline all of its descendant functions, and unroll all loops (in the specified function as well as in all of its descendant functions). This is done to create a high-performance pipelined hardware. Consider the following call graph: main a b c d c where function c contains a loop. If function a is specified to be pipelined, functions c and d will be automatically inlined. When LegUp runs, it will print out the following: Info: Adding no_inline attribute to the user-specified function: a Info: Inlining function 'c' into its parent function 'a' for pipelining. Info: Inlining function 'd' into its parent function 'a' for pipelining. Info: Unrolling the entire loop nest on line 22 of function_pipeline.c. This loop nest is inside function 'a', which is specified to be pipelined. Info: Pipelining function 'a' on line 15 of function_pipeline.c. It shows that LegUp first adds the no_inline attribute to function a to prevent it from being inlined. Then it inlines its descendant functions and unrolls their loops. Care must be taken though, if the function designated to be pipelined has many descendant functions, which also has many loops, the hardware area can increase significantly (as was described above for Loop Pipelining). For instance, in the call graph shown above, if main is specified to be pipelined, functions a, b, c, and d will be automatically inlined. There will be two copies of c, as the function is called from two different places. As there is also a loop in c that will be completely unrolled (in each copy of c), this can increase the area significantly. Hence for function pipelining, one should examine the program before pipelining a function that has many descendant functions or loops. Structs LegUp attempts to automatically split up all structs into their individual elements. This improves performance, as the elements can then be accessed in parallel. However, there are cases when LegUp cannot split up a struct. This includes 49

50 cases where the struct holds other structs or arrays. The struct can, however, hold a pointer which points to an array. When LegUp cannot split up a struct, it will print out the following: Warning: The struct, "struct1", on line 168 of struct.c can result in inefficient memory accesses. We recommend splitting it up into individual elements. If a struct cannot be automatically split up, LegUp still generates proper hardware when targeting Intel FPGAs. We still recommend splitting up the remaining structs manually in C for better performance. For other FPGA vendors (Xilinx, Microsemi, Lattice, Achronix), when LegUp cannot automatically split up all structs, it will print out the following error: Error: LegUp HLS does not support structs when targeting XILINX. The struct, "struct1", on line 168 of struct.c must be split up into individual elements. In this case, such structs must be split up manually in the C code. Inferring a Shift Register A shift register is an efficient hardware unit that is composed of a chain of registers that are connected from one to the next. It allows data to be continuously shifted from one register to its adjacent register. It is useful in applications where data is continuously coming in in a streaming fashion, where some amount of data has to be kept around for processing. It is different from a memory, in that all elements stored in a shift register can be accessed at the same time. For example, in a FIR filter, time-shifted versions of the input data, commonly referred to as taps, are needed to compute the output. A FIR filter can be expressed with the following equation: where y[n] is the output, x[n] is the input, N indicates the filter order, and b0 to bn are filter coefficients. As you can see in the equation, once an input is received, it is needed for N+1 computations of the output. This is the perfect application for a shift register. Let s see how one can infer a shift register from C using LegUp. int FIRFilter(int input) { static int shift_register[taps] = {0}; #pragma unroll for (j = (TAPS - 1); j >= 1; j -= 1) { shift_register[j] = shift_register[j - 1]; } shift_register[0] = input;... } return out; We show the part of the FIRFilter function which pertains to the shift register (x[n] in the equation above). Each time the FIRFilter function is called, it receives one input and produces one output. First, the shift_register array is declared as static. This is needed since the data stored in the shift register (time shifted versions of the input data) needs to be kept around on the next invocation of the function. The loop shows each element of the array being stored to an array index of 1 greater than its current index, starting from the highest array index (TAPS - 1) all the way down to 1. This is effectively moving each element of the array up by one array index. Then the newest input is stored in the lowest array index (0). It is important to note the unroll pragma, which allows the loop to be unrolled. Unrolling the loop splits up the array into individual elements, where each element is stored in a register, hence creating the shift 50

51 register. Without the unroll pragma, the shift_register array is stored in a memory (RAM), which only allows up to 2 memory accesses per cycle. Note that if the FIRFilter function is specified to be pipelined, or if the shift register loop is contained within another loop that is specified to be pipelined, the shift register loop will automatically be unrolled and the unroll pragma is not required. Inferring a Line Buffer A line buffer is used to buffer a line of pixels of an image or a video frame, in order to keep data around and reduce the overall required memory bandwidth. It is useful for image/video processing applications, where an image/video pixel is continuously streamed in and processed, and is often used in conjunction with the shift register described above. A good example of such an application is the Sobel filter, which is used as one of the key steps of edge detection a widely used transformation that identifies the edges in an input image and produces an output image showing just those edges. At a high-level, Sobel filtering involves applying a pair of two 3 3 convolutional kernels (or filters), typically called Gx and Gy, to a 3x3 pixel stencil window. The stencil window slides over the entire image from left to right, and top to bottom, as shown below. The two kernels detect the edges in the image in the horizontal and vertical directions. At each position in the input image, the filters are applied separately and the computed values are combined together to produce a pixel value in the output image a 01 b c d e f g h i At every position of the stencil, we calculate the edge value of the middle pixel e, using the adjacent pixels labeled from a to i, each of which is multiplied by the value at its corresponding position of Gx and Gy, and then summed. An example C code for the Sobel filter is shown below. #define HEIGHT 512 #define WIDTH 512 for (y = 0; y < HEIGHT; y++) { for (x = 0; x < WIDTH; x++) { if (notinbounds(x,y)) continue; xdir = 0; ydir = 0; for (xoffset = -1; xoffset <= 1; xoffset++) { for (yoffset = -1; yoffset <= 1; yoffset++) { pixel = input_image[y+yoffset][x+xoffset]; xdir += pixel * Gx[1+xOffset][1+yOffset]; ydir += pixel * Gy[1+xOffset][1+yOffset]; } } edgeweight = bound(xdir) + bound(ydir); 51

52 } output_image[y][x] = edgeweight; } The outer two loops ensure that we visit every pixels in the image, while ignoring image borders. The stencil gradient calculation is performed in the two inner loops. The x and y directions are bound to be between 0 and 255 and the final edge value is stored to the output image. Consider a case where each pixel in a 512x512 image is received every clock cycle. One approach to implementing this in hardware is to store the entire image in memory, then perform filtering on the image by loading it from memory. While this approach is certainly possible, it suffers from several weaknesses. First, if the input image is pixels, with each pixel received every clock cycle, it would take 262,144 cycles to store the entire image. This represents a significant wait time before seeing any output. Second, we would need to store the entire input image in memory. Assuming 8-bit pixel values, this would require 262KB of memory. If the image is stored in off-chip memory, it would take a considerable amount of time to access each pixel, and the performance would suffer significantly. An alternative widely used approach is to use line buffers. a b c out 512 element shift register in d e f out 512 element shift register in g h i 8 pixel_in The figure shows two buffers, each holding 512 pixels. Rather than storing the entire input image, we only need to store the previous two rows of the input image (as the 3x3 stencil window can cover 3 rows), along with a few pixels from the current row being received (bottom row of pixels in the figure). As new pixels are received, they are stored into the line buffers. Once the first two lines of the image (and the first three pixels of the third row) have been received, we can start computing the edges. From this point onwards, the stencil starts to move with every new pixel received. When the stencil moves to the next row, its previous two rows are always stored in the line buffers. With the line buffers, we can start computing the edges much earlier, as we do not have to wait for the entire image to be stored. This also drastically reduces the amount of memory required to just two rows of the input image. By storing the line buffers in on-chip memory, its data can be accessed very quickly (with 1 cycle latency). Techniques such as this allow efficient real-time video processing on FPGAs. We show how one can create the line buffers with the 3x3 stencil window (shown in the figure above) in C using LegUp. void sf_window_3x3_and_line_buffer(unsigned char input_pixel, unsigned char window[3][3]) { // shift registers static unsigned prev_row_index = 0; static unsigned char prev_row1[width] = {0}; static unsigned char prev_row2[width] = {0}; // window buffer: // window[0][0], window[0][1], window[0][2] // window[1][0], window[1][1], window[1][2] // window[2][0], window[2][1], window[2][2] // shift existing window to the left by one window[0][0] = window[0][1]; 52

53 window[0][1] = window[0][2]; window[1][0] = window[1][1]; window[1][1] = window[1][2]; window[2][0] = window[2][1]; window[2][1] = window[2][2]; int prev_row_elem1 = prev_row1[prev_row_index]; int prev_row_elem2 = prev_row2[prev_row_index]; // grab next column (the rightmost column of the sliding window) window[0][2] = prev_row_elem2; window[1][2] = prev_row_elem1; window[2][2] = input_pixel; prev_row1[prev_row_index] = input_pixel; prev_row2[prev_row_index] = prev_row_elem1; } prev_row_index = (prev_row_index == WIDTH - 1)? 0 : prev_row_index + 1; This function creates two line buffers, where the input pixel and the 3x3 stencil window are passed in as arguments. The line buffers, prev_row1 and prev_row2, are declared as static arrays (same as for Inferring a Shift Register), since the data in the line buffers need to be kept around on successive invocations of the function. Another static variable, prev_row_index, keeps track of the position of the line buffer where its data needs to shifted out, and where new data needs to be shifted in. We then shift each element in the 3x3 window to the left by one. The last elements of the line buffers are read out and stored into the 3x3 window, along with the new input pixel. The new input pixel is also stored into the first line buffer, and the last element of the first line buffer is stored into the second line buffer. Then prev_row_index is updated, by incrementing it by one, unless it has gone through the entire row, in which case it is set to zero (indicating that we are moving onto a new row). Inferring Streaming Hardware via Producer-Consumer Pattern with Pthreads The producer-consumer pattern is a well-known concurrent programming paradigm. It typically comprises a finitesize buffer and two classes of threads, a producer and a consumer. The producer stores data into the buffer and the consumer takes data from the buffer to process. The producer must wait until the buffer has space before it can store new data, and the consumer must wait until the buffer is not empty before it can take data. The waiting is usually realized with the use of a semaphore. The pseudocode for a producer-consumer pattern with two threads is shown below. producer_thread { while (1) { // produce data item = produce(); // wait for an empty space in the buffer sem_wait(numempty); // store item to buffer lock(mutex); write_to_buffer; unlock(mutex); // increment number of full spots in the buffer sem_post(numfull); } } 53

54 consumer_thread { while (1) { // wait until buffer is not empty sem_wait(numfull); // get data from buffer lock(mutex); read_from_buffer; unlock(mutex); // increment number of empty spots in the buffer sem_post(numempty); // process data consume(item); } } In a producer-consumer pattern, the independent producer and consumer threads are continuously running, thus they contain infinite loops. Semaphores are used to keep track of the number of spots available in the buffer and the number of items stored in the buffer. A mutex is also used to enforce mutual exclusion on accesses to the buffer. The producer-consumer pattern is an ideal software approach to describe streaming hardware. Streaming hardware is always running, just as the producer-consumer threads shown above. Different streaming hardware modules execute concurrently and independently, as with the producer-consumer threads. LegUp supports the use of Pthreads, hence the producer-consumer pattern expressed with Pthreads can be directly synthesized to streaming hardware. Our easyto-use FIFO library provides the required buffer between a producer and a consumer, without the user having to specify the low-level intricacies of using semaphores and mutexes. An example pseudo code with three kernels, where function A is a producer to function B (B is a consumer to A), and function B is a producer to C (C is a consumer to B) is shown below. // Although a pthread function takes a void* argument, // the arguments are expanded below for readability. void *A(legup::FIFO<int> *FIFO0) {... loop_a: while (1) { // do some work... // write to output FIFO FIFO0->write(out); } } void *B(legup::FIFO<int> *FIFO0, legup::fifo<int> *FIFO1, legup::fifo<int> *FIFO2) {... loop_b: while (1) { // read from input FIFO int a = FIFO0->read(); // do some work... // write to output FIFOs FIFO1->write(FIFO1, data1); FIFO2->write(FIFO2, data2); } } void *C(legup::FIFO<int> *FIFO1, legup::fifo<int> *FIFO2) {... loop_c: while (1) { // read from input FIFOs int a = FIFO1->read(); int b = FIFO2->read(); // do some work 54

55 ... } }... void top() { legup::fifo<int> FIFO0; legup::fifo<int> FIFO1; legup::fifo<int> FIFO2; // Multiple arguments for a pthread function must be passed in as a struct, // but the arguments are expanded below for readability. pthread_create(a, &FIFO0); pthread_create(b, &FIFO0, &FIFO1, &FIFO2); pthread_create(c, &FIFO1, &FIFO2); } Each kernel contains an infinite loop, which keeps the loop body continuously running. We pipeline this loop, to create a streaming circuit. The advantage of using loop pipelining, versus pipelining the entire function (with function pipelining), is that there can also be parts of the function that are not streaming (only executed once), such as for performing initializations. The top function forks a separate thread for each of the kernel functions. The user does not have to specify the number of times the functions are executed the threads automatically start executing when there is data in their input FIFOs. This matches the always running behaviour of streaming hardware. The multi-threaded code above can be compiled, concurrently executed, and debugged using standard software tools (e.g., gcc, gdb). When compiled to hardware with LegUp, the following hardware achitecture is created: Testbench A FIFO0 B FIFO1 FIFO2 C Another advantage of using Pthreads in LegUp is that one can also easily replicate streaming hardware. In LegUp, each thread is mapped to a hardware instance, hence forking multiple threads of the same function creates replicated hardware instances that execute concurrently. For instance, if the application shown above is completely parallelizable (i.e., data-parallel), one can exploit spatial hardware parallelism by forking two threads for each function, to create the architecture shown below. This methodology therefore allows exploiting both spatial (replication) and pipeline hardware parallelism all from software. Testbench A0 FIFO0 B0 FIFO2 FIFO3 C0 A1 FIFO1 B1 FIFO4 FIFO5 C1 For replication, some HLS tools may require the hardware designer to manually instantiate a synthesized core multiple times and also make the necessary connections in HDL. This is cumbersome for a hardware engineer and infeasible for a software engineer. Other HLS tools provide system generators, which allow users to connect hardware modules via a schematic-like block design entry methodology. This, also, is a foreign concept in the software domain. Our methodology uses purely software concepts to automatically create and connect multiple concurrently executing streaming modules. 55

56 To create the replicated architecture shown above, one simply has to change the top function as the following: #define NUM_THREADS 2 void top() { } int i; legup::fifo<int> FIFO0[NUM_THREADS], FIFO1[NUM_THREADS], FIFO2[NUM_THREADS]; // Multiple arguments for a pthread function must be passed in as a struct, // but the arguments are expanded below for readability. for (i=0; i<num_threads; i++) { pthread_create(a, &FIFO0[i]); pthread_create(b, &FIFO0[i], &FIFO1[i], &FIFO2[i]); pthread_create(c, &FIFO1[i], &FIFO2[i]); } This simple, yet power technique allows creating multiple replicated streaming hardware modules directly from standard software. As this is a purely standard software solution, without requiring any tool specific pragmas, the concurrent execution behaviour of the replicated kernels can be modeled from software. 56

57 CHAPTER FOUR HARDWARE ARCHITECTURE This section describes the hardware architecture produced by LegUp. Circuit Topology Each C function corresponds to a hardware module in Verilog. For instance, if we have a software program with the following call graph: main a b c d c where main calls a and b, which in turn call c and d. Notice that function c is called by both a and b. One way to create this system in hardware is to instantiate one module within another module, in a nested hierarchy, following how the functions are called in software: main a c b c d e This architecture is employed by other HLS tools, but it can create an unnecessary replication of hardware. Notice how module c has to be created twice, since the function c is called from different parent functions. In LegUp, we instantiate all module at the same level of hierarchy, and automatically create the necessary interconnect to connect them together. 57

58 main c FU a interconnect b d e This prevents modules from being unnecessarily replicated, saving area. The hardware system may also use a functional unit (denoted as FU in the figure), such as a floating-point unit, which is also created at the same level. This architecture also allows such units, which typically consume a lot of area, to be shared between different modules. Note: For a small function, or for a function that is only called once in software, LegUp may decide to inline the function to improve performance. Thus you may not find all the software functions in the generated hardware. Threaded Hardware Modules When Pthreads/OpenMP are used in software, LegUp automatically compiles them to concurrently executing modules. This is synonymous to how multiple threads are compiled to execute on multiple processor cores in software. By default, each thread in software becomes an independent hardware module. For example, forking three threads of function a in software creates three concurrently executing instances of module a in hardware. In a processor-accelerator hybrid system, each thread is compiled to a concurrently executing hardware accelerator. Memory Architecture LegUp stores any arrays (local or global) and global variables in a memory. We describe below what type of memories they are, as well as where the memories are stored. In LegUp, there exists four hierarchies of memories: 1) Local memory, 2) shared-local memory, 3) aliased memory, and 4) processor memory. Local, shared-local, and aliased memories exist in hardware (in a hardware-only system or within a hardware accelerator in a hybrid system). In a hardware-only system, only the first three levels of memories exist. In a processor-accelerator hybrid system, all four levels of memories exist. Any data that is only accessed by an accelerator is stored on the accelerator side. Any data that is only accessed by the processor, or is shared between multiple accelerators, or is shared between an accelerator and the processor, is stored in processor memory. The processor memory consists of an off-chip memory and on-chip cache. The processor and all hardware accelerators have access to the on-chip cache. For the processor-accelerator hybrid architecture, see the ARM Processor-Accelerator Hybrid Architecture section below. Local Memory LegUp uses points-to analysis to determine which memories are used by which functions. If a memory is determined to be used by a single function, where the function is to be compiled to hardware (in a hardware-only system or as part of a hardware accelerator in a hybrid system), that array is implemented as a local memory. A local memory is created and connected directly inside the module that accesses it. Local memories have a latency of 1 clock cycle. 58

59 Shared-Local Memory If a memory is accessed by multiple functions, where all the functions that access it are to be compiled to hardware, the memory is designated as a shared-local memory. A shared-local memory is instantiated outside the modules (at the same level of hierarchy as the modules, as shown above), and memory ports are created for the modules to connect to the memory. Arbitration is automatically created to handle contention for a shared-local memory. LegUp automatically optimizes the arbitration logic, so that if the modules that access the memory are executing sequentially, a simple OR gate is used for arbitration (which consumes only a small amount of area), but if the accessing modules are executing concurrently (with the use of Pthreads or OpenMP), a round-robin arbiter is automatically created to handle contention. Shared-local memories have a latency of 1 clock cycle. Aliased Memory There can be cases where a pointer can point to multiple arrays, causing pointer aliasing. These pointers need to be resolved at runtime. We designate the memories that such a pointer can refer to as aliased memories, which are stored in a memory controller (described below). A memory controller contains all memories that can alias to each other, and allows memory accesses to be steered to the correct memory at runtime. There can be multiple memory controllers in a system, each containing a set of memories that alias to each other. Aliased memories have a latency of 2 clock cycles. Memory Controller The purpose of the memory controller is to automatically resolve pointer ambiguity at runtime. The memory controller is only created if there are aliased memories. The architecture of the memory controller is shown below: addr/data_in en/write_en tag 2 = mema en memb data_out en 3 = For clarity, some of the signals are combined together in the figure. Even though the figure depicts a single-ported memory, all memories are dual-ported by default. The memory controller steers memory accesses to the correct RAM, by using a tag, which is assigned to each aliased memory by LegUp. At runtime, the tag is used to determine which memory block to enable, with all other memory blocks disabled. The same tag is used to select the correct output data between all memory blocks. Processor Memory The processor memory consists of the on-chip cache and off-chip memory in the processor-accelerator hybrid flow. Any memories which are accessed by functions running on the processor are stored in off-chip memory and can be brought into the on-chip cache during program execution. Memories that are shared between multiple accelerators 59

60 are also stored in this shared memory space. In case of aliased memories, if a pointer that aliases points to a memory that is accessed by the processor, then all of the memories that the pointer aliases to are stored in this shared memory space. Memory Storage By default, each local, shared-local, and aliased memories are stored in a separate dual-ported on-chip RAM, where each RAM allows two accesses per clock cycle. All local memories can be accessed in parallel. All shared-local memories can be accessed concurrently when there are no accesses to same memory in the same clock cycle. If there are concurrent accesses to the same RAM, our arbitration logic handles the contention automatically and stalls the appropriate modules. All independent memory controllers can also be accessed in parallel, but aliased memories which belong to the same memory controller must be accessed sequentially. Memory Optimizations LegUp automatically stores each single-element global variable (non-array) in a register, rather than a RAM, to reduce memory usage and improve performance. A RAM has a minimum read latency of 1 clock cycle, where a register can be read in the same clock cycle (0 cycle latency). For small arrays, LegUp may decide to split them up and store individual elements in separate registers. This allows all elements to be accessed at the same time. If an array is accessed in a loop, and the loop is unrolled, LegUp also may decide to split up the array. For constant arrays, which are stored in read-only memories (ROM), the user can choose to replicate them for each function that accesses them. This can be beneficial when the constant array is accessed frequently by multiple threads (in Pthreads/OpenMP). By creating a dedicated memory for each thread, it localizes the memory to each thread, making the constant array a local memory of each threaded module. This can improve performance by reducing stalls due to contention, and also save the resource of arbitration logic. This feature can be enabled with the Replicate ROM to each accessing module constraint in LegUp. The figure below shows an example architecture with Pthread modules and all of the memories that we have described for a hardware-only system. The main function forks two threads of function d, hence two instances of d are created. main, d0, and d all execute in parallel, and they share the memory controller, shared-local memories 0 and 1, a register module, as well as a hardware lock module. The hardware lock module is automatically created when a mutex is used in software. All arbitration logic shown are round-robin arbiters, as all of the modules execute in parallel. A local memory that is used only by main is instantiated inside the module, and d0 and d1 have replicated constant memories instantiated inside. main d0 local mem Replicated ROM0 arb arb arb global memory controller shared local mem 0 shared local mem 1 d1 Replicated ROM1 arb arb register HW lock 60

61 Interfaces LegUp generates a number different interfaces for functions and memories. For instance, if we have the following C prototypes: int FunctionA(int a, int *mem, legup<int>fifo &fifo); void FunctionB(FIFO<char> &fifo); where FunctionA calls FunctionB, with a FIFO being written to in FunctionA and read from in FunctionB, LegUp can generate the following module interfaces. module FunctionA ( // default interface input clk, input reset, input start, output reg finish, input memory_controller_waitrequest, // for return value output reg [31:0] return_val, // for argument input [31:0] arg_a, // for calling FunctionB output FunctionB_start, input FunctionB_finish, // for accessing mem pointer output reg [7:0] mem_address_a, output reg mem_enable_a, output reg mem_write_enable_a, output reg [31:0] mem_in_a, input [31:0] mem_out_a, output reg [7:0] mem_address_b, output reg mem_enable_b, output reg mem_write_enable_b, output reg [31:0] mem_in_b, input [31:0] mem_out_b, ) // for writing to fifo input fifo_ready_from_sink, output fifo_valid_to_sink, output [31:0] fifo_data_to_sink module FunctionB ( // default interface input clk, input reset, input start, output reg finish, input memory_controller_waitrequest, // for reading from fifo input fifo_ready_to_source, output fifo_valid_from_source, 61

62 ) output [31:0] fifo_data_from_source As seen above, each module contains a number of default interface signals, clk, reset, start, and finish. The start/reset signals are used by the first state of the state machine: Reset = 1 Initial State Start = 0 Start = 1 The finish signal is kept low until the last state of the state machine, where finish is set to 1. The memory_controller_waitrequest signal is the stall signal for the module, which when asserted, stalls the entire module. Since FunctionA has a return type of int, it has a 32-bit return_val output. Each scalar argument becomes an input to the module. The int a argument creates the 32-bit arg_a input. In the C program, FunctionA calls FunctionB, hence LegUp creates handshaking signals between the two modules. When the output signal FunctionB_start is asserted, FunctionB starts executing. FunctionA then waits until the input signal FunctionB_finish is asserted. This is in line with the sequential executing behavior of software. However, when Pthreads is used, the caller module continues to execute after forking its threaded modules. In FunctionA, memory signals for the mem pointer argument are also shown. In this case, mem is designated as a shared-local memory and created outside the module, hence memory ports are created to access the memory. Two ports are created for the memory to allow two accesses per clock cycle. Note that if a pointer is passed into a function, but LegUp determines that the array the pointer refers to is only accessed by the function, the memory will be designated as a local memory and be instantiated within the module (removing the need for memory ports). The following memory signals are shown: Memory Signal Type address enable write_enable in out Description Address of memory. 1 if reading or writing in this clock cycle 1 for write, 0 for read Data being read from memory Data being written to memory The width of the address bus is determined by the depth of the memory. The fifo argument is given to both FunctionA and FunctionB, where FunctionA writes to the FIFO and FunctionB reads from the FIFO. LegUp currently does not allow the same function to both write to and read from the same FIFO. In this case, the sink (reader) is FunctionB and the source (writer) is FunctionA. FIFO Signal Type ready valid data Description Indicates whether the sink is ready to receive data Indicates whether this data is valid Data being sent AXI Slave Interface (Beta) LegUp currently provides Beta support for AXI Slave Interface. Please refer to the user guide for more details. 62

63 Hardware Accelerator Interface in Hybrid Flow In the hybrid flow, hardware accelerators can communicate with the processor, and also access the on-chip cache. Currently, the hybrid flow is only supported for Intel FPGAs and is available in the full licensed version of LegUp. We use Altera s Avalon Interface and Altera s System Integration Tool, Qsys. Each hardware accelerator contains a default slave interface for clock and reset, a slave interface for communicating with the processor, and a master interface to access the on-chip cache. Default slave interface for clock and reset: Avalon signal csi_clockreset_clk csi_clockreset_reset Description hardware accelerator clock hardware accelerator reset Avalon slave signals (prefixed with avs_s1) are used by the processor to communicate with the hardware accelerator. Avalon signal avs_s1_address avs_s1_read avs_s1_write avs_s1_readdata avs_s1_writedata Description address sent from processor to hardware accelerator. processor sets high to read return value from hardware accelerator processor sets high to write an argument or start the processor. data returned from accelerator data back to processor data written from processor to accelerator Avalon master signals (prefixed with avm) are used by the accelerator to communicate with the on-chip cache. Avalon signal avm_accel_address avm_accel_read avm_accel_write avm_accel_readdata avm_accel_writedata avm_accel_waitrequest Description address to send to cache set high when accelerator is reading from cache set high when accelerator is writing to cache return data from cache data to write to cache asserted until the readdata is returned from cache The on-chip data cache is a write-through cache, hence when an accelerator or the processor writes to the cache, the cache controller also writes the data to off-chip memory. If a memory read results in a cache miss, the cache controller will access off-chip memory to fetch a cache line and write it to the cache, then the appropriate data will also be returned to the accelerator. ARM Processor-Accelerator Hybrid Architecture LegUp can automatically compile a software program into a processor-accelerator SoC, comprising an ARM processor and hardware accelerators. Currently, this is available for the Intel (Altera) DE1-SoC Board and the Arria V SoC Development Board. Both boards contain an ARM HPS (Hard-Processor System) SoC FPGA. The ARM HPS consists of (among other things) a dual-core ARM Cortex-A9 MPCore, on-chip instruction and data caches, an SDRAM controller which connects to an on-board DDR3 memory. A number of different interfaces are provided to communicate between the HPS and the FPGA, including the HPS-to-FPGA interface (denoted as H2F in the figure below), and the FPGA- to-hps interface (denoted as F2H in the figure). The H2F is an AXI (Advanced extensible Interface) interface that allows the ARM processor to communicate with the hardware accelerators in a hybrid system. To access the processor memory space, hardware accelerators access the F2H interface, which is connected to the Accelerator Coherency Port (ACP), which connects to on-chip caches in the HPS to provide cache-coherent data. A hardware accelerator may also have any of the hardware memories as described previously. 63

64 ARM HPS CPU0 CPU1 On-chip Cache H2F F2H DDR3 SDRAM Avalon Interconnect Local RAM Shared-local RAM Replicated ROM Local RAM Shared-local RAM Replicated ROM HW Accelerator HW Accelerator 64

65 CHAPTER FIVE RESTRICTIONS LegUp has the following restrictions in this release. Hybrid Flow The processor-accelerator hybrid flow is currently limited to Intel FPGAs only in the full licensed version of LegUp. Struct Support LegUp automatically attempts to split up all structs into their individual elements, but this may not be possible in some cases. When there are structs that cannot be automatically split up, the user has to manually split them up in C, when targeting Xilinx, Lattice, Microsemi, and Achronix FPGAs. For more details, please refer to Structs section in Optimization Guide chapter. Floating-point Support LegUp currently supports floating-point operations for Intel FPGAs only. Function Pipelining When function pipelining feature is used (i.e., when one or more functions are added with Pipeline function constraint), a custom top-level function must be specified. The top-level function has to be one specified with Pipeline function constraint, or a wrapper function that simply calls multiple sub-functions that are all specified with Pipeline function constraint (Function Pipelining). LegUp C/C++ Library Bit-level Operation Library When using the bit-level operation APIs, all index and width arguments must be constant integers (Bit-level Operation Library). 65

66 Streaming Library A FIFO can only be written to by one function and read from by another function. It cannot be both written and read by the same function. In addition, there cannot be multiple functions writing to the same FIFO or multiple functions reading from the same fifo. Using Pre-existing Hardware Modules LegUp allows connecting existing hardware modules to the hardware generated by LegUp. Currently, an existing hardware module that is connected to LegUp-generated hardware cannot access any memories. All arguments need to be passed in by value. An existing hardware module cannot be invoked in a pipelined section (within a pipelined loop/function). In addition, an existing hardware module cannot invoke other modules (functions or other custom Verilog modules). Achronix Support (Beta) Support for Achronix is still in beta for this release. Memory inference is not yet supported (for both RAMs and ROMs). All memories will be synthesized into registers with Achronix s CAD tool. If there are large RAMs/ROMs in the design, this can cause the circuit area to increase significantly. 66

67 CHAPTER SIX FREQUENTLY ASKED QUESTIONS How is LegUp different from other high-level synthesis (HLS) tools? LegUp is the only high-level synthesis tool that can target ALL FPGA Vendors (Intel, Xilinx, Lattice, Microsemi, and Achronix FPGAs). This means that you are not tied to a vendor specific HLS tool, many of which have very different design methodologies, constraints, and input languages. You can simply design your application once and target any FPGA you want! LegUp is also the only HLS tool that supports the use of Pthreads and OpenMP to create parallel hardware. This powerful feature allows one to write standard multi-threaded software, and automatically compile it to a multi-core hardware system. Lastly, LegUp can generate a complete SoC, comprising a processor and hardware accelerators, with a click of a single button. Tell me more about Pthreads and OpenMP support. Pthreads and OpenMP are popular parallel programming methodologies in software. The threads described in software gets compiled to execute on multiple processor cores. LegUp can take the same software program and automatically compile it to multiple hardware cores. This powerful technique allows one to create a multi-core hardware system directly from software, without any changes to the code. This feature is available both for a hardware-only system (without a processor) or for a processor-accelerator hybrid system. Tell me more about the processor-accelerator hybrid flow. Not all programs are amenable for hardware acceleration. One may want to have a processor for controlling the hardware, or to run an operating system, or just for ease-of-use. The processor-accelerator hybrid flow in LegUp allows one to accelerate a portion of the program while keeping the rest running in software on a processor. All that the user has to do is simply designate which function(s) to accelerate in hardware, and the rest is taken care of automatically. LegUp automatically partitions the program so that the designated function(s) and the descendant functions are compiled to hardware accelerators, with the remainder of the program is compiled to execute on an ARM processor. The tool automatically handles the software/hardware partitioning, and their communication interfaces and interconnect are also automatically generated. This flow allows one to generate a complete SoC, comprising a processor and hardware accelerators, directly from software. Note that this processor-accelerator hybrid flow is only available currently for Intel FPGAs in the full licensed version of LegUp. What is the input language? LegUp supports most of ANSI C, but it does not support recursive functions or dynamic memory. As mentioned above, it also supports Pthreads and OpenMP to create parallel hardware. What is the output language? Currently, LegUp only supports Verilog output. Which FPGAs do you support? 67

68 LegUp supports Intel (Altera) Arria V, Cyclone IV, Cyclone V, Stratix IV, Stratix V, Xilinx Virtex 6, Virtex 7, Lattice ECP 5, Microsemi Fusion, and Achronix Speedster FPGA. These are the FPGAs that we have tested for, but the majority output that LegUp produces is generic and not tied to a particular FPGA. Hence it can be easily ported to other FPGAs as well. How can I get support? For general inquries, please contact info@legupcomputing.com. For technical questions and issues, please contact support@legupcomputing.com. To purchase a full license of LegUp, please contact sales@legupcomputing.com. 68

69 CHAPTER SEVEN CONSTRAINTS MANUAL LegUp accepts user-provided constraints that impact the automatically generated hardware. These constraints can be specified using the LegUp GUI and are stored in the Tcl configuration file config.tcl in your project directory. This reference section explains all of the constraints available for LegUp HLS. The most commonly used constraints are available from the LegUp GUI: Set target clock period: CLOCK_PERIOD Pipeline loop: loop_pipeline Pipeline function: function_pipeline Set top-level function: set_custom_top_level_module Set test bench module: set_custom_test_bench_module Set test bench file: set_custom_test_bench_file Set resource constraint: set_resource_constraint Set operation latency: set_operation_latency Inline function: inline_function Do not inline function: noinline_function Flatten function: flatten_function Preserve function: preserve_kernel Minimize bitwidth: MB_MINIMIZE_HW Replicate ROM to each accessing module: REPLICATE_ROMS Set FPGA synthesis top-level module: set_synthesis_top_module Use generic memory initialization file: USE_MEM_FILE A few debugging constraints are available from the LegUp GUI: Keep signals with no fanout: KEEP_SIGNALS_WITH_NO_FANOUT Insert simulation assertions (for debugging): VSIM_ASSERT Note: Advanced users can read the defaults for each Tcl constraint in examples/legup.tcl. These defaults are read before being overridden by the project config.tcl file. 69

70 Commonly Used Constraints CLOCK_PERIOD This is a widely used constraint that allows the user to set the target clock period for a design. The clock period is specified in nanoseconds. It has a significant impact on scheduling: the scheduler will schedule operators into clock cycles using delay estimates for each operator, such that the specified clock period is honored. In other words, operators will be chained together combinationally to the extent allowed by the value of the CLOCK_PERIOD parameter. LegUp has a default CLOCK_PERIOD value for each device family that is supported. That default value was chosen to minimize the wall-clock time for a basket of benchmark programs (mainly the CHStone benchmark circuits). If the parameter SDC_NO_CHAINING is 1, then the CLOCK_PERIOD parameter has no effect. Category HLS Constraints Value Type Integer represent a value in nanoseconds Valid Values Integer Default Value Depends on the target device Location Where Default is Specified boards/cycloneii/cycloneii.tcl boards/stratixiv/stratixiv.tcl and so on... Dependencies SDC_NO_CHAINING: If this parameter is set to 1, then CLOCK_PERIOD does nothing. Applicable Flows All devices and flows 70

71 Test Status Actively in-use Examples set_parameter CLOCK_PERIOD 15 ENABLE_OPENMP This parameter enables the synthesis of OpenMP pragmas and APIs. To compile OpenMP pragmas and API functions to hardware, this parameter must be set to 1. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter ENABLE_OPENMP 1 71

72 loop_pipeline This parameter enables pipelining for a given loop in the code. Loop pipelining allows a new iteration of the loop to begin before the current one has completed, achieving higher throughput. In a loop nest, only the innermost loop can be pipelined. Optional arguments: Parameter Description -ii Specify a pipeline initiation interval (default = 1) -ignore-mem-deps Ignore loop carried dependencies for all memory accesses in the loop Category HLS Constraints Value Type Parameter loop_pipeline -ii -ignore-mem-deps Value Type String Integer None Valid Values See Examples Default Value N/A Location Where Default is Specified examples/legup.tcl Dependencies None Applicable Flows All devices and flows Test Status Actively in-use 72

73 Examples loop_pipeline "loop1" loop_pipeline "loop2" -ii 1 loop_pipeline "loop3" -ignore-mem-deps function_pipeline This parameter enables pipelining for a given function in the code. Function pipelining allows a new invocation of a function to begin before the current one has completed, achieving higher throughput. Optional arguments: Parameter Description -ii Specify a pipeline initiation interval (default = 1) Category HLS Constraints Value Type Parameter loop_pipeline -ii Value Type String Integer Valid Values See Examples Default Value N/A Dependencies None Applicable Flows All devices and flows Test Status Actively in-use 73

74 Examples function_pipeline "add" function_pipeline "add" -ii 1 set_custom_top_level_module This TCL command specifies the top-level C function that will be compiled to hardware by LegUp. All descendant functions called by the top-level C function will be compiled by LegUp. By default the top-level is the main function. When a custom top-level module is specified, a custom testbench will be needed for running RTL simulation. Please see set_custom_test_bench_module and set_custom_test_bench_file. Category HLS Constraints Value Type string Dependencies NONE Applicable Flows All devices and flows Test Status Actively in-use Examples set_custom_top_level_module "accelerator_function" set_custom_test_bench_module This TCL command is to overwrite the name of testbench module to be elaborated in simulation. Category Simulation 74

75 Value Type String Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_custom_test_bench_module "custom_tb" set_custom_test_bench_file This TCL command is to specify the file that defines the custom testbench module, which is set via set_custom_test_bench_module. The testbench file will be compiled during RTL simulation. Category Simulation Value Type String Dependencies None Applicable Flows All devices and flows Test Status Actively in-use 75

76 Examples set_custom_test_bench_file custom_tb.v set_top_level_argument_num_elements This TCL command specifies the number of elements for a top-level function s argument. This is needed for SW/HW Co-Simulation when an argument passed into the top-level function is a dynamically allocated array (with malloc). Statically allocated arrays do not need to be specified but for any dynamically allocated arrays, their number of elements must be specified for the Co-Simulation flow to work. The name of the argument needs to be specified as a string and the number of elements of the argument needs to be specified as an unsigned integer. When using the GUI, choose Set top-level argument number of elements as the Constraint Type and type in both the argument name and the number of elements in the Constraint Value box. For the Co-Simulation flow, the user must also specify the top-level function with set_custom_top_level_module. Category HLS Constraints Value Type argument name as string, number of elements as an unsigned integer Dependencies NONE Applicable Flows All devices and flows Test Status Actively in-use Examples set_top_level_argument_num_elements array_in 20 76

77 set_resource_constraint This parameter constrains the number of times a given operation can occur in a cycle. Note: A constraint on signed_add will apply to: signed_add_8 signed_add_16 signed_add_32 signed_add_64 unsigned_add_8 unsigned_add_16 unsigned_add_32 unsigned_add_64 Category HLS Constraints Value Type <operation> integer Valid Values See Default and Examples Note: operator name should match the device family operation database file: boards/stratixiv/stratixiv.tcl or boards/cycloneii/cycloneii.tcl Default Values memory_port 2 divide 1 modulus 1 multiply 2 altfp_add 1 altfp_subtract 1 altfp_multiply 1 altfp_divide 1 altfp 1 Location Where Default is Specified examples/legup.tcl Dependencies None 77

78 Applicable Flows All devices and flows Test Status Actively in-use Examples set_resource_constraint signed_divide_16 3 set_resource_constraint signed_divide 2 set_resource_constraint divide 1 set_operation_latency This parameter sets the latency of a given operation when compiled in LegUp. Latency refers to the number of clock cycles required to complete the computation; an operation with latency one requires one cycle, while zero-latency operations are completely combinational, meaning multiple such operations can be chained together in a single clock cycle. Category HLS Constraints Value Type <operation> integer Valid Values See Default and Examples Note: operator name should match the device family operation database file: boards/stratixiv/stratixiv.tcl or boards/cycloneii/cycloneii.tcl Default Values altfp_add 14 altfp_subtract 14 altfp_multiply 11 altfp_divide_32 33 altfp_divide_64 61 altfp_truncate_64 3 altfp_extend_32 2 altfp_fptosi 6 altfp_sitofp 6 signed_comp_o 1 signed_comp_u 1 78

79 reg 2 memory_port 2 local_memory_port 1 multiply 1 Location Where Default is Specified examples/legup.tcl Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_operation_latency altfp_add_32 18 set_operation_latency multiply 0 inline_function This parameter forces a given function to be inlined. Category HLS Constraints Value Type Parameter inline_function Value Type String Valid Values See Examples 79

80 Default Value N/A Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples inline_function "add" noinline_function This parameter prevents a given function from being inlined. Category HLS Constraints Value Type Parameter noinline_function Value Type String Valid Values See Examples Default Value N/A Dependencies None 80

81 Applicable Flows All devices and flows Test Status Actively in-use Examples noinline_function "add" flatten_function This parameter unrolls all loops and inlines all subfunctions for a given function. Category HLS Constraints Value Type Parameter flatten_function Value Type String Valid Values See Examples Default Value N/A Dependencies None Applicable Flows All devices and flows Test Status Actively in-use 81

82 Examples flatten_function "add" preserve_kernel The LegUp HLS compiler optimizes away functions that are not used. This parameter prevents a given function from being optimized away. Category HLS Constraints Value Type Parameter preserve_kernel Value Type String Valid Values See Examples Default Value N/A Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples preserve_kernel "add" 82

83 MB_MINIMIZE_HW This parameter toggles whether the reduced bitwidths analyzed by the bitwidth minimization pass will be used in generating the Verilog design. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl Dependencies None Related parameters: MB_RANGE_FILE, MB_MAX_BACK_PASSES, MB_PRINT_STATS Applicable Flows All devices and flows Test Status Prototype functionality Examples set_parameter MB_MINIMIZE_HW 1 83

84 REPLICATE_ROMS This parameter replicates read-only memories (ROMs) to instantiate them in each of its accessing module. When the accessing modules execute concurrently, replicate the ROMs can reduce memory contention and increase performance, at the expense of more memory usage. It can also help to reduce LUTs, by reducing the arbitration/multiplexing logic that is required when the ROM is shared. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl Dependencies None Applicable Flows All devices and flows Examples set_parameter REPLICATE_ROMS 1 set_synthesis_top_module This TCL command specifies the name of the Verilog module that should be set as the top level when running FPGA vendors synthesis flows. By default the top level module for FPGA synthesis is top, which instantiates the RTL module of the top-level C function ( main by default, or specified by set_custom_top_level_module). This top level name is also used when creating a new Quartus project. 84

85 Category Quartus Value Type string Dependencies NONE Applicable Flows All devices and flows Test Status Actively in-use Examples set_synthesis_top_module "accelerator_function" Debugging Constraints KEEP_SIGNALS_WITH_NO_FANOUT If this parameter is enabled, all signals will be printed to the output Verilog file, even if they don t drive any outputs. Category HLS Constraint Value Type Integer Valid Values 0, 1 Default Value unset (0) 85

86 Location Where Default is Specified examples/legup.tcl Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter KEEP_SIGNALS_WITH_NO_FANOUT 1 VSIM_ASSERT When set to 1, this constraint causes assertions to be inserted in the Verilog produced by LegUp. This is useful for debugging the circuit to see where invalid values (X s) are being assigned. Category Simulation Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl 86

87 Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter VSIM_ASSERT 1 Advanced Constraints These are not available from the GUI. CASE_FSM This parameter controls whether the finite state machine (FSM) in the Verilog output by LegUp is implemented with a case statement or if-else statements. Although both options are functionally equivalent; some back-end RTL synthesis tools may be sensitive to the RTL coding style. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 1 Location Where Default is Specified examples/legup.tcl 87

88 Dependencies None Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter CASE_FSM 1 GROUP_RAMS This parameter group all arrays in the global memory controller into four RAMs (one for each bitwidth: 8, 16, 32, 64). This saves M9K blocks by avoiding having a small array taking up an entire M9K block. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl Dependencies None 88

89 Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter GROUP_RAMS 1 GROUP_RAMS_SIMPLE_OFFSET When GROUP_RAMS is on, this option simplifies the address offset calculation. Calculate the offset for each array into the shared RAM to minimize addition. The offset must be a multiple of the size of the array in bytes (to allow an OR instead of an ADD): before: addr = baseaddr + offset after: addr = baseaddr OR offset the idea is that none of the lower bits of baseaddr should overlap with any bits of offset. This improves area and fmax (less adders) but at the cost of wasted empty memory inside the shared RAM Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl Dependencies None 89

90 Applicable Flows All devices and flows Test Status Actively in-use Examples set_parameter GROUP_RAMS_SIMPLE_OFFSET 1 LOCAL_RAMS This parameter turns on alias analysis to determine when an array is only used in one function. These arrays can be placed in a block ram inside that hardware module instead of in global memory. This increases performance because local rams can be accessed in parallel while global memory is limited to two ports. Category HLS Constraints Value Type Integer Valid Values 0, 1 Default Value 0 Location Where Default is Specified examples/legup.tcl Dependencies None Applicable Flows All devices and flows 90

91 Test Status Actively in-use Examples set_parameter LOCAL_RAMS 1 NO_INLINE This is a Makefile parameter that can disable the LLVM compiler from inlining functions. Note that all compiler optimizations will be turned off when NO_INLINE is enabled. This parameter can be set in examples/makefile.config or in a local Makefile. Category LLVM Value Type Integer Valid Values 0, 1 Default Value unset (0) Applicable Flows All devices and flows Test Status Actively in-use Examples NO_INLINE=1 91

92 NO_OPT This is a Makefile parameter that disables LLVM optimizations, which is equivalent to the -O0 flag. This parameter can be set in examples/makefile.config or in a local Makefile. Category LLVM Value Type Integer Valid Values 0, 1 Default Value unset (0) Applicable Flows All devices and flows Test Status Actively in-use Examples NO_OPT=1 UNROLL This is a Makefile parameter that allows user to specify additional flags related to the unroll transformation in LLVM compiler. This parameter can be set in examples/makefile.config or in a local Makefile. Please see example settings in examples/makefile.config. Category LLVM Value Type string 92

93 Applicable Flows All devices and flows Test Status Actively in-use Examples UNROLL = -unroll-allow-partial -unroll-threshold=1000 set_accelerator_function This sets the C function to be accelerated to HW in the hybrid flow. It can be used on one or more functions. Category HLS Constraints Value Type String Valid Values Name of the function Default Value NULL Location Where Default is Specified N/A Dependencies None Applicable Flows Hybrid flow 93

94 Test Status Actively in-use Examples set_accelerator_function "add" set_accelerator_function "div" set_argument_interface_type This parameter specifies the interface type for the top-level function that will be compiled to hardware by LegUp. This can be used to specify a pointer argument passed into the top-level function to become an output register interface. A global variable can also be specified to become an AXI slave interface. Category HLS Constraints Value Type Parameter argument_name interface_type Value Type String String Valid Values Two types of interface can be specified, register and axi_s. Dependencies Require top-level function to be set via set_custom_top_level_module. Applicable Flows All devices for pure hardware flow Test Status Prototype functionality 94

Examples set_argument_interface_type data_out register data_out is a pointer argument passed into the top-level function that is specified to become a register interface.

95 Examples set_argument_interface_type data_out register data_out is a pointer argument passed into the top-level function that is specified to become a register interface. set_argument_interface_type slave_memory axi_s slave_memory is a global variable specified to be an AXI slave interface. set_combine_basicblock This parameter allows for basic block merging within the LLVM IR which potentially reduces the number of cycles of execution. There are two modes of operation: merge patterns throughout program, merge patterns only within loops. Currently, only 2 patterns are supported: Pattern A: A1, A2, A3 are basicblocks. Pattern B: 95

96 B1, B2, B3, B4 are basicblocks. Category HLS Constraint Value Type Integer Valid Values 0, 1, 2 Default Value unset (off by default) Location Where Default is Specified examples/legup.tcl 96

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem: