High Level Synthesis Evaluation of Tools and Methodology

Size: px

Start display at page:

Download "High Level Synthesis Evaluation of Tools and Methodology"

Darleen Theresa Norris
5 years ago
Views:

1 High Level Synthesis Evaluation of Tools and Methodology AMIR NAMVAR GHAREHSHIRAN Degree Project in System-on-Chip Design Second level, 30.0 HEC Stockholm, Sweden 2014 TRITA-ICT-EX-2014: 133

2 Abstract The advances in silicon technology, as well as competitive time to market, in the recent decade have forced the design tools and methodologies to progress towards higher levels of abstraction. Raising the level of abstraction shortens the design cycle via elimination of details in design specification. One such new methodology is High Level Synthesis (HLS). HLS tools accept the behavioral design in the abstract level as the input and generate the detailed Register Transfer Level (RTL) code. In this thesis project, the HLS methodology is introduced in the design flow and its advantages are outlined. We then evaluate and compare three HLS tools developed by market leading vendors, namely, C-to-Silicon, CatapultC and Synphonycc. To compare the HLS tools, an HLS input is developed for one of the Ericsson s designs and the generated RTL is compared with the hand-written RTL based on several performance criteria. Thereof, we discuss the choice of the best tool so as to facilitate adoption of HLS in Ericsson s design flow. At last, capability of the HLS tools in the synthesis of designs with pure control flow is investigated. I

3 II

4 Acknowledgment I would like to express my gratitude to my academic examiner at KTH, Professor Ahmed Hemani and my supervisor Nasim Farahini for all support, feedback, and encouragement. I also would like to thank my industrial supervisor Björn Fjellborg for the invaluable comments and remarks, and his engagement throughout this thesis project. I would like to thank profusely my managers Hans Lundén and Jens Andersson, who generously helped during my time at Ericsson. I am grateful to Henrik Svensson, Marcus Lövgren and Roger Engberg, who willingly shared their precious time; I extremely benefited from our technical discussions. I would like to further thank Frederic Genin, Jan Jezek, Frederic Pouyet and Richard Toone for their help and supports. My deepest thoughts are, as always, with my loved ones, who have supported me throughout the entire process both by keeping me harmonious and helping me putting pieces together. I will be grateful forever for your love and support. III

5 IV

6 Table of Contents Chapter 1: Introduction Background Motivation... 2 Chapter 2: Design Flow with HLS What is High- Level Synthesis? Automated HLS processes Where is HLS in ASIC Design flow? User s Detailed Design flow from HLS to ASIC High Level Synthesis Benefits which language is suitable for the HLS? Chapter 3: Evaluated HLS Tools C- to- Silicon Compiler HLS Flow with C- to- Silicon Compiler C- to- Silicon Compiler s Features Catapult HLS Design Flow with Catapult SL Catapult s Features Synphony C Compiler: HLS Flow with Synphony Synphony C Compiler s Features Micro- Architecture Constraints Loop Pipelining Loop Unrolling Allocating Memories for Arrays Chapter 4: De- Rate Matcher (DERM) Algorithm and Code Development for HLS Original Code Structure Framework Development for DERM Test bench Development Chapter 5: DERM HLS Using C- to- Silicon, Catapult SL and SynphonyC Compiler DERM HLS with C- to- Silicon Specific input modifications for CtoS Building the design in CtoS Exploring Designs and Micro- architectures Synthesis Analyses of the Results Output and Verify DERM HLS Using Catapult SL writing and testing the C++ model Loading design files setting design constraints Scheduling and analysis of the Results Generating and verifying the RTL Comparisons of Solutions DERM HLS by Synphony C Compiler Configure a project V

7 Configuring an Implementation Building the design Run the flow Analyzing the Design RTL verification Chapter 6: Cache Controller HLS Cache Controller Architecture and model development Cache Controller HLS with Synphony C HLS Results from Synphony C Chapter 7: HLS Results Analysis and Tools Evaluation Quantitative Metrics Comparison QoR (Latency and Area) Comparison Input model development time Qualitative metrics comparison Chapter 8: Conclusions and Future Works Contributions conclusions Recommendation Future works Bibliography VI

8 Chapter 1: Introduction Increasing the size, complexity and variety of designs, besides the competitive time to market, force the System-on-Chip (SoC) designers to employ modern design and verification flows. Over the last decade, Electronic Design Automation (EDA) tool vendors and research groups have focused on raising the level of abstraction in ASICs/FPGAs design flow. This new trend has been mainly aimed at accelerating the product development cycle and called for the introduction of a new methodology that replaces the error prone hand-coding Register Transfer Level (RTL) flow (Grant Martin, 2009). 1-1 Background In light of the growth in the silicon technology, the SoC designs have become more complex nowadays. The Moore s law testifies the number of transistors on integrated circuits has been doubled approximately every two years. This has led to a rapid increase in the number of on-chip components like processors, DSPs and ASICs. Figure 1.1 depicts the Moore s law and the rate at which the number of transistors in microprocessors has been increased over time. For instance, the NVDIA's new graphics card (code-named GF100) accommodates more than three billion transistors (Grant Martin, 2009) (Philippe Coussy M. M., 2009). Figure 1.1 Numbers of transistors on various chips (CPUs) versus the date of introduction The prevalence of small mobile electronic devices has further made some aspects of designs, such as area (dimension), performance (user response) and power consumption (battery life), even more 1

9 critical. This has imposed tight constraints on the designs to meet the requirements enforced by the market. Yet, time to market is quite an acute issue and manufacturers continuously strive to shorten it. The delay in the design process would delay the launch of the product that may cause irreversible losses in profit and/or market share. The key to address the aforementioned issues is to raise the level of abstraction. This accelerates the design and verification process via the removal of unnecessary details and less coding requirements (that introduces fewer bugs). Besides improving the productivity, high-level abstraction enables the designer to explore more potential architectures, leading to higher design performance, power efficiency, and smaller size (Philippe Coussy M. M., 2009). Having the model in higher levels of abstraction postulates a tool that transfers this algorithmic level of design to RTL or gate level so as to continue the rest of the flow to silicon implementation. At this point, High Level Synthesis (HLS) tools are to be developed to handle the conversion of the behavioral models into RTL via an automatic process. There exist several languages, e.g., Matlab, ANSI C/C++ and SystemC that allow design in high level of abstraction and are acceptable by HLS tools. HLS has been debated since 1990s to replace the RTL Synthesis in the same way that RTL Synthesis replaced Logic Synthesis in 1980s. About the same time, research was initiated to find a way in order to design hardware and software simultaneously. The first versions of the HLS tools focused on synthesizing the designs with data flow. Hence, they were not useful in the designs with both data and control flow. Luckily, today s HLS tools are mature enough to synthesize designs with data/control flow or pure control flow (Grant Martin, 2009). 1-2 Motivation Ericsson, as the leading company in telecommunications, designs and manufactures several models of radio base stations, each accommodating digital units (DUs) with processors, DSPs and ASICs inside. The Ericsson s Digital ASIC department is mainly focused on efficient methodologies such as HLS and investigates how they can be deployed in their design flow to shorten the design cycle. This thesis project aims to evaluate three different commercial HLS tools and the methodology, which ends up with the best Quality of Results (QoR). We consider two designs for this evaluation, namely, De-Rate Matcher (DERM) and Cache Controller. DERM is selected due to the availability of both manual written RTL and C-model. Therefore, the C-model can be input to the HLS, the results of which can be compared with those of the manual RTL. On the contrary, the Cache Controller design does not have the written RTL. Therefore, a sample code has been developed for it. The main idea is to investigate ability of the HLS tools in the synthesis of designs with pure control flows rather than data flows. Evaluation with DERM was further aimed at achieving QoR that, if not better, is at least the same as the manual RTL. 2

10 Chapter 2: Design Flow with HLS The Introduction discussed how significantly the tremendous complexity of today s designs has increased the design and verification costs and time. This situation will even worsen with the modern trend in increasing the level of concurrency. Raising the level of abstraction via using HLS tools is the key to improve productivity in many ways. Abstraction allows ignoring unnecessary details; hence, functionality can be described in a much more compact way. As the result, the design cycle shortens and the verification is made easier. Using HLS tools has unique advantages. For instance, one can explore a larger set of possible design implementations so as to optimize the design in terms of performance, area and power consumption. This chapter introduces the HLS concept in the design flow and investigates its benefits. We further compare different languages and their instrumental features for the HLS. 2-1 What is High-Level Synthesis? High Level Synthesis (HLS) is an automated process in the design flow of ASICs and FPGAs. An HLS tool accepts the design in behavioral description and translates it into Register Transfer Level (RTL). The behavioral model is simply the algorithm with no implementation details such as timing, clock, register s width and pipelining. Nonetheless, the behavioral model is not sufficient for the HLS tools to generate RTL. The designer further has to input the following items to the HLS tool (Fingeroff, 2008): Target clock frequency Target technology library that embodies available components Implementation constraints for expected quality of results (area, performance and power) Interfaces, IP s and memories 2-2 Automated HLS processes Many characteristics of the HLS are the same as synthesis in other levels of abstraction, e.g., RTL Synthesis. Primary synthesis steps are as summarized below: 1. Behavioral Description Translation and Data Flow Analysis: The first step is translation of the behavioral description to the intermediate structure that shows data and control dependencies and parallelism. For an example, consider the behavioral description (Hemani, 2012) (Peng): Read (A, B, C, D, E) For (K = 0; K < 10; K++) { 3

11 F = E * (A + B); G = (A + B) * (C + D);} The HLS tool then generates the intermediate description (Control Data Flow Graph CDFG) that is illustrated in Figure 2.1. Figure 2.1 CDFG extracted from behavioral description 2. Resource Allocation: In view of the design constraints, data types and provided library, the HLS tool allocates resources. Different resources and allocation possibilities are explored at this stage to satisfy implementation constraints and requirements. The technology library usually determines the number and types of resources. The HLS tools require information about the component size, latency and power consumption. This information is commonly acquired through the Library Characterization process. During this process, the HLS tool reads the target library. Assisted by back-end synthesis tools such as Design Compiler, it then implements resource components, e.g., adders and multipliers, with various characteristics in terms of size, timing and power. These components will then be used in the allocation-scheduling process with the HLS tools (Philippe Coussy M. M., 2009) (Fingeroff, 2008). 3. Scheduling: In this step, order and time are assigned to processes. Scheduling has big impact on the results, particularly on the area and latency. Allocation process tightly binds to scheduling. This is mainly because the tool starts to find at least one solution to schedule the process; if it passes successfully, it then tries to find the optimal solution by examining different resources (fast-big or slow-small), so as to balance the tradeoff between area and time. As shown in Figure 2.2, the order of the process can affect timing as well as used components. Thus, during scheduling, we may confront several allocation experiments and various scheduling solutions until reaching the optimum concerning to implementation constraints (Fingeroff, 2008) (Peng). 4

Registers can be shared among processes if there exist no data dependencies between them. Figure 2.

12 Figure 2.2 Different schedule solutions with different number of resources and latency 4. Register Allocation: If data transfers between cycles in the solution found by the Scheduling process, it has to be stored in registers. Registers can be shared among processes if there exist no data dependencies between them. Figure 2.3 demonstrates how registers are used to store input variables and keep processed data on crosses of the cycle boundaries. In this example, 4 register are required that can be shared among cycles (Peng) (Hemani, 2012). Figure 2.3 Register allocations between the cycle boundaries 5. Resource Binding: In this step, the HLS tool assigns resources to the operations and connects them to registers, interconnects and multiplexers. If the tool shares more 5

13 resources, it will require more multiplexers. Figure 2.4 shows two different binding decisions and illustrates how the numbers of shared resources and multiplexers relate to each other (Peng) (Hemani, 2012). Coussy A. M.) (Philippe Figure 2.4 Different binding examples 6. Data path and State Machine Generation: in the last step, the tool will generate data path and state machine, which is a manual process in hand-coded RTL. After this step, the RTL is ready to be written to the files (Peng). 2-3 Where is HLS in ASIC Design flow? Figure 2.5 depicts the ASIC design flow with HLS. As can be seen, the only distinction is that the hand-coded RTL is replaced (Cadence, May 2012)with the automatic generated RTL. This has a huge impact on the design and verification cycle. The steps preceding the HLS, e.g., system specification and architecture design, remain in the flow, however, are required to be defined in a 6

It starts with developing and verifying synthesizable behavioral description by considering the architecture (Dan Gajski, 2010) and design specification.

14 higher level of abstraction. The steps following the High Level Synthesis is also similar to the flow with hand-coded RTL. Figure 2.5 HLS position in ASIC design flows 2-4 User s Detailed Design flow from HLS to ASIC Figure 2.6 illustrates the complete design flow with HLS. It starts with developing and verifying synthesizable behavioral description by considering the architecture (Dan Gajski, 2010) and design specification. The HLS user cannot use pure software description of the design without any understanding of the hardware and generate good quality RTL. In fact, the designer needs to take into account the interfaces, architecture and hardware resources when writing the input model. Providing a good quality input, in addition to the target library, design constraints and microarchitecture decisions, enable the HLS tool to generate good quality RTL. In the final step, the designer has to simulate the RTL and compare the results with the expected results and goals for area and performance. HLS tools have made this process easier by generating a wrapper around the generated RTL or a new test bench in Verilog or VHDL. As the result, the designer is able to verify the generated RTL with the same testbench as the input model in a very fast and error-free process. If the results are unsatisfactory as compared with the expected area and performance, the designer has to revise two steps: First, the specified micro-architecture decisions have to be modified. In case of failure, input code has to be modified to force the tool to generate the expected results. This operation requires experience and enough knowledge of the tool. It should be further noted that the area result from the HLS tool is not reliable. It only provides a conservative estimation of the area. To obtain a more accurate area result, back-end synthesis needs to be run. The rest of the steps are similar to the design flow with manual RTL in all respects (Philippe Coussy M. M., 2009) (Philippe Coussy A. M.). 7

Figure 2.6 Complete ASIC design flows 2-5 High Level Synthesis Benefits This section concerns the advantages of the automatic generated RTL as compared with the handcoded RTL. Table 2.

15 Figure 2.6 Complete ASIC design flows 2-5 High Level Synthesis Benefits This section concerns the advantages of the automatic generated RTL as compared with the handcoded RTL. Table 2.1 summarizes the main differences. Design Steps Manual RTL High Level Synthesis Design algorithm Manual Manual I/O specification Manual Manual Resource allocation Manual Automatic Resource Sharing Manual Automatic Micro-architecture Manual Automatic FSM generation Manual Automatic Performance/Area Estimation Guess Approx. Accurate Table 2.1 Comparison of the HLS with the hand-coded RTL process 8

16 With the automation of the aforementioned design steps in the HLS design flow, the designer benefits in several aspects as outlined below: 1. Shorter design and verification cycle: Through providing an error-free path from the abstract specification to the RTL, the HLS tool shortens the design cycle. Working in a higher level of abstraction, the designer does not need to care about details such as control unit, resource sharing, and detailed architecture such as pipeline stages. The HLS tool considers these details automatically and the designer s role is just to control the HLS process. Further, since the focus is on functionality (and not on implementation details), the code is shorter. Hence, debugging of the code is much easier and faster. The HLS tools also offer an easy way for verification via generation of the test bench infrastructure that seamlessly reuse the original model test bench and test vectors. Figure 2.7 compares the design times of the RTL and HLS flows. With the RTL flows, the verification cannot start until the design is ready. On the contrary, the HLS flow allows us to start verification at the same time as the design. This is mainly due to obtaining the first functional RTL much earlier in the HLS flow, as compared with the manual RTL. The HLS allows refining the input model to meet the target QoR. This parallel design and verification yield productivity improvements (Philippe Coussy A. M.). Figure 2.7 Design-time comparisons of RTL and HLS flows 2. Exploration of various design architectures: The HLS tool allows the designer to explore different architecture solutions and find the optimal design considering the design goals. 3. Easy specification driven optimization: By generating the accurate design performance and estimating the area using the HLS tool, the designer can choose the best version among several generated RTLs to meet the design goals. 9

4. Equal or better QoR in comparison with hand-coded RTL: Since the HLS tool allows exploring various components and architectures in a short time; it can potentially produce better results than the

Reusable HLS input models in other design steps and future products: The synthesizable code, written for the HLS tools, can in some cases be used in other ASIC design steps such as simulation of the

17 4. Equal or better QoR in comparison with hand-coded RTL: Since the HLS tool allows exploring various components and architectures in a short time; it can potentially produce better results than the hand-coded RTL. The results are mostly the same when compared to the well-written and optimized RTL, yet the short design time is the key benefit. 5. Reusable HLS input models in other design steps and future products: The synthesizable code, written for the HLS tools, can in some cases be used in other ASIC design steps such as simulation of the system level and early software development. Besides, by only changing the clock frequency, technology library and design constraint (without editing the code), and the new generated RTL can be used in different products. As shown in Figure 2.8, same HLS model for video decoder can be used in various products such cell phones, cameras or TVs. The desired chips only differ in performance, area and power consumption. These parameters are given to the HLS tool as constrains and the HLS generates three different RTLs. 6. Applicable to developing both ASIC s and FPGA s: The HLS can replace the handcoded RTL in both FPGA and ASIC design. Figure 2.8 three different chips produced from the single behavioral model by using the HLS 2-6 which language is suitable for the HLS? This, in fact, is a critical question as the answer can affect the infrastructure around the new flow with HLS. Companies thus need to consider various aspects in selecting the suitable language so as to maximize the benefits from using the HLS. From hardware designer s point of view, the HLS language is required to have particular characteristics as outlined below: 10

C++ has powerful features such as class and template mechanism, modular encapsulation and parameterization.

18 Easy to write the desired functionality in higher level of abstraction. Languages such as ANSI C possess this specification. Many companies thus use C in the first step of the implementation of algorithms. Supports abstraction mechanism and reusing. Object-oriented languages such as C++ satisfy this requirement. C++ has powerful features such as class and template mechanism, modular encapsulation and parameterization. Supports the hardware requirement elements such as timing, concurrency, hierarchy, and bit accuracy. SystemC has all features of C++ in addition to what one needs as a hardware designer. Produces code reusable in other ASIC design steps such as early software development and fast system level simulation. Industry standard SystemC Transaction Level Model (TLM) modeling provides this functionality. It allows separating the untimed algorithm from the cycle accurate interfaces in transaction level. In light of the above discussion, we realize that SystemC has all fundamental features that are required for hardware design. It, however, slightly decreases the abstraction as compared with C/C++. Table 2.2 compares C/C++ and SyctemC from various aspects (Dan Gajski, 2010). Table 2.2 Comparisons of ANSI C/C++ and SystemC 11

19 Chapter 3: Evaluated HLS Tools In the previous chapter, the HLS flow was elaborated and compared with the hand-coded RTL design flow. The advantages of using the HLS were spelled out and suitable languages were explored. This chapter introduces three HLS tools that we use in this project, namely, C to Silicon, Catapult and Synphony. We then investigate their appealing features in the design flow. Finally, we explore various micro-architecture constraints for HLS. This stage is identical for all HLS tools. 3-1 C-to-Silicon Compiler CtoS, developed by Cadence, automatically generates RTL from timed/untimed SystemC code written in higher levels of abstraction. CtoS, like other HLS tools, increases the design productivity through shortening the design and verification cycle. Its tight integration with RTL Compiler (RC), Encounter, and Incisive verification ensures that the final design meets target goals and specifications (Cadence, May 2012) HLS Flow with C-to-Silicon Compiler Figure 3.2 depicts the complete HLS flow with C-to-Silicon as outlined below (Cadence, May 2012): Figure 3.1 HLS flow with C-to-Silicon 12

20 1- A model has to be developed as the input to the C-to-Silicon. Oftentimes, C and C++ models are written with the concern of performance initially. SystemC wrappers are then written to specify concurrency. 2- CtoS reads and compiles the input model and extracts the data and control flow. 3- User specifies design constraints, target library and micro-architecture decisions. 4- The under the hood RC quickly characterizes the required components for more accurate scheduling and resource sharing. Note that library characterization is not a separate step and is performed automatically in the RC. 5- C-to-Silicon generates the optimal RTL taking into account the design and architecture constraints. 6- The generated RTL is verified by means of the waveform simulation or other formal tools such as Sequential Equivalence Checker (SLEC). CtoS automatically generates a wrapper that enables the designer to verify both the SystemC model and the RTL with the SystemC testbench. It also generates a script file that can be input for SLEC. Next, SLEC verifies the generated RTL via High Level Synthesis (HLS) tools. Based on Sequential Analysis, SLEC eliminates functional errors in the generated RTL. Figure 3.2 illustrates the side-by-side verification of the generated RTL with SystemC testbench. Figure 3.2 Side-by-side verification methods 7- After checking off the RTL s functionality, the back end synthesis process with tools such as Design Compiler (DC) or RTL compiler (RC) is run to find the accurate area and timing C-to-Silicon Compiler s Features Salient features of C-to-Silicon include (Cadence, May 2012): It accepts a wide range of C/C++/SystemC coding styles and constructs including, but not limited to: templates, classes, user-defined types, and certain types of pointers. 13

21 It automatically generates I/O cycle-accurate simulation models, assertions, and scripts for simulation. The interactive Graphical User Interface (GUI) integrated into C-to-S provides a complete environment for synthesis, analysis, and debugging; it further allows maximum user control on the high-level synthesis process and visualization of results. It automatically generates SystemC wrappers to enable RTL verification with SystemC testbenches. It provides integrated/tested flow and scripts for Calypto SLEC. It supports transaction-level modeling TLM 1.0 constructs (FlexChannels). 3-2 Catapult The second tool that we evaluate in this project is Catapult. It accepts SystemC and ANSI C++ as the input for the synthesis and generates the RTL. Catapult s synthesis flow enables the designer to automatically explore various micro-architectures and interfaces. Similar to other high-level synthesis tools, Catapult brings the design to a higher level of abstraction, hence, improves the design and verification productivity. The designer drives the Catapult through the incremental synthesis steps. Finally, the graphical design analysis provides visibility on the process. Full control over the synthesis enables the designer to fine-tune and obtain the optimal QoR (Calypto, 2012) HLS Design Flow with Catapult SL 1- Develop the abstract model and testbench in C++/SystemC (that together constitute the hardware design): Arguments and parameters of this function become the hardware interfaces. To generate the hardware, the designer has to define the interfaces ports and their properties such as: port size, direction, timing and protocol. Bit-accurate data types should be used to declare variables sizes, which map to interfaces. A sample of the code with Catapult s bit-accurate data types is provided below: void sample (ac_int<8, false> &a, ac_int<8, false> b, ac_int<8, false> c ) { c = a + b; } // 8 shows width of the variable and false indicates this variable is unsigned Directions of the ports are recognized by Catapult. For instance, in the above example a and b are considered as inputs and c as an output by Catapult. The synthesizable C++ code has some limitation in comparison with the pure C++ code. This is mainly because some constructs are not supported by Catapult, e.g., - Dynamic memory allocation 14

22 - - - Pointer for array indexing Unions float and double native data types Data transfer between the blocks or testbench is implemented by using the AC channels, which operate as a First-In-First-Out (FIFO). To this end, the designer has to include the ac_channel.h header file in the code. A sample code that shows how to declare the ac_channel, read from a channel and write to a channel is presented below: void sample_channel( ac_channel <int> &data_in, ac_channel <int> &data_out) { if(data_in.available(2)) { int acc = 0; for (int i = 0; i < 2; i++) acc += data_in.read(); data_out.write(acc << 1); } } Specify the target technology library that is characterized for Catapult: Library characterization is a process that has to be done via the Catapult Library Builder and may take some time (2.5 Days in my case). The characterization process requires the liberty files and the back end RTL synthesis tool. 3- Specify Interfaces detail (interface synthesis): Interface synthesis is the process of mapping the top-level C++ variables to resources that implement a timed interface protocol (wire, handshake, memory). Interfaces protocols are set through selecting a resource and applying interface synthesis constraints. 4- Specify the architectural constraints such as loop unrolling/pipelining and memories: This step is accomplished through applying constraints inside Catapult (and not by modifying the code). Loops with unknown number of iterations can lead to lower QoR in Catapult. It is thus recommended to set explicit upper limits in the C++ code. 5- Run Scheduling: The synthesis tool attempts to find the best solution by performing several tasks: Multi-objective scheduling Arithmetic optimization and bit-width trimming Speculative execution 15

Memory access splitting/merging Fine and coarse-grain resource sharing 6- Generate and verify the generated RTL: This can be done via either simulation or formal verification and emulation.

23 Memory access splitting/merging Fine and coarse-grain resource sharing 6- Generate and verify the generated RTL: This can be done via either simulation or formal verification and emulation. Catapult SL supports all major verification tools. The verification steps in HLS with Catapult SL are shown in Figure 3.3. Figure 3.3 Catapult SL flow with verification steps Catapult s Features Most noticeable features of Catapult include: SystemC and ANSI C++ synthesis Mixed data path and control logic synthesis Multi-abstraction synthesis Power, performance, and area optimization Push-button generation of the RTL verification infrastructure 16

24 Top-down and bottom-up hierarchical design management Full and accurate control over design interfaces AXI interface library Silicon vendor certified synthesis libraries Integrated ECO and formal verification Generates both VHDL and Verilog codes 3-3 Synphony C Compiler: Synphony, developed by Synopsys, is the last commercial HLS tool that we evaluate in this project. Similar to other HLS tools, it reduces the development time by taking the design at the algorithmic level and generating the RTL. This algorithmic model should be only in C/C++. Note that it is crucial to take into account the hardware architecture during the C model development. Other input to the Synphony includes: design constraints, architecture constraints, C/C++ testbench and technology Library. Providing the above inputs, the Synphony can then fine-tune the design for the target goals (performance, area or power) and generate the RTL implementation for both ASICs and/or FPGAs (Synopsys, 2012) HLS Flow with Synphony 1- Develop the untimed sequential pure C/C++ code for the input model and testbench. In symphony C, micro-architecture constraints should be mentioned in the C code as #pragma commands. In other words, one needs to edit the code several times during the HLS process until obtaining the expected results. 2- Generate the reference results that are used to verify the developed C/C++ code during the Golden Simulation process. In this process the tool compiles the C/C++ code and compares the results with references. This process is referred to as Golden Simulation. 3- Configuring the design implementation through designating target library and clock frequency. As mentioned earlier, the library has been characterized via the back-end synthesis tool before being used in Synphony. 4- Run the Synphony to perform the building and verification process, as shown in Figure 3.4, and generate the RTL and Verilog testbench. 17

25 Figure 3.4 build and verification steps in Synphony C 5- Analyze the results through inspecting the reports that are generated by the tool. Change the original code for architecture constraints and re-run the tool until achieving the desired results. Figure 3.5 depicts the HLS flow via Synphony. Figure 3.5 HLS flow with Synphony C 18

26 3-3-2 Synphony C Compiler s Features Important features of Sysnphony C Compiler are summarized below: High-level synthesis based on single-threaded untimed, sequential C/C++ model, constraints, testbench, and libraries Support of standard (AXI, AHB, OCP, etc.) and custom external interfaces to eliminate glue logic High-level synthesis optimizations for performance and area goals Automated single-to-multi-threaded transformations Hierarchical block-level resource sharing Automatic scheduling and pipelining Timing optimizations for variably bounded loops Recursive hierarchical compilation enabling arbitrary number of hierarchy levels for optimization Architectural clock gating for fast implementation of complex, low-power designs Automatic generation of testbench, design files, constraints, and scripts for logic synthesis, power optimization, and verification tools. 3-4 Micro-Architecture Constraints As discussed earlier, specifying the micro-architecture constraints is the shared step among all HLS tool. These constraints have big impact on the area and performance of the design and mainly relate to loop resolving and memory mapping, as pointed out below: Loops resolving Unrolling Pipelining Arrays mapping to Flatten array Memories (built-in RAM, prototype memory, and vendor RAM) Micro-architecture constraints lead the tool in the scheduling process by providing information about timing, order of processes, and how to bind them to clock cycles. They also provide the pattern of array mapping to various memories Loop Pipelining Loop pipelining allows the next iteration of the loop being started before the current iteration has finished. Therefore, execution of the loop iterations can be overlapped. This increases the design 19

performance by running loops in parallel. Figure 3.7 illustrates the execution time for the loop whit 4 processes that iterates 3 times both before and after pipelining (Fingeroff, 2008). Figure 3.7 Loop pipelining and performance improvement The initiation interval (II) is set on loop as a design constraint in the HLS design environment.

27 performance by running loops in parallel. Figure 3.7 illustrates the execution time for the loop whit 4 processes that iterates 3 times both before and after pipelining (Fingeroff, 2008). Figure 3.7 Loop pipelining and performance improvement The initiation interval (II) is set on loop as a design constraint in the HLS design environment. It indicates how many clock cycles are passed before starting the next iteration of the loop. Thus, II = 2 means the new loop iteration is started every 2 clock cycles as shown in Figure 3.8 (Fingeroff, 2008). Figure 3.8 Pipelined loop with II = Loop Unrolling Loop unrolling is the primary mechanism for parallelism of a design, which is done via scheduling multiple loop iterations in parallel. The scale of the parallelism can be controlled by the user, which is referred to as partial loop unrolling. Loop unrolling can theoretically execute all loop iterations in a single clock cycle. Figure 3.9 presents an example of a loop with 2 processes and 4 iterations together with its partially unrolled counterpart (Fingeroff, 2008). 20

28 Figure 3.9 comparisons between full unrolled and partially unroll Allocating Memories for Arrays There are several schemes for mapping an array in HLS to a physical memory in implementation. As shown in table 3.1, each scheme is best suited to a particular array use case or access pattern. Below, we outline these schemes (Fingeroff, 2008) (Cadence, May 2012). 21

29 Table 3.1 comparison of different memory mapping options Flatten Array: The designer can flatten an array (or several arrays) and replace its entire reads and write with equivalent reading and writing variables representing each word in the array. Flattening arrays that are small and heavily used, or whose index is typically a constant, may lead to smaller, faster designs; however, flattening large arrays produces very large designs that may be impractical. Built in RAM: Arrays that are suitably small (nearly256 words or fewer) can be implemented as built-in RAMs. This is a good choice when the number of array words is small, but multi-process access is desired (note, however, that this feature supports both single- and multi-process access). The RAM is generated for synthesis using behavioral Verilog, accessing an array of necessary dimensions. Storage of elements in the array after synthesis is implemented using flip-flops, and control and data path logic are primitive gates from the technology library (for example, muxes and bitwise logicals). Flip-flops carry a large area penalty, compared to SRAM technology cells; hence, a flip-flop-based RAM will always be less efficient in area and power consumption. Prototype Memories: Prototype memories are useful during the micro-architectural and design exploration stages of a design, when the final implementation of a memory may not yet be complete, or the correct memory technology cell is not available. Prototype memories allow the HLS tool scheduling step to complete by using a prototype memory as a placeholder for the actual memory cell. HLS tool or Library Builder (depends on tool) can create a prototype memory for an array, similar to using built-in memory. A prototype memory does not require logic synthesis for area and timing estimates during resource and timing analysis. 22

30 Vendor RAM: While exploring the design micro-architecture, it is not optimal to map arrays to Vendor RAMs since they may be merged, flattened, or optimized which can alter their dimensions and access properties. However, when arrays are in their final form, you may then allocate them as Vendor RAMs. After allocation, the implementation is understood to be final, and scheduling may begin. During allocation, the numbers of access ports to the RAM are determined, which affects the ability to schedule the design, as well as the quality of results (QoR). 23

31 Chapter 4: De-Rate Matcher (DERM) Algorithm and Code Development for HLS In chapter 3, we introduced three HLS tools and explored the complete design flow using them. This chapter aims at developing a general input C model for the HLS. This code will then be modified exclusively for each HLS tool. We use derm_core (de-rate matcher. Our choice is based on the following reasons: - It has mixed data and control flow - The hand-coded is available for DERM and can be compared with the one generated by the HLS tool DERM communicate via pipe interfaces. This feature enables adding and removing blocks without affecting others functionality. Figure 4.1 below illustrates the signals that the blocks use for communication with each other. valid sot SRC eot data ready DST i_dst i_dst o_src o_src... i_dst py_pipe_dst dst2user block (user) user2dst py_pipe_src... o_src Figure 4.1 common pipe interface setup in a block The boundary of the interfaces is considered such that the py_pipe_dst and py_pipe_src modules are excluded. Therefore, by following the same protocol and interfaces between the block and py_pipe_dst/py_pipe_src, the RTL generated via the HLS process can replace the hand-coded version. 24

32 4-1 Original Code Structure A system model in TLM 2.0 was available for DERM. This model and its algorithm are used as a benchmark for developing the HLS input model. The algorithm comprises 4 steps as summarized below: Receive systematic bits and interlaced parity bits, Write bits into memory after adding dummy and filler bits, De-interleaving and de-interlacing is performed when writing bits into the memory (so that the memory can be read linearly), Send de-interleaved systematic and parity bits after removing dummy bits (filler bits are preserved). The Source code investigation shows that we have two functions. The second function calculates the addresses for S, P1 and P2 bits and the first function (de_rate_match) transfers data from linear addresses in harq_buffer to the calculated addresses in soft_value.the structure of the code is shown below: void de_rate_match(int *soft_values, int*harq_buffer, int harq_buffer_size, int nrof_rows, int db_plus_fb_mod_cols, int db_plus_fb_div_cols, int rm_fb) { } int *circ_buff_add = malloc( ); get_circular_buffer_add(circ_buff_add, ); for(k) soft_values[circ_buff_add[k]] = harq_buffer[k]; void get_circular_buff_add( ) { } for () { } // systematic bits (s) addresses for () { } // interlaced parity bits (p1, p2) addresses dummy bits are not written to the memories soft_values array is 25initialized with filler bits values first

33 This code is not synthesizable due to the following reasons: Using function call like malloc Nonexistence of bit-width accurate interfaces It is not optimized for hardware resources To address these issues, we first modified the arrays sizes and replaced the malloc with constant size array. Using the HLS process, the modified code led to the following results: harq_buffer will be mapped to an SRAM; soft_values will be mapped to an SRAM; circ_buffer_add will be mapped to an SRAM; De-interleaved and de-interlaced addresses are computed first. Then, reordering is performed from harq_buffer memory to soft_values memory. The abovementioned code was written for the early software development and fast simulation. The developer thus did not take into account the hardware; hence, it is not the optimal solution for hardware implementation. Remember that HLS tools cannot convert the pure piece of software code to a decent RTL optimized for hardware. As an HLS user, one needs to think about architecture of the hardware and required resources when developing the model for HLS. This method is referred to as Architecture Driven Design. It comprises several steps as summarized below: Think about the architecture and hardware resource requirements Develop a coding framework to describe the desired architecture and interfaces Map the existing algorithm code into the framework to enable easier exploration and reuse (rather than modifying it incrementally) Add extra functions (temporary buffering in input and output) to have the same throughput as specified This methodology leads to desirable results from the HLS, as well as predictable and reliable processes during the HLS. 4-2 Framework Development for DERM We proceed to develop the platform by following the steps, mentioned in the previous section, and answering the following questions: What is the desired hardware operation? - Obtain harq values from a stream (handshake) and write in local memory at deinterleaved and de-interlaced addresses - Read soft values using linear addressing and put them to a stream (handshake) 26

34 How many resources should be used? - A memory to store incoming bits and de-interleave them. Addresses must be computed on the fly (no storage) How is the loop structure? - 2 loops to write bits (s loop, p1/p2 loop) - 1 loop to read bits The answers to the above question were partly found with the help of Ericsson designers, particularly those who were involved in writing the hand-coded RTL. This was done mainly due to being able to compare the results with the manual version. In view of the required architecture and resources, the synthesizable template is given below: /* Sample of the platform used for synthesizable code */ void de_rate_match(int nrof_rows, int db_plus_fb_mod_cols, int db_plus_fb_div_cols, int rm_fb) { for (column = 0; column < nrof_sub_block_interleaver_columns; column++) { for (row = 0; row < ((ci.nrof_rows + number_of_memories- 1)/number_of_memories); row ++) { // get systematic bits (s) from input stream in the size of 8 bytes // insert dummy and filler bits // compute de- interleaved address // store bits to de- interleaved address for (column = 0; column < nrof_sub_block_interleaver_columns; column++) { for (row = 0; row < ((ci.nrof_rows + number_of_memories- 1)/number_of_memories); row ++) { // get interlaced parity bits (p1, p2) from input stream in the size of 8 bytes // insert dummy and filler bits // compute de- interleaved address // store bits to de- interleaved address for (n = 0; n < 3; n++) { 27

35 for (row = 0; row < ci.nrof_rows; row++) { for (column = 0; column < nrof_sub_block_interleaver_columns; column++) { // read bits in linear order from memory // remove dummy bits // put bits to output stream in the size of 8 bytes // do these steps three time for s, p1 and p Modifications during the code transform and framework development include: Nested row/col loops is used instead of single loop to avoid modulus operations Modulus operation in parity bit address computing is replaced by simple overflow detection Constraints are added on variable loop bounds The address computation from the original code is used without editing Vendor defined channels are used in input and output stream Temporary buffering process is used for managing the throughput (8 byte read and write to channels) Input and output interfaces are defined as variables with bit-accurate data types. The first version of the synthesizable code had one byte per cycle input/output and was synthesized through the HLS process. The generated RTL was functionally correct, however, did not have same latency and throughput. Several version of the code were thus developed and synthesized until getting the desired latency result. Below, the sequence of modifications on the framework that led to the desired result is summarized: 1 byte write/read with one memory block (SRAM) 1 byte write/read with 8 memory block 1 byte write and 8 bytes read with 8 block of memory 8 bytes write/read with 8 blocks of memory 4-3 Test bench Development There exists no exclusive test bench for the DERM block; hence, we developed one with the following specifications: Read the test vectors from files and send them to DUT Receive output values from DUT and compare with expected values 28

36 Add Dummy values (bytes) at the end of input stream to make it divisible by 8 Note that it is necessary to use the same channels as the DUT s input and output in the test bench. These channels are usually defined by the HLS tools in such a way to provide higher QoR and faster simulation. This chapter presented tips for developing a general framework and test bench. Each HLS tool, however, requires specific modification in the code that is outlined in the next chapter where we use different HLS tools for synthesis. 29

37 Chapter 5: DERM HLS Using C-to-Silicon, Catapult SL and SynphonyC Compiler In the previous chapter, we introduced the architecture driven method as an efficient way to develop an input model for the HLS. In this chapter we outline the steps of the process of generating the RTL using HLS tools. 5-1 DERM HLS with C-to-Silicon The HLS process with C-to-Silicon (CtoS) involves the following steps: 1- Building the design in CtoS 2- Exploring various designs and architectures 3- Scheduling the design 4- Generating and verifying the RTL Before proceeding to perform the above steps, the framework developed in Chapter 4 needs to be modified and made specific to CtoS Specific input modifications for CtoS A streaming platform is provided by vendors as initial step of the HLS process with CtoS. This streaming platform transfers the receiving data to the output. In the testbench, the output data is compared with expected value. Benefits from the streaming include (Cadence, May 2012): Existence of Complete design and verification environment for streaming designs Based on FlexChannel library (Vendor defined channel that supports TLM ) Maximizes Code re-use Always working design approach Using the same code for TLM and cycle accurate simulations Fast way to functionally correct RTL Structured C++ coding style for easy customization Separation of control and data-path Standardized scripts, makefiles etc., works out-of-the box In the first step, the synthesizable framework is merged with the streaming platform, and the following modifications are made: 30

38 The input and output channels are replaced with FlexChannels in both the block and the test bench. Interfaces are defined inside the streaming platform The original code uses two-dimensional arrays M [8] [2316] to represent 8 blocks of memory. By giving the directives CtoS generates the ports for blocks of the memory; however, it cannot recognize that 8 different memory blocks are accessed at each cycle. By breaking down the twodimensional arrays to 8 one-dimensional array and using specific flags and commit for each memory block access, the tool is forced to access different memory blocks in each cycle. The structure of the code, which is used in memory write, is given below: for (k = 0; k < NUMBER_OF_MEMORIES; k++) { unsigned short row = row_index * NUMBER_OF_MEMORIES + k; if (row < ci.nrof_rows){ unsigned int addr = interleaved_col + row*nrof_sub_block_interleaver_columns; } } unsigned int mem_bank = (bank+k)%number_of_memories; commit_flag[mem_bank] = true; c_data[mem_bank] = _data[k]; c_addr[mem_bank] = get_add(addr); wait(); //commit batch of 8 data bytes to 8 memories if(commit_flag[0] == true) my_array_0[c_addr[0]] = c_data[0]; if(commit_flag[1] == true) my_array_1[c_addr[1]] = c_data[1]; if(commit_flag[2] == true) my_array_2[c_addr[2]] = c_data[2]; if(commit_flag[3] == true) my_array_3[c_addr[3]] = c_data[3]; if(commit_flag[4] == true) my_array_4[c_addr[4]] = c_data[4]; 31

39 if(commit_flag[5] == true) my_array_5[c_addr[5]] = c_data[5]; if(commit_flag[6] == true) my_array_6[c_addr[6]] = c_data[6]; if(commit_flag[7] == true) my_array_7[c_addr[7]] = c_data[7]; Another issue encountered during the synthesis that requires modification of the code, is that: for loops with dynamic iteration numbers spend an extra cycle on checking the condition of the loop for the first iteration. This problem can be addressed by using do {} while () loops instead. An example of this modification is given below: for (row_index = 0; row_index < ((ci.nrof_rows + NUMBER_OF_MEMORIES- 1)/NUMBER_OF_MEMORIES); row_index++) Changed with: row_index = 0; do { } while (row_index < (ci.nrof_rows + NUMBER_OF_MEMORIES- 1)/NUMBER_OF_MEMORIES); The test bench is further modified as follows: The streaming platform test bench transmits all packets through the channel and receives all output packets at once. The original test bench, however, sends and receives packets one by one. Finally, the design is tested with waveform simulation for functionality and timing. The waveform simulation of the input model with the developed testbench is shown in Figure 5.1. In this figure, it is essential to look at the data input/output and interfaces signals, and compare them with the expected values to ensure functionality. 32

40 Figure 5.1 Waveform simulations of CtoS input model Building the design in CtoS CtoS can be run in two modes: Graphical User Interface (GUI), or text base. Although there is no difference between these modes, it is recommended to run CtoS in GUI in the initial stages of the design because of the powerful Control Data Flow Graph (CDFG). This feature enables the designer to take a good grasp of the design in the synthesis process. All the processes in the GUI mode have the textual equivalence in batch mode; these texts can be captured during the synthesis with GUI and reused in the next runs both in GUI or batch mode. In the synthesis process, we followed the same approach as in the GUI and saved the equivalent command to the *.tcl script file for the next runs (Cadence, May 2012). All files are then loaded to the CtoS by sourcing the ctos.tcl file. This file entails the following commands: ## file name ctos.tcl new_design top set_attr auto_write_models "true" /designs/top define_sim_config - model_dir "./model" /designs/top set_attr source_files [list src/top.cc] /designs/top set_attr compile_flags "- I./include - w" /designs/top set_attr top_module_path "sc_main.top" /designs/top set_attr build_flat true [get_design] set_attr enable_multiple_pipeline_stalls "true" [get_design] source tech_and_clk.tcl build source micro_arch.tcl schedule - effort high - passes 200 /designs/top allocate_registers /designs/top write_rtl - o./model/top_rtl.v /designs/top/modules/top write_rc_script - rtl_file./model/top_rtl.v - o top_rc.tcl /designs/top/modules/top define_sim_config - makefile_name "Makefile.sim" \ - testbench_files "tb/main.cc" \ - testbench_kind "self_checking" \ - simulator_args "- sctop sc_main - I./src - I./tb - I./include" \ - success_msg "PASSED" /designs/top 33

41 write_sim_makefile - overwrite These commands provide the path to all files that are necessary for the synthesis, and tech_and_clk.tcl, which contains information about the technology library to be used and the clock characteristic. The other sourced file is micro_arch.tcl; this file encapsulates all the microarchitecture decisions during the synthesis. GUI generated commands can also be stored to this file for latter runs. In the first run, the micro_arch.tcl file is empty, where the architecture constraints are to be specified. The rest of the commands in ctos.tcl relate to the scheduling process, the names of the output files, simulation Makefile.sim, and configuration Exploring Designs and Micro-architectures In this step, micro-architectural decisions are made. The Task Window table in the GUI depicts the necessary parts that are required to specify the micro-architecture decisions. The red items in Figure 5.2, namely, Combinational Loops, Arrays and Functions; require attention before proceeding with the synthesis. By clicking on each of the red items, a new window pops up that shows the required actions (Cadence, May 2012). Figure 5.2 CtoS task window Functions Using function calls make the code comprehendible and allow reusing parts of it in other places. When the process calls a function, registers of the processor need to be stored in the memory and restored after executing the function. Meanwhile, the called function can access all inter-process registers in an optimal way. Writing a process as a function stores only one copy and, each time the function is called, this piece of code will be executed. There are two approaches in the synthesis with functions: inlining or keeping them as they are. With function inlining, a copy of the function will be replaced everywhere that the functions is called. In many cases, this leads to better synthesis results. This is mainly due to the function optimization in with context of the calling process and capability of sharing resources increases. It may even lead to improvements in the area. In general, if the functions are called frequently, this may increase the area. Thus, one needs to experiment both approaches. For the synthesis of DERM, we preserve the functions that are called frequently or 34

42 require arithmetic computations inside. The rest of the functions are inlined. Figure 5.3 shows the functions that require inlining. Figure 5.3 CtoS micro-architecture window, Functions Combinational Loops Several looping structures are available in C such as: while {} for() do{} while{} The most common approach is loop unrolling. Complete unrolling of the loop replaces the body of the loop with replications. This, however, is only possible when the number of iterations is known and static. With loop unrolling, all iterations are executed at the same time or same cycle; hence, the function is faster. This, however, boosts the area. The other approach is pipelining. It empowers the tool to share the resources as well as improving the 35

latency and throughput of the design. Combining unrolling and pipelining leaves us several architecture possibilities with different latency and throughput characteristics.

43 latency and throughput of the design. Combining unrolling and pipelining leaves us several architecture possibilities with different latency and throughput characteristics. For the DERM synthesis, we unrolled all the innermost loops as shown in figure 5.4. By right-clicking on the loop name and selecting Show Input Source, the tool shows the location of the loop in the code. Figure 5.4 CtoS micro-architecture window, Loops Arrays arrays can be mapped to registers by using the Flatten command or stored in the memory. Small arrays are often flattened and big arrays stored on external or on-chip RAM. Note, however, that the design specifications are important to make decision for the arrays. For the DERM synthesis, we flatten all arrays except my_array, which is used instead of the memory blocks in the code. These arrays are mapped to the Prototype Memories. Prototype memories allow the CtoS scheduling step to complete using a prototype memory as a placeholder for the actual memory cell. CtoS can create a prototype memory for an array similar to using built-in memory. Figure 5.5 illustrates the arrays that require attention with the needed actions. 36

Figure 5.5CtoS Allocate IP window State Adding outer loops or loops with dynamic iteration numbers still remain as the combinational loops and require further attention.

44 Figure 5.5CtoS Allocate IP window State Adding outer loops or loops with dynamic iteration numbers still remain as the combinational loops and require further attention. In such loops, one has to look at the CDFG and find out if there exists a path in without any state. Then, we need add the state after/before and op in that path. Figure 5.6 provides more details and demonstrates different components in CDFG. These states can also be written in the SystemC input model using the wait () command in suitable places. The wait () command tells the tool explicitly to process the logic after that in the next clock cycle. 37

Figure 5.6 CtoS CDFG flow graph Adding extra states is the last step in specifying the micro-architecture constraints. All the GUI commands can be found in the tool log file.

Therefore, in case of edited code, old commands are not useful. The solution to this problem is to label the loops or use script to fine the operation and perform the architecture constraints.

45 Figure 5.6 CtoS CDFG flow graph Adding extra states is the last step in specifying the micro-architecture constraints. All the GUI commands can be found in the tool log file. The commands related to the architecture constraint are stored in micro_arch.tcl for future runs. However, most of them are mapped to the code lines. Therefore, in case of edited code, old commands are not useful. The solution to this problem is to label the loops or use script to fine the operation and perform the architecture constraints. In the following code, that is the latest version of micro_arch.tcl for synthesis of DERM using CtoS, samples of these scripts are shown: set_attr timing_criticality "high" [find - behavior *main ] set_attr default_export_memories true [get_design] set_attr prototype_memory_launch_delay 0 [get_design] set_attr prototype_memory_setup_delay 15 [get_design] # inline everything except set func_calls [ find - behavior *] 38

46 foreach op $func_calls { if {!( [regexp {inside_dummy_bits} $op ] [regexp {inside_filler_bits} $op ] [regexp {get_bank} $op ] [regexp {get_add} $op ] ) } } { inline $op } unroll_loop [ find_combinational_loops ] set all_arrays [ find - array *] foreach op $all_arrays { } if {!( [regexp {my_array*} $op ] ) } { flatten_array $op } set arrays [ find - array *] foreach op $arrays { } if { ( [regexp {my_array*} $op ] ) } { allocate_prototype_memory $op } create_state /designs/core/modules/core/behaviors/core_main/edges/forfork_ln46_1 create_state /designs/core/modules/core/behaviors/core_main/edges/forfork_ln167_ Synthesis Synthesis involves several steps, as outlined in chapter 3, including scheduling and optimization of the results. Scheduling can be easily performed via setting a few options as shown in Figure 5.6. As can be seen, the effort level is set to high and the rest of the options are kept as default for the synthesis of DERM. By pressing OK, the tool attempts to schedule and bind resources to the processes. 39

Figure 5.6 CtoS Scheduler setting If there exist errors in scheduling, one needs to return to the micro-architecture decision step. (Oftentimes, the scheduler cannot meet the timing.

47 Figure 5.6 CtoS Scheduler setting If there exist errors in scheduling, one needs to return to the micro-architecture decision step. (Oftentimes, the scheduler cannot meet the timing.) There are two approaches to address this problem: One can enable the Relax Latency box in the scheduling behavior setting. Thereby, we give permission to the tool to add enough states when it is necessary. This, however, is not the most effective way most of the times. The second approach is to first locate the problem source and manually add enough states to enable the tool to pass the scheduling. If we receive negative slack warning during the synthesis, the back end synthesis tool has to be run after generating the RTL. If it is unable to fix the issue, the synthesis process has to be re-run with new constraints until the problem is resolved Analyses of the Results When the synthesis steps are finished, we need to analyze the results using the reports. These reports help the user to find the problem if the result does not match predefined goals or could be used for further optimization. One such report is the Summary Report and is shown in Figure

Path Analysis. The critical path can be found in the reports on the task bar by selecting Timing and then clicking on Cycle Analysis ; see 8 for an example.

48 Figure 5.7 CtoS Summary Report In the summary report, the following items are of significant important: 1) Minimum Slack after Schedule (ps): If this is large, one needs to do Critical Path Analysis. The critical path can be found in the reports on the task bar by selecting Timing and then clicking on Cycle Analysis ; see Figure 5.8 for an example. Figure 5.8 Critical path cycle analyses As is evident from the critical path, negative slack is present in this case. This was solved after running the RTL compiler. Other solution to the timing problem include: using faster resources for the most time consuming processes, adding pipeline stages or extra states where there exists negative slack. 41

Another way to investigate the negative slack is to check the slack box in the CDFG as shown in Figure 5.9 The timing can then be observed in the CDFG.

49 Another way to investigate the negative slack is to check the slack box in the CDFG as shown in Figure 5.9 The timing can then be observed in the CDFG. It is straightforward to find paths with negative slack and add extra states to solve the timing issue. Figure 5.9 CtoS timing report on CDFG Other useful items in the repot summary are Flip Flops (bits) and Muxes (bits). These two are more important in cases where we aim to reduce the size of the design. We can further optimize our code manually in cases where we have complex Muxes or the tool could not distinguish the maximum size for arrays or variables. The last item in the report, which could be of interest, is Area in the Tree Map as shown in Figure 5.9. Information like Rough Area calculation break downed by the resources and sharing components, which are highlighted in green, are the most important data we can obtain from this map. 42

Figure 5.10 CtoS Area Tree Map 5-1-6 Output and Verify The last step in the HLS processes with CtoS is RTL generation and verification.

50 Figure 5.10 CtoS Area Tree Map Output and Verify The last step in the HLS processes with CtoS is RTL generation and verification. CtoS can further generate script file for the Sequential Equivalence Checker (SLEC) and RTL compiler (RC).THis makes the process easier in verification and back end synthesis. The Generate RTL window and its features are illustrated in Figure We can control, name and specify the path to store the output files in this window. By pressing the OK, the tool generates RTL and directive files (Cadence, May 2012). 43

The results are then compared with the expected values. Waveform simulation is further done to ensure the functionality of extra interface signals.

51 Figure 5.11 CtoS Generate RTL windows Finally, the generated RTL has to be verified. In this project, the verification process is performed by running the RTL simulation for all available test vectors. The results are then compared with the expected values. Waveform simulation is further done to ensure the functionality of extra interface signals. Figure 5.12 illustrates the waveforms of the input and output signals for the generated RTL. In this figure, we observe the input-output data and interfaces and can easily compare them with the expected values. Figure 5.12 Generated RTL waveform simulation 44

52 5-2 DERM HLS Using Catapult SL The HLS flow using Catapult comprises six major steps as summarized below (Calypto, 2012): 1- Writing and testing the C++ Input model 2- Loading design files 3- Setting design constraints 4- Scheduling and analyzing the Results 5- Generating and verifying the RTL 6- comparing solutions writing and testing the C++ model In the HLS process using Catapult, the designer starts with developing the behavioral code in pure untimed C++ without any concurrency. This is what we did for developing the framework described in chapter 4. That framework still requires specific modifications, as outlined below, to be able to use it with Catapult (Fingeroff, 2008) (Calypto, 2012): - Bit accurate data types (AC data types) should be used for input, output ports and some internal variables - AC channels are to be used for input and output data stream - The testbench should be modified by changing some syntaxes which are used for SCVerify Using Catapult, we face the same problem as CtoS for accessing memories. Therefore, twodimensional arrays will be replaced with 8 one dimensional arrays. At the end, the developed input model is verified before starting synthesis via running OSCI simulation inside Catapult Loading design files All source file, except header files, should be loaded to Catapult. Source files such as testbench and or test helper should be marked as exclude in the loading process. Loading the design files can be performed either by running the script or in the GUI mode. Figure 5.13 shows the loading window inside Ctapult GUI and how we exclude the testbench file from the synthesis process (Calypto, 2012). 45

53 Figure 5.13 Catapult Add Input Files window setting design constraints This process has several steps as summarized below: 1- Select process level handshake (Transaction Done, Start/Done, Reset behavior) Transaction done selected for synthesis of DERM as process level interface. 2- Select Functionality Hierarchy py_derm is selected as the top hierarchy and the rest of functions are left inline. 3- Select Synthesis Library The Characterized library is generated using Catapult Library Builder. We then select the characterized library with single port SRAM. 4- Set clock frequency. 5- Specify micro-architecture constraints These constraints can be applied to the I/O interfaces, loops, and storage. By modifying these constraints, the designer explores various solutions. In synthesis of DERM with Catapult, these constraints are set as described below: Interfaces Resource Type - The interfaces resource types are left as default. The only exception is the data stream, for which the full handshake protocol is selected. Arrays Resource Type All internal arrays are mapped to registers and my_arrays, which represents memories in our code that are mapped to RAM_SinglePort. Loops unrolling / Pipelining All inner loops are unrolled; loops with dynamic number of iterations and their outer loops are pipelined. As mentioned before, we have three main loops: 1) writing systematic bits to the memory; 2) writing p1 and p2 to the memory; and 3) reading the output data from the memory. In view of the algorithm, it 46

54 can be seen that the first loop in each clock cycle accesses the memories once, so it can be pipelined with II = 1 (Initiation Intervals). In other words, 1 clock cycle is taken before starting the next loop iteration. The second main loop which writes p1 and p2 is pipelined with II = 2 since, in each iteration, it access the memories twice. Finally, the third loop is pipelined with II = 1 because of the single access to memories in each iteration. Figure 5.14 shows the loop table inside the GUI of Catapult. Figure 5.14 Catapult architectural constraints window 6- Specify Resource constraints Resource constraints are left as default to provide maximum freedom to the tool. In this step, the user can edit the characteristic of the resource components such as Adder, And, Equal, MUX, etc. These characteristics are area and delay Scheduling and analysis of the Results Through running the schedule, Catapult allocates operation into clock cycles. This is an automatic process and is discussed in chapters 2. At the end of the schedule process, Gantt chart C-Step is generated. The Gantt chart provides full insight on loop profiles, algorithmic dependencies and functional units in the design by indicating states in finite state machine simply put, more C-Steps in the schedule translates to more states in the RTL FSM. With close scrutiny of this Gantt chart, the designer can easily obtain information about the shared components among different C-Steps. It also helps to track down the problem in the case of unexpected performance due to connecting processes to a number of C-Steps. For example, if we decide to access 8 memories in the same clock cycle, we should look at the Gantt chart, and find if we have all 8 memory accesses in the same C-Step or not. Figure 5.15 depicts the C-Steps Gantt chart focused on the memory write part in the first main loop. 47

Figure 5.15 Catapult Gantt chart 5-2-5 Generating and verifying the RTL Once the proper architecture constraints are specified and the schedule process has passed, the RTL can be generated.

55 Figure 5.15 Catapult Gantt chart Generating and verifying the RTL Once the proper architecture constraints are specified and the schedule process has passed, the RTL can be generated. Catapult generates RTL in VHDL and Verilog. It also generates a wrapper to simplify the verification through using the same test bench for the generated RTL and the input model. It also generates the Makefiles to perform compilation for waveform simulation and backend synthesis. The RTL will be verified via running the waveform simulation in Questasim so as to observe functionality and interfaces behavior. The verification can also be done using SLEC, which has proper synergy with the verification of Catapult s result Comparisons of Solutions Report Analysis: Among the constituents of the output file is the reports. It is necessary to look at the reports and compare them with the expected results. They also help to find the parts that could be further optimized. Four files are included in report directory as follows: Commands This file includes all commands used in the synthesis process from the first to the last step. By saving/editing this file, one can re-run the synthesis process easily and faster. Figure 5.16 depicts the Catapult s GUI that shows the Command Report. 48

Figure 5.16 Catapult command window Messages In this report, we can find all messages, which are printed in the transcript during the HLS process.

56 Figure 5.16 Catapult command window Messages In this report, we can find all messages, which are printed in the transcript during the HLS process. Cycle This report mainly incorporates information about loops total iterations and the processes throughput and duration in the scale of time and clock cycle. The cycle report generated for the DERM synthesis is illustrated in Figure As can be seen, the total number of cycles is 120. However, the waveform simulation reveals that this is not accurate and the precise value is

Figure 5.17 Catapult cycle report RTL This report specifies various components used in the RTL and explains their characteristics and quantity.

57 Figure 5.17 Catapult cycle report RTL This report specifies various components used in the RTL and explains their characteristics and quantity. One of the important parts in the RTL report is the critical path timing report. This timing will be more accurate after the RTL synthesis. At this point, the negative slack is acceptable. The critical path can also be seen in the schematic form; see Figure These figures assist to find the bottleneck in the timing. The negative slack is usually solved during the back end synthesis and is negligible. 50

different solution characteristic such as latency and throughput.

58 Figure 5.18 Catapult RTL report Comparison of Solutions: The General report shows different solution characteristic such as latency and throughput. The general report for the synthesis of DERM using catapult is shown in Figure The numbers in this report are calculated conservatively using the characterized library 51

information. For more precise estimates, the back end synthesis and waveform simulation should be performed for area calculation and latency evaluation, respectively. Figure 5.

The bar chart for the different solution of DERM is shown in Figure 5.20.

59 information. For more precise estimates, the back end synthesis and waveform simulation should be performed for area calculation and latency evaluation, respectively. Figure 5.19 General repot from Catapult GUI foe several solutions The area estimations can be compared for various solutions using the bar chart which breakdowns the area according to various criteria. The bar chart for the different solution of DERM is shown in Figure This bar chart helps considerably in cases where we have numerous solutions and want to compare their areas in detail for different fragments. Figure 5.20 Catapult Bar Chart for area report 5-3- DERM HLS by Synphony C Compiler Synphony is the last HLS tool that is experimented for the synthesis of DERM. As mentioned earlier, this tool accepts pure C/C++ code as input. The original framework that we developed was also in C++; hence, one only needs to replace the data types (with bit accurate data types) and input/output channels. The test bench is further modified to take test vectors as a command line argument: - example: make ;./run_py-derm /vobs/asic/cab/pyradonis_core/tb/tv/lte/1_2/ 52

60 - set_implementation_params cexec_args /vobs/asic/cab/pyradonis_core/tb/tv/lte/1_2/ This helps to easily run the complete test vectors either on the C++ source code or on the generated RTL without having to re-compile or re-synthesize the design Configure a project As the first step, the project parameters should be set in GUI as follows: Select Project Parameters Text Editor Like other synthesis tools, these commands can be put together in the *.tcl file and then sourced in QUI or batch mode. Subsequently, the project files have to be loaded. The project files comprise: - C/C++ code to be synthesized - C/C++ test bench code - Test bench input files - Test bench reference output files Files are loaded to the design implementation in the configuration process. In this step, we load and specify the type of the files such as: Source, Header, Data or Results. A sample of the *.tcl file is shown below, which contains all commands from the initial to the final step of synthesis (Synopsys, 2012): set env(ncv_path) /proj/asic/tools/cadence/incisive/incisiv_ /tools set_project_params - sources "derm.cpp main.cpp" set_project_params - headers "scc_types.h derm.h" set_project_params - results "decoder_ulma_cmodel_soft_values.ascii" if {[info exist env(derm_frequency)]} { } else { } set frequency $env(derm_frequency) set frequency 400 set implementation imp- py_derm if { [file exists ${implementation}] } { delete_implementation ${implementation} } create_implementation ${implementation} set_implementation_params - appfiles "derm.cpp" set_implementation_params - proc py_derm 53

61 set_implementation_params - clock_freq ${frequency} set_implementation_params - system_port_name "clock:clk" set_implementation_params - cexec_args "/vobs/asic/cab/src/pyradonis_core/tb/tv/lte/1_2/" # get maximum performance (no dead cycles between loops) set_implementation_params - continuous_processing always set_implementation_params - task_overlap never # memory path delay set_implementation_params - memory_return_path_external_delay 0.50 set_implementation_params - memory_forward_path_external_delay 0.50 # schedule exit conditions as early as possible setvar - a synthesize_auxopts "- Fsched_early_exit_asap=yes" # II sharing of operators >= 12 bits (default is 16) setvar - a synthesize_auxopts "- Fdata_width_for_II_resource_sharing=12" csim - golden #exit preprocess csim - preprocess schedule csim - schedule synthesize create_rtl_package vlogsim - offline - sim ncv - dump_vcd - detailed_perf_report # create DC synthesis scripts enable_scc_tcl_lib syn_setup - clock 705 # edit sdc file to add specific constraints set filename ${implementation}/rtl_package/synth/synopsys/synopsys.sdc if {[catch {open ${filename} a} fileid]!= 0} { } return - code error ${fileid} 54

62 puts ${fileid} { set_max_transition 0.3 [all_outputs] } puts ${fileid} { set_max_transition 0.3 [remove_from_collection [all_inputs] \"${SCC_CLOCK_PORTS} reset\"] } puts ${fileid} { set_max_fanout 12 [remove_from_collection [all_inputs] \"${SCC_CLOCK_PORTS}\"] } close ${fileid} Configuring an Implementation In this step, we need to specify the implementation name, application file s name and important information such as clock frequency and technology Library. Here, we use the sample library as the target library; however, Design Compiler (DC) can be characterized for Synphony. This is a onetime job, however, may take few days. Instead, we decided to use frequency sweep, which will be introduced later Building the design In this step, we specify the architecture constraints. These constraints have been partly indicated in the run.tcl file, which relates to the general process. The rest of constraints, e.g., loop unrolling, pipelining, and array mapping, should be mentioned in the source code file as #pragmas; see below for an example #pragma ii 1 for (col = 0; col < NROF_SUB_BLOCK_INTERLEAVER_COLUMNS; col++) { #pragma num_iterations(1,,32) for (row_index = 0; row_index < ((ci.nrof_rows + NUMBER_OF_MEMORIES- 1)/NUMBER_OF_MEMORIES); row_index++) { The designer requires complete understanding of the design architecture in order to specify microarchitecture constraints in the design code. It is naturally frustrating to remember various pragma and proper usage of them at the initial steps of the design. The beginners are thus advised to obtain suggestions from the tool (similar to the HLS process using C-to-S and Catapult). In our last experiment with Synphony, we unroll the entire inner loop; outer loops are pipelined with II = 1. Two dimensional arrays in this version are used to specify the user-specific memory. The directive that is used for this setup is given below: int8 M[NUMBER_OF_MEMORIES][DEPTH_OF_MEMORIES]; #pragma user_supplied M 55

63 #pragma no_dependence M write:write_parity_loop write:write_parity_loop The second pragma explicitly shows that the two write loops have no dependency. This directive enables us to use two-dimensional arrays instead of 8 one-dimensional arrays Run the flow Editing the code and putting pragmas as directives to the tool, we now proceed to run Synphony to generate the RTL. The RTL has been successfully generated with 500 MHz clock and sample library. In our experiment with Synphony, we did not characterize the library. Instead, we decided to use the sweep frequency approach to find the best clock frequency for HLS sample library that generates the best RTL, by concerning area in the backend synthesis actual frequency and library. we targeted various clock frequencies ranging from 1 st Frequency to 10 th (Removed due to Ericsson request). The results obtained from running the DC is presented in table 5.1: HLS Process Freq. with 65nm Library Result of area from DC actual library and frequency 1 st nd rd th th th th th th Table 5.1 Sweep Freq. results for generated RTL after backend synthesis Analyzing the Design After generating the RTL, the design has to be analyzed by investigating the reports and comparing the expected results with those obtained. These reports can be seen in GUI and are located under the implementation. The analysis process comprises of four important steps as follows (Synopsys, 2012): 1- Sequential graph analysis: This graph illustrates dependencies among the design components such as loops, memories, and registers. The graph for the DERM block is shown in Figure As is evident in the diagram, L0 and L1, which are first two loops, read the data from input stream and write them into M, which is the memory. The L2 loop 56

64 reads the memory and sends it to output stream. Therefore, L2 cannot be initiated before finishing L0 and L1. Table 5.21sequential graph generated for DERM by Synphony 2- Resource analysis: information regarding the resources is included in two reports: report_cost.txt and report_register.txt. These two reports will be used for area optimization. Register report is the most important report during the area optimization process since it provides information about the number/width of registers. The designer has to compare this report with the expectations on variables width and number of registers. 3- Dynamic performance graph analysis: This graph provides information about the latency of each process and how the processes are pipelined. Figure 5.22 illustrates various processes inside the DERM and their latency. As can be clearly seen in the figure, it is impossible to process subsequent packets before finishing the process with current packet. Therefore, the main loop cannot be pipelined. i.e., we only pipeline L0, L1 and L2. Table 5.22 Performance graph generated by Synphony for DERM 4- Other useful reports that could assist in the analysis are listed below: - report_summary.txt: This report summarizes the most useful information about the design. The final report for DERM HLS is provided below. This report entails information about the design, loops latencies, and how they are pipelined

65 Local memories : Memory name (ID) Description Parameters Indexed ports Bandwidth Arbitration (words,width,#buf,lat) (#RO,#WO,#RW) (#read,#write) (#reqstrs, #ports) RO WO RW SUB_BLOCK_INTERLEAVER_PERMUTATION_TABLE (14) (S,-,ROM,I) (32,6,2,0) (1,0,0) (2,0) M[0] (3) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[1] (4) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[2] (5) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[3] (6) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[4] (7) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[5] (8) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[6] (9) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) M[7] (10) (U,A,CRW,E) (2318,8,1,1) (0,0,1) (8,16) 'Memory description' is a quadruple (U/S, A/N, CRW/SRW/ROM, I/E): U=>From- user, S=>Synthesized, A=>Artisan, N=>Native, CRW=>Combined read- write ports, SRW=>Read- only and write- only ports, ROM=>ROM ports, 'I'=>Internal, 'E'=>External. 'RO'=>Read- only, 'WO'=>Write- only, 'RW'=>Read- write. '*'=>Existence of raw ports '#bufs' w.r.t ROM => multi- port ROM requirement met by multiple single- port ROM instances Streams: Name ID Property Width Length Source Position py_derm::dst2user_if:input 0 input(passthru) derm.cpp:60:38 py_derm::user2src_if:output 1 output(passthru) derm.cpp:61:38 Loop Statistics:

66 Loop II Schedule Target Task Pipelining Loss Name Achieved Length Iterations Latency Between Tasks PA PA PA Task Processing: Task Overlap : No Continuous_processing : Yes Reasons for PA Pipelining Loss Between Tasks : PA0 task pipelining is turned off PA1 task pipelining is turned off PA2 task pipelining is turned off Synphony C Compiler estimated cost: gates report_interface.txt: This report includes information about interfaces of the generated RTL modules. The designer needs to check if the generated interfaces satisfy expectations. - report_constraints.txt: This report indicates the design constraints during the synthesis. - report_interface_timing.txt: This report provides information about the delays from input ports to output ports. - report_feedback_path.txt: Feedback paths also available from the GUI (Feedback Path Viewer). This report will help to see the timing and slack in various parts of the design. This is useful for performance optimization. - report_pseudo_code.txt: pseudo C code describing the generated pipelines also available from the GUI (Operation Schedule Viewer) - report_address_map.txt: Mapping of memory-mapped variables, if any RTL verification The tool verifies the results in various steps of the synthesis process as stated in chapter 3. However, similar to other tools, the generated RTL is verified via the waveform simulation for assurance. 59

67 Chapter 6: Cache Controller HLS In this chapter, a sample of cache controller will be used for the HLS process. This enables us to evaluate the power of the HLS tools in synthesis of control flows. 6-1 Cache Controller Architecture and model development The specifications of this memory cache controller are listed below: Receives read/write requests from host Checks if the corresponding data is already in cache Sends requests to Direct Memory Access (DMA) in case of a miss Writes request to flush a dirty cache line Reads request to fetch a cache line Waits for DMA acknowledgement in case of a miss Sends the cache line index to host Remarks: The block only contains the controller, and not the actual cache memory. In reality, there exist several requesters; however, the initial template implements a single agent The structure of the code for this design is presented below: while(1) { if (in.nb_get(host_req)) { // check if hit or miss if (hit) { // set dirty bit if write access } else { index = get_index(); // cache replacement while(1) { if (dirty) { // send write request to DMA 60

68 dirty = false; } else { // send read request to DMA break; } } // update tag memory // set dirty bit if write access } inter.put(info); // forward to next loop } } while(1) { if (inter.nb_get(info)) { if (!info.hit) { // get ack from DMA } out.put(); // send cache line index to HOST break; } } The block diagram of the two main while loops and their interaction with the Direct Memory Access (DMA) and HOST is shown in Figure

69 Figure 6.1 Cache controller loop's architecture 6-2 Cache Controller HLS with Synphony C The cache controller is synthesized with Synphony using the following architecture constraints and directives: Performance Constraints - All loops are running at II = 1 - #pragma ii Auto Start is enabled to restart the design automatically set_implementation_params auto_start_ppa yes Continuous processing is enabled to obtain full pipelining set_implementation_params -continuous_processing always Non-Blocking stream calls are used to get drainable design Scheduling Constraints One cycle delay between DMA ack. and HOST ack. (could be more) set_implementation_params -wait "get_dma_ack:put_host_ack:1" FIFO Depth Indicates the number of request that can be buffered #pragma fifo_length inter 2 62

70 6-3 HLS Results from Synphony C The design passed the synthesis and the RTL has been generated. The architecture of the design from Synphony C report is shown in Figure 6.2. Figure 6.2 Synphony C report for design architecture The cost of this design and its latency are reported below: Application function: cmc TASK_II: 0 (unconstrained) Clock: 250MHz Techlib: artisan_tsmc65gp- rvt_adv10 Base address: 0x10000 Delivered Task II [vlogsim - offline] (avg): 2 Delivered task latency [vlogsim - offline] (avg): 3 Delivered throughput [vlogsim - offline] : Loop Statistics: Loop II Schedule Target Task Pipelining Loss Name Achieved Length Iterations Latency Between Tasks PA PA Task Processing: Task Overlap : Yes Continuous_processing : Yes 63

71 Reasons for PA Pipelining Loss Between Tasks: None Synphony C Compiler estimated cost: 2598 gates There are some possible modifications to improve the functionality of the cache controller: Current Design only handles one requester Code can be modified to add stream arbitration (input) and stream demux (output) Current Design implements a simple clock Least Recently Used (LRU) cache replacement algorithm Current code is not optimal for the hardware if the number of lines increases Other algorithms/implementations are possible Finally, the design is verified using the RTL simulation. Results are illustrated in Figure 6.3. Figure 6.3 Waveform simulation of generated RTL for cache controller 64

72 Chapter 7: HLS Results Analysis and Tools Evaluation In the previous chapters, the DERM was introduced and its implementation with HLS was explained in detail for various HLS tools. In this chapter, we evaluate the tools based on two quantitative and some qualitative metrics. These metrics are summarized below: Quantitative: QoR - - Implementation Performance (Latency) Implementation area (Size) Code development time and code lines - - HLS input development time Number of code lines for the input Qualitative: Ease of use HLS tool running time Verification time Visibility Controllability User Learning Curve Reports Maturity This project has been aimed at implementing the DERM block using HLS tools with the same interfaces (in the input and output side), functionality, throughput and latency as the hand- coded RTL. We use the quantitative metrics to compare the results with the hand- coded RTL. The qualitative metrics are further compared among various tools based on the user experience in this project. 7-1 Quantitative Metrics Comparison 65

73 7-1-1 QoR (Latency and Area) Comparison In order to evaluate the QoR of the generated RTL code, the latency of the block to process a packet of test vector is considered and compared with the hand- coded RTL. Latency is included in the reports in some tools report; however, more accurate results can be obtained via the waveform simulation and counting the numbers of clock cycles. The test vector for TLM model used for waveform simulation. The second quantitative metric in our QoR evaluation is the silicon area required for implementation that can be calculated through running the Synopsys Design Compiler (DC) with desired ASIC technology library and clock frequency. To obtain more accurate results, desired technology library is used during the synthesis with CtoS and Catapult. However, in synthesis with Synphony C, due to the lack of library builder license, the time- consuming frequency sweep process is replaced with library characterization and explained in previous chapters. Table 7.1 compares the result of the back- end synthesis for the generated RTL for DERM by HLS tools with the hand- coded RTL. The results for latency are just for one of the test vectors, which were taken as sample and differ in the experiment with other test vectors. Metric Tool C Tool B Tool A Manual RTL Block Latency Logic. Area (µm²) µm² µm² µm² µm² Comb. Area (µm²) µm² µm² 9615 µm² µm² Gate Count 31 kgates 33 kgates 23 kgates 28 kgates Comb. Gate Count 26 kgates 23 kgates 13 kgates 26 kgates Number of Reg. Number of gated Reg. Gated registers (%) % 44 % 98 % 98 % Figure 7.1 Back-end synthesis results for the generated RTL vs. hand-coded RTL In the test vector, the variable nrof_row = 2 and, according to the DERM algorithm, we write same number of values as nrof_rows to SRAM in each clock cycle. That is, 3*32 = 96 clock cycles 66

74 are required to write S, P1 and P2 bytes to the SRAM and (3*2*32) / 8 = 24 clock cycles to read out the SRAM. The total writing and reading latency is 120 clock cycles in the best- case scenario, where no time is spent on reading the input channels and calculations of the SRAM addresses. We obtained the interval of clock cycles from the HLS tools, which is close to our reference, the hand- coded RTL that spends 128 clk. This difference is retained when using different test vectors (different input data sizes). The area figures also differ among various tools. However, it is evident that the HLS tools better optimize the combinatorial logics with the cost of using more registers. Tool C proved to provide the best register usage and sharing among other tools. On the other hand, Tool A yielded a well- designed hardware in terms of the size, particularly for the combinatorial section. It, however, used three times more registers than the hand- coded RTL. The experiment with DERM synthesis revealed that higher QoR can be achieved by first investing time in learning the complete tool s features and their functionality, and then spending time on optimization of the code structure exclusively for each tool. In this evaluation, the same code is used for all tools and the hand- coded RTL is considered as our reference for latency and area. (It would be a very challenging task for an inexperienced user to obtain acceptable results with no reference.) Tool C yielded the best result in terms of area among others by investing time in optimization. Below, we show via an example how a small change in the code can lead to improvements in the QoR of Catapult: After running the DC, we received the following results in Tool B: Design: py_derm Logic. Area (µm²): µm² (excl MBIST) Comb. Area (µm²): µm² Number of memories: 0 Active Area (µm²): µm² Gate Count: 46 kgates Comb. Gate Gate Count: 30 kgates Number of Cells: 25 k Number of Comb Cells: 22 k Number of Nets: 25 k Number of Reg.: 2544 Number of gated Reg.: 1722 Gated registers (%): 67 % Comb. Gates per Reg Ratio: 12 Number of clock gating cells: 38 In comparison with other HLS tools, the area was bigger and number of registers was larger. Investigating the register reports revealed that some arrays that were defined in the global scope resulted in huge number of registers. We thus defined these arrays in the local scope of the processes, which led to improvements in register sharing. The results obtained from this new version are provided below: Design: py_derm Max Internal Frequency: 705 MHz Logic. Area (µm²): µm² (excl MBIST) Comb. Area (µm²): µm² Number of memories: 0 67

Active Area (µm²): 23911 µm² Gate Count: 33 kgates Comb. Gate Gate Count: 23 kgates Number of Cells: 19 k Number of Comb Cells: 18 k Number of Nets: 20 k Number of Reg.: 1787 Number of gated Reg.

75 Active Area (µm²): µm² Gate Count: 33 kgates Comb. Gate Gate Count: 23 kgates Number of Cells: 19 k Number of Comb Cells: 18 k Number of Nets: 20 k Number of Reg.: 1787 Number of gated Reg.: 798 Gated registers (%): 44 % Comb. Gates per Reg Ratio: 13 Number of clock gating cells: Input model development time Most of the time spent on input development is usually dedicated to defining the architecture. In the DERM experiment, the architecture was previously defined by the manual RTL designer. This architecture was used as the reference during the HLS input model development. If one starts the design from the scratch using the HLS flow, the architecture should be defined in higher levels of abstraction, and without considering the hardware details such as pipeline stages, number of registers and sharing policy. This will shorten the input development process. The other difference between the hand- coded RTL and the behavioral model is the size of the code. This affects the development cycle in two ways: First, it reduces the code development time. Second, less code requires less time for debugging. Table 7.2 compares the size of the code for various versions of the HLS input and the hand coded RTL. Here, we count the code lines in both *.h and *.cpp files in addition to the #pragmas, that are used in the code for the Synphony. All code versions have identical functionality, however, with different architecture, throughput and latency. Table 7.2 Code size comparison 7-2 Qualitative metrics comparison For this evaluation, C- to- Silicon Compiler 12.2, Catapult SL 2011a update5 and Symphony C compiler G SP2 are used. Qualitative metrics are evaluated based on the experience of a new user. 68

76 Ease of Use: Although all these tools have the same processes, Tool C requires more coding effort mainly due to the SystemC coding style. A SystemC wrapper is thus needed around the C++ RTL. During the synthesis process, the Tool B provides the user- friendliest environment for synthesis with straightforward steps, especially for defining the architecture constraints. This makes it not only an easy- to- use, but also a fast tool. Tool A comes next by introducing clear steps for the synthesis and specifying the architecture constraints. However, the architecture constraints and hints have to be written inside the code. This makes the synthesis process slow when exploring various architectures. Besides, it is hard to remember all the pragmas when a new user is writing the code. This requires more frequent referral to the user manual during the synthesis. The last tool in this comparison is the Tool C compiler. It presents the unique way to the analysis of the design through working with a particular model. It further uses a distinct way for specifying the micro- architecture constraints (loops). Finally, it requires more time to understand the design and make decisions about micro- architectures. HLS Tool Running Time This metric is described as the time that the tool requires for scheduling and generating the RTL. The fastest tool in the DERM experiment was Tool C, followed by Tool A and B. (We encountered several crashes in the start- up phase and during the process in Tool B, for which we could not find the source.) Verification time All three tools have identical approach in the verification phase: They either generate a wrapper around the DUT or generate the Verilog RTL from the C test bench. This allows simulating and verifying the generated RTL with the same test bench as the input C model.tool C and B, however, produce a script file for running Sequential Equivalence Checker (SLEC). Visibility As mentioned before, Tool C introduces a unique way to show the architecture, provides the user with more visibility around the design. Tool B is placed second due to showing the processes with C- Steps. Its way of showing the interfaces and schematics provides extra visibility of the design during the synthesis. The unique feature of this tool is providing a table that enables the user to compare various versions result (by clicking on each version, it will automatically load). Tool A is placed third. It provides acceptable visibility during the synthesis through providing the dynamic performance graph and the processes graph. Controllability Tool C provides the highest controllability during the synthesis process. This is mainly due to using the benefits of the SyctemC cycle accurate interfaces. Tool B is placed second because of the easy way to reload previous designs and changing the micro- architectural decisions. The last in the ranking is Tool A since the user needs to complete all processes of the synthesis to see the result and edit the input code and re- run the whole HLS process until obtaining the desired results. User Learning Curve The easiest tool to learn, is Tool B due to having simple building interfaces, defining architectural constrains, and memory mapping. The second is the Tool A, which demands more effort to learn different pragmas and their function. The last one is Tool C that requires more time (particularly as a new user) to understand how the HLS architecture maps the design and deals with the loops. However, once that stage has passed, it provides good controllability and visibility and generates high quality results. 69

77 Reports Accuracy- The area report is only an estimation based on the information of the characterized library. More accurate result can be obtained via the back end synthesis. Tool B provided incorrect latency and throughput results when the data types of some variables were changed. Table 7.3 ranks the tools based on the above qualitative metrics. Metrics Tool C Tool B Tool A Input Development Effort Ease of Use 2(3 As beginner) 2 3 Run Time 1 3 (Many crashes) 2 Verification Time Learning Time Visibility Controllability Maturity Table 7.3 Qualitative comparisons of the HLS tools 70

78 Chapter 8: Conclusions and Future Works 8-1 Contributions In this thesis, three commercial HLS tools were evaluated and compared. Results validated capability of HLS tools to generate an RTL with at least the same performance as the manual RTL. The HLS design flow further proved to be capable of synthesizing designs with pure control flow. Using HLS tools in the ASIC s design shortens the design cycle and provides the designer with the opportunity to explore various architectures and select the optimal design. The results further revealed the importance of carrying out architectural design as a part of the HLS process in initial stages. 8-2 conclusions The area and latency of the generated RTL for DERM yielded promising results as compared with the hand coded RTL. The HLS tools were able to generate results of at least the same quality as the hand- coded version. The experiment with DERM revealed that by investing time, there can be significant QoR gains via a fairly easy procedure for optimization of the design. Here, this was evident since the hand- coded RTL was available. However, the advantage of using the HLS tools may be undervalued in new designs, where no benchmark is available. Our experiment with the cache controller further evidences that the HLS tools are able to handle the design with both data flow and control flow. Two design experiments with different HLS tools indicated that defining the architecture and hardware resources is the preliminary step in developing the model for HLS. This method, referred to as architecture driven design, comprises the following steps (Fingeroff, 2008) (Philippe Coussy A. M.) (Brian Bailey, June 2010): Taking into account the hardware resources and structure Developing a coding framework to describe the desired architecture and interfaces Mapping the existing algorithm code into the framework to enable easier exploration and reuse, rather than modifying it incrementally Adding extra functions (temporary buffering in input and output) to have the same throughput as specified 8-4 Recommendation Our evaluation of the three HLS tools advocates using Tool C. Although not being the smallest, the generated hardware was the fastest and the most similar to the hand- coded RTL. As an example, the number of registers was close to the handwritten version, whereas Tool B and A generated too many registers (three times more) due to many pipelining stages so as to meet the latency constrains. 71

79 Tool C further provides full controllability and visibility through illustrating the design. Using the SystemC model benefits the users in that the design can be optimized during the input model development through using exclusive features of the SystemC such as time stamp printout. Finally, it supports Transaction- Level Modeling (TLM) driven designs. Therefore, the model developed for hardware design can be used in early software development stages. 8-4 Future works 1- Verification of the generated RTL: 50 80% of the effort in a new design relates to verification. HLS tools offer automated generation of both the RTL and the testbench. This implies that all test cases and test benches developed for the input C model can be used for the generated RTL. The HLS vendors claim that the generated RTL is correct by construction; one needs to concentrate on the verification in C model level. On the other hand, in our experiment with DERM, we found out that this claim may not hold at all the times, and the generated RTL has to verify thoroughly. Sequential Equivalence Checker (SLEC) is a solution for the verification of the generated RTL. This tool is from Calypto and is able to perform equivalence check between the SystemC/C++ source code and the generated RTL. If it works reliably, it can reduce the verification cycle and increase the verification quality. 2- Developing the same TLM code for HLS and Virtual System Platform (VSP): Figure 8-1 illustrates the flow from algorithm- level code development to the HLS input model. It shows that VSP model requires refinement so as to become synthesizable. One can investigate the possibility of having one code that satisfies both the TLM and implementation priorities. These priorities are as follows: TLM priorities: - - Interoperability with other TLM- 2 models Maximum simulation speed Implementation priorities: - - All details needed for implementation Acceptable QoR 72

80 Figure 8.1 Flow from algorithm to HLS model 3- Complex systems with several blocks: Although the evaluation results are promising, our experiment with only two designs cannot cover all aspect of the HLS. Synthesizing a complex system with multiple blocks will help to understand more deeply the advantages and disadvantages of using HLS tools. 4- Comparison of the generated RTL with the handwritten RTL: This helps to find out how the tool optimizes various logics and functions so as to improve QoR of the block when replacing the hand- coded RTL. The generated RTL may not be easy to read; however, using the HLS tools, one can map the functions inside the C model to the generated RTL, which requires going through all parts of the code. 73

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis