Simulation and Exploration of LAURA Processor Architectures with SystemC

Size: px

Start display at page:

Download "Simulation and Exploration of LAURA Processor Architectures with SystemC"

Bryce Hall
5 years ago
Views:

1 Simulation and Exploration of LAURA Processor Architectures with SystemC M.Sc. thesis of Feraaz Imami July 9, 2009 Leiden Institute of Advanced Computer Science Leiden University Supervisor: Second reader: Dr. Ir. A.C.J. Kienhuis Dr. Ir. T.P. Stefanov

2 Abstract The COMPAAN/LAURA tool flow automatically maps digital signal processing applications onto reconfigurable platforms such as FPGAs. The COMPAAN tool generates a Kahn Process Network description for a program written in Matlab. For this KPN, LAURA generates VHDL code that is synthesized into a description that programs the FPGAs; this synthesis step, however, takes a long time. Instead of using VHDL, we could also use SystemC as a new design methodology. It speeds up the design process as it allows for faster exploration of different implementations with almost the same accuracy as VHDL. SystemC is able to simulate faster as it raises the abstraction level at which hardware models are described. In this thesis, we created the IMASC tool that stands for Implementations of Multi-processor Architectures with SystemC. This tool automatically generates for all processes in a Kahn process network, the cycle accurate models for the Read, Execute, and Write unit that make up the LAURA hardware processor model. We validated the correctness of these models and calibrated them against real hardware results for a couple of applications. We found over 99% accuracy with a factor 3 faster execution times compared to realizations in VHDL. We also show a simple design space exploration in which we study the effect of different levels of pipelining and we show how IMASC is used to simulate a heterogeneous KPN realization consisting of hardware and software components. ii

3 Acknowledgements I would like to thank my supervisor Dr. Ir. Bart Kienhuis for his guidance and support during my research and development work at the Leiden Embedded Research Center (LERC), and making it possible for me to do this in a unique environment he creates with people who share a common interest in core computer technologies. One of the people is Sven van Haastregt whom I also want to give thanks to for his technical advice and the insightful discussions we had. Further, I would like to thank Eyal Halm who integrated the hardware accelerators into his design. Thus, he provided me the information on heterogeneous networks. I also, would like to thank Bin Jiang for his interest and the useful discussions we had. Finally, I would like to thank my parents who always encouraged me, allowing me to realize my own potential. iii

4 Contents 1 Introduction Problem Definition Solution Approach Related Work Thesis Organization Background Kahn Process Network MoC LAURA Tool Abstract Architecture Model LAURA Processor Model SystemC Transaction Level Modeling Register Transfer Level SystemC Design Methodology SystemC RTL Constructs Implementation ToolFlow Running Example: merge2write Modeling the LAURA Architecture Read Unit iv

5 CONTENTS CONTENTS Execute Unit Write Unit Control Unit Evaluate Logic Unit Counter Unit Adopting the SystemC sc fifo model Experiments & Results Experiment Setup Applications Show (featrier) Sobel Edge Detection QR-Decomposition Motion JPEG Results Calibration Design & Compilation Times Design Space Exploration Heterogeneous Networks Future Work 67 6 Conclusions 69 v

6 Chapter 1 Introduction With the advent of System-on-Chips (SoCs), new design methods are emerging to cope with the increasing design complexity of systems. These methods allow cutting the design time to enhance productivity of large and complex systems, by providing means for detecting design mistakes as early in the design process as possible, eliminating or reducing the need to cycle back to correct them. This new emerging method also closes the gap between system level and register transfer level (RTL), allowing simulations at higher levels of abstractions and refinement, within the same environment down to the RTL. This allows Electronic & Design Automation vendors (EDA) to provide system companies with a high level specification of a design to add their value on top of it. Current generation of hardware description languages are insufficiently equipped to deal with the increase in complexity of hardware design and system level design, typically associated with SoCs. At present, hardware modules often co-exist on the same chip with processor cores, embedded software, and other complex intellectual property (IP) blocks, which requires designers to perform slow and inefficient co-simulation of hardware and software parts when trying to simulate the entire system together. A way to solve this is raising the abstraction level to the system level and enabling hardware/software co-design. This has lead to a new design methodology which is referred to as SystemC. SystemC is a modeling platform that can deal with the issues discussed above. SystemC is both a system level and a hardware description language. It is a hardware description language because it allows modeling the design at the RTL and it is a system level specification language 1

7 CHAPTER 1. INTRODUCTION 1.1. PROBLEM DEFINITION because it allows modeling at the algorithmic level. With SystemC one can support design abstractions at the system level, behavior level and RTL. SystemC provides means to hardware/software co-design; the ability to exchange IP easily and efficient; and the ability to reuse test benches across different levels of modeling abstractions. SystemC is supported by the Open SystemC Initiative (OSCI) [1], a consortium of a wide range of system houses, semiconductor companies, IP providers, embedded software developers, and design automation tool vendors. The LAURA tool automatically converts a process network, representing a digital signal processing (DSP) application, to a hardware implementation. Each process or node in the network consists of the LAURA processor model, executing a particular function. This LAURA multiprocessor network described with a particular HDL, needs to cope with the current SystemC design methods envolved with SoC design. 1.1 Problem Definition The COMPAAN [2] framework automatically transforms digital signal processing applications, written in a subset of Matlab, into Kahn Process Networks [3]. The LAURA tool [4] takes this System Specification VHDL Test Bench Simulation Synthesis Simulation at Gate Level Chip Figure 1.1: VHDL flow. 2

8 CHAPTER 1. INTRODUCTION 1.1. PROBLEM DEFINITION KPN specification as input and generates VHDL hardware descriptions. Currently LAURA automatically generates hardware descriptions in the VHDL language, which allows the modeling of hardware modules. After hardware modules are modeled, one can simulate the design to verify the behavior. When the model shows the desired behavior, the VHDL description will be synthesized and mapped to the gate level (See Figure 1.1). There are some issues with the VHDL flow, as depicted in Figure 1.1. The problem is that synthesis to gates can take a large amount of time, especially when the design is large. After synthesis, simulation is very slow because every gate is simulated to guarantee the correctness of the model. Simulation can also be done before synthesis. But, once designs are getting bigger and bigger errors or design flaws can still be overlooked. Iterating back to the stage where the model needs to be re-designed is a time consuming process and will increase the design time, which is unacceptable, especially when the time to market requirements is also shortening due to consumer expectations. One cannot be confident enough that all the errors within the design are covered. Therefore, test benches allow the designer to verify the functionality of the hardware model. A test bench is written separately, when the system specification is known. Once design mistakes shows up, not only the VHDL descriptions needs to be modified but also part of the test bench needs to be written over again. Thus, the VHDL hardware descriptions are sufficient when systems are composed primarily of discrete parts such as microprocessors, memory chips, analog devices and application-specific integrated circuits (ASICs). A specification for an ASIC of a few thousand to few hundred thousand gates is possible to write in a natural language, and hand off to an ASIC designer or team, who would start by capturing the design at the RTL for which VHDL is the perfect match. But current advances in computer technologies allow integration of at least two or more complex micro-electronic components onto a single die. Complex functionalities that previously required heterogeneous components to be connected on a Printed Circuit Board (PCB), are now being integrated within one single silicon chip. Having multiple complex entities, hardware modules, coming onto a single silicon real estate, is referred to a System-on-Chip. SoC design introduces new design methods which shorten the design time by increasing the abstraction level to the system-level. Since the simulation time and verifying the design are important issues within the LAURA design as it is, we need to adopt these methods allowing: Faster simulation. Hardware software co-design. 3

9 CHAPTER 1. INTRODUCTION 1.2. SOLUTION APPROACH Architectural exploration. Enabling the LAURA architecture to be part of these new advancements in computer technologies, mainly SoCs, these new methodologies, in particular SystemC, needs to be incorporated within the LAURA model. 1.2 Solution Approach To equip the LAURA architecture with the possibility to shorten the simulation and design time, we use the hardware description language based on C++ called SystemC [1]. We use this methodology to effectively create a cycle accurate model of the LAURA hardware models. Using this methodology, one can effectively create a cycle accurate model of software algorithms, hardware architecture, and interfaces of a SoC and system-level designs. For developing the cycle accurate models, we extend the COMPAAN/LAURA tool with a tool called IMASC, which automatically generate SystemC RTL hardware descriptions for a particular application. IMASC is an acronym for Implementations of Multi-processor Architectures with SystemC. To achieve this, we extended the visitor design structure in the KPNFORMAT tool. Because the higher level of abstraction that is provided by generating SystemC, we expect designs to simulate at least two times faster than with using VHDL models. In this thesis, we are interested in how much the simulation and design times will increase compared with the current LAURA designs based on VHDL. Furthermore, we will show in this thesis how we can perform a design space exploration and introduce a way that allows for simulation of heterogeneous multiprocessors, within the same SystemC environment. 1.3 Related Work This section discusses some of the work which relates to our work. Most of the work describes modeling architectures at higher level of abstractions, whereas we allow the automatic generation of hardware models at the register transfer level, for fast simulation and design space exploration. SystemC offers three entry points for a particular design: system level, behavior level and RTL level. We enter the design at the RTL level and explore several architectures as part of a design space exploration. SYSTEMCODESIGNER [5] is trying to integrate behavioral synthesis into electronic system level (ESL) design space exploration tools. While SYSTEMCODESIGNER 4

10 CHAPTER 1. INTRODUCTION 1.3. RELATED WORK is able to provide hardware/software implementations from the behavior level, it is clear we only provide models that can directly be mapped to a hardware platform. For SYSTEMCODESIGNER to obtain synthesizable modules from the behavioral models, they integrated Forte s Cynthesizer. Forte s Cynthesizer translates behavioral SystemC to SystemC RTL. Both, our approach and of SYSTEMCODESIGNER, are aiming at the automatic mapping of DSP applications to hardware platforms, e.g., FPGAs. One approach, which also includes automatic mapping of applications to hardware platforms, is Koski, described in [5]. Koski differs from COMPAAN/LAURA tool in that it takes a Kahn Process Network specification modeled in UML, which needs further refinement. With COM- PAAN/LAURA, however, no further refinements are needed and the specification can directly be converted to a hardware platform. One other approach for the automatic mapping of applications to hardware platforms is the Daedalus design flow [6]. Its design flow is comparable with the COMPAAN/LAURA tool. It automatically converts a C-loop program into Kahn process network. As in IMASC, this process network can be converted into a hardware model. Hereby, processors and IP cores are instantiated as predefined components and FIFOs are connected between the processes. The LAURA tool does not have this notion of components, but generates fully elaborated VHDL. It relies on the Embedded Development Kit (EDK) from Xilinx to elaborate to a complete design. IMASC, generates SystemC code to enable hardware/software co-design of heterogeneous multi-processor simulation. The benefit from our approach is being able to fast prototype designs. In [7], Bombana and Bruschi discuss how to define, apply and evaluate a methodological framework and design flow allowing co-simulation of VHDL and SystemC modules. They accomplish this by satisfying the following: modeling the applications by mixing SystemC and VHDL modules in the model itself and in the test bench; applying synthesis from RTL or behavior level, for SystemC and VHDL models; simulate each representation of the model and co-simulate mixed representations; use commercially supported tools. Within their proposed design flow, modeling starts at the behavior level, where the SystemC subset for synthesis is used as high level description of the application. After system level verification, they use SystemC compiler to generate a VHDL net list at the RTL level. Co-simulation is performed to prevent the manual translation of the test benches. This is done by giving the test bench to an intermediate model called HDL Cosim, which creates a environment around the VHDL descriptions to allow its connections with the SystemC environment. Integrating IP components into a given design is a very complex task in the whole reuse pro- 5

11 CHAPTER 1. INTRODUCTION 1.4. THESIS ORGANIZATION cess. The ROSES [8] design environment introduces an IP integration approach, which presents a unique combination of features that enhance IP reuse. One example is the automatic assembly of interfaces between heterogeneous software and hardware IP components. IMASC allows software IP components as predefined functions for easy integration of IP cores. In IMASC, we don t apply any co-simulation with our hardware models. Also we don t start the design at the behavior level; instead we use a KPN as a high level specification for how the design should be modeled in RTL. 1.4 Thesis Organization In this section, we give a overview of how the thesis is organized. Section 2 discusses the Kahn process network Model of Computation (MoC), and the LAURA Tool, explaining the generation of VHDL hardware descriptions. We also discuss the importance of SystemC at both the register transfer level and the transaction-level. We finalize this section with introducing some of the language constructs we used of SystemC. In Section 3, we show how we implemented the LAURA architecture using SystemC RTL. Section 4 shows the experiments we have done, and results we obtained by calibrating design points with real hardware implementations. We show how and how much the design and compilation times of our hardware modules differ from the VHDL models. Design space exploration and the simulation of heterogeneous multi-processor networks will also be discussed. In Section 5, we discuss future work and we conclude the thesis in Section 6. 6

12 Chapter 2 Background This section will provide background information on some of the concepts we used to realize the IMASC tool. 2.1 Kahn Process Network Model of Computation Streaming based applications are usually modeled using a Model of Computation (MoC) pertaining to a specific application domain e.g. audio, video and 3D multimedia applications such as encoding and decoding MPEG video streams. A generally used MoC is the Kahn Process Network (KPN) Model of Computation, in which multiple parallel processes can process data simultaneously and where communication goes through unbounded First In First Out (FIFO) channels. A KPN can be portrayed as a directed graph G =(V,E), where V = p 1,..., p n is the set of processes of the network and E = e 1,..., e n the set of edges which are the FIFO channels (See Figure 2.1). p1 p2 p3 e1 e2 Figure 2.1: Simple Kahn Process Network 7

13 CHAPTER 2. BACKGROUND 2.2. LAURA TOOL Writing to such channels may occur without blocking, reading however is always blocking if no tokens are available on the channel to read from. In the latter case a process stalls until there are tokens available. This blocking-read mechanism of a KPN keeps the network deterministically. This means that for a given input sequence the produced output of the network is always the same. The order, in which processes are being executed, and timing of processes within the network will not have an effect on the deterministic behavior of the network. In real applications, however, FIFO channels are usually bounded, and therefore, uses a fixed amount of memory. This introduces a so called artificial deadlock [9]. This deadlock happens because now a blocking-write can occur. Blocking-write happens when processes are not consuming tokens from a channel, while other processes are continuously producing tokens filling up that channel. So, when the channel is full, no tokens can be produced anymore to that channel, hence this is called blocking-write. When we have one or more blocking-write conditions, and due to this, no other process in the network can advance, we say that the network has deadlocked. To avoid deadlock, FIFO sizes must be chosen big enough to accommodate the given application. This can be done by computing the absolute minimum of FIFO sizes, i.e., a correct network with minimum memory resources. This however does not guarantee the fastest possible network performance. In [10] is shown how to compute the minimum FIFO sizes in a KPN such that the maximum possible network performance can be achieved. 2.2 Leiden Architecture Research and Exploration Tool LAURA, which is part of the COMPAAN/LAURA tool chain [4], incorporates an approach that allows the automatic generation of hardware modules. These hardware modules are described using a hardware description language (HDL) which can be mapped onto a reconfigurable platform. The COMPAAN Tool [2] accepts as input a parameterized static nested loop program, which is Matlab Compaan KPN Laura VHDL Modules Figure 2.2: COMPAAN/LAURA Tool chain. 8

14 CHAPTER 2. BACKGROUND 2.2. LAURA TOOL a subset of the Matlab language. COMPAAN will generate a dependence graph of the program and convert it to a KPN specification (see Figure 2.2). This KPN specification can then be processed by the LAURA tool, which creates hardware modules by describing them in the Very high speed integrated circuit Hardware Description Language (VHDL). This description of a hardware models can subsequently be mapped on a Field Programmable Gate Array (FPGA), or can be used for simulation to gain data that is bit-accurate; level, such as performance in terms of clock cycles and time delays and silicon area Abstract Architecture Model In LAURA, the process to transform a KPN to a hardware module is separated in a platform independent and platform dependent part. In the platform independent part an abstract model of the architecture, which is independent of any target architecture, is build on which the KPN is mapped in a one-to-one fashion. Each node of the KPN becomes a separate process. The abstract architecture model defines a semantic model, of how the different components interact with each other and can be described as a network of virtual processors [4]. Each virtual processor is based on the LAURA Processor Model which consists of a Read Unit, Write Unit, an Execute Unit and a Controller Unit. The LAURA processor model will be discussed in more detail in Section In the platform dependent part, target specific information is being added to the abstract architecture model. This includes, for example, bit-width and size of the hardware channels. Once the abstract architecture model is established for a particular KPN, it is converted into VHDL code using the Visitor Design Structure. This Visitor Design Structure actually visits every structure or component within the abstract architecture model, and then generates the VHDL code that represents that component on the target platform. The Visitor Design Structure can early be adopted to generate other HDLs e.g., Verilog, SystemVerilog or SystemC LAURA Processor Model A virtual processor is build out of four Units: Read Unit, Execute Unit, Write Unit and Controller Unit. These Units make up the LAURA Processor Model [4] (See Figure 2.3). The Read Unit is responsible for assigning all the input arguments of the Execute Unit with valid data. The Execute Unit takes the data and computes a function. The Write Unit distributes the result of the Execute Unit to other processors in the network. The Controller Unit simply orchestrates the Read, Execute, and Write Unit. Section 3 explains this in more detail. 9

15 CHAPTER 2. BACKGROUND 2.2. LAURA TOOL FIFO Read Unit Write Unit FIFO FIFO M U X Execution Unit IP Core D E M U X FIFO Logic Logic Counter Controller Unit Counter Figure 2.3: LAURA Processor Model. The model follows a well defined convention on how data is being read, executed and written. So does the Execute Unit only compute a function when all the input arguments have data. For example, the simple function: [x, y] =f(z,a,b). The function f(z,a,b) must have knowledge of all its input arguments which are z, a and b. Once we obtain the value of these arguments the function is able to produce data for all its output arguments x, y. Otherwise, the fuction waits until all the input arguments become available. The ReadUnit is composed of a Read Multiplexer, Logic Unit and Counter. A Read Multiplexer is instantiated for every argument of the Execute Unit. The Counter iterates from a certain lower bound value to a upper bound value. The Logic Unit decides for a specific iteration which data is needed. The Write Unit has a similar structure and behaviour. The only difference is that it has a Write DeMultiplexer and the Logic Unit decides, based on an iteration, if a write operation must occur. The Controller, however, will block the whole Read, Execute and Write units when the outgoing FIFO channels are full. The Read Unit and Write Unit can block the computation of a function, thereby stalling the complete processor. A blocking-read situation occurs when data is not available at a given input port where needed. A blocking-write situation occurs when data cannot be written to a particular output port, when needed. The input port and output port of a virtual processor are the I/O interfaces that connect the 10

16 CHAPTER 2. BACKGROUND 2.3. SYSTEMC virtual processor with a communication channel. The Read Unit selects data from a specified input port. If data is available at that port, it will be taken. A write operation happens only when all the output arguments of the execute unit are available for the Write Unit. The information where to read from or write to is determined by the Logic Unit. 2.3 SystemC SystemC is a hardware description language for modeling and describing hardware components. It provides a single language for hardware software co-simulation, and provides a single language to facilitate step-by-step refinement of a system design down to the register transfer level for synthesis. Hardware components can be described at the transaction level model (TLM) or register transfer level (RTL). Section explains transaction-level modeling in terms of the time to market requirements and increasing complexity of systems [11]. In Section we discuss the SystemC register transfer level and compare the two design methodologies. Also, we discuss some of the language constructs (e.g., modules, processes, time etc.) in Section 2.3.4, as we use them in Section 3, when describing the various components of the LAURA model in more detail Transaction Level Modeling The consequence of designs getting bigger and bigger in size, faster in speed, and larger in complexity, is the necessity to describe designs at higher levels of abstraction. Moore s law, which states that the number of transistors incorporated in an Integrated Circuit (IC) doubles every two years, has been an important trend in the history of computer hardware. The outburst of complexity, as predicted by Moore s law, has driven the semiconductor industry to challenge another revolution: System-on-Chip. With the advent of SoCs, where distinct electronic components form an entire electronic system, and platform based design, where silicon producers develop a basic silicon platform for system companies who add their value on top of the platform: make it extremely difficult to communicate a platform to different design teams. System designers can not start adding features to the platform until the platform or a prototype has been made available by the hardware designer, see Figure 2.4 [12]. As a result, system exploration and verification happens very late in the design process. When the target platform has 11

17 CHAPTER 2. BACKGROUND 2.3. SYSTEMC System Specification Hardware Development No Communication Software Development Hardware Re-design Prototype Software Re-design System Integration System Validation Chip Fabrication Figure 2.4: Traditional Design Flow. been made available by the manufacturer, or hardware designer, and verification does not reveal the expected behavior, the target platform and perhaps part of the software must be re-designed, which is a costly and tedious process. SystemC closes this communication gap by raising the abstraction level to the transaction level modeling (TLM). With TLM, software developers for an embedded system can start the software development at an early phase of the design process since a platform or a prototype is already available as an executable specification. An executable specification is a simulation model of the platform at either the system level, the behavior level or the RTL level. With this, system engineering teams can get knowledge of the platform that they are about to customize. They can observe the behavior of different parts of that platform. The executable specification can be easily distributed among different design teams and enables early system exploration and verification. More important, both the hardware and 12

18 CHAPTER 2. BACKGROUND 2.3. SYSTEMC Customer Specification Paper Specification HW/SW Partitioning TLM Hardware Development Concurrent HW/SW Engineering Based on TLM Software Development Test Chip System Integration & Validation Chip Fabrication Figure 2.5: TLM design flow. software designers can work concurrently on the same platform, see Figure 2.5 [12]. In TLM, the communication is modeled by the use of function calls that represent the transactions, typically supported by the target platform. When designing a system at the TLM level, signals are usually avoided entirely, and instead data is exchanged between different processes by reading and writing shared data variables. The TLM designs are usually more concise and simulate order of magnitude faster than corresponding register transfer level designs. TLM designs can initially be written in an untimed manner, i.e., without regarding the clocking scheme that will be actually implemented in a device. At the TLM level, designers can easily verify the design at a very fast rate, since they do not depend on changes in clock signals. One main advantage of SystemC is that it also provides a notion of time. So, one can also 13

19 CHAPTER 2. BACKGROUND 2.3. SYSTEMC model systems at the TLM level in a timed manner. Modeling at the transaction level allows design teams to easily verify system components and helps to understand the behavior of the entire system. Once the functional verification of the system is finished, the refinement of the system toward the RTL can start Register Transfer Level In the previous section, we have seen that System level design offers a way to have a fast executable specification of the design that can be used for validation. At this level, the model is manageable enough such that different architectures can be explored and changes can be made quickly, see Figure 2.6. When the verification is done one can refine to the register transfer level. The refinement can be done in the same SystemC environment, which introduces a new system design methodology which will be discussed in Section Register Transfer Level is the next level above gates. SystemC RTL, which is a subset of SystemC, can be used to describe hardware models at the RTL. At the RTL level, communication is modeled using signals. The RTL style of modeling corresponds to digital hardware synchronized by clock signals. The basic building block in a SystemC RTL is a module, which we will discuss in Section It is a container in which processes and other modules can be instantiated. A module Fastest System Iteration time RTL Chip Slowest Figure 2.6: Exploring alternate architectures at different levels. 14

20 Input ports Output ports CHAPTER 2. BACKGROUND 2.3. SYSTEMC Module RTL Process RTL Process Signals RTL Process Signals Figure 2.7: Simple SystemC RTL module. typically exists of one or more RTL processes, see Figure 2.7. Processes may represent sequential logic, in which cases they are sensitive to a clock edge or represent combinational logic, in which case they are sensitive to all their inputs. Ports directly correspond to wires in the real world and therefore SystemC RTL models are pin-accurate. Such models can be automatically synthesized to gates using RTL synthesis tools. The RTL modeling style is also widely used in languages such as Verilog and VHDL SystemC Design Methodology In the current system design methodology, one would have a system level model written in C/C ++. The model is analyzed to verify the concepts and algorithms at the system level. When the model is validated, it is converted manually to a hardware implementation i.e., VHDL/Verilog, see Figure 2.8. There are problems with this approach. First of all, the manual translation from C/C ++ to a hardware description is error prone. Secondly, after the translation, the hardware description model becomes the focus of development and the C model quickly becomes out of date. Most of the time, changes are made to the hardware description model and not the implemented C model. One last issue is that the system engineer would have to create multiple system tests to validate the functionality. The C/C++ test of the system level model cannot run against the HDL model without conversion. So, in this case the entire test suite needs to be converted to the HDL 15

21 CHAPTER 2. BACKGROUND 2.3. SYSTEMC C/C++ System level Model Manual Conversion Refine Analysis VHDL/Verilog Results Simulation Modification Synthesis Figure 2.8: Traditional System Design Methodology. environment. With the SystemC approach, the design is not converted from a C/C ++ level description to an hardware description language. The design is slowly refined in small sections to add the necessary hardware and timing constructs to produce a good design. Using this refinement methodology, the designer can more easily implement design changes and detect bugs during refinement, see Figure 2.9. This new design methodology allows that the same language can be used to write a design, verify it and further refine the design all the way to the implementation level. The SystemC design method allows hardware and software to be developed at the same time but independently from one another. This technique is called hardware/software co-design SystemC RTL Constructs SystemC offers a class library that extends C ++ to model hardware descriptions. Its purpose is providing implementations of many types of objects that are hardware specific. Modeling hardware requires C ++ to handle multiple processes executing concurrently, hardware timing and 16

22 CHAPTER 2. BACKGROUND 2.3. SYSTEMC SystemC Model Simulation Refinement Synthesis Figure 2.9: SystemC Design Methodology. reactive behavior. The class library enables the user to define modules, processes, and communication through ports and signals that can handle a range of data types ranging from bits, bit vectors to standard C++ types to user-defined data types such as enumeration types and structure types. The SystemC model is called an executable specification i.e., you can compile and execute the SystemC model to understand the behavior of the system. The executable specification of the design or system is simulated using the simulation kernel that is provided with SystemC. Since SystemC is C++, one can use standard C++ programming language development tools to create, simulate, debug and explore different architectural and algorithmic descriptions of a design. Figure 2.10 shows how to include a SystemC model in a standard C ++ design flow [11]. The designer writes the SystemC models at system Level, behavior or RTL using the SystemC library. The compiler compiles the source and links the library, making an executable which can be executed using the SystemC simulator. Also input and output files can be used for several purposes, for example test benches, traces and wave forms. To include a software development environment into the hardware design and system design flow introduces some powerful advantages. Software developers may use C ++ for verification 17

CHAPTER 2. BACKGROUND 2.3. SYSTEMC Source files Compiler Linker Debugger SystemC Make Simulator executable Trace Log Input files Run Output Figure 2.10: SystemC in a C ++ development environment.

23 CHAPTER 2. BACKGROUND 2.3. SYSTEMC Source files Compiler Linker Debugger SystemC Make Simulator executable Trace Log Input files Run Output Figure 2.10: SystemC in a C ++ development environment. and debugging tasks. Hardware designers can preview simulation data in the form of waveform displays. But the most powerful feature is that the hardware, software, and test bench parts of the design can be simulated in one simple and unified simulation environment without the need for awkward co-simulations of disparate modeling paradigms. Modules Modules allow the partitioning of a complex design into smaller entities. In SystemC, structural decomposition is specified with modules, which are the basic building blocks. In SystemC, components are represented with SC_MODULE, and are used to declare a class. Modules can be hierarchical, which is an important requirement for structural design representation. A variety of elements make up the body of a module: ports, signals, sub-modules, constructors and processes SC_MODULE(module_name) // MODULE BODY ; 18

24 CHAPTER 2. BACKGROUND 2.3. SYSTEMC Signals and Ports Ports allow modules to communicate with their surroundings. They are used to specify the interface of a module and are declared as sc_in or sc_out. Signals on the other hand are used for communication between processes and for connecting module instances. A signal is declared using the sc_signal declaration. Ports and signals can be of a certain type e.g., bool, int or sc_bv<size>. Constructors Within a module constructor, one can perform the following tasks: Register a process with the SystemC kernel. Set sensitivity on ports or signals. Connecting modules. When the constructor is placed in the implementation (e.g.,.cc file) or it requires more then one argument, the syntax of the constructor is as follows: SC_MODULE(module_name) // constructor SC_HAS_PROCESS(module_name); module_name(sc_module_name instname [, other_args...]); ; When the constructor is placed in the class file (e.g.,.h file) and don t require more then one argument, the syntax will be simpler SC_MODULE(module_name) // constructor SC_CTOR(module_name) // constructor body ; The arguments can be used to include memories, address ranges for decoders, FIFO depths and other configuration information. So, SC_HAS_PROCESS is used when modules need to be parameterized, otherwise SC_CTOR is used. 19

25 CHAPTER 2. BACKGROUND 2.3. SYSTEMC Processes All the executing code is initiated from one or more processes. Processes execute concurrently. A process in SystemC takes no arguments and has no return value SC_MODULE(module_name) // constructor SC_HAS_PROCESS(module_name); module_name(sc_module_name instname [, other_args...]); // prototype of a process void PROCESS_NAME(); ; There are two kind of processes: SC THREAD modeling sequential behavior and, SC METHOD modeling combinational behavior. The SC_THREAD behaves exactly like a software thread. It is called once by the simulator and when it executes, it has complete control of the simulation until it chooses to return control to the SystemC simulator. One way to give control back to the simulator is a simple exit e.g., return(). This terminates the thread for the rest of the simulation. Once the SC_THREAD process is terminated it is gone forever, therefore SC_THREAD typically contains at least one wait() enclosed within a while or for-loop. This wait suspends the SC_THREAD process by giving control back to the simulator. The SC_METHOD process differs from the SC_THREAD process, in that it can be invoked more than once by the simulator and they cannot suspend by a wait() statement during execution. So, once the last statement within a method is done, control is given back to the simulator. When a process is defined, it must be registered to the simulation kernel. This allows the thread to be invoked by the simulation kernel scheduler. The registration is done within the constructor. 20

26 CHAPTER 2. BACKGROUND 2.3. SYSTEMC SC_MODULE(module_name) // ports and signals sc_in<int> inport; sc_out<int> outport; sc_signal<sc_bv<4> > sign; ; // constructor SC_CTOR(module_name) SC_THREAD(PROCESS_NAME_1); SC_METHOD(PROCESS_NAME_2); // prototype of a process void PROCESS_NAME_1(); void PROCESS_NAME_2(); Time SystemC provides the notion of time. Time is expressed using the sc_time data type. Possible time units are SC_NS and SC_PS. The first one represents the time in nano seconds the last one the time in pico seconds. The sc_time data type is used to represent simulation time, time intervals, delays and time-outs. The time is multiplied with 1 ps, which is the default time resolution. We can change the time resolution with sc_set_time_resolution. It allows changes in signals to be visible at the specified resolution. Changes to signals lower than that will not be visible. Test Benches Setting up a test bench in SystemC is very easy. It is specified with an SC THREAD, just like a process, and is easily integrated into the overall design. More sophisticated test benches can be built using the constructs available in C ++, in contrast to the relatively primitive capabilities of VHDL that deals with file I/O, data abstraction, and text processing. Top-level The top-level is the part of the implementation where all components of a design are instantiated and connected. This can be done with the use of signals. It is also the part of the implementation where you start the simulation of your design. Modules at this level will also be connected to a 21

27 CHAPTER 2. BACKGROUND 2.3. SYSTEMC clock. At line 7 in the code below, one can see how the time resolution is set. It allows changes to signals to be visible at 0.1 ns and higher. Also at line the start time and the period is initialized. The start time is simply the time when the clock starts ticking, the period is the time a clock cycle takes i.e., in this case 2 ns #include <systemc.h> #include "module_name_1.h" #include "module_name_2.h" int sc_main(int argc, char* argv[]) // Set time resolution sc_set_time_resolution(0.1, SC_NS); // Declare clock sc_time start_time(0, SC_NS); sc_time period(2, SC_NS); sc_clock clock("clock_name", period, 0.5, start_time, true); // Declare signal sc_signal<int> data_sig; module_name_1 my_instance_1("my_instance_1"); module_name_2 my_instance_2("my_instance_2"); // Connecting modules through signals my_instance_1.outport(data_sig); my_instance_2.inport(data_sig); my_instance_1.clk(clock); my_instance_2.clk(clock); sc_start(); // start simulation return(0); The clock is a data type and is declared as sc_clock. It takes the name, period, duty cycle, start time and starting edge as arguments, see line 12. Before we can connect the desired components to one another, we need to instantiate them. Two components are instantiated at line 17 and 18. The parameters my_instance_1 and my_instance_2 are stored as strings for debugging purposes. The SystemC function name() can be used to acquire a particular module or instance name. To connect the modules, we need a signal, which is declared at line 15. The modules are connected by adding the signal as depicted in lines 21 and 22. At lines 24 and 25 we also connect the clock to the modules to synchronize their behavior. Finally, at line 27 we start the simulation 22

28 CHAPTER 2. BACKGROUND 2.3. SYSTEMC of the design. 23

29 Chapter 3 Implementation This Section shows how to generate the LAURA multi-processor network, and discus how we have modeled the LAURA architecture model within the SystemC formalism. We finally discuss the use of sc fifo, a SystemC primitive. 3.1 Tool Flow To realize a network of processors within the SystemC formalism, we must follow some necessary steps. The sequential matlab code needs to be converted into a KPN specification. To achieve this, matlab code is provided to the COMPAAN tool. The generated KPN is subsequently given to the IMASC tool, which creates the RTL hardware implementations in SystemC, as depicted in Figure 3.1. Afterwards, the SystemC hardware implementations can be simulated, or synthesized to gates using commercially available tools. To enable COMPAAN to generate a KPN, get one of the matlab files that can be found in compaan/test/algorithms and provide this as input to the COMPAAN tool as follow:./compaan matlab_file This will generate a file, matlab file.kpn, that can be used as input to the IMASC tool, which transforms the KPN into a network of processors, where each processor corresponds with the LAURA processor model as discussed in Section To enable the generation of hardware modules at the RTL level one can do the following: 24

30 CHAPTER 3. IMPLEMENTATION 3.1. TOOL FLOW./Imasc -f matlab_file.kpn --rtl The --rtl argument enables IMASC to generate SystemC in RTL. If the argument --systemc or -s is provided, IMASC generates SystemC TLM code. The IMASC tool can also be invoked with some options to validate the behavior of the network using traces, or generating waveforms. The option to do this are: test or -t, for enabling a test environment where we check whether the inspected value equals the expected value. traces or -tr, enabling the generation of waveforms for each component within the LAURA processor. When the -t option is included, the execution unit of each module will indicate where the user have to include the values. The user will have to define an array with the values. When a multiprocessors network is simulated, values read by a node in the network will be compared with the values expected. These values are provided as traces. The main idea with traces is that it allows the validation of the network. Matlab COMPAAN KPN kpnformat + Visitor Design Structure IMASC LAURA C++ equivalent Of IP core SystemC VHDL IP core Simulation Synthesis Synthesis Simulation Figure 3.1: Tool Flow. 25

31 CHAPTER 3. IMPLEMENTATION 3.2. RUNNING EXAMPLE: MERGE2WRITE In order to obtain the traces, assuming that the user is located at compaan/bin, one must execute the following statements at the command line: mkdir kpn_file_name $COMPAAN/bin/dgparser -f kpn_file_name.m --test --param N:7 --param K:5 $COMPAAN/bin/kpnformat -f kpn_file_name.kpn --test -p cd kpn_file_name $JAVA_HOME/bin/javac -cp. kpn_file_name_testbench.java $JAVA_HOME/bin/java -cp. kpn_file_name_testbench 3.2 Running Example: merge2write In the upcoming sections, we show how we have modeled the LAURA architecture for a particular application. For this purpose, we will use the KPN merge2write.kpn, as shown in Figure 3.3. This KPN has five processes: two source nodes, two sink nodes and one pass node for j=1:1:5, [a(j),b(j)] = InitA; end for j=6:1:10, [a(j),b(j)] = InitB; end for j=1:1:10, [a(j),b(j)] = pass(a(j),b(j)); end for j=1:1:5, [] = SinkA(a(j),b(j)); end for j=6:1:10, [] = SinkB(a(j),b(j)); end Figure 3.2: Matlab code of merge2write. When we examine the matlab code from figure 3.2, the functions (InitA, InitB, pass, SinkA and SinkB) represents the nodes which are depicted in the process network of figure 3.3. The arrays a and b represents the FIFOs to which a node is connected. For example, InitA is connected to the FIFOs a and b. The nodes InitA and InitB do not read tokens from FIFOs, whereas SinkA and SinkB only do. We call InitA and InitB Sources and SinkA and SinkB Sinks. Node Pass we call a transformer. Each of the nodes is nested within a for-loop. InitA produces tokens to FIFO a and b, from iteration 1 5. Afterwards, InitB poduces tokens from iteration The node pass consumes the 26

32 CHAPTER 3. IMPLEMENTATION 3.2. RUNNING EXAMPLE: MERGE2WRITE tokens from iteration 1 10 and writes or produces them to the corresponding FIFOs. Such node is typically called a transformer since it consumes then changes and then produces the values. SinkA will consume tokens from iteration 1 5 and SinkB from Since, these are sink nodes, they hold the final outcome. Usually this value is displayed or written to file. InitA SinkA pass InitB SinkB input FIFOs output FIFOs R M UX R M UX EXECUTE UNIT W MU X W MU X Figure 3.3: merge2write Network with multiple arguments to the Execute Unit. In this running example, the two source nodes (InitA and InitB) keep on producing tokens to the FIFO channels, to which it is linked. Meanwhile, the node named pass selects a token from one of the channels and performs some calculation. Afterwards it selects one of the outgoing channels 27

33 CHAPTER 3. IMPLEMENTATION 3.2. RUNNING EXAMPLE: MERGE2WRITE to write the token to. If there are tokens available on one of the outgoing channels, one of the sink node (SinkA or SinkB) will read the token from the FIFO. InitA SinkA pass InitB SinkB InitA SinkA EU D M UX M U X EU CTRL Logic Counter FIFOs M U X EU D M UX FIFOs Logic Counter CTRL EU D M UX Logic Counter CTRL Logic Counter M U X EU Logic pass Logic CTRL Counter Counter CTRL InitB SinkB Figure 3.4: mergewrite Network with one argument to the Execute Unit. This running example was a case study we did to check whether multiplexers corresponding to input/output arguments to the execute unit, where selected correctly. This case has more then one multiplexer. A simpler case is the mergewrite, see Figure 3.4, where there is only one argument to the execute unit (See also Figure 3.5). To understand how the distinct components of the LAURA architecture are implemented in SystemC, we discuss them in later Sections for the transform node Pass. 28

34 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE for j=1:1:5, [a(j)] = InitA; end for j=6:1:10, [a(j)] = InitB; end for j=1:1:10, [a(j)] = pass(a(j)); end for j=1:1:5, [] = sinka(a(j)); end for j=6:1:10, [] = sinkb(a(j)); end Figure 3.5: Matlab code of mergewrite. 3.3 Modeling the LAURA Architecture In this section we will explain the LAURA processor model and how we have modeled it at the Register Transfer Level (RTL) using SystemC. We also discuss how the components of LAURA are created and bind to form a fully functional hardware processor model Read Unit The read multiplexer has the task of selecting tokens or data from one of its ports. When data has been selected, we say that a read operation has occurred. Selection occurs only when certain condition apply. Figure 3.6, portrays the read multiplexer, together with the Logic Unit which we will discuss in Section 3.3.5, and the Controller Unit which will be the subject of Section To be able to select data from one of the ports p 0 and p 1, the following conditions must be satisfied: Data is Needed as indicated by the Logic Unit. Data Exist on the port that needs the data. If the conditions are satisfied, the Control Unit enables the read mux and data is read from the port of which data was needed. In Figure 3.6, the various signals discussed are shown. The read multiplexer connected to a particular FIFO can be blocking read or non-blocking read. For example, the FIFO connected to p 0 has at least one token, and the FIFO connected to p 1 is empty. When the read multiplexer wants to select a token from port p 1, a blocking read occurs. In this case, the read multiplexer suspends, until a token becomes available. Once 29

35 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE FIFO FIFO exist p_0 p_1 READ MUX out exists isneeded executeenable LOGIC UNIT CONTROLLER UNIT Figure 3.6: Conceptual diagram of a read multiplexer, which is a input argument of the Execute Unit. available, the token will be read by the read multiplexer and propagated to the out port, towards the Execute Unit. In our SystemC modeling a Read Unit is named using a specific scheme; Readi_j where i and j are integer numbers, i stands for the KPN node the module reside in and j which argument it represents; and a port ND_iIP_j_Din where i and j are integer numbers, i stand for the KPN node the module reside in and j which port of the KPN node it is connected to. Figure 3.7 depicts the module Read3_1. Data is read from input ports ND_3IP_1_Din and ND_3IP_2_Din. When data is selected from one of the input ports, it leaves the module through output port out. Figure 3.8, shows the waveform analysis of node 3 from our running example. It portrays two read multiplexers separated by a blank line. These are the two arguments to the Execute Unit. We observe that each of the arguments are bounded to two input ports. The first argument is connected to the inputs ND_3IP_1_Din and ND_3IP_2_Din, and the second one to ND_3IP_3_Din and ND_3IP_4_Din. The read multiplexer knows from which node to select by looking at neededrd. It contains a hot bit encoding, which allows a read multiplexer to select data only from one of their ports at a time. In Figure 3.8, we observe that when the read multiplexers are 30

36 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE SC_MODULE(Read3_1) public: sc_in<int> sc_in<int> sc_out<int> ; ND_3IP_1_Din; ND_3IP_2_Din; out; sc_in<sc_bv<4> > exist; sc_out<bool> exists; sc_in<sc_bv<4> > readfromfifo; sc_in<bool> enable; sc_in<sc_bv<4> > neededrd; sc_in<bool> sync2fifo; SC_HAS_PROCESS(Read3_1); Read3_1(sc_module_name mn); protected: void Read_Prc(); void Exists_Prc(); Figure 3.7: Read Multiplexer Module of the first argument to the Execute Unit. CLK ND_3IP_1_Din ND_3IP_2_Din out enable exists neededrd ND_3IP_3_Din ND_3IP_4_Din out enable exists neededrd MUX Read3_1 MUX Read3_2 Figure 3.8: Waveform analysis of two arguments of the Execute Unit. enabled, and there exists data on one of the FIFOs, and the data is needed for a particular port, then a read operation takes place. Let us look at how we have realized the concepts as previously explained. The top level contains a process called Exist_Prc. This process checkes if at least one token on one of the FIFOs 31

37 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE exists. It does this by propagating the state of all the FIFOs, through the signal existr_sig, towards the exist ports of all the read multiplexers (See Figure 3.9). The read multiplexer contains for(;;) if( ND_3IP_1_Din->num_available() == 0 ) _exist[0] = false; else _exist[0] = true; // more code omitted existr_sig.write(_exist); wait(0.1, SC_NS); Figure 3.9: Top level process Exist Prc, verifying if a FIFO contains at least one token. two process, Read_Prc and Exists_Prc, which are sensitive to some ports, as indicated in Figure When the exist port changes, it triggers the Exists_Prc process to execute (See Figure 3.11). The Exists_Prc evaluates if there are token available and if they are needed SC_METHOD(Read_Prc); sensitive << sync2fifo; dont_initialize(); SC_METHOD(Exists_Prc); sensitive << exist << neededrd << readfromfifo; dont_initialize(); Figure 3.10: Process registration and Sensitivity list. When at least one token is available, on at least one of the FIFOs, it is propagated through the exits port, towards the Control Unit. When it is necessary to read a token from one of the input ports and the read multiplexer is enabled, the Read_Prc at the top level (See Figure 3.12) will write the token to the corresponding input port of the read multiplexer. This will trigger the Read_Prc immediately, since it is sensitive to sync2fifo. The purpose of sync2fifo is to trigger the read process, inside the read multiplexer, at the moment a read has occured at the top level. Once triggered, it checks if the token is needed. If so, the token is written through the out port, towards the Execute Unit (See Figure 3.13). 32

38 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE bool _ex0; if (readfromfifo->read()[0].to_bool() ) _ex0 = exist->read()[0].to_bool() && neededrd->read()[0].to_bool() ; else _ex0 =!neededrd->read()[0].to_bool() &&!neededrd->read()[1].to_bool() ; bool _ex1; // more code omitted exists->write(_ex0 _ex1 ); Figure 3.11: Exists Prc, propagates if tokes are available, to the Execute Unit for(;;) if( renable_sig.read() && neededrd_sig.read()[0].to_bool() ) read0_sig.write(nd_3ip_1_din->read()); sync2fifo_sig.write(!sync2fifo_sig.read()); // more code omitted Figure 3.12: Top levels Read Prc static int value0; if ( neededrd->read()[0].to_bool() ) if ( readfromfifo->read()[0].to_bool() ) value0 = ND_3IP_1_Din->read(); out->write(value0); else out->write(value0); // more code omitted Figure 3.13: Read Prc, reading tokens Execute Unit The Execute Unit is the part of the model where the actual computation takes place. It may contain a function with several input arguments and output arguments depending on the application. The function of an application is provided as an IP core and must be connected manually to the input and output arguments by the user. For mimicking the delays of a particular applications, the user can change the pipeline depth associated within the Execute Unit. 33

39 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE IP1 OP1 IP2 mux arg1 out1 D E M U X OP2 IP3 mux arg2 f out2 D E M OP3 IP4 U X OP4 Figure 3.14: Data must be present at arg1 and arg2 for execution to take place. The Execute Unit executes a particular function only when all the arguments are available. Thus, data must be present at all input ports of the Execute Unit, as indicated in Figure When data isn t present, at arg1 or arg2, this means (as discussed in Section that the multiplexers has blocked the read operation from happening. When all the arguments are present at the input ports from the Execute Unit, the function is carried out as the execute signal is given by the controller, and dispatches the result to the available output arguments (out1 and out2). This, however, may not happen immediately, because the function can be pipelined in hardware. Time Iteration 1 R E W R E W R E W Iteration 2 R E W R E W R E W Iteration 3 R E W R E W R E W Figure 3.15: Pipelining: Read (R), Execute (E) and Write (W) operations. 34

40 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE With pipelining, we are able to overlap multiple instructions that can execute in parallel. As a consequence, we increase the throughput of each instruction. Figure 3.15, depicts a three stage pipeline with overlapping operations Read, Execute and Write. Notice, that where the operations are overlapped, they are carried out in parallel. The first iteration (further in time) shows that when a write occurs, an execute from the second iteration and read from the third iteration can happen simultaneously. The figure also clearly depicts that all the pipeline stages are fully utilized. The Execute Unit, in the figure has a pipeline depth of one (i.e., only one Eis shown). Time Iteration 1 R E 1 W E 2 E 3 E 4 E 5 Iteration 2 R E 1 E 2 E 3 E 4 E 5 W Iteration 3 E 2 E 3 E 4 E 5 R E 1 W Figure 3.16: Pipeline depth of five: Read (R), Execute (E) and Write (W) operations. Within our implementation of the Execute Unit, the pipeline depth can be adjusted by the user using a simple parameter, allowing more execute operations to be carried out simultaneously. In SystemC this can be done easily. Figure 3.16, shows that at iteration one, the third execute operation (E 3 ) happens in parallel with the second execute operation (E 2 ) from iteration two and the first execute operation (E 1 ) from the third iteration. Notice that the pipeline depth is set to five. So, E 1 will be present to the Write Unit after five steps. The R, E and W stages are fully utilized in this figure. Now let us look at how pipelining is implemented. The Execute Unit is enabled by the Control Unit (Discussed in Section 3.3.4), which can be seen in Figure The Execute Unit has, two input arguments (iargs) and two output arguments (oargs). The two input arguments to the function, which is provided to the Execute Unit as a IP core, must be avialable to the output arguments at the same time. That is why two sc signal objects are used to model pipelining 35

41 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE iargs oargs EXECUTE UNIT executeenable CONTROLLER UNIT Figure 3.17: Execute Unit is enabled by the Control Unit. (See line 9 10 from Figure The Execute_Prc process models the pipeline behavior. Because there are two input arguments to the Execute Unit, two sc signal objects are used to model pipelining (See line 9 10 from Figure Modeling the pipelining mechanism is simple. Once SC_MODULE(Execute3) public: // code omitted protected: void Execute_Prc(); void Out_Prc(); ; static const int PIPELINE_STAGES = 1; sc_signal<int> delay1_pipe[pipeline_stages+1]; sc_signal<int> delay2_pipe[pipeline_stages+1]; Figure 3.18: Execute Unit Module. the Control Unit enables the Execute Unit it means that al tokens are available. When we have a pipeline depth of five, the tokens will be put in the pipeline delay1_pipe and delay2_pipe. With a pipeline depth of five, it means that we will have five clock cycle time delays. When the first token read, reaches the end of the pipeline it signals to send to all the output arguments. In general, the arguments in_1 and in_2 are connected to the IP core. Thus, the Execute Unit works as a IP core wrapper [13]. In the waveform from Figure 3.20, we observe the five clock 36

CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 void Execute3::Execute_Prc() // code omitted if (enable->read() ) // connect IP here delay1_pipe[0].

42 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE void Execute3::Execute_Prc() // code omitted if (enable->read() ) // connect IP here delay1_pipe[0].write(in_1->read()); delay2_pipe[0].write(in_2->read()); for(int i = PIPELINE_STAGES; i >= 1; i--) delay1_pipe[i] = delay1_pipe[i-1]; delay2_pipe[i] = delay2_pipe[i-1]; // code omitted Figure 3.19: Modeling the pipelining mechanism. cycles delay for each token. Thus, the data will be written to the Write Unit when, after five clock cycles, there is a positive clock edge and the enable signal is high (1). CLK readenable executeenable writeenable in_1 out_1 in_2 out_2 5 cycles 5 cycles Figure 3.20: Waveform analysis depicting five cycles of pipeline delays Write Unit The Write Unit exists out of a write de-multiplexer, Logic Unit and Counter Unit. The latter two will be discussed in Section and The write de-multiplexer is responsible for writing the results provided by the Execute Unit to the outgoing FIFO channels. Besides this, it also monitors if the FIFOs are full or not and reports this to the Controller Unit. The write multiplexer is very similar to the read multiplexer. To be able to write data to one of the ports p 0 and p 1, the following statements must all be satisfied: Data is needed. Data can be written. 37

43 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE out EXECUTE UNIT in p_0 p_1 WRITE MUX FIFO FIFO writeenable HasRoom isneeded CONTROLLER UNIT LOGIC UNIT Figure 3.21: Conceptual diagram of a write de-multiplexer, which is a output argument of the Execute Unit. If the conditions are satisfied, the Controller Unit enables the write mux and data is written to the port out. In Figure 3.21, the various signals we discussed are shown. Before the data is written, the de-multiplexer will verify if the FIFO to which the data needs to be written, is full (hasroom) or not. When the FIFO is full a blocking write occurs, stalling the complete processor (Read, Execute and Write Unit). A Write Unit is named using a specific scheme. A write module is named Writei_j, where i and j are integer numbers. i stands for the KPN node the module resides in and j which argument it represents. And a port ND_iIP_j_Dinwhere i and j are integer numbers, i stand for the KPN node the module reside in and j which port of the KPN node it is connected to. There are two processes called Write Prc and Full Prc. The first one writes data to the output ports, but when the corresponding FIFO is blocking it doesn t write the data. Thus, we check whether the FIFOs are full in order to know if a write must happen or not (See Figure 3.24). The sync2fifo mechanism is the same as what we have discussed in Section The second function, propagates whether the corresponding FIFO is full or not to the Controller Unit. A top level process, evaluates every 0.1 ns if the FIFO is full or not, and provides this to the full port of the write de-multiplexer. This is exactly the same as what we have discussed in Section

44 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE Figure 3.25, depicts the waveform analysis of the write de-multiplexer we discussed in the SC_MODULE(Write3_1) public: sc_in<int> sc_out<int> sc_out<int> sc_in<sc_bv<4> > sc_in<bool> ; in; ND_3OP_1; ND_3OP_1_d1; full; enable; sc_in<sc_bv<4> > neededwr; // more code omitted SC_HAS_PROCESS(Write3_1); Write3_1(sc_module_name mn); protected: void Write_Prc(); void Full_Prc(); Figure 3.22: Module of Write DeMultiplexer. if( enable->read() &&!full->read()[0].to_bool() && neededwr->read()[0].to_bool() ) ND_3OP_1->write(in->read()); sync2fifo->write(!(sync2fifo->read())); if( enable->read() &&!full->read()[1].to_bool() && neededwr->read()[1].to_bool() ) ND_3OP_1_d1->write(in->read()); sync2fifo->write(!(sync2fifo->read())); Figure 3.23: Write Prc process. for(int i = 0; i<=1; i++) if ( full->read()[i].to_bool() && neededwr->read()[i].to_bool() ) cfull->write(true); break; else cfull->write(false); Figure 3.24: Full Prc process. 39

CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE previous paragraphs. At the beginning of a rising clock edge, when the enable port is high (1), a token or data is written to the FIFO.

45 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE previous paragraphs. At the beginning of a rising clock edge, when the enable port is high (1), a token or data is written to the FIFO. Afterwards the full port becomes high (1) (with FIFO size set to one). Another processor from the network will read that token, making the FIFO empty, which causes the full port to become low (0) again. The first two bits (from the left) of the neededwr vector, which are hot bit encoded, indicates to which port NO 3OP 1 or ND 3OP 1 d1 the data is needed to be written. Therefore, notice that when the bits changes from value, the port the data is written to is also switched. Hot bit encoding means, that only one output port per de-multiplexer can be selected to write the data to. CLK enable cfull neededwr in ND_3OP_1 ND_3OP_1_d1 Figure 3.25: Waveform analysis from the write de-multiplexer Control Unit The Control Unit controls the Read, Execute and Write Unit. The read, execute and write operations occur in succession. This means that a write operation can only occurs after an execute operation, and an execute operation only occurs after a read operation. The Control Unit affects the counters of the Read Unit and Write Unit. The counters informs the Control Unit whether reading and writing is done, which will be discussed in Section The execution order of the units is implemented with the three processes, ReadSig_Prc, ExecuteSig_Prc and WriteSig_Prc. The first process activates the Read Unit by enabling the read multiplexers (ren) and the counter (actcrd), which can be seen in Figure The second process enables the Execute Unit and the third the Write Unit. Figure 3.6 shows that the read multiplexer provides the Control Unit with the information whether there exists data or not. And in Figure 3.21 from Section 3.3.3, we have seen that the write multiplexer provides the Control Unit whether the outgoing FIFOs are full or not. These conditions together with, if the Read Unit is done reading, are evaluated within the Read_Prc as depicted in Figure So, when there exists data on the input ports, and the outgoing FIFOs are 40

CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE M UX D M UX M UX EXECUTE UNIT D M UX LOGIC ren een wen CONTROLLER UNIT LOGIC actcrd actcwr COUNTER doner donew COUNTER Figure 3.

46 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE M UX D M UX M UX EXECUTE UNIT D M UX LOGIC ren een wen CONTROLLER UNIT LOGIC actcrd actcwr COUNTER doner donew COUNTER Figure 3.26: Control Unit: transmission of control signals. not full, and we are not done reading, then we activate the read multiplexer and the counter. Thus, we read a token or data. The read operation happens in one clock cycle. When the ren is high (1), two processes are triggered. The first one is the Execute Unit and the second one is Pipeline_Prc. The latter one controls the pipeline from the Control Unit, and it executes in parallel with the pipeline from the Execute Unit. The only thing that differs, is that it takes control values (0 or 1), instead of data, provided with ren (See Figure 3.28). Once delay_pipe[0] contains the control value, as indicated at line 6 from Figure 3.28, the Execute Unit will also be enabled (See Figure 3.29 line 2). Depending on the pipeline depth say N, it will void Controller3::ReadSig_Prc() ren->write(exist->read() &&!full->read() &&!doner->read() ); activatecounterrd->write(exist->read() &&!full->read() &&!doner->read()); Figure 3.27: ReadSig Prc, enabling the multiplexers and counters. 41

47 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE // more code omitted if(!full->read() ) for(int i = PIPELINE_STAGES; i >= 1; i--) delay_pipe[i] = delay_pipe[i-1]; delay_pipe[0] = ren->read(); Figure 3.28: Pipeline Prc, executes in parallel with pipeline from the Execute Unit. take N clock cycles before data is actually executed. Ones the control value reaches the end of the 1 2 // more code omitted een->write(delay_pipe[0] &&!full->read()); Figure 3.29: ExecuteSig Prc, enables the Execute Unit pipeline, i.e., PIPELINE STAGES, data is executed within the Execute Unit. The control signal, inside delay pipe[pipeline STAGES] is forwarded to wen. This activates the Write Unit which needs one clock cycle to write the data to the outgoing FIFOs // more code omitted wen->write( delay_pipe[pipeline_stages].read() &&!full->read() &&!donew->read() ); activatecounterwr->write( delay_pipe[pipeline_stages].read() &&!full->read() &&!donew->read()); Figure 3.30: WriteSig Prc, enables the Write Unit The waveform from Figure 3.31 shows the control three signals from the Control Unit. In this case we have a pipeline depth of five and a FIFO size of one. As one can see, the first write operation happens five clock cycles after the first execute operation. Also notice, that the execute operations are not efficient as there are holes in the execute signals. The IP Core could fire, but apparently there is no data. The reason for its inefficiency is the small FIFO size of one. Once the first token is read, the FIFO becomes empty. The waveform from Figure 3.32, has the same pipeline depth as in the waveform from Figure The FIFO sizes, however, are set to five. By increasing the FIFO size of self-loops we observe that the Execute Unit runs more efficiently as the Execute is constantly high i.e., at every clock cycle the Execute Unit is busy. There are, however, more factors that can influence the pipeline utilization. 42

48 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE CLK ren een wen Figure 3.31: Waveform analysis depicting control signals from the Control Unit. CLK ren een wen Figure 3.32: Waveform analysis from the Control Unit for a streaming application Evaluate Logic Unit In Section we have discussed the Read Multiplexer. We have seen that the Read Multiplexer requires some information from the logic component, which indicates when and how a read operation must occur. The Logic Unit tell, from which FIFO to consume data int k = registercounter0_sig.read().to_uint(); int j = registercounter1_sig.read().to_uint(); e0_sig.write(k-2 >= 0); e1_sig.write(1 >= 0); e2_sig.write(k-1 = 0); e3_sig.write(1 >= 0); e4_sig.write(j-2 >= 0); e5_sig.write(1 >= 0); e6_sig.write(j-1 = 0); e7_sig.write(1 >= 0); Figure 3.33: Evaluate Process, evaluates the expressions. The Logic Unit uses the iteration we are in (provided by the Counter Unit), to evaluate if a token is needed in that particular iteration. It does this by evaluating the expressions as dipicted in Figure These expressions are established by examining the constraint matrix within the KPN file [2]. The expression is pertaining to a specific input port and is used to select from which port to read the data from. The Logic Unit evaluates the expressions as indicated in Figure 3.34 within the evaluate process. The control process as shown in Figure 3.34, indicates which input port is selected. The selection is hot bit encoded i.e., from only one port, per multiplexer, data will be read from. 43

49 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE sc_bv<4> _ctrl; // ND_3IP_1 _ctrl[0] = e0_sig && e1_sig ; // ND_3IP_2 _ctrl[1] = e2_sig && e3_sig ; // ND_3IP_3 _ctrl[2] = e4_sig && e5_sig ; // ND_3IP_4 _ctrl[3] = e6_sig && e7_sig ; ctrl->write(_ctrl); Figure 3.34: Control Process, selecting input ports. Figure 3.35, shows that at iterations k = and j = , data is needed from port ND 3IP 1. Data is needed and will be read from port ND 3IP 2 at iterations k =1and j = ND 3IP 3, will be read from at iterations k = and j = Finally, at iterations k = and j =1is read from port ND 3IP 4. The Evaluate Logic Unit will propagate this as a control vector ( ctrl) to the read multiplexers, indicating from shich port data is needed. Every port corresponds to a particular FIFO. So, the figure shows from which FIFOs data is needed. k k ND_3IP_4 ND_3IP_3 ND_3IP_ ND_3IP_ j j Figure 3.35: Input port domains from iteration

50 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE Counter Unit In the previous section we have seen why the Logic Unit depends on the Counter. The Counter simply tells the Logic Unit in which iteration it is. To make this possible, the counter (See Figure 3.36), has two output ports from which the information can be propagated to the Logic Unit i.e., iterator and registercounter. The Logic Unit sets the correct lower bound and upper bound values for the Counter Unit. Therefore the Counter Unit has two input ports called upperbound and lowerbound. IP0 IP1 mux isneeded IP2 IP3 mux rcntr it Evaluate Logic lb ub ub lb Counter Unit it rcntr executeenable done Figure 3.36: Counter. When the Counter has reached its upper bound it means that there are no more iterations to perform calculations on. The Counter will need to notify this and does this by port done_0. The Counter Unit contains another module GenCounter declared as _gencounter0. The actual counting and checking if a upper bound is reached happens inside this module. Therefore the ports 45

51 CHAPTER 3. IMPLEMENTATION 3.3. MODELING THE LAURA ARCHITECTURE of the Counter Unit are mapped to module _gencounter0. There are two important processes within the GenCounter, that is RegisterSig_Prc and CounterSig_Prc. The count is initiated from the process RegisterSig_Prc, see Figure When we start the simulation of our hardware modules the first thing we do is resetting all the component to their lower bound. The RegisterSig_Prc is sensitive for changes on rst. This means that when rst changes value, the RegisterSig_Prc process will be executed. As we can see from Figure 3.37 the statement within the if clause will be executed only once, writing the lower bound into the signal register_sig. The CounterSig_Prc process (See Figure if( rst->read() ) register_sig.write(lbound_sig.read()); else if( en->read() ) register_sig.write(counter_sig.read()); Figure 3.37: RegisterSig Prc: initiating the count. 3.38) is sensitive to changes on register_sig and will therefore execute every time changes occur on register_sig. Within the first part of the if-statement we check if gencounter has reached its upper bound. If this is the case then we put the count back to its lower bound, so that we can start the count again when it is needed. Otherwise, we go into the else clause and check whether the count still resides below the upper bound i.e., we are not done counting. If this is the case the value from register_sig is incremented by one and is written into counter_sig. This will trigger the RegisterSig_Prc process to register the new value and again triggers CounterSig_Prc if( done_sig.read() ) counter_sig.write(lbound_sig.read()); else if (counter_sig.read().to_uint() < ubound_sig.read().to_uint()) counter_sig.write(register_sig.read().to_uint() + 1); else counter_sig.write(lbound_sig.read()); Figure 3.38: CounterSig Prc: incrementing the count by one. 46

52 CHAPTER 3. IMPLEMENTATION 3.4. ADOPTING THE SYSTEMC SC FIFO MODEL. 3.4 Adopting the SystemC sc fifo model. SystemC uses sc fifo to model the behavior of a FIFO. We have used this SystemC FIFO to model the FIFO channels in our network of processors. By adopting the sc fifo model, the network we generate is a mixed TLM/RTL network. The LAURA processor models are implemented at the RTL, but the connections between are TLM. Allowing TLM FIFO models, will allow easy integration of heterogeneous networks, which we will see in Section 4.5. The sc fifo has several predefined methods. Two of them are num_available() and num_free(). The first one is used to check if data is available on the FIFO. The second one is used to check how many slots are free. We use these two functions in the read unit to update the exist signal and in the write unit to update the full signal. The two processes in which this is done are Exist_Prc and Full_Prc residing in the top-level of the hardware modules, see Figure 3.39 and Figure void HW_ND_3::Exist_Prc() sc_bv<4> _exist; for(;;) if( ND_3IP_1_Din->num_available() == 0 ) _exist[0] = false; else _exist[0] = true; if( ND_3IP_2_Din->num_available() == 0 ) _exist[1] = false; // More code left out. existr_sig.write(_exist); wait(0.1, SC_NS); Figure 3.39: Top-level process updating the exist signal. These processes do not contain a sensitivity list, but instead are triggered every 0.1 ns. Normally in our design a clock cycle takes 2 ns. By triggering these processes every 0.1 ns we force the information to be available before the next rising clock edge occurs. We do the same thing for the Full_Prc process. The information of whether a FIFO is full or not will be available on 47

53 CHAPTER 3. IMPLEMENTATION 3.4. ADOPTING THE SYSTEMC SC FIFO MODEL. time, that is before a token is written to one of the channels void HW_ND_3::Full_Prc() sc_bv<4> _full; for(;;) if( ND_3OP_1_Dout->num_free()!= 0 ) _full[0] = false; if( ND_3OP_1_Dout->num_free() == 0 ) _full[0] = true; // More code left out. full_sig.write(_full); wait(0.1, SC_NS); Figure 3.40: Top-level process updating the full signal. 48

54 Chapter 4 Experiments & Results In this Section, we discuss which applications we have applied to the SystemC Hardware Models. We present results on the validation and calibration of design points, and how long the design and compilation takes for a particular design. Interesting is the comparison of these results with the numbers available for a VHDL design. First we discuss how we setup the experiments. 4.1 Experiment Setup To perform the experiments we have used a system with the following specifications: System: Intel(R) Pentium D CPU 3.40 GHz Cache: 2048 KB Memory: 2048 MB For compiling, running and simulating the hardware implementations, we used: Operating System: SUSE Linux (kernel ) Compiler: GCC version with optimization step -O2 Simulator: SystemC version

55 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS For comparison of the simulation times, all models were simulated under the same condition i.e., time resolution was set to 0.1 ns and clock period to 2 ns. Simulation ends when all processors are done executing. That is why we did not provide a finite simulation time to the simulator. Instead of that we wrote a simple test bench, that waits for all processors to finish execution // Code omitted // Constructor NetworkObserver::NetworkObserver(sc_module_name mn) : sc_module(mn), time(0, SC_NS), period(2, SC_NS) SC_METHOD(getCycles_Prc); sensitive << clk.pos(); dont_initialize(); // Stop simulation when al processors are done executing. void NetworkObserver::getCycles_Prc() if( done_1->read()... done_n->read() ) time = sc_time_stamp(); cycles = time/period; cout << "Network finished at: " << time << endl; cout << "clock period: " << period << cout << "Cycles: " << cycles << endl; sc_stop(); endl; Figure 4.1: Simple test bench: network observer. 4.2 Applications This section will explains the applications, for which we generated the hardware components in SystemC. All the SystemC networks have been validated using traces. This means that the input output behavior of the Matlab code and the representation of the network in SystemC are equivalent. We will look at how the networks generated looks like for each application and discuss some of the special features Show (featrier) Show is an artificial application. We used it to validate the behavior of the hardware modules in SystemC. The hardware modules are generated by providing the sequential matlab code, see 50

56 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS Figure 4.2, to the tool flow as discussed in Section 3.1. The tool also provides the corresponding KNP depicted in Figure for i = 1 : 1 : 10, [ a(i) ] = init; end for i = 1 : 1 : 5, for j = 1 : 1 : 5, [ a(i+j) ] = f(a(i+j)); end end Figure 4.2: Sequential matlab code: show example. The show example, given in Figure 4.2, has a for-loop containing an init function, and a double nested for-loop with the function f(a(i+j)). From iteration 1 10, the function init produces tokens to array a which corresponds to ED_2 from figure 4.3. Within the double nested for-loops, a token is written five times for each iteration of the outer loop to array a(i+j) corresponding to ED_1. Once a token has been read and written to ED_1 at iteration (3, 4), and later on it needs to be read from (4, 3), the read happens from the self-loop ED_1. ED_1 ED_2 ND_1 init ND_2 f Figure 4.3: Kahn process network of show Sobel Edge Detection Sobel edge detection is a common image processing operation. It implements a gradient measurement on an image giving the direction of the largest possible increase from light to dark and the rate of change in that direction. In the resulting image the edges will be shown intensified, since 51

57 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS Figure 4.4: Original image (left) and the result of the Sobel operation applied to it (right). the sudden change of lightning are typically found at those regions in a image. Sobel can be used to find the approximate absolute gradient magnitude at each point in an input grayscale image. Sobel edge detection is based on convolving the image with small, divisible, and integer valued filter in horizontal (J x) and vertical (J y) direction. The actual filter is shown below as a pair of 3x3 convolution mask J x = I and J y = I The original image is denoted by I, and the convolution operator with. For each pixel, the magnitude of the gradient can be computed using the formula: G = J 2 x + J 2 y An approximate magnitude can be calculated using: G = G x + G y For our experiments we have used the Sobel edge detection application as described above. The application was provided in matlab as a sequential loop program (See Figure 4.6). By providing the matlab code to COMPAAN, as described in Section 3.1, we obtain a KPN as depicted 52

58 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS ED_2 ND_3 ED_3 Jc ED_4 ED_5 ED_6 ED_7 ED_14 ND_1 _Read_m ED_1 ND_2 Copy ED_11 ND_5 Sb ED_16 ND_5 _Write_m ED_12 ED_8 ED_13 ED_15 ED_9 ED_10 ND_4 Jc Figure 4.5: KPN for Sobel Application. in Figure 4.5. In order to validate and calibrate our design we used a grayscale image as dipicted in Figure 4.4 with width set to 280 and height to 200. Pipeline for the gradient and absolute value calculation has been set to for j = 1:1:M, for i = 1:1:N, [ a(j,i) ] = _Read_m(); [ image(j,i) ] = Copy(a(j,i)); end end for j = 2:1:M-1, for i = 2:1:N-1, [ Jx(j,i) ] = Jc(image( j-1,i-1),image(j,i-1),image(j+1,i-1), image(j-1,i+1),image(j,i+1),image(j+1,i+1) ); end end for j = 2:1:M-1, for i = 2:1:N-1, [ Jy(j,i) ] = Jc( image(j-1,i-1),image(j-1,i),image(j-1,i+1), image(j+1,i-1),image(j+1,i), image(j+1,i+1) ); end end 53

59 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS for j = 2:1:M-1, for i = 2:1:N-1, [ Sbl(j,i) ] = Sb( Jx(j,i), Jy(j,i) ); end end for j = 2:1:M-1, for i = 2:1:N-1, [ Sink(j,i) ] = _Write_m( Sbl(j,i) ); end end Figure 4.6: Sequential matlab code: Sobel edge detection QR-Decomposition The QR-Decomposition factors a M N matrix A into an orthogonal matrix Q and upper triangular matrix R as: A = QR. During the factorization the matrix A is reduced to an upper triangular matrix by elementary reflection transformations which allow to zero all the elements of a column which are located below the matrix diagonal. The QR-Decomposition is essential for solving the least square minimization computational problem, where it is desired to fit a linear mathematical model to measurements obtained from for j = 1:1:N, for i = j:1:n, [r(j,i)] = ReadMatrix_Zeros_64x64(); end end for k = 1:1:K, for j = 1:1:N, [x(k,j)] = Read(); end end for k = 1:1:K, for j = 1:1:N, [r(j,j), x(k,j), t ] = Vectorize( r(j,j), x(k,j) ); for i = j+1:1:n, [r(j,i), x(k,i), t] = Rotate( r(j,i), x(k,i), t ); end end end for j = 1:1:N, for i = j:1:n, [ Sink(j,i) ] = Pass( r(j,i) ); end end Figure 4.7: Sequential matlab code: QR decomposition. 54

60 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS experiments. QR-Decomposition finds its application in space and time adaptive filtering [14] and beam forming [15] tasks. QR-Decomposition can be computed using various methods. For the purpose of our experiments we used two matlab implementations (Figure 4.7) of QR using given rotations, that is QR implemented with lookup tables (QR LUT) and QR implemented with Taylor approximations (QR TA). For QR LUT the parameters N and K are set to 7 and 21. In order to simulate the delays for the vectorize and rotate functions, the pipeline depth for vectorize is set to 10 and rotate to 4. For QR TA parameter are equal to QR LUT. The pipelines are set to 19 for vectorize and 10 for rotate. The resulting KPN is depicted in Figure 4.8. ED_9 ED_5 ED_7 ND_1 ReadMatrix_Zero_64x64 ED_6 ED_2 ED_1 ED_8 ED_10 ND_4 Rotate ED_12 ND_2 Read ED_4 ND_3 Vectorize ED_3 ED_11 ND_5 Pass Figure 4.8: KPN of QR Decomposition Motion JPEG M JPEG, which stands for motion JPEG, is a video compression technique based on JPEG image compression. The JPEG image compression algorithm is repeatedly applied to each separate frame of the video. M JPEG is often used in mobile devices and digital content applications, i.e., to transform analog video to digital format or approach every separate image or frame when editing the video. We used the M JPEG algorithm implemented as sequential matlab code. The functions are provided as IP cores to the execute units of processors a en b, see Figure 4.9. We carried out two 55

CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS experiments where the pipeline depth differs. At the first experiment the pipeline for DCT and Quantizer was put to one, which is not realistic.

61 CHAPTER 4. EXPERIMENTS & RESULTS 4.2. APPLICATIONS experiments where the pipeline depth differs. At the first experiment the pipeline for DCT and Quantizer was put to one, which is not realistic. In real time one may always notice some delay, that is why in the second experiment DCT was put to 94 and 27. In both situations the number of frames, height (HNumBlocks) and width (VNumBlocks) was respectively set to 8, 16 and 8, and each block existed out of a 8x8 pixel. Figure 4.9: KPN of Motion JPEG. 56

62 CHAPTER 4. EXPERIMENTS & RESULTS 4.3. RESULTS 4.3 Results In this Section we present the results we obtained from our experiments. We have done two kinds of experiments: calibration and measuring the design and compilation time Calibration After validation of the applications, as discussed in Section 4.2, the question is whether the simulations done are accurate enough compared to the real hardware implementations from VHDL. Therefore, we have calibrated three applications QRvr B(LUT), QRvr B(TA) and Sobel edge detection with results available from previous research [16]. Table 4.1 shows how long it takes to simulate an application on the hardware models from SystemC. The simulation time is measured in the amount of clock cycles. Table 4.2 depicts the execution time of the applications on a real hardware implementation described in VHDL. Its execution time is also measured in clock cycles. Experiments Parameters Pipeline depth Clock cycles M JPEG 1 NumFrames:8 VNumBlocks:16 HNumBlocks: M JPEG 2 NumFrames:8 VNumB:16 HNumB:8 DCT:94 Q: QRvr N:5 K:7 Vectorize:1 Rotate:1 227 QRvr B(LUT) N:7 K:21 Vectorize:10 Rotate: QRvr B(TA) N:7 K:21 Vectorize:19 Rotate: Sobel W:280 H:200 Gradiant:3 absval: Show N/A pipelines:1 42 Table 4.1: Simulation with SystemC. We have done two experiments with M JPEG indicated as M JPEG 1 and M JPEG 2 in Table 4.1. In the first experiment, all the processes had a pipeline depth of 1. This does not simulate the delays that would occur running a real implementation of M JPEG. Therefore we have chosen more realistic pipeline depths for our second experiment M JPEG 2, see Table 4.1. The process DCT was given a pipeline depth of 94 and the process Q a pipeline depth of 27. What we notice is a increase of 291 cycles, which appears to be a lot. We also performed three experiments using QR-Decomposition with different parameters and pipeline depths. In the QRvr experiment all processes have a pipeline depth of one. Interesting are QRvr B(LUT) adn QRvr B(TA), with the same parameter settings. However, when the pipeline 57

63 CHAPTER 4. EXPERIMENTS & RESULTS 4.3. RESULTS depths are increased we notice that the amount of clock cycles it takes before simulation ends takes much longer. Sobel takes parameters which indicates the width and height of an image, which is given as input to the application. From the parameters we see that the image has a width of 280 and a height of 200. The pipeline for the processes gradient and absval are set to 3. Notice that from all the applications in Table 4.1 it uses more clock cycles. With the show application all process have a pipeline depth of one. Show was simply used as a test case sample. No comparison material was available for it. The differences between the execution time is shown in Table 4.2 and is made perceptible in Figure Execution of Sobel takes minus %. This means that the SystemC simulation is % faster than the VHDL hardware implementation. For QRvr B(LUT) we notice that the execution with VHDL is 0.39 percent faster than simulating with SystemC. QRvr B(TA) also shows a faster execution time compared with the SystemC simulation of 0.21 %. Experiments Parameters Pipeline depth Clock cycles Diff. Perc. QRvr B(LUT) N:7 K:21 Vectorize:10 Rotate: QRvr B(TA) N:7 K:21 Vectorize:19 Rotate: Sobel W:280 H:200 Gradiant:3 absval: Table 4.2: Execution with VHDL. Thus, we observe that the execution times between the simulation and the execution of the real hardware implementations are very close to one another. We may assume that this will also be the case for the remaining applications from Table 4.1. This means that we can simulate the real hardware implementation from VHDL very accurately with our SystemC hardware models Design & Compilation Times By looking at the design and compilation times, we observe how long it takes to evaluate a design point. There are several design points we take into consideration: KPN-to-SystemC: Converting a Kahn Process Network specification into SystemC hardware implementation. This is equivalent with generating a VHDL & ISE script within the VHDL environment. SC-to-EXE: Compiling the generated SystemC hardware implementations to a executable specification, which is the same as compiling VHDL to a ISim executables. 58

64 CHAPTER 4. EXPERIMENTS & RESULTS 4.3. RESULTS Run EXE: Running the executable specification or simulating the design. This is the same as running the ISim executable. Running an executable can be done in two ways, which we have discussed in Section 3.1. One can run a executable with traces on or off. When the traces are on we observe a significant increase in the time it takes to simulate designs, see tabel 4.3 and Figure Experiments Avg. Time w/o Traces Avg. Time w. Traces M JPEG s s M JPEG s s QRvr s s QRvr B(LUT) s s QRvr B(TA) s s Sobel s s Show s s Table 4.3: Timing experiments with traces on & off. When the traces are turned on, changes that occur to a signal will be written to a trace file. This I/O operation is expensive especially when there are a lot of hardware modules where changes to signals need to be written to more than one file, each for every hardware module. But it may also happen for example, that a hardware module may contain multiple input arguments (for the execute unit). In that case it would have more than one read multiplexer each having changes on signals written to a trace file. So, the design complexity of a SystemC hardware module may also effect simulation time when the traces are turned on. Design points Design time Sobel SystemC Sobel VHDL KPN-to-SC s s. SC-to-EXE s. 49 s. Run EXE w/tr s. 77 s. Run EXE w/o tr s. N/A Tot. w/tr s s. Tot. w/o tr s. N/A Table 4.4: Comparison of Sobel Design times between SystemC and VHDL. 59

65 CHAPTER 4. EXPERIMENTS & RESULTS 4.3. RESULTS Design points Design time Show Sobel QRvr KPN-to-SC s s s. SC-to-EXE s s s. Run EXE w/tr s s s. Run EXE w/o tr s s s. Tot. w/tr s s s. Tot. w/o tr s s s. Table 4.5: Design times: Show, Sobel and QR Decomposition. In Table 4.4 and Figure 4.10 one can find the results obtained from measuring the design times with Sobel from SystemC and VHDL. Within VHDL traces can t be turned on or off, they are automatically generated with VHDL simulation. Notice that SystemC is much faster in generating the hardware modules and compiling it to a executable. However, with traces turned on the SystemC hardware model is slower in comparison with VHDL. However, with SystemC we will already have a executable specification for describing the models behavior. This is even a bigger advantage over the fact it just takes 16 seconds more to complete. When the traces for the SystemC hardware models are turned off, the timing experiment for Sobel clearly shows it is much faster in comparison with VHDL. In Table 4.5 and 4.6 one can find the timing experiments for which we did not had comparison material. Nevertheless, the results depicted in the table can be used in future research. Design points Design time QRvr B(LUT) QRvr B(TA) M JPEG KPN-to-SC s s s. SC-to-EXE s s s. Run EXE w/tr s s s. Run EXE w/o tr s s s. Tot. w/tr s s s. Tot. w/o tr s s s. Table 4.6: Design times: QRvr-LUT, QRvr-TA and Motion JPEG. 60

66 CHAPTER 4. EXPERIMENTS & RESULTS 4.3. RESULTS Comparing design times of SystemC and VHDL using Sobel Edge Detection Calibration of Designs from SystemC and VHDL Design points Total Run model SC-to-EXE KPN-to-SC Clock cycles Time in seconds Sobel SystemC Sobel VHDL 0 QRvr_B(LUT) QRvr_B(TA) Sobel Applications SystemC VHDL 12 Simulation with traces on and off 160 Simulation with traces on and off Clock cycles Clock cycles Qrvr Qrvr_B(LUT) Qrvr_B(TA) Applications 0 M_JPEG1 M_JPEG2 Sobel Show Applications Avg. w /o tr. Avg. w. tr. Avg. w /o tr. Avg. w. tr. Figure 4.10: Comparing design times and calibrating designs (above), and execution with traces on/off (below). 61

67 CHAPTER 4. EXPERIMENTS & RESULTS 4.4. DESIGN SPACE EXPLORATION 4.4 Design Space Exploration In our design space exploration experiment we were able to easily run many different implementations within a reasonable amount of time. We used the QR application and its skewed version, to explore the different pipeline depths. The results are laid out in Figure 4.12 and 4.13, and shows how the individual pipeline depths influences the efficiency of the network. A simple shell script was written to modify the pipeline depths of the vectorize and rotate nodes within these applications (See Figure 4.11). The pipeline depths ranging from 1 39 for both the vectorize and rotate #!/bin/bash for (( i = 1 ; i<40; i++ )) do if [ -e HW_ND_3 ] # Vect then echo "\$i..." cd HW_ND_3 sed -i -r -e "s/static const int PIPELINE_STAGES = [0-9]*;/static const in sed -i -r -e "s/static const int PIPELINE_STAGES = [0-9]*;/static const in cd.. fi for (( j = 1; j<40; j++ )) do if [ -e HW_ND_4 ] # Rot then cd HW_ND_4 sed -i -r -e "s/static const int PIPELINE_STAGES = [0-9]*;/static const sed -i -r -e "s/static const int PIPELINE_STAGES = [0-9]*;/static const cd.. cd Debug make all echo -n "$i $j " >> "designse.dat"./tqrvr_ds grep "Cycles:" sed "s/cycles: //" >> "designse.dat" fi done done cd.. Figure 4.11: Shell script changing pipeline depths of the Vectorize and Rotate nodes. nodes where explored. We have conducted a total of 1521 (39x39) experiments for each of the applications. We approximated the amount of time needed for a single experiment by dividing the total simulation time (about 2385 minutes) with the number of experiments (1521). The time per experiment we observed is about 1 minute and 57 seconds. As we have seen from experiments concerning the design times (See Section 4), QR design times are faster in SystemC as opposed to VHDL. Thus, the results we obtain by exploring dif- 62

68 CHAPTER 4. EXPERIMENTS & RESULTS 4.4. DESIGN SPACE EXPLORATION vectorize rotate 40 Figure 4.12: Exploring different implementations of QR. ferent implementation, using our SystemC models, will be faster than exploration with VHDL models. Using SystemC we can get faster feedback or explore even more designs within a given time. Because we are interested in performance numbers, the traces where turned off. In Figure 4.12 we observe how the network performance correlates (z-axis) with different pipeline depths from the vectorize and rotate nodes. What we notice is that the rotate node has greater influence on the network performance than the vectorize node does. The shape of the graph is a bit curved, which indicates that at a certain point the effect of certain pipeline depths on the performances, changes. This may occur when the pipeline is not fully utilized. Figure 4.13 is the skewed version of the previous figure we discussed. We clearly see a sudden change in performance at 0 5 from the rotate axis. Afterwards the performance decreases smoothly when the pipeline is increased from the rotate node. From Figure 4.13, the skewed version, we notice 63

69 CHAPTER 4. EXPERIMENTS & RESULTS 4.5. HETEROGENEOUS NETWORKS vectorize rotate Figure 4.13: Exploring different implementations of QR Skewed. that in the beginning it doesn t matter if you increase the pipeline of vectorize by one clock cycle. Figure 4.12, where QR isn t skewed, we clearly observe that increasing the pipeline depth of vectorize by one clock cycle is very expensive. 4.5 Heterogeneous Networks The advantage of IMASC, producing SystemC code, allows simulation of heterogeneous networks. IMASC provides the hardware modules generated as a library of hardware accelerators, see Figure These are the SystemC hardware components generated for a particular application e.g., Motion JPEG, Sobel. The hardware components which are available from the library can be connected to a network of processors from a different platform. 64

Implementation LWP simulating MicroBlaze LWP ND_1 Simulation of LAURA Architecture ND_3 communication Through FIFOs LWP ND_2 communication Through FIFOs Heterogeneous Multi-Processor

Instead of making a detailed simulation of the microprocessor, an Instruction Set Simulator (ISS) is used.

70 CHAPTER 4. EXPERIMENTS & RESULTS 4.5. HETEROGENEOUS NETWORKS ESPAMSC IMASC HP HP Unix Sockets SystemC Interface LWP SC_FIFO Hardware Accelerators ND_1 ND_2 ND_3 MicroBlaze ISS C-code LWP SystemC RTL Level hardware Implementation LWP simulating MicroBlaze LWP ND_1 Simulation of LAURA Architecture ND_3 communication Through FIFOs LWP ND_2 communication Through FIFOs Heterogeneous Multi-Processor Simulation Figure 4.14: Heterogeneous SystemC Networks. With the ESPAMSC design flow [17] one is able to generate microprocessors for each process in a KPN. Instead of making a detailed simulation of the microprocessor, an Instruction Set Simulator (ISS) is used. Current work by [18] allows the HPs to communicate through UNIX sockets with the SystemC environment. The SystemC environment acts as an interface to the HPs. Each UNIX socket connects a HP to what is called a lightweight process (LWP). A LWP is a SystemC module that simulates a HP to which it is connected by means of a socket. Having LWPs simulating the HPs within the SystemC environment, enables certain measurements to be done on the HPs. So, it 65

Abstraction Layers for Hardware Design

SYSTEMC Slide -1 - Abstraction Layers for Hardware Design TRANSACTION-LEVEL MODELS (TLM) TLMs have a common feature: they implement communication among processes via function calls! Slide -2 - Abstraction