DTNS: a Discrete Time Network Simulator for C/C++ Language Based Digital Hardware Simulations

Size: px

Start display at page:

Download "DTNS: a Discrete Time Network Simulator for C/C++ Language Based Digital Hardware Simulations"

Ambrose Wheeler
5 years ago
Views:

1 DTNS: a Discrete Time Network Simulator for C/C++ Language Based Digital Hardware Simulations KIMMO KUUSILINNA, JOUNI RIIHIMÄKI, TIMO HÄMÄLÄINEN, and JUKKA SAARINEN Digital and Computer Systems Laboratory Tampere University of Technology Hermiankatu 12 C, FIN Tampere FINLAND Abstract: - This paper introduces a way to enhance digital system design, and multimedia hardware design in particular, through high-level discrete time system simulations. Towards the same end, bus based interconnection architecture is utilized for intellectual property interfacing. The emphasis is on architecture design, design space exploration, and hardware/software co-simulation. A Discrete Time Network Simulator (DTNS) is described as a novel extension to current simulators. DTNS simplifies many aspects of architecture exploration and supports technology optimization for continuous-media applications. Design re-use is facilitated by the use of a common programming language and data transmission efficiency by using the Heterogeneous IP Block Interconnection (HIBI) scheme. The fixed time increment, discrete time, approach allows rapid model development and fast simulations. Long simulations can be run to obtain statistical performance information. A video encoder (H.263) based simulation is used to demonstrate these capabilities. Key-Words: - Simulation, Modeling, Intellectual Property, Computer Aided Design, and Multimedia. 1 Introduction The paradigm shift from desktop applications to real-time, continuous-media, mobile computing is challenging contemporary digital design practices [6]. The shrinking size drives whole electronic systems into a single integrated circuit (IC). These solutions are called systems-on-a-chip (SoC). To cope with the complexity and time-to-market constraints some design objects are prepared for reuse and called intellectual property (IP). These IP blocks can be shared within an organization or obtained from external sources. Architecture exploration is a quite laborious task with current design tools. The most important decisions during the design process are made at the highest levels. However, acquiring reliable information to form a basis for these decisions is difficult. The traditional system simulation tools for signal processing applications use heavy abstractions for communication between functional blocks. These abstractions are fine for algorithm design but cannot be considered valid for hardware design in systems where the communication between blocks forms a significant portion of the complexity of the design. In addition, the lack of hardware models for processors and interconnection architectures seems to be of a paramount importance. Contemporary design tools have many other weak points as well. If the detailed models are used, simulation times for complex designs can easily become prohibitive. Hardware description languages (HDLs), like VHDL (VHSIC Hardware Description Language), and their simulators are a representative example of simulation environments that are capable of representing designs in necessary detail but they may be too slow in the architectural design phase. In addition, their expressive power is quite limited compared to contemporary object oriented programming languages like C++ or Java. Thus, describing architectures and their communications can be very time consuming. To solve the aforementioned problems, one design tool or a specific technology is not enough. Thus, we propose an interconnection based design flow. We propose an interconnection, Heterogeneous IP Block Interconnection (HIBI), which is capable of exploiting the predictable nature of continuousmedia processing [7]. Therefore, HIBI should be able to provide the same throughput with less signal lines than the conventional bus architectures. In addition, we propose a tool for architectural exploration and interconnection parameter optimization; the Discrete Time Network Simulator (DTNS). DTNS is a C (C++) language based extension to current simulators. It is intended for the highest level of design and it exploits the fact that many of contemporary multimedia algorithms are

2 specified in C language, thus bridging the oftenencountered tool flow gap between specifications and the first behavioral implementation. In addition, using standardized interconnection architectures, from a library of such components, this tool facilitates design reuse. DTNS is based on the discrete event simulation methodology, which enables fast simulations while retaining the correct logical operation of the interconnection structure. Furthermore, we describe a top-down, simulation based, design flow for SoC designs. The paper is organized as follows. Section 2 describes the design flow that is required for efficient use of simulations in SoC designs. In Section 3 the DTNS simulation details are explained. Section 4 is about applying the Discrete Time Network Simulator with the HIBI bus to a H.263 video encoder. Finally, in Section 5 the conclusions are given. 2 Simulation Based Design Flow There is no single way to classify hardware/software simulators. Our classification is based on [1] and [5], and uses the design abstraction level as the main distinguishing factor. It is usually possible and desirable to mix components from different abstraction levels to perform multi-level simulations. During different phases of the design flow, different simulation levels are used. Notable is the fact that non-trivial designs need to be simulated in several levels. Usually the contemporary simulation environments are different from one level to another. Therefore, to avoid making completely different descriptions to all simulation levels a flexible and continuous tool flow is necessary. Statistical simulations describe the problems as high-level mathematical models. This type of simulators are useful for validating generic concepts. They do not necessarily provide any detailed information about a specific system. Despite the lack of exact information, the growing complexity of SoC designs requires designers to adopt design tools, which operate at this level. However, most simulators dealing with data transfers are so called network simulators. These simulators are mainly intended for multicomputer or multiprocessor simulations, thus limiting their usability for embedded designs. Typical for simulators at this level is their use of statistical distributions and some generic simulators are heavily biased towards solving differential equations. This may be useful in modelling complex physical systems, in which the embedded design must operate. Data flow simulations are very similar to statistical simulations. However, their data flow is based on the actual application run on the system and not just statistical models of system inputs. This approach is more suitable for embedded designs because the special nature of custom computational nodes can be better accommodated. The down side is that systems can no longer be conveniently analytically analyzed. Algorithm level simulations are essentially simulations of the specifications or simulations of behavioural implementations of specifications. Communication between different components is usually abstracted as streams of information in a communication channel. The division between hardware and software parts need not have been performed. Typically, very little, if any, timing information is available. At instruction level, the design has been divided into hardware and software parts. The software part is simulated in an Instruction Set Simulator (ISS), which, at this point, typically uses very little timing information. The hardware either is in algorithm level or described in behavioural hardware description language (HDL). If component communications are deemed less important, they can still be very abstract. Otherwise, communications can be described in HDL. Approximate instruction order and clock cycle based timing information may be available. Architecture level simulations describe the functionality of the hardware in detail. This requires the design to be specified at least in behavioural HDL. Intellectual property blocks may have been instantiated. The communication between components is also specified in corresponding level. Thus, relatively accurate clock cycle based timing information should be available. Register transfer level (RTL) describes hardware that can be synthesised. Timing information is available in terms of clock cycles. Simulation times for large systems are very long. We can see four major problems with this traditional design flow. (1) The lack of timing information in algorithm level descriptions, even though it may be present in the original specifications. (2) The design description and tool gap between specifications / algorithm and the instruction / architecture level implementation. (3) The abstraction of inter-component communications to communication streams. Due to the difficulty of describing communication structures and the prohibitive simulation times the possibility for architectural exploration is limited to the highest abstraction levels. However, the communication stream abstraction makes the performance

3 evaluation of different architectures difficult. This criticism is directed to dedicated, embedded applications, where statistical load estimates cannot be considered valid. (4) Simulation times with the more accurate simulators may become prohibitive. Thus, architectural design must be performed in higher levels, typically with less information than would be available in the lower abstraction levels. 3 Discrete Time Network Simulator The Discrete Time Network Simulator can be used as a stand-alone simulator. However, it is mainly a C language function library intended to support synchronous interconnection design and analysis. DTNS supplements any C or C++ language based simulator. The actual simulator provides the framework for the simulations. In DTNS, all signal transitions are tied to the system clock, making it a fixed time increment or time driven simulator. If used with a proper simulator, this software may support event driven simulation mechanics. The discrete time and fixed time increment approximations are done to make the simulation update cycle uncomplicated and, thus, faster. However, these approximations are accurate enough for many logical simulations in systems based on a global clock. The framework for using DTNS in a stand-alone configuration is depicted in Fig. 1. The DTNS main function can also be viewed as a top-level testbench. It is responsible for keeping the simulation running and instantiating the simulation modules. A typical DTNS simulation consists of an interconnection model, one or more agents for the interconnection, a monitor object to track and preprocess simulation variables and collection of assorted support functions. The support functions perform, for example, translations between hexadecimal and binary numbers. All signalling in the interconnections is based on a signal model analogous to the IEEE std_logic signal type in VHDL. Properly designed interconnection models, agents, and other functions are re-usable in other designs and they form the DTNS Library. A pseudocode for a DTNS main() function is depicted in Fig. 2. First, the interconnection model is defined and initialized. The actual simulation occurs in a loop that is executed for the specified number of simulation cycles. One loop corresponds to half a clock cycle in the system. In the loop, first the control for the simulation is read from a file. This control can take the form of, for example, interconnection values. The current values for the interconnection are resolved. Then, all agents are run with the current interconnection values. The agents, thus, produce the interconnection values for the next phase. The values from all agents are combined together and the final signal values resolved. main() Interface signals Interface statistics Wave Window DTNS Library Interconnection Models Agent Models Mathematical Functions Other Support Functions Input Vectors Agent Monitor Objects Agents (for every Interface) Simulation Control instances: Interconnection Architecture Interface Blocks Interface Monitor Analysis Agent 1 Agent 2 Agent 3 Agent 4 Mathematical Softw are Fig. 1. DTNS simulation flow. Protocol 2 Protocol 3 Protocol 1 Protocol 1 Protocol 2 Protocol 3 Protocol Check & Analysis

4 The multi-type logic signals can conveniently be used to catch errors, such as multiple agents driving the interconnection at the same time. The clock phase count is advanced by one. Current interconnection information is saved into files. Finally, in the loop, the next phase values are assigned as the current interconnection values. When all the simulation loops are done, statistical information gathered during the simulation is saved to a file. If multiple clock signals are desirable, the methodology for such descriptions is briefly discussed in [4]. main() initialize interconnection do read testvectors and control resolve current cycle interconnection for (each interconnection agent) { run agent with current values } resolve next cycle interconnection increment clock phase count by one output interconnection info assign current values <= next values while (specified simulation cycles) output info from simulation run Fig. 2. Pseudocode for the DTNS main() function. The inputs to the simulation are either prepared testvectors or testbench modules that provide values based on the information they are given. Typically the testvectors provide the control for the simulation and the modules provide the bulk of data transferred on the interconnection. The data and its arrival rate can be based on statistical distributions or directly based on the actual application simulated. Being a fully programmable simulator, the DTNS does not limit types of outputs available from the simulation. However, we have mainly used three different kinds of outputs. The main outputs are the signals in the interconnection. These are collected to a separate file that can be viewed, for example, with a standard VHDL waveform viewer. The monitor object collects data about the simulation that is mainly statistical in nature. For example, idle cycles in the interconnection, ratios between read and write transactions, utilization, and throughput can be calculated. This information can be sent further to mathematical software for analysis and graphical representation. The third possibility is to gather detailed data from the operation of the agents in the simulation. This information is very application specific and can be used for debugging and finetuning the performance of the interconnection network. The nature of the agent descriptions is not discussed here in detail. This is due to the fact that they can be of very different abstraction level, depending on the intended application. In statistical and data flow simulations such properties as data arrival rates, service times and processor loads may suffice to model the systems. Sequential algorithms with approximate delays may be enough for algorithm level simulations; whereas strict coding style with lots of additional implementation information must be adhered if the design is to be synthezised. DTNS can conveniently be used in the three highest level simulations from our classification, namely the statistical, data flow, and algorithm simulations, combining attributes from all types. If DTNS is combined with a C or C++ based statistical level simulator, the statistical functions are usable in DTNS. Data flow level is the most natural level for DTNS simulations. The application data flow can be emulated and the analysis capabilities used for architectural exploration, performance evaluation, and interconnection and algorithm development. In some cases, the specifications of algorithms in C language can be used to provide fairly realistic data. In algorithm level, the system specification containing the rest of the algorithms, is determined. The strength of DTNS becomes apparent when changing from the algorithm descriptions to some co-simulation and design environment. Typically the algorithms have required a lot of work when converted for the co-simulation environment. The communication structure is well defined in DTNS simulations and thus both the communication structure and the C language based agents are readily usable in leading co-design environments using their C language interfaces. From there, the design flow continues according to the software vendor s tool-flow. It is conceivable that DTNS could be used in instruction level simulations if combined with a simulator that compiles the operation of the processors into the C code itself [1]. However, this concept has not been tested. Compared to typical simulation environments the difference in DTNS in all of these simulation levels is the emphasis on more accurate communication model. 4 Simulation Example The purpose of our study was to test HIBI performance in multimedia environment and to

5 illustrate the design, reuse, analysis, and representation capabilities of the DTNS simulation method. The hardware behavior of an ITU-T H.263 video encoder is analyzed in high-level simulations without actually implementing the algorithm. 4.1 Heterogeneous IP Block Interconnection A typical Heterogeneous IP Block Interconnection based system consists of several heterogeneous IP blocks connected with the HIBI interconnection. The IP blocks could be processor cores, memories, special computation accelerator units, or interfaces to other systems. A basic HIBI system is bus based. HIBI design has been based on the concept that a bus system can function without a central arbiter. This eliminates the need for dedicated control signals for each of the IP blocks. Each agent has a unique priority. In addition, time-slots are allocated, based on a priori knowledge from system operations. The HIBI specification enforces transactions that have no handshaking and a minimal number of wait states. Wait states are not allowed at all during the actual data transmission operations. HIBI signals consist of scalable Data and Address busses, Command bus, and three system control signals The HIBI time-slots offers quality of service (QoS) type of services in an on-chip environment. This is a clear improvement on traditional computer buses when such properties are required. A most important aspect in HIBI is that there is no initial or subsequent data latency. All clock cycles in an access after the arbitration has been completed should transfer data. This can be accomplished because there is no handshaking between the transmitter and the receiver. If an agent runs out of data to send, it must release the HIBI bus for other agents to use. It is theoretically conceivable to invent strategies for an agent to hold the bus without actually transmitting data, but such schemes are discouraged. These properties are important because they are usually the most important determining factors between the peak throughput and the actual achieved throughput. Efficient design with IP blocks, and HIBI based designs in particular, require a high-level design and cosimulation environment. For system level architecture exploration, it has been traditional to write a custom cycle and C/C++ based simulator. A lot of work has gone lately to enhance C/C++ based hardware simulation and synthesis [2,3]. Our effort to amend these aspirations is embodied in the Discrete Time Network Simulator. In our case the DTNS is used for initial HIBI parametrization and configuration. 4.2 The H.263 Video Encoder Environment The ITU-T video coding standard H.263 utilizes motion estimation, together with motion prediction and motion compensation to compress video data by using the temporal correlation between picture frames. Encoding and decoding are done to pixels in block by block basis. Fig. 3 depicts the H.263 encoder used in our simulations. Motion estimation is usually done by searching blocks, obtained from the current picture, from the previous frame. Also the discrete cosine transforms can be used with quantization to compress data. Memory is needed in the design to store the picture frames being processed. An external bridge is required to communicate with the off-chip components. A control processor directs the coding process and constructs the final bit-stream. 4.3 Simulation Result Analysis The block is instantiated five times in our system (Fig. 3) and every block has FIFO memories to buffer data transactions. The size of these FIFOs is an important trade-off between system performance and the consumed silicon area. Bus signals Bus statistics Motion Estimation HIBI Bus DCT / IDCT Processor core (ARM) Test Vectors Memory External Bridge H.263 System Input / Output Fig. 3. Simulation of an H.263 video encoder with the HIBI bus architecture. The functional units are modeled as servers with queues (FIFOs), response times, and instructions

6 how to respond to writes to the particular unit. Instead of the processor core, the test vectors control the simulation progress. As an example analysis the bus utilization factor and the utilization of some of the FIFOs are depicted in Fig. 4. Bus utilization is calculated in a ten clock cycle window. The usual main simulation results, the signal waveforms, are shown in Fig. 5. Utilization Clock cycles Fig. 4. Utilization of HIBI bus (solid line), ARM input FIFO (dash), and External Bridge FIFO (dash-dot). Fig. 5. HIBI bus signal waveforms. 5 Conclusion The specification planning phase needs more powerful tools to address system performance issues. Finding the most efficient configuration for the overall system is often guesswork in high-level and may be very tedious work from the lowest levels of hardware description. Thus, the high-level simulations need some practical estimations of hardware performance. System-on-a-chip designs pose a novel problem for the digital design. The complexity of designs has risen steeply. DTNS facilitates architectural exploration, design parameter optimization, and presents a tool for design partitioning based on the external communications of the partitions. In addition, the tool flow with DTNS from architectural exploration to physical design is continuous. DTNS itself does not support event based modelling, thus, accurate timing information from the time between clock cycles is not available. The IP shortage affects also DTNS; if there are no IP blocks available designing them can be a considerable effort. However, a lot of C code is available that should be adaptable to DTNS with reasonable ease. The simulations results show the power of the described method for analyzing design functionality and performance. Long simulation runs can be made with different configurations to optimize the design. DTNS present interesting future research possibilities and needs a lot of further work. The statistical functions and measures for the interconnection performance still need a closer look. The focus of the research will probably be the automatic parameter optimization with, for example, simulated annealing. A more distant research idea is to distribute the DTNS simulations to a computer cluster. This should be relatively easy because a DTNS simulation already has partitioned the design into agents that can be processed concurrently. References: [1] L. Guerra, et al., Cycle and Phase Accurate DSP Modeling and Integration for HW/SW Co- Verification, Proc. 36 th Design Automation Conference, 1999, pp [2] R. Gupta and S. Liao, Using a Programming Language for Digital System Design, IEEE Design & Test of Computers, Vol. 14, No. 2, 1997, pp [3] R. Gupta and G. De Micheli, Hardware- Software Cosynthesis for Digital Systems, IEEE Design & Test of Computers, Vol. 10, No. 3, 1993, pp [4] C. Hansen, Hardware Logic Simulation by Compilation, Proc. 25 th Design Automation Conference, 1988, pp [5] J. Hennessy and M. Heinrich, Hardware/ Software Co-Design of Processors: Concepts and Examples, Hardware/Software Co-Design, G. De Micheli and M. Sami, ed., Kluwer, [6] C. Kozyrakis and D. Patterson, A New Direction for Computer Architecture Research, IEEE Computer Vol. 31, No. 11, 1998, pp [7] K. Kuusilinna, et al., Low Latency Interconnection for IP-block Based Multimedia Chips, Proc. 2 nd IASTED International Conference Parallel and Distributed Computing and Networks, 1998, pp

Hardware/Software Co-design

Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction