Interface Design of VHDL Simulation for Hardware-Software Cosimulation

Size: px

Start display at page:

Download "Interface Design of VHDL Simulation for Hardware-Software Cosimulation"

Regina Hood
6 years ago
Views:

1 Interface Design of Simulation for Hardware-Software Cosimulation Wonyong Sung, Moonwook Oh, Soonhoi Ha Seoul National University Codesign and Parallel Processing Laboratory Seoul, Korea TEL : , FAX : Abstract To perform cosimulation, an interface design of simulation is needed. This interface is responsible for communicating packets between any simulator and the cosimulation backplane, PeaCE, which is a Ptolemy extension as codesign environment. The interface also manages the simulation for correct timed cosimulation. By the automatic interface generation mechanism, the interface is generated without user intervention. The proposed interface mechanism is implemented for two simulators and verified by covalidation with a QAM-16 modulation example. The results and lessons from the experiment are described. 1 Introduction Cosimulation is a key facet of codesign methodology. Through cosimulation, we can validate the functional correctness of the hardware and software working together, ahead of the final synthesis step. Also, cosimulation allows us to evaluate each design decision such as partitioning or component selection. As codesign proceeds, the level of cosimulation evolves from the behavioral level to the implementation level in which timed cosimulation would be necessary to validate the timing requirements. It is unknown to us if there is a cosimulation environment which cooperates with various existing simulators in cosimulation from the behavioral level to the timed level. This paper will present an interface mechanism to combine any simulator in the proposed cosimulation environment. In most codesign systems, the software part is usually a C program and the hardware part is written in hardware description languages (HDLs) such as, Verilog, or Hardware-C. For hardware simulation, most people use a hardware simulator such as a simulator. For software simulation, however, there are various approaches from using the processor simulator to making stand-alone UNIX processes. We use the latter approach. To perform cosimulation, two simulators should be combined, which requires making interfaces for communication between them. To keep designer from the burden of making interfaces at each design iteration step, an automatic interface generation facility is devised in this paper. Since, the software part is running as a UNIX process, most of our efforts for interface design is focused on the interface for the simulation. In the next section, we will give a short introduction to our codesign workflow. The requirements and solutions for interface are described in section 3. The implementation details dependent on our environment are explained in section 4. Some experimental results and discussions are shown in section 5 and section 6 has a conclusion and future works. 2 Codesign Methodology The codesign workflow in PeaCE, a codesign framework we are developing, is represented in figure 1.!"##$%&'( )'*+ A dataflow graph is chosen as an initial specification model for a given application and simulated for algorithm validation. Dataflow models are very effective for specifying most types of DSP applications[2, 3]. During algorithm development and simulation, the use of dataflow specification allows to obtain measures of the algorithmic performance (e.g. bit error ratio) quickly, because there is no need to determine the timing of the operations [9]. Clock, reset and other implementation dependent signals are not required at this level of abstraction. The simulation efficiency is higher than that of discrete event simulation [4].

2 SDF or DDF Algorithm Simulation BP domain Partition CGC : interface node interface insertion topic of this paper The next step is to partition the initial dataflow graph into subgraphs; software graphs and hardware graphs. By cosimulation and evaluation, the feasibility and the cost effectiveness of a partition is examined. At each iteration of codesign process, a new partition is made, which requires the rebuilding of the interface between two subgraphs. Since making an interface is tedious and error-prone work, it is desirable to generate the interface automatically[10]. This is the main topic of this paper. After partitioning, each subgraph is modified in order to add interface nodes at the boundary. From the partitioned graphs, C and codes are generated. No user intervention is needed for the addition of interface node and code generation. The interface node to be added is chosen from the reusable interface node library according to the synthesis target and communication protocol. Through cosimulation, a designer can check the functional correctness ahead of the final synthesis step. Also, cosimulation is used to get the profiling information. As shown in [1], the profiling results help the designer partition the target system more efficiently. Another role is to identify the performance bottleneck. Moreover, by the timed cosimulation, the exact timing behavior of the whole system could be found. To be simulated, the generated C code is compiled into a UNIX process and the generated code from the hardware graph is passed to the simulator for hardware simulation. Back Plane Event Queue BP scheduler Dataflow in original design Dataflow in Backplane CGC test vector C Dataflow Visulaization using socket Code generation with interface C DSP compiler Unix Process DSP executable file compile with Unix CC cosimulation Evaluation Prototyping Board construction simulator O.K. simulator not satisfied Synthesize FPGA loadable file C Process simulator Figure 2. The cosimulation backplane and communication with client simulators Figure 1. Hardware Software Codesign Workflow $ # )'*+ + *+ % To combine two concurrent simulator (C process and simulator), PeaCE introduces and implements a backplane concept, which reduces the number of interface module from N(N-1) to N. In the backplane approach, a software or a hardware simulator interacts with PeaCE, the cosimulation backplane, not knowing the existence of the counterpart. The backplane monitors and manages all communication events between the software and the hardware simulators. On the other hand, Ptolemy group of U.C.Berkeley[8] devised a common interface mechanism between any pair of simulators so that they also achieve the same reduction of interface modules to N. Their approach is called heterogeneous simulation[12]. Therefore, as shown in figure 2, while the cosimulation is in progress, several UNIX processes are running concurrently and cooperatively: C processes, a simulator and PeaCE itself. In figure 2, a C process and a simulator communicate with each other through the cosimulation backplane, PeaCE. In the backplane, we can use simulation and visualization capability of Ptolemy. In figure 2, a dashed line represents a flow of data within the backplane through a function call and a solid line displays a data transmission through socket. The user specifies the dataflow with the dotted lines as shown in figure 4. The backplane supports an event-driven scheduling with an event queue, which holds future events sorted by time. Any data between processes is transferred through the cosimulation backplane, which makes the event queue in the backplane a global event queue. The backplane scheduler manages the event queue and transmits a packet to a client simulator in the order of event generation time. If

3 the destination is an external process like a C process or a simulator, the backplane scheduler calls the utility functions to send the packet via socket. The interface node, automatically inserted by PeaCE, receives the packets and transmits it to C or module. After transmitting packets to a client simulator, the backplane scheduler waits in a polling loop until it receives the results. To avoid deadlock, the client simulator should make it sure to send a response packet to the backplane even when there is no result data. #&+ + +$% +" Based on simulation results, the evaluation module makes a decision whether the current partition satisfies the system requirements. Unless they are satisfied, the codesign process continues to iterate from the partition stage to the evaluation stage. Otherwise, the codesign process reaches the synthesis and testing step. After validation check through timed cosimulation and evaluation step, C and codes are regenerated from the hardware and software subgraphs. In this stage, the generated code includes interface code suitable for the prototyping board, which consists of a DSP and a FPGA. 3 Design and generation of cosimulation interface In this paper, we use simulators and UNIX processes as concurrent processes communicating each other through BSD sockets. An alternative approach is to treat the entire system as a single process by making the C portions of the system as the procedures called from the modules[7]. A serious drawback of this approach is that not all software parts can be expressed as procedures. In our environment, the communication between the hardware and the software is modeled as a message passing system. We aim to design the flexible and extensible message passing interface for cosimulation. We first make the desired characteristics of the message passing interface for generic cosimulation. No modification of the initial specification : Generally, in the initial algorithm specification, there is no considerations for partitioning and interfacing. So, whenever the cosimulation is needed, interface code should be generated on every hardware-software partition automatically. By only adding new the interface, a cosimulation is constructed, without any modification of the user design. No modification of simulator : Since we will use existing simulators for HW simulation, this requirement is crucial. Many of problems we met in this paper are from this requirement. It is revealed that the interface design will be improved significantly if we can add some capabilities to the simulator. In section 4.3, the desirable capabilities will be described. No modification of module libraries : We will not change the code of modules but augment the interface code automatically without user s intervention. It distinguishes our approach from Sari s approach of Carnegie Mellon University[11]. No restriction on specification : A previous work from U.C.Berkeley restricted the program, which is generated from a program graph with SDF semantics, to a single thread of control [8]. Even though this approach schedules the communication statically for deadlock avoidance as well as runtime performance improvement, it is too restrictive for general applications. In [8], only one sequential process is running on the simulator. Our specification has no such restrictions on model. In fact, we add a new module as the interface module, which will run concurrently with the graph. Timed cosimulation : After functional correctness is validated, we also need to check the timing requirements of the system. Since the simulator has the notion of time, we can perform timed cosimulation if software processes are managed by an event-driven scheduling. Then, we need to define a synchronization protocol between two concurrent event-driven simulators. 4 Implementation *)' #)$ "! +$% Our codesign environment, PeaCE is based on the Ptolemy which is a framework for heterogeneous system specification, simulation and synthesis[6]. In Ptolemy, an application is represented as a block diagram which is given an appropriate semantics. For example, a DSP algorithm is represented as a block diagram with the dataflow graph semantics. Each block contains the code to be executed to make the coarse grain dataflow graph. In this paper, we consider a category of applications that can be represented with dataflow graphs, for example, DSP systems. Ptolemy generates C codes from the partitioned dataflow graphs for software and makes stand-alone C processes to be run on the host computer. And, from the partitioned dataflow graphs for hardware, Ptolemy generates codes that are passed to the simulator. Before C codes and codes are generated from the partitioned

4 graphs, communication blocks are automatically inserted at each partition boundary. The type and connectivity information of the communication block are decided by those of the partitioned arc. These communication blocks contain all communication and synchronization routines for cosimulation: they include socket establishment, socket termination, data communication, buffer management and synchronization by timestamp management. By the foreign interface mechanism according to the ANSI/IEEE std , the communication codes for are written in C and called by the code in the newly added communication modules. )" #)$ ' %& signal foreign interface M_1 wait for 1 ns; M_2 Master Receive Receive inf(1) object_time User s Design inf(2) Foreign Module (C) Figure 3. Function of three entities of interface Send Send For simulator we append three types of interface nodes; master node, send node, and receive node. For each partitioned arc, we add either a receive node for an input arc or a send node for an output arc. Send nodes and receive nodes are implemented separately for each data type: float or integer. A simulator schedules interface nodes when communication occurs. In case there are more than one input communication links between hardware and software, there exists a risk of deadlock unless we carefully manage the firing order of receiver modules. We solve this problem by adding one entity, called the master node, which serializes the firings of communication nodes. The architecture of interface simulation is depicted in figure 3. In the initialization stage, the master node establishes the socket connection and setups the input and output queues; one for each partitioned link. Also, the master node scans the output buffer of send blocks to export the new data to the outside. Even in case there is no output data, it generates the END signal to the outside to indicate that the partitioned graph has finished its execution. If we define a simulation loop as the duration between receiving input packets form the backplane and transmitting response packets, a simulation loop is divided into two parts: the interface time (inf(1) + inf(2) in figure 3) and user module time(object time in figure 3). At the beginning of a simulation loop, the master node calls the socket receive function, which scans all incoming channels to read all simultaneous data from the backplane until the GO control packet is received. Each packet from the outside contains an identification field to specify which receiver node it is transferred to. After transferring all received data to the input buffer, it generates an enable signal to each receive node to run the partitioned graph. The receive nodes get the data from the buffer, which defines the end point of inf(1) duration. If a send node is scheduled, it writes a result data into an output buffer in the foreign module. This is the beginning of inf(2) duration. After delta delay in the simulator, the master node scans the output buffer and transmit the data to the backplane through socket. The end of inf(2) is marked here. The time measurement is described in section 5.2. #+ )$%!$ %' # )'*+% To make the concurrent event-driven simulation, we use a conservative approach so that the clock of simulator may not be ahead of the global clock. In a conservative timed cosimulation, a client simulator can advance its local time only when it receives a packet which has a time stamp larger than its local time. Since we may not expect that the simulator has an interface for the external cosimulation engine to prevent the advancement of the local clock, we let the master node keeps the time advancement of the simulator from being ahead of the global clock. The master node plays a role of runtime manager of the simulator. For correct timed cosimulation, the behavior of the master node described in the previous subsection is divided into two parts. At each execution, the master node checks the input connection and send signals to wake up the receive nodes. After all events at the current time are processed in the simulator, the master node is scheduled again to check the output buffer and sends packets to the cosimulation backplane. Then, the master node goes into a wait loop to be excited from the outside. Figure 4 and figure 5 are screen dumps from QAM cosimulation in PeaCE. Figure 4 has a top view of QAM system which has two super nodes; one has a C subgraph and the other has a subgraph that is displayed in figure 5. The top view window also has two simulation support

The backplane will receive the packet after it advances to the next time slot. So, the packet will be an old packet and the conservative cosimulation will be broken.

5 Figure 4. A screen dump of top view window of QAM in PeaCE is scheduled and produces a output value when a packet is received from the backplane. If a ramp node is scheduled after the master node checks the output buffer and produces a result, then the result will be checked and sent out at the next time slot. The backplane will receive the packet after it advances to the next time slot. So, the packet will be an old packet and the conservative cosimulation will be broken. Another problem is that the value responded first is a glitch. Since it is usually not possible to schedule a certain process at the end of the current time wheel, as shown in figure 6, we advance the local time for delta delay before the master node checks the output buffer and sends response packets to the backplane of cosimulation. Although the master node is scheduled at the beginning of the next time slot rather than at the end of the current time slot, we can deliver the final results of the previous time slot to the backplane if we subtract the delta time from the time-stamp of the output packets. while(go packet is received) loop receive packet from socket; write data into input queue; end loop; Check timestamp of received packet; if (timestamp > simulator s time) advance time; Send ENABLE signal to receiver nodes; Advance local time for delta delay(1 NS); Check data in output queue; If data exists then send data to socket; Send END signal with next time; Figure 5. A screen dump of QAM cosimulation in PeaCE nodes; a clock node (a test pattern generation node) and a XGraph node (a visualization node) in the backplane. While the module graph has 8 nodes, 12 entities are simulated because of the insertion of a master node, two receive nodes, and a send node. Each of 12 entities has its own process. Thus 12 concurrent processes are running within the simulator. For the master node to check the output buffer, the master node should be scheduled only after all events in the same time wheel are processed. Or, the simulator may not produce the current output to the backplane on time. Recall that once the master node receives a packet from the backplane, even if there is no result data at that time, a response packet should be sent to the backplane for the backplane scheduler to exit from the wait loop. In the module graph shown in figure 5, there is a source node, a ramp node, which produces an incremented value whenever it is scheduled. The designers intention is that the ramp node Figure 6. Behavior of the master node The advancement of the local clock by delta delay is prohibited in conservative distributed simulation. To cure this problem, the duration of arbitrary time advancement should be small enough to be ignorable at the simulation interface. An easiest way is to use a very small time unit such as 1 femto second. Since the small unit of time makes the internal data structure of simulator inefficient, we use a SCALE parameter, which is a ratio between the time unit in the entire cosimulation and that within the simulator. When the master communicate with the backplane, the time-stamp of a packet is interpreted by multiplying or dividing the time-stamp with the SCALE value. The behavior of the master node is shown in figure 6. There are two points where the time advancement is occurred. After packets from the backplane are received, if the future time stamp is detected by calling check time advance() foreign procedure, the simulator advances its local time. The amount of time advancement is computed by multiplying the SCALE value with the time difference between the backplane and the simu-

6 lator. Another advancement is the delta delay advancement described earlier. This is used only once per each simulation cycle. If the SCALE is larger than 1, timed simulation works well. However, another problem is caused by using the SCALE parameter. If there is a module which uses a time unit smaller than SCALE, the module work differently from the designer s intention. In QAM, for example, a ramp node has a wait for 1ns; statement. After we replace the statement with wait for SCALE ns;, the system works correctly. We will construct a module library, a SCALE is used as one of its generic parameters. From the experiences of interface design and implementation, we make a list of facilities that simulators had better support for cosimulation. A callback function for hooking the scheduler: Before the simulator s scheduler advances to the next cycle, a function pointed by a pointer is called. In normal situation the pointer is pointed a null function. If a cosimulation environment designer wants to hook it up, he defines a function body and set the pointer to the new address. If the supported language is a C++, it will be done by a virtual function. By the callback mechanism, the master node can be executed at the end of the current wheel. Then, we will do without the delta delay management described above. The time of nearest future event: To perform more efficient cosimulation, the information of nearest future event is required. The current implementation of PeaCE uses the next time increment as a nearest future event and it is a major source of inefficiency of cosimulation time. The C language interface in the Synopsys simulator(vss) supports this facility with the cligetnexteventtime function. 5 Experiences of cosimulation with designed interface mechanism We implemented two sets of the proposed interface generation mechanism with QAM modulation example. One set of implementation is for Synopsys VSS simulator and the other is for IVSIM, which is developed in the same University[13]. Though both support the foreign interface, since the implementation details of the foreign interface are not specified in the standard, the interface mechanism in each simulator depends on the simulator implementation. + ) #$ &+ #)$ The fixed C module in table 1 is a foreign module which includes socket handling routine and interface routine. In VSS, a foreign module defines the body of a entity itself, thus needs more extra code to interface with the scheduler of the VSS simulator. On the other hand, the C modules in IVSIM are just procedure calls. As a result, the fixed C part in VSS is larger than that in IVSIM. In VSS, however, interface designer can exploit more facilities related with simulator kernel using CLI(C Level Interface) facility. The fixed module in VSS includes entity definitions of send, receive, and master node, while that in IVSIM has only a master node. On the contrary, the proportional parts in IVSIM are larger than those in VSS. The proportional part in VSS has only entity instantiation codes while, in IVSIM, the proportional part defines and instantiates entities of send or receive node. Although there are much differences in detailed implementation, the same interface mechanism using the master node is used for both simulators, and the size of interface part is about 10% of the whole simulated code for both VSS and IVSIM in the QAM example. Table 1. Interface code overhead in VSS and IVSIM simulators Fixed C Module (bytes) Fixed module (lines) lines per Receive lines per Send VSS IVSIM ' #$ & #)$ The result of runtime monitoring of simulator and interface module is presented in table 2. To measure the execution time of user design modules and that of interface modules, we use gettimeofday() UNIX system call. The current time is expressed in elapsed seconds and microseconds since 00:00 GMT, January 1,1970. We obtain the system time at four points as described in section 2.2. The time spent in interface modules means the duration between the UNIX socket to input/output buffers while the module time is the time duration between when the receive nodes get the input data from the input buffer and when the send nodes write the results into the output buffer. The time overhead of interface block is small enough to be ignored. 6 Conclusion We have presented a new interface mechanism for hardware-software cosimulation. We think that the approach, which satisfies all of the requirements in the wish-

7 Table 2. Interface time overhead in QAM cosimulation. (IVSIM simulator is used and all values are presented in micro seconds unit.) Simulation Loop User Module Interface Module 96 65,905 5, ,768 8, ,005 11, ,545 15, ,625 18,399 list, is applicable to all detailed levels of cosimulations and to various existing simulators. It is not known to us that there has been any cosimulation environment which works with any simulator and at the same time performs timed cosimulation. We implemented the interface mechanism both for VSS and IVSIM simulators and compared them. Also, we verified the feasibility of the proposed approach by conservative timed cosimulation with a QAM-16 modulation example. The implemented interface and lessons for more efficient timed cosimulation has been described. As a future work we will improve our cosimulation environment, construct a generic and parameterized module library, and make a smooth migration path to cosynthesis. Tool. IEEE International Conference on Acoustics, Speech, and Signal Processing, [9] Peter Zepter, Thorsten Grotker, and Heinrich Meyr. Digital Receiver Design Using Generation From Data Flow Graphs. Procedings of 34th DAC, June [10] S. Schemeler, et. al. A backplane approach for cosimulation in high-level system specification environments. European Design and Test Conference, [11] Sari L. Coumeri and Donald E. Thomas. A Simulation Environment for Hardware-Software Codesign. IEEE Design and Test of Computers, pages 16 28, September [12] Wayne Wolf. Hardware-software codesign of embedded systems. Proceedings of IEEE, 82: , July [13] Y.Kim, K. Kim, Y.Shin, T.Ahn, W.Sung, K.Choi, and S.Ha. An integrated hardware-software cosimulation environment for heterogeneous system prototyping. Proc. of ASPDAC, pages , August References [1] C. Passerone, et. al. Fast and accurate hardwaresoftware co-simulation using software timing estimates. CODE/CASHE96, [2] E. A. Lee. Recurrences, Iteration, and Conditionals in Statically Scheduled Block Diagram Language in VLSI Signal Processing III. IEEE Press, [3] E. A. Lee, and D. G. Messerschimitt. Synchronous data flow. IEEE Proceedings, September [4] G. Jennings. A case against event driven simulation of digital system design. The 24th Annual Simulation Symposium, pages , April [5] IEEE. IEEE Standard Language : Reference Manual. IEEE, Inc., 345 East 47th Street, New York, NY 10017, USA, [6] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschimitt. Ptolemy: A framework for simulating and prototyping heterogeneous systems. International Journal of Computer Simulation, 4: , April [7] J. P. Soninen, et. al. Co-simulation of real-time control systems. IEEE/ACM Proc. of Euro-Dac 95, pages , [8] J. Pino, Michael C. Williamson, and Edward A. Lee. Interface Synthesis in Heterogeneous System-Level DSP Design

An Integrated Hardware-Software Cosimulation Environment for Heterogeneous Systems Prototyping

An Integrated Hardware-Software Cosimulation Environment for Heterogeneous Systems Prototyping Yongjoo Kim*, Kyuseok Kim*, Youngsoo Shin*, Taekyoon Ahn*, Wonyong Sung', Kiyoung Choi*, Soonhoi Ha' * Dept.