A study on transactors in multi language, mixed-level simulation of digital electronic systems

Size: px

Start display at page:

Download "A study on transactors in multi language, mixed-level simulation of digital electronic systems"

Rebecca Lawson
6 years ago
Views:

1 Master Thesis IMIT/LECS/ [ ] A study on transactors in multi language, mixed-level simulation of digital electronic systems Master of Science Thesis In Electronic System Design by Pablo Fernández Carmona Stockholm, April 2007 Supervisor: Examiner: Wido Kruijtzer Axel Jantsch

3 Abstract In today's semiconductor industry, there is an increasing distance between engineers' productivity and state-of-the-art fabrication technology. This is known as the Productivity Gap. More efficient design methodologies are needed to cover this gap. In order to manage the growing complexity of designs, and to provide some verification tools at early stages, different levels of abstraction are used at different stages in the design process. The so-called Electronic System Level is emerging as the next level of design. Electronic System Level design tools based on SystemC and Transaction Level Modelling (TLM) are likely to drive the electronic design automation market in the coming years. SystemC based design flows have clear benefits when compared to conventional methodologies. They provide an executable specification of the system's behaviour, which can be used as a replacement for the ambiguous textual specifications. Such an executable specification also serves as a system level test bench for the next steps in the design flow (i.e. RTL implementation and software development). This simplifies dramatically the time and effort required for verification. Within the context of enabling such efficient verification in the design flow, this thesis project examines the application of transactors. Transactors are pieces of software, or a mixture of hardware and software that facilitate the connection of models that are described at different levels of abstraction, in order to enable mixed-level cosimulation. As a illustrative example, a system composed by SystemC TLM models and Verilog RTL blocks could be cosimulated with the use of transactors. The aim of this thesis project is to evaluate the possibilities and limitations of transactors in designs using multiple levels of abstraction. This includes a literature study of transactor extended use cases, and an experiment with commercial products. As a vehicle of research, an USB core was integrated in an ARM11 subsystem. For this purpose a commercial transactor from SpiraTech has been chosen. The USB is modelled at the RTL level of abstraction, and written in the Hardware Design Language Verilog. The rest of the system is modelled at TLM level in SystemC. The purpose of this experiment is to use the experience as a feedback for improving the modelling style, and to get a qualitative idea of the benefits and problems that may arise using mixed level simulation in the development of real complex systems. This thesis project is part of the SPRINT (Open SoC Design Platform for Reuse and Integration of IPs) European project in which the easy re-use of third party IPs in complex SoC designs is researched, including system verification. I

4 II

5 Acknowledgements To my family, who made possible that I study what and where I really wanted to. I also want to dedicate this work to all the members of the System Design Methodology department of NXP in Eindhoven. I have to thank them their help and support with all the technical difficulties, and what is more important, I want to thank them for making me feel like yet another member of the team. III

6 Glossary API CA CC DUT EDA ESL HDL OSCI PSE PV PVT RTL SoC TLM USB Term Description Application Programming Interface Cycle Accurate Cycle Callable Design Under test Electronic Design Automation Electronic System Level Hardware Description Language Open SystemC Initiative Philips SystemC Environment Programmers View Programmers View with Timing Register Transfer Level System on Chip Transaction Level Modelling Universal Serial Bus IV

7 Table of Contents 1 Introduction Brief review on Transactors and TLM Electronic System Level design. SystemC Transaction Level Modelling HDL and RTL Transactors Cosimulation PSE SPRINT project Main transactor use cases Transactor roles Enable communication between abstraction levels Protocol checking Performance analysis Test coverage analysis Industrial use cases SoC design with strong IP reuse Co-emulation using synthesizable transactors Architecture exploration Faster test writing in high level languages Guaranteeing models equivalence SoC verification against real target environment Constructing reusable test benches Ideal transactor requirements Multi-level, multi-language simulation: Problem statement Main parameters to control in cosimulation Interoperability of different levels Intercommunication between languages Synchronization between abstraction levels Compatibility of different model sources Thesis scope Proof of Concept Design: solution to the problem Base subsystem Intellectual Properties integrated USB core AHB transactor AMBA_AHB_v1p0_32_32_ps_NCSIM_5p Accomplished Experiments Environment Verification of components...36 V

8 6.3.3 System set-up Creation of TLM adaptors Transactor integration with the subsystem Transactor integration test Fixing PV-RTL incompatibility Shell generation Connection of AHB wires UTMI interface of the USB Fixing protocol problem Verification software Performance analysis Analysis of results Requirements fulfilled Use cases suitability Conclusions Reflections: Further work: References...63 Appendix A. SystemC ARM Subsystem relevant modules:...i Appendix B. SPRINT project results...iv Activity 3: SpiraTech transactor integration...iv Activity 4: Mixed Integration Verification...V VI

9 Illustration Index Fig 1: Productivity gap...3 Fig. 2: System design flow...4 Fig. 3: Abstraction levels...5 Fig. 4: Equivalence between OSCI TLM abstraction levels...6 Fig. 5: Example schematic of a circuit represented at RTL level...7 Fig. 6: Transactor connecting PVT and CA blocks...8 Fig. 7: Virtual prototype with parts modelled at different abstraction levels...9 Fig. 8: PSE positioning for model development...9 Fig. 9:Transactors enable communication between abstraction levels...11 Fig. 10: Built-in AHB protocol tree in a transactor...12 Fig. 11: Multi level view of a communication channel...13 Fig. 12: NXP performance analysis tool window...13 Fig. 13: Use of transactors for analysing the coverage of test applied to DUT...14 Fig. 14: TLM System performing Cosimulation and Co-emulation...16 Fig. 15: System architecture exploration process...17 Fig. 16: Application of high level test benches to an RTL design...18 Fig. 17:Test scenario being driven into XVC environment via XVC Manager...19 Fig. 18: isave design flow, and its physical mapping...20 Fig. 19: Layered test-bench container structure...21 Fig. 20: Cohesive visualization GUI...23 Fig. 21: Need for synchronization of clocks of different transactors...24 Fig. 22: Correlation between Use cases and properties of transactors used...25 Fig. 23: Designs modelled at different abstraction levels can not be directly connected...28 Fig. 24: Multi language simulation...28 Fig. 25: Timing equivalence between CA and RTL is straight forward...29 Fig. 26: TLM models of the same protocol coming from different sources might not be compatible...30 Fig. 27: Elements involving cosimulation experiment...31 Fig. 28: Components integrating the ARM11 subsystem...32 Fig. 29: USB core Block diagram...33 Fig. 30: Diagram of a SpiraTech transactor...34 Fig. 31: Schematic design of the system to be build, representing main blocks...37 Fig. 32: Direct connection between blocks with different TLM standards is not possible...38 Fig. 33: PV functions in SpiraTech TLM...39 Fig. 34: PV TLM write function in NXP...39 Fig. 35: PV TLM write data class in NXP...40 Fig. 36: Possible TLM transactions and their equivalent correspondence Fig. 37: Chain of connections and adaptors from system bus to transactors...42 Fig. 38: System set-up with integrated memory, showing the adaptors and transactor...43 Fig. 39: C code executed for testing the VHDL memory...44 Fig. 40: Time processing in standard NXP models at PV level...45 Fig. 41: RTL and PV cosimulation timing problem...45 Fig. 42: Modification proposed to NXP models time processing at PV level...46 VII

10 Fig. 43: Test alternating read and write accesses to RTL and TLM memories...47 Fig. 44: Wires connected between USB slave port and transactor...49 Fig. 45: Extract of Top Level file, showing connections between USB and transactors...50 Fig. 46: Ambassize definition and use...50 Fig. 47: USB interface...51 Fig. 48: USB scheme of connections to the system...53 Fig. 49: Extract of USB registers test provided by Evatronix...54 Fig. 50: Extract of the waveform activity...55 Fig. 51: Extract from execution result of registers test...56 Fig. 52: Test data flow in time line for a single read transaction...57 VIII

11 1 Introduction Several types of representations, models and design techniques are used in the semiconductor industry design flow nowadays. As technology evolves, allowing increasingly complex designs, the abstraction required to manage complexity increases. This leads to a situation where a range of description languages and models represent a design at different stages exist. In this scenery, an engineer would ask himself two questions: How can I guarantee equivalence of different representations of the same entity?, and, how much work can I reuse to shorten development time?. The first question is addressed by the verification specialists, with techniques such as formal verification, or simulation against a specification. This master thesis falls within the second question. One of the possibilities of reusing work is to be able to simulate at the same time parts of a system written in different languages, and modelled at different levels of abstraction. This is called cosimulation. Cosimulation of blocks described in the same level of abstraction, but in different languages is an old problem, already addressed and with commercial solutions integrated in most of the simulation tools in the market. The new problem is how to simulate designs made up of blocks modelled at different levels of abstraction. The solution to this issue is to include pieces of software, or a mixture of hardware and software called transactors. These transactors have interfaces in different levels of abstraction, and act as transparent pipes for information flowing in and out its interfaces. Internally the data is processed and made readily available at each output at the required level. Some use cases of transactors with cosimulation are: to speed up test bench writing, to verify RTL IPs integrated in SystemC virtual prototypes, to perform bus protocol checking, to test designs on mixed hardware-software platforms, to do performance analysis and many others. This thesis will evaluate a particular set-up built for cosimulation using transactors, which serves for a number of use cases. More specifically the main experiment implemented is the integration of a low level USB 2.0 controller in a high level subsystem with the inclusion of transactors. All the blocks are commercial products. The subsystem consists of an ARM1176 processor, memory and interrupt controllers, a LCD display and some memories. It uses AMBA TM -AXI TM and NXP VPB proprietary buses. This system is modelled at a high level of abstraction and written in SystemC, with no notion of time, only of order. The USB controller is an Intellectual Property of the polish company Evatronix. It has AMBA TM -AHB TM master and slave ports, implemented Direct Memory Access, and On- The-Go supplement. It is written in Verilog. The transactor used is the AMBA_AHB_v1p0_32_32_ps_NCSIM_5p5 from the company SpiraTech. The vendor claims to enable seamless connectivity and visualisation between multiple levels of abstraction, permitting mixed-level and mixed-language simulations to be carried out. [1]. In order to 1

12 connect the previously described components, a number of bridges, adaptors and glue logic are also necessary. An expected result from the thesis is a qualitative idea of the benefits and problems that may arise when using transactors in mixed-level simulation for the development of complex systems. We also expect to come up with some guidelines to improve the modelling style for multilevel environments. The particular set-up will be examined to evaluate whether it fulfils the requirements needed for the stated use cases. A list of requirements for transactors in the use cases under study will also be elaborated as a sort of check-list for designers. 2

13 2 Brief review on Transactors and TLM The aim of this chapter is to introduce the main concepts necessary to understand the work carried out in this thesis. It intends to show the big picture, so the reader can situate the new topics introduced in later chapters within their context. In the following pages abstraction levels, description languages, modelling styles and other topics related to transactor usage in cosimulation will be briefly discussed. 2.1 Electronic System Level design. SystemC. Electronic System Level (ESL) design encompasses the concurrent design of the hardware and software parts of an electronic product [2]. It is motivated by the growing difference existing between the productivity of engineers and the possibilities that technology offers. This is known as the productivity gap, see Fig 1. Fig 1: Productivity gap. Source International Technology Roadmap for Semiconductors A generic design activity usually starts from some sort of description of what the system does, called specifications. The description can be of many kinds: text in natural language, schematics, algorithms, UML diagrams, and many more. System design aims at describing how the system performs its functionality, and at providing a solution for further implementation. ESL design ends where implementation starts, with implementation models for all software and hardware elements in the system. These include: Embedded software application. Custom integrated circuits. Hardware platform. Fig. 2 illustrates the system design flow, starting from an idea, fixing the specifications, and refining it down to a physical system formed by hardware and software. The use of system specifications has two main purposes [3]: Application oriented: The specifications are intended to verify whether the system under design will solve the posed problem with all its functional and non functional 3

14 requirements. It is directed towards the problem domain. Implementation oriented: Functionality and constraints of the system are defined. The specification itself can be used as a base for further development through refinement. In order to do this with sufficient accuracy, system level designers usually assume a structural partitioning and an approximate assignment of functions to components. SystemC is a C++ class library, and a design methodology to use the library for modelling hardware architectures, software algorithms and interfaces on the system level. The SystemC class library provides the necessary constructs to model system architecture, including timing and concurrency [4]. Since it is C++, there is a broad set of compilers, debuggers and development environments available. One of the main advantages of this language is the possibility to create executable specifications. This ensures unambiguous interpretation of the specifications, allows validation prior to implementation, and can be used as a base design to be refined into RTL. Some of the constructs and features of SystemC are: Modules. Processes. Ports. Signals. Rich set of ports and signal types. Rich set of data types. Clocks. Cycle-based simulation kernel. 4

15 Multiple abstraction levels models of the same design. Multi-level communication protocols. Debugging support. Waveform tracing. For further information please refer to SystemC standard[4] and the literature[5], [6], [7], [8], [9]. 2.2 Transaction Level Modelling. Traditionally, systems were modelled at a cycle and pin accurate level in RTL, and then simulated before synthesis. However, System on Chip (SoC) designs are nowadays large and complex, which make the design and simulation directly in RTL too slow. To overcome this limitations, system designers tend to raise the abstraction level of system models. The primary goal of Transaction Level Modelling is to increase simulation speed, while maintaining enough accuracy for the design task. This speed-up is achieved by abstracting away the number of events and amount of information that needs to be processed during simulation to the minimum required. For example, instead of driving all the individual signals of a bus protocol, exchanging only the data payload and control directives. The Open SystemC Initiative (OSCI) is an independent not-for-profit organization composed of a broad range of companies, universities and individuals dedicated to supporting and advancing SystemC as an open source standard for System Level design [10]. OSCI proposes a series of TLM interfaces that define how models communicate, and a set of levels of abstraction to model designs. Those levels aim to cover different representations of a complete system, ranging in detail from a functional description to a detailed synthesisable hardware model. Fig. 3 Shows Abstraction levels, from the most abstract Algorithmic Level, down to the more concrete RTL. It is commonly accepted the reference to abstract levels as high levels, while low levels are the more detailed ones. Fig. 3: Abstraction levels. Source ARM 5

16 Those levels are: Communicating Processes View (CP): An algorithmic or functional model of the system, not associated to a particular architecture. At this level the system is partitioned into concurrent activities. The CP is intended for very early verification of the system functionality and assisting architects and marketing forces with early customer engagements. Programmer's View (PV): A functionally correct model of the System on Chip to enable embedded software developers to start development of the system's firmware. The PV level allows the system to be represented as it logically appears to the embedded applications programmer. Time is not represented, but partitioning between hardware and software tasks exists. Tasks are mapped to devices that models at bit level all the registers accessible to the programmer. The memory map needs to be accurately modelled. Skipping the modelling of time speeds up simulation, while being suitable for most of the firmware development. Programmer's View with timing (PVT): A model that allows architects to identify and solve potential bottlenecks. It is similar to PV, but includes some timing information. Architects require a functional model of all elements in the system and models that provides sufficient time information to analyse the performance of different architectures. Cycle Accurate (CA) or Cycle Callable (CC): Accurate implementation down to the cycle and register-transfer level. It allows verification engineers to use system level models as testbenches for RTL implementations. Models at this level have to be able to interface with RTL models described in HDL, such Verilog or VHDL. CP Communicating Processes (YAPI) PV Programmer s View PVT Programmer s View with Time CA RTL Implementation/ Cycle accurate models Fig. 4: Equivalence between OSCI TLM abstraction levels. Source NXP Transaction Level Modelling (TLM) is motivated for a number of practical advantages [11], [12]. These include: Providing an early platform for software development System Level Design Exploration and Verification 6

17 The need to use System Level Models in Block Level Verification. Modelling of communication infrastructure at different levels of abstraction. High-level transactions allow efficient links to prototyping hardware. High-level transactions can (and should) be used for testbenches. Use of transaction refinement to increase modularity and reusability Currently there is not any industry standard for TLM. Therefore work is ongoing in different consortia of companies and research centres to try to agree on a standard. The required properties suggested by Cadence [12], a company strongly involved in the standardisation process, for it to be successful are: It must be easy, efficient and safe to use in a concurrent environment. It must enable reuse between projects and between abstraction levels within the same project. It must easily model hardware, software and designs which cross the hardwaresoftware boundary. It must enable the design of generic components such as routers and arbiters. For further details please refer to [11], [12], [13]. 2.3 HDL and RTL. Wikipedia [14] defines Register Transfer Level (RTL) description as a way of describing the operation of a synchronous digital circuit. In RTL design, a circuit's behavior is defined in terms of the flow of signals (or transfer of data) between hardware registers, and the logical operations performed on those signals. RTL is at the same level of abstraction as CA, or in a lower level, but designs modelled at these two levels can always be interconnected, since CA is defined at cycle and pin accurate level. Hardware Description Languages (HDL) are languages designed to represent circuit at register transfer level of abstraction, from which lower-level representations can be derived. An RTL description is usually distilled to a gate-level description of the circuit by a logic synthesis tool, and the results can be used by placement and routing tools to create a physical layout. See Fig. 5 for an example of RTL circuit. 7

18 2.4 Transactors A transactor is an entity that provides communication between modules written at different levels of abstraction. In order to communicate with a model at a particular abstraction level, a transactor must include a functional model for the communication channel at that level. Transactors only make sense in communication channels. They are basically formal definitions of communication protocols, where the protocol is defined at different levels of abstraction. Those levels are generally related trough hierarchical trees, where high level transactions branch into one or several other lower transactions. Those transactions branch themselves into more transactions, messages or wire sequences. There are many added features that a transactor can have, but the main functionality is to transform communication at one level of abstraction down to the level below, and also recognize activity from one abstraction level and transform it up to the abstraction level above. See Fig. 6, where R arrows mean Recognition and G arrows Generation. Transactors can have support for two or more abstraction levels, and can be unidirectional, bidirectional, or have all the communication information available simultaneously at the abstraction levels supported. In the following chapters we will analyse more exhaustively the properties expected from transactors, as well as some uses of them in digital design flow. 2.5 Cosimulation Fig. 6: Transactor connecting PVT and CA blocks. Source [15] We refer to cosimulation, in the scope of the present work, as the simulation of a number of designs of different nature together. When this different nature refers to the language in which the designs are described, we talk about mixed language cosimulation. If the differences are in the level at which the designs are described, we refer to mixed level, 8

19 or multilevel cosimulation. Fig. 7 represents a system formed by blocks modelled at different levels. Such systems are commonly used as virtual prototypes at various design stages. In Fig. 7 abstraction level is represented as height. It can be seen that it is not possible to directly connect blocks at different levels. The gap is represented in black, and specific connectors are necessary. They are the transactors. Similar to cosimulation is co-emulation. In the latter, part of a system is simulated with a simulation software, and part is mapped into actual hardware. This hardware is commonly programmable logic, such as FPGAs. 2.6 PSE Fig. 7: Virtual prototype with parts modelled at different abstraction levels The purpose of Philips SystemC Environment (PSE) is to provide a standard modelling methodology for SystemC users of NXP, formerly Philips Semiconductors. The PSE library, along with the CoReuse standard are designed together to provide the infrastructure for creating highly re-usable TLM models [16]. PSE arranges TLM communication, storage, timing, and behavioural aspects in a systematic way to make model design more modular and efficient to develop and test. It is based on current IEEE SystemC, Open SytemC Initiative (OSCI) and OSCI TLM standards. Fig. 8 shows position of PSE in relation with SystemC and other standards. Platforms Ad Hoc models Peripheral IP models PSE OSCI-TLM 1.0 AXI,AHB Bus Models SCV IEEE 1666 SystemC Fig. 8: PSE positioning for model development. Source [16] 9

20 2.7 SPRINT project The Open SoC Design Platform for Reuse and Integration of IPs (SPRINT) is a European project to research and promote open interface and modelling standards for IP integration in the European semiconductor industry. It promotes enhanced IP reuse, consistent design across abstraction levels, and enhanced automation of IP integration, verification and debug. The project's main objective is: To enable Europe to be the leader in design productivity and quality in System-on-Chip design, by mastering the SoC design complexity with effective standards and design technology for reuse and integration of Intellectual Property (IP) modules. The SPRINT Project is partly funded under the European Union's IST Sixth Framework Program and partly by the project members. The alliance was started in February 2006 for an initial period of 30 months. Project members include chipmakers NXP Semiconductors, STMicroelectronics and Infineon Technologies; IP vendors ARM, Evatronix S.A., and Syosil; EDA vendors Spiratech Ltd, Lauterbach and KeesDA; research groups at Paderborn University, TIMA and the Royal Institute of Technology (KTH); and the ECSI association for training and dissemination. The present work is part of the Work Package 1: Validate standards for SoC Design. This first phase intends to generate requirements for SPRINT design technologies and standards, carry out proof-of-concept design activities, update IP and tools with SPRINT standards and to demonstrate and measure improvements. 10

3 Main transactor use cases. This chapter summarizes the practical possibilities offered by transactors. It first includes a generic compendium of transactor roles and standard features.

21 3 Main transactor use cases. This chapter summarizes the practical possibilities offered by transactors. It first includes a generic compendium of transactor roles and standard features. After this, follows a presentation of nowadays industrial use cases for transactors and cosimulation. These use cases are intended to be representative of the different activities in a design flow which make use of transactor technology. 3.1 Transactor roles The situation of a transactor, placed in a communication channel, turns it into a rich source of information. Its architectural support for different levels of abstraction allow many possibilities for communication, debug and analysis in multi level systems. Some of the main features of transactors will be mentioned here [15]. The combination of some of these features used for a specific purpose is called a use case Enable communication between abstraction levels This is the foremost and basic feature of transactors. Allowing communication of designs implementing the same communication protocol, but modelled at different abstraction levels. Fig. 9 graphically shows interconnection of Programmers View and Cycle Accurate levels. Transactors have the ability to automatically recognise high level messages or transactions formed by low level sequences. Once the recognition is done, and provided that there are no protocol violations, it generates corresponding low level activity and high level messages at all the abstraction levels. Fig. 9:Transactors enable communication between abstraction levels. Source [17] The availability of the same information at all the levels of abstraction at a time, allows its 11

22 use as a transparent glue adaptor between the desired designs. From the behavioural point of view, the transactor does not exist. In a hypothetically automated design environment, a system could be set up with designs modelled at different levels, and the environment would place the transactors needed in a user transparent way. Such an environment is actually an active research topic in semiconductor companies and EDA vendors Protocol checking In order to recognize and generate transactions, transactors have a built-in formal definition of the protocol they are designed for (See Fig. 10). Therefore strict error checking is carried out in the activity at all its interfaces. Transactors can have added functionality with algorithms to act in a specific way when a violation is found in a transaction. It might be interesting for test purposes for example, to accept an erroneous transaction and Fig. 10: Built-in AHB protocol tree in a transactor. Source: SpiraTech [1] correct it to its most likely value, and then throw a warning to the simulator. For other simulation purposes it can be better the transactor just to ignore defective transactions, or to return an error to the sender. These are all configurable functions that a transactor can be programmed to do. Fig. 11 shows how transactors monitor communication channels at very different levels of abstraction, checking for protocol violations. In order to be reliable the protocol checking must be exhaustive in all its aspects: Timing: between transactions, and at signal level (hold, set up times). Structure: illegal combination of signals, incomplete structures. Value: illegal value usage, data coherence. 12

Read (Address,Data) Read_two (Address,Data1,Data2) PV (without Timing) Sequence based Transactions PVT Read (Address,Data) Read_two (Address,Data1,Data2) (with Timing) Events Clocks CA Req(R,A)

23 Read (Address,Data) Read_two (Address,Data1,Data2) PV (without Timing) Sequence based Transactions PVT Read (Address,Data) Read_two (Address,Data1,Data2) (with Timing) Events Clocks CA Req(R,A) Resp(D) Req(R2,A) Resp(D) Resp(D) CC R a a R2 a a d d d d d d Clock based Transactions PV: Programmer s View PVT: Programmer s View + Timing CA: Cycle Accurate CC: Cycle Callable Fig. 11: Multi level view of a communication channel. Source: NXP [18] Performance analysis Transactors are situated in communication channels, so all the data flowing in the channel passes through them. This places the transactors in a strategic situation to collect traffic, performance and usage data of the channels. The availability of multiple views of the communication protocols at different levels facilitates the gathering of statistics. Fig. 12: NXP performance analysis tool window 13

We can analyse the following aspects: Bandwidth: bits per second transferred. Throughput: data units crossing the bus per second. Latency: time to complete a transaction.

24 We can analyse the following aspects: Bandwidth: bits per second transferred. Throughput: data units crossing the bus per second. Latency: time to complete a transaction. Occupancy: percentage of simulation time that the bus is active. It has a different meaning when measured at different abstraction levels. Concurrency parameters: percentage of out of order completions, late arrivals or discarded data units. Specific for protocols supporting concurrency. This information can be provided by the transactor as elaborated statistics, or simply as raw data for further processing Test coverage analysis Transactors can provide useful information about tests applied to designs, when they are connected between the test-bench and the Design Under Test (DUT). The quality of the test can be measured, in terms of coverage analysis and distribution of the vectors applied. A simulation set-up for this purpose can be seen in Fig. 13. Fig. 13: Use of transactors for analysing the coverage of test applied to DUT. Source [19] This analysis can be done at different levels: Value level: Monitoring of value and toggling of individual parameters or groups of them at all the abstraction levels. Structure level: patterns of combination of protocol elements, where the result of applied vectors might result in responses having different structures. Time level: Analysis in terms of order, precedence and in general not precise timing relations. 14

25 3.2 Industrial use cases A series of common use cases for transactors and cosimulation is presented, along with a concise description of specific projects implementing it. These use cases are intended to cover the different activities in a design flow involving transactors usage SoC design with strong IP reuse. This is the most extended use of cosimulation, encouraged by all the main EDA vendors and a growing trend in industry. Nowadays, as designs get bigger, System on Chips rely more and more on IP reuse and software programmability [20]. In many cases these IPs are assumed to be bug free. The verification focuses then on testing the interoperability of the system, more than in the IPs themselves. Transaction Level Models are considered to be very effective in verifying such interactions between blocks. High Time-to-Market pressure, and the highly programmability of blocks, drive the main verification efforts to search for hardware flaws that would prevent hardware from being manufactured. Other errors that could be fixed afterwards with software are regarded as less important, and may not be fully verified prior to silicon fabrication. On the other hand, timing constraints are heavily verified. This is due to the fact that many SoCs are designed for real time applications. Critical blocks, as well as as newly designed blocks are therefore modelled at highly accurate level, often in RTL. Well know blocks such as processors or bus models tend to be kept in high level in order to speed up simulations. While the newly designed blocks might only constitute a small fraction of the system, its complexity, and the potential high costs of undiscovered bugs force a extensive verification of these blocks. Transactors are used to connect this low abstraction level SystemC or HDL blocks to the rest of the high level blocks. For these set-ups, highly integrated simulation tools supporting mixed language simulation, debugging and analysis capabilities are used. Cadence, Coware, and mentor are among the EDA vendors providing these simulation environments Co-emulation using synthesizable transactors. Sometimes RTL designs are simply too big to be simulated at a reasonable speed for a specific purpose. This is the case of validation of complex IPs integrated in TLM subsystems. The cosimulation speed can be drown down in the case of RTL processors, or heavy multimedia cores. Fig. 14 shows a system including both cosimulation and coemulation of RTL cores. This TLM System performs cosimulation with Module A and coemulation with Module B. Co-emulation requires the use of synthesizable transactors. In this cases a solution can be to synthesize the RTL block in real hardware and provide a communication link to the simulated rest of the system. For such a link, a transactor is needed. One possibility for the transactor is to be fully simulated, to map the RTL wires to a physical interface and to connect the wires to the hardware where the IP has been 15

synthesized. This approach shows a big drawback.

simulation speed slow, with no real advantage over pure

The second approach is to place a transactor in the

System Design Methodology department developed a

minimizing the amount of data flowing in the

Only high level transactions cross the interface,

By this method the amount of events going from the

This transactor translates read and write calls on its

It acts as an AHB slave on the TLM bus and as an AHB

The tools used for this development were Esterel s

higher level, and EVE Corporation boards for the

Esterel's tool provides graphical and textual tools to

with the design to increase error visibility[21].

onto specific logic and memory resources, and the

26 synthesized. This approach shows a big drawback. Communication is done at RTL level, which makes the simulation speed slow, with no real advantage over pure simulation. The second approach is to place a transactor in the interface, with a simulated part, and a hardware synthesized part. As an example of this case, let us take a NXP case. System Design Methodology department developed a prototype intended to maximize the cosimulation speed by minimizing the amount of data flowing in the hardware-software interface. Only high level transactions cross the interface, instead of wires. By this method the amount of events going from the simulator to the hardware board is minimized. This transactor translates read and write calls on its TLM interface to AHB signals on its RTL interface. It acts as an AHB slave on the TLM bus and as an AHB master on the RTL bus. The tools used for this development were Esterel s Studio as a means to describe the transactors at a higher level, and EVE Corporation boards for the physical connection. Esterel's tool provides graphical and textual tools to create finite state machines and also provides a way of describing assertions in its code. Therefore a protocol can be modelled and the result synthesized to create a transactor. The defined assertions can also be synthesized along with the design to increase error visibility[21]. Eve technology provides tools for hardware-software co-verification. It allows the mapping of a hardware portion of a design onto specific logic and memory resources, and the communication between hardware and software through a PCI interface. In this way it can execute software drivers, operating systems or applications at a high speed, while providing full hardware and software debugging capabilities. For this project a standard API was used to send data messages through the interface. The described set-up allows fast simulation of virtual prototypes with complex IPs for verification purposes. The complex RTL model under test is synthesized in the hardware 16

27 board while the rest of the system remains in fast SystemC high level models. This provides full accuracy in the synthesized IP, with much higher speeds than RTL simulations, while reducing the bottleneck effect of the Hardware/Software interface Architecture exploration. Modifications in the behaviour and architecture of systems and components are much easier to do with high level models than with RTL IPs. Transactor use provides more realistic performance statistics during architecture exploration, allowing us to maintain sensitive IPs in RTL, and take advantage of its full accuracy. Fig. 15: System architecture exploration process. Source[22] The traffic, usage, and performance statistics provided by transactors inserted in, or connected to buses provide accurate data. This information helps the designer in the election of the optimal system configuration. Also the visualisation tools available at some advanced transactors, as graphical user interfaces, or traffic log files, improve more intuitive visibility of the system. Virtual system prototypes for architecture exploration, among other purposes are in extended use in industry and academy. An example of use of this systems with mixed level blocks using transactors is Toshiba. Its design kit for its user-configurable media embedded processor is based on an ESL design environment that enables the designer to customize the configuration for a particular application. Designers can explore different configurations to determine the optimal one and not only validate the architecture, but also verify that individual hardware and software modules meet the system requirements.[23]. 17

3.2.4 Faster test writing in high level languages. Generation of complete test benches for RTL designs is a slow, painful manual process.

crucial for its success. High level tests can be written in a faster and more abstract way with transactors.

or design changes[24]. Massachusetts-based Ammasso designs Ethernet-based products that extend the capabilities of existing Ethernet solutions.

a reduced number of engineers. The product was designed on a board with a 2-million-gate Virtex-II Pro FPGA [25].

Another feature that makes high level tests faster is that they can be easily modified to target newly found bugs and problematic areas.

The quality of the test can also automatically be measured by the transactor in terms of coverage at all the levels: Value, Structural and

A unified test applied to all the possible abstraction levels of design models guarantees the equivalence of the models.

28 3.2.4 Faster test writing in high level languages. Generation of complete test benches for RTL designs is a slow, painful manual process. Exhaustive test is hardly even possible, and the election of the optimal test vectors determines the duration of the verification phase, and is crucial for its success. High level tests can be written in a faster and more abstract way with transactors. C language tests, for example, can be elaborated to cover the most interesting cases, and easily modified to take into account newly found bugs or design changes[24]. Massachusetts-based Ammasso designs Ethernet-based products that extend the capabilities of existing Ethernet solutions. They used Cadence PCI-X Transactor technology for the development of a new gigabit Ethernet adaptor, in order to meet the product deadline with a reduced number of engineers. The product was designed on a board with a 2-million-gate Virtex-II Pro FPGA [25]. Applying high-level SystemC tests to the design trough transactors allowed Ammasso to spend more time in running tests than in writing them. Another feature that makes high level tests faster is that they can be easily modified to target newly found bugs and problematic areas. Complex combination of events, corner cases, and large cross conditions are also easier to cover. The quality of the test can also automatically be measured by the transactor in terms of coverage at all the levels: Value, Structural and Timing level Guaranteeing models equivalence. A unified test applied to all the possible abstraction levels of design models guarantees the equivalence of the models. Refinement processes to get RTL model from high level is error prone, so it is abstraction to generate abstract SystemC from low level IPs. Applying the same test to all the models of a design both shortens verification time for each model, and guarantees the equivalence between them. Despite the facts above, test benches are level specific, so an intermediate layer is needed to 18

29 adapt the test to different abstraction levels. Transactors maintain the consistency of the communications sent trough them, across all their interfaces in different levels, so they seem to be ideal to act as such layer. This approach was used by ARM Ltd. for the development of the ARM PrimeCell PL190 Vectored Interrupt Controller. They chose transactor technology from SpiraTech, and RTL model compilation tools from Tenison Design Automation [26]. The architecture of the test bench module (XVC environment) is conformed by a DUT container and a standard test. The test, or set of tests are applied to the container instead of directly to the design. The container is divided into two elements. See Fig. 17 for a better understanding of the set-up structure described. Fig. 17:Test scenario being driven into XVC environment via XVC Manager Source[26] The top layer is a user-extensible library of defined test actions. The tests make use of this layer. This top, or action layer, is designed to be integrated with verification environments. It can also be connected to a specific manager (XVC Manager) that can drive one or many of this containers (XVC) at a time. For the system test bench to apply to all the abstraction levels without alteration, the test stimulus must be passed to the DUT trough a driver that can handle multiple abstractions. Those drivers are the transactors. The lower layer of the container is based on a transactor which supports bidirectionality, since the top layer of the container must be able to both drive and monitor the DUT. This restriction is not significant for commercial or generic use transactors, since most of them support bidirectionality, and simultaneous availability of the 19

transaction at all the levels. On the other hand specific-use, hand-coded transactors are usually simple unidirectional drivers unsuitable for this use. 3.2.

30 transaction at all the levels. On the other hand specific-use, hand-coded transactors are usually simple unidirectional drivers unsuitable for this use SoC verification against real target environment. This technique proposes the idea of verifying a SoC, or design without writing specific RTL test vectors. Instead, the design is integrated and run in the target environment. This can be used, for example, in the development of a new video processing chip for an existing PCB system. This is the case of the isave verification methodology developed by Dynalith Systems Co. and the Korea Advanced Institute of Science and Technology [27]. The main idea behind is to enable fast verification of early stage designs, by getting fully accurate test vectors from the target environment. This way test vectors do not have to be coded by hand for the design, but the target environment is programmed to run in all its possible modes, which saves time and avoids human mistakes. Fig. 18 shows the isave design flow, and its physical mapping. Target PCB is the system in which the resulting SoC under development will be integrated once it is ready. Physical connections from the interface sub-model mapped on a FPGA and the target PCB are represented with triangles. Fig. 18: isave design flow, and its physical mapping. Source[27]. The methodology consists on separating the design model into two sub models. One is the functional sub-model, describing the function of the design at purely algorithmic level. This can be done in C++, or SystemC, and might have a notion of the structure in blocks. The 20

31 other part is the interface sub-model, which has the information of the inputs, outputs, protocols and other design boundaries. This second sub-model needs to be described in pin and cycle accurate level, because it will have to interact with physical hardware. Verilog, or VHDL are the most common languages for describing it. These two sub-models represent the behavioural and black box views of the design. Once the design is split into two sub models, the interface sub model is synthesized in an FPGA along with a transactor and a debugger circuit. The transactor used is therefore synzesizable, and is in charge of translating high level messages coming from the functional sub-model into wire activity. The interface sub model is physically connected to the interfaces of the target system, and has as many independent communication ports as necessary to the system Constructing reusable test benches Dividing tests into layers helps simplify the complex problems. Layering reduces test complexity and increases reusability at the cost of sticking to a fixed methodology. Synopsis proposes a structure of vertical and horizontal layers where the vertical ones provide communication and the horizontal ones the major functionality of the tests. See Fig. 19 for a graphical representation of the layers descripted. The higher the layer, the more abstraction and less control of the DUT. This allows engineers to write tests more efficiently and most importantly, allows test to be independent of the hardware. The lower the layer, control of the DUT grows, and the layers have more details about protocols, timing and pins. At the lowest level, there is a specific transactor for the protocol, but it has no notion of the structure of the tests. Fig. 19: Layered test-bench container structure. Source [28] Many layers make up the test structure, with specific functions: Utility: It provides a set of global functions, such as printing and logging Communication: It is associated with multiple protocols, interconnects components and provides packets queues to store messages. 21

32 Transactor: It is associated to specific protocols. Typically can be reused for any test bench using that protocol. Acts as a interface between DUT and test bench. Generate/Check: A generator is also associated to a single protocol. It creates random data units based on a series of parameters. It then sends a copy to the layer below, and another to a Predictor. The Predictor will generate the expected response, which will be compared to the actual response by the Checker. In case of mismatch an error occurs. Configuration: It has two functions, configuring the test benches and providing a stable interface between tests and lower layers. Tests: The test code communicates with the test bench by making calls to the configuration layer. Ideally the tests are small, easy-to-write and independent of the lower layers. They can contain randomized constrains and parameter settings that will be translated into different actual vectors depending on the rest of the layers. Synopsis proposes a quantitative measure of the test reusability based on coefficient tables. For further reference [28]. 22

33 4 Ideal transactor requirements. A list of desirable features of commercial transactor has been elaborated from the bibliography studied. It evaluates a series of aspects that are not necessary for the basic functionality of transactors, but which represent added value. It is not intended to be exhaustive, but only a reference for designers neophyte in the use of transactors. Bidirectionality: Simple unidirectional transactors can be easily hand written as drivers from one abstraction level to another. However bidirectionality is necessary for driving and monitoring an interface at the same time. Simultaneous coherence at all the abstraction levels: The number of abstraction levels supported by a transactor can vary from two to an arbitrary amount. Maintaining coherence at all of them all the time means that a transaction can be monitored at various abstraction levels at a time. Also a transaction can be initiated at any abstraction level. Transactions visualisation: Incorporated graphic user interface used to display graphical information during a simulation. A time line where transactions are represented is of particular interest, because transactions have a format that can not be represented with standard visualization tools, due to its special treatment of time. Fig. 20 Shows the transactions runtime visualization tool of SpiraTech Ltd. The reader can remember that some transactions, such as the ones in PV level, are timeless, and only have a defined order. A series of high level timeless transactions represented with a standard visualization tool would result in a single black line in time zero, since real time does not move forward. Connection to an external database: Many of the most interesting features a transactor can have, require communication with external databases or files. Data gathered during simulation can be then stored in a database, in a standard format, to be processed by third party programs, or in proprietary formats to be used with a debug environment of the transactor's company. Very useful feature for Fig. 20: Cohesive visualization GUI. Source: SpiraTech 23

34 performance analysis, statistics elaboration, and timeless transactions visualisation Configurability: Support for different bus bit lengths, or different equivalent logic representations, as binary, standard logic or boolean, make integration engineer life easier. Full support of protocol standards: In order to successfully act as a protocol checker, and as a transparent adaptor, the whole set of protocols must be supported. Some standards have reduced subsets of protocols, such as the Lite version of AMBA TM AHB TM. It is desirable that transactors support the widest possible set. Standard TLM interfaces: It is desirable that transactors follow standards in their interfaces. If such standards do not exist, either an exhaustive description of the abstraction process followed, or a set of adaptors to extended TLM existing styles would be of the utmost help. Synchronisation support for low level clocks of different transactors instantiated in the same design: This is required in systems where two or more ports connect an RTL block to a TLM system (See Fig. 21). One transactor is instantiated per port, and each one generates one clock signal. However, the RTL core only has one clock input for all the ports. In order to drive the core a mechanism is necessary that synchronizes the clocks of all the transactors. Clk Port will have timing problems TLM Transactor Wires TLM Transactor Clk Wires RTL Block Fig. 21: Need for synchronization of clocks of different transactors Synthesizablility: Possibility of mapping a transactor into real or simulated hardware. In the case of systems having parts simulated in TLM and parts implemented in physical hardware, a transactor has to be situated in the interface. This requires the transactor, or part of it to be synthesizable. In such a situation it is convenient to use a high abstraction level in the hardware-software interface. Using high level calls, as in PV interfaces, reduces the communication overhead minimizing the impact in the system speed. Modelling high level TLM communications using blocking function calls: This is 24

35 necessary for guaranteeing execution order in the case of cosimulation of timeless TLM models and RTL blocks. Fig. 22 correlates the studied use cases of transactors, in the colons, with the properties the transactors have to fulfil, in the rows. It can be seen that some properties are required for the basic functionality. Some others result in functional improvements or extra facilities for the designer. There is also a number of transactor properties that are indifferent for specific uses. Required property Desirable property Use cases Properties Bidirectionality Simultaneous abstraction coherence Transactions visualisation External database Configurability Full standard support Standard TLM interfaces Clocks synchronisation Synthesizablility Blocking function calls Ip reuse Co emulation Architecture exploration High level test writing Guaranteeing models equivalence Verification against real target Reusable test benches Fig. 22: Correlation between Use cases and properties of transactors used 25

36 26

37 5 Multi-level, multi-language simulation: Problem statement. Multi-level multi-language cosimulation is needed in the industry principally as a means of decreasing time to market. Being a rather recent activity, there is no consolidated standard, and many techniques are still immature. Many problems remain unsolved, and possible solutions are left to be evaluated. In this thesis work, we will be interested in the ones directly related to the use of transactors. In the early stages of design, abstract models are made as an aid for clarifying the designer's mind and fixing the specifications. For example SystemC allows executable abstract models, which can be used as an unambiguous specification reference. As design advances, the description level in the models gets more detailed, finally resulting in synthesizable RTL. In typical complex designs, more and more programmable devices are part of the system, such as DSPs and processors. If software designers had to wait until the whole design were ready, the resulting design cycle would be too long. Also verification engineers want to start testing the system as soon as possible. The ability to set up a simulation where parts described at different levels of abstraction could just plug seamlessly would allow verification and software engineers to start their job before the designers finished theirs. It would also help architects and designers to check the functional behaviour of the system, and to evaluate the effect of making specific changes. There are also other reasons that lead design methodology engineers to look at mixed level simulation. Some of these are, as discussed previously: to speed up test bench writing, to verify RTL IPs integrated in SystemC virtual prototypes, to perform bus protocol checking, to test designs on mixed hardware-software platforms, to do performance analysis, early software development, etc. 5.1 Main parameters to control in cosimulation Besides the possibilities and advantages of a multilevel environment, there are also many difficulties. A number of issues have to be controlled, such as: time coherence, synchronization, intercommunication between abstraction levels and interconnection of multiple languages are the main ones. These issues will later be explained in more detail Interoperability of different levels. The main function a transactor has to fulfil is to act as a transparent glue layer between IPs which are modelled at different levels of abstraction. This translation of data flowing from one abstraction level to another can be done in many different ways. This is not important from a behavioural point of view. On the other hand, from the point of view of the designer building the simulation set-up some parameters are important. Those are the introduced delay, if the translations are 27

bidirectional, and the availability of the data flowing through its various interfaces. All these aspects will be studied in this work with the help of an experiment. Fig.

38 bidirectional, and the availability of the data flowing through its various interfaces. All these aspects will be studied in this work with the help of an experiment. Fig. 23: Designs modelled at different abstraction levels can not be directly connected Intercommunication between languages. Multi language communication consists on the intercommunication of blocks described at the same level of abstraction but in different languages. It is a well covered feature in many available commercial products. Because of the maturity of the commercial solutions, we will not cover this problem in this thesis work. It will nevertheless be used in our experiments as a necessary tool for addressing the multi-level cosimulation problem in a realistic way. Design A (Verilog) Design B (VHDL) System Design C (SystemC) Fig. 24: Multi language simulation Synchronization between abstraction levels Synchronization is the most challenging problem to be solved in multi level cosimulation. There is a variety of cases, depending on the abstraction level of the IPs that we want to simulate. The difficulty of maintaining time coherence between different abstraction levels resides in the fact that by abstracting the functionality of a design, the notion of time itself changes. 28

Low abstraction levels such as RTL and Cycle Accurate (CA) do not tend to suppose big difficulties to be correctly synchronized, since the notion of time is very similar.

39 Low abstraction levels such as RTL and Cycle Accurate (CA) do not tend to suppose big difficulties to be correctly synchronized, since the notion of time is very similar. The most common situation will be a clock sharing. Following with the RTL and CA example, in RTL the time is continuous, while in CA time only advances when signals change. This is not a complication, since every signal change is taken into account in CA models, as seen in Fig. 25. Synchronisation can be easily maintained using the signal toggling events as a reference. Fig. 25: Timing equivalence between CA (up) and RTL (down) is straight forward On the other hand, there are levels where no concept of time exists at all, such as PV or algorithmic view. In this case, blocking mechanisms are necessary to make compatible the notion of order available to other levels of abstraction where every event occurs on a clock basis. These mechanisms have to be implemented in the simulation engine, and the modelling style must consider by construction the possibility of multi level cosimulation. Another extreme cosimulation environment regarding timing is the use of real hardware to synthesize parts of a system, while others are simulated in a computer. In such a situation it is necessary the use of synthesizable transactors to act as a bridge between the two areas: real and simulated. There are therefore at least two clock domains, physical and simulated, plus other possible simulated parts modelled in different levels with a different treatment of time, such as PV TLM. An example of a synchronization problem is a PV modelling style, which bases its ordering in delta events provided by the simulation engine. It will work efficiently without problems until cosimulation with an RTL block is tried. The RTL clock will wait forever to change its state, because only delta events will move forward. Since there is no modelling standard for time in SystemC at abstract levels, different parties offer different solutions to the synchronization problem. This is therefore an open topic for further improvements. 29

40 5.1.4 Compatibility of different model sources. In the context of reuse and exchange of designs between development groups or companies, having standards in the model representations is crucial. Lower level designs are usually written in any of the de facto industry standards such as VHDL (IEEE-1076) or Verilog (IEEE-1364 ). Due to their low level of abstraction, and their age (Verilog dates from 1985 and VHDL from 1993), there is no much room for ambiguity in the designs. Nevertheless higher level designs are subject to ambiguity and incompatibilities. The relative newness and lack of an industry agreement, added to the intrinsic distance from reality of abstract models, make design exchange at these levels error prone. Fig. 26: TLM models of the same protocol coming from different sources might not be compatible There are several projects oriented to agreeing on modelling standards at high levels. Some are OSCI (Open SystemC Initiative), and SPRINT (Open SoC Design Platform for Reuse and Integration of IPs). In spite of this, there is no complete standard yet. 5.2 Thesis scope. The purpose of this thesis is to do an evaluation of a transactor set-up that covers a number of use cases in the paradigm of the Philips SystemC standard methodology. The particular set-up will be examined in order to evaluate whether it fulfils the requirements needed for applying the stated use cases. A correlation will be done between the literature study, the typical issues in cosimulation, and the experiments. The expected results are a qualitative idea of the benefits and problems of using mixed level simulation, and some guidelines to improve the modelling style and design methodology in multilevel environments. The experimented set-up will be analysed to see which requirements are fulfilled and which use cases it is suitable for. A correlation will be done with chapters 3 and 4, where industrial extended use cases and ideal transactor requirements are exposed. 30

41 6 Proof of Concept Design: solution to the problem. As a vehicle for research, and in order to face the real problems that a fairly complex multilevel multi-language system will have, some experiments are carried out. To make the results more realistic and therefore valuable, a real system with commercial products has been chosen. A set-up for cosimulation has been constructed out of a high level SystemC ARM subsystem, an out-of-the-box Verilog USB core IP, and a commercial transactor (Fig. 27). To demonstrate that the set-up is correct, the verification software provided by the IP vendor has been run unmodified in the ARM processor. This set up intends to be a functional environment designed to make use of many advantages transactors have in various use cases. It allows software development for the USB core, bus performance analysis, faster simulation for RTL verification, searching for protocol violations and architecture exploration as the main uses. The next sections cover a specification of the parts used in the project, followed by the description of the experiments carried out and the adaptors specifically developed for the set-up. System Set-up (cosimulation) = ARM + Transactor + subsystem TLM level TLM/RTL level RTL level Fig. 27: Elements involving cosimulation experiment The performed experiments involve the following activities: Verification of the SystemC subsystem. Verification of the transactor. Verification of the USB RTL IP. Development of interface adaptors. Integration of the ARM subsystem with a transactor and a VHDL memory. Integration of the ARM subsystem with the USB core and transactors. Verification of the system set-up trough cosimulation. Measurements and data gathering for performance characterization. Development of software for demonstration purposes. 6.1 Base subsystem. The ARM11 SoC Virtual Prototype is a SystemC subsystem used in NXP for architecture 31

exploration, performance analysis, verification and software development. The various components of this subsystem are TLM SystemC models of actual IPs from the company portfolio. Fig.

42 exploration, performance analysis, verification and software development. The various components of this subsystem are TLM SystemC models of actual IPs from the company portfolio. Fig. 28 shows a block diagram of the base subsystem. The system is based on the AMBA TM -AXI TM protocol. There is also a proprietary VPB bus. All the models are available in PV and PVT abstraction levels, and support the Transaction Level Model of the AXI bus at these levels. The most important components are an ARM processor, memory and interrupt controllers, LCD display,timer and some memories. See appendix for a detailed description of the components directly involved in the integration of the USB core. All the components are instantiated in a Top Level entity, defined in files phtop.h and phtop.cpp. The compilation of these files generate a standalone executable system, and software can be loaded in the processor with the Real View debugger tool, once compiled for the ARM architecture. In order to give an idea of the complexity of the subsystem, some parameters are given: Approx 120K code lines. 643 code files. Approximately 50 configuration files, out of which 15 of them was modified for the experiments. Able to boot Linux 25 Components Fig. 28: Components integrating the ARM11 subsystem NXP

43 6.2 Intellectual Properties integrated. Two designs from external companies to NXP are used in the project. A Verilog USB IP to be verified, from Evatronix, and a transactor to make the intercommunication possible, from SpiraTech USB core. The USB_OTG_MPD Intellectual Property is a USB On-The-Go controller. It complies with the 2.0 version of the USB protocol, with the On-The-Go supplement. It has an integrated DMA controller that handles byte transfers autonomously over the AMBA TM - AHB TM bus. The IP connects to the system through two AHB ports, one slave and one master, of configurable bit length. It supports single transfers, and bursts. The core application interface module can also generate interrupt signals for a microprocessor. The design is strictly synchronous with positive-edge clocking, a synchronous reset and has no internal tri-states. It has two differentiated clock domains, one for the AHB system bus and one for the UTMI+ interface (USB Transceiver Macrocell Interface). The estimated size of the core is approximately 8500 gates 1 excluding memory area. Fig. 29: USB core Block diagram Evatronix SA The USB_OTG_MPD can be used as a dual role device and can act as a USB peripheral or as a USB host. Fig. 29 Shows the internal structure of the core and its interfaces. Detailed list of features [29]: 1 The calculations have been done indirectly from the product data sheet. A synthesis of the core with speed optimization in a UMC 0.13 µm process results in a µm 2 area. The average density for such process is 220K gates per mm 2 (source which results in 8634 gates. 33

44 Complies with the USB 2.0 specification and On-The-Go supplement. Supports HS hubs and multiple Low, Full or High-Speed devices in Host mode. Supports Full-Speed and High-Speed data transfer in Peripheral mode. UTMI+ (Level3) or ULPI Transceiver Macrocell Interface. 32 bit AMBA AHB slave interface, 32 bit AMBA AHB master interface. Integrated USB protocol-aware multichannel DMA module. Remote Wake-Up function, suspend and resume power managements functions AHB transactor. In order to allow communication between RTL level and PV level, a transactor is needed. A transactor from the company SpiraTech is chosen. It is a commercial product, with support for multiple abstraction levels, and provides visualisation and debug tools AMBA_AHB_v1p0_32_32_ps_NCSIM_5p5 The transactor used in this project is an AMBA TM -AHB TM 32 bits long protocol definition, which has been pre compiled to run under the cadence NCSIM simulator. It supports the Lite subset of the AHB protocol. It is provided with a SystemC interface header, to be included in the system as a standard component. SpiraTech transactors also have their own graphical user interface for displaying some information regarding performance analysis and protocol violation warnings. Fig. 30 shows a SpiraTech transactor acting as a bridge between abstraction levels. Fig. 30: Diagram of a SpiraTech transactor. Source:[15] 34

45 6.3 Accomplished Experiments The idea behind the use of transactors and cosimulation is to improve the designer tools for developing projects. For this reason this section will be treated from the design methodology point of view, and in a chronological order. The main difficulties encountered were: Incompatibilities with libraries: Visualization and debug tools from SpiraTech transactors are incompatible with Cadence NCSIM libraries. That made it impossible to use these tools in the experiments. Incomplete Transaction Level Modelling standard: AHB bus TLM models at PV level are different for SpiraTech and NXP. This made the creation of adaptors necessary. The reason is that even though there is a TLM standard, it does not define the convenience layer, responsible for the data structures. Reduced protocol subset supported by transactors: Only the Lite subset of AHB protocol is supported by the transactors used. This required the inclusion of extra logic to generate some missing signals. TLM AHB model limitation: SpiraTech model for AHB protocol does not support burst transfers. The USB core generates burst transfers in its master port. NXP TLM timing limitation: NXP modelling style for PV models has a timing problem that makes it unsuitable for cosimulation with RTL. A temporary solution was implemented to fix this problem, modifying all the IPs of the subsystem which have master ports. The main activities carried out are: Setting the environment: Compilation of OSCI SystemC and Philips SystemC Environment libraries, configuration of mixed language compiler and simulator, configuration of Verilog compiler in the computer system used. Compilation, configuration and verification of SystemC ARM Virtual Prototype, Verilog USB core and verification of SpiraTech transactor. Creation of a SystemC AXI memory, and C verification software to test functionality of the ARM subsystem, and familiarize with the creation and integration of new modules. Creation of adaptors for connecting TLM ports from the different NXP and SpiraTech interfaces. Generation of auxiliary SystemC blocks to generate missing signals in the transactor's interfaces, and to adapt different RTL port bit-widths. Generation and manual modification of a SystemC shell for wrapping the Verilog USB core, with the Cadence Ncshell tool. Hand coding of a temporary solution for the timing incompatibility of NXP TLM at PV level with RTL cosimulation. 35

46 Set-up of a cosimulation experiment integrating the SystemC TLM subsystem with a SpiraTech transactor and a VHDL memory IP. Verification of the set-up with c written software running on subsystem's ARM processor. Set-up of a cosimulation of SystemC PV and CA blocks using the OSCI kernel, a transactor and the SpiraTech transaction visualization Graphical User Interface. Set-up of a cosimulation experiment integrating the SystemC TLM subsystem with two SpiraTech transactors and a Verilog USB core. Verification running unmodified USB vendor provided test intended for prototype testing, on simulated subsystem's processor. Writing demonstration C software accessing the USB and printing the results in the subsystem's LCD screen using PicoTk drawing library. The execution of the software shows alternating execution in RTL and SystemC domains of the cosimulated system Environment The computer system used for the experiments is a cluster formed by machines with dual AMD Opteron 64 bits processor, 2'8 Ghz, 1Mb cache, 16 GB RAM machine, running Redhat Linux, with kernel The simulation environment is Cadence NC-SystemC simulator 5.5 with Cadence Simvision for visualization. The RTL block has been compiled with Cadence NC-Verilog. SystemC language version used is 2.1, with the PSE 1.1 (Philips SystemC Environment) extension library Verification of components Every component has been recompiled and verified to work in the local environment. All the tests provided by the component vendors have been executed to check for possible incompatibilities with the computer architecture or the simulation programs. An incompatibility issue was found that made it impossible to use the transactor waveform viewer and performance analyser graphical user interface with the NC-SystemC simulator. Apparently the GUI can only work when the simulation engine is the open source OSCI kernel. Since OSCI does not support the multiple language simulation needed for including the Verilog USB core, we cannot use the SpiraTech GUI. Indirect measurement methods had to be used. For visualization, Cadence Simvision is used System set-up The simulation had to run in a single engine. There are some possibilities to do this: A Verilog simulator with built-in support for SystemC blocks, or a SystemC simulator with support for Verilog. There are some reasons to choose the second option. The main one is that most of the design is written in SystemC, and only a small block is RTL. Also both the SpiraTech transactor and the ARM subsystem have extensively been tested for the SystemC simulators 36

47 OSCI kernel and Cadence NC-SC. Since OSCI does not support simulation of Verilog code, NC-SC is chosen as the best solution. The simulation therefore has two language domains, a Verilog one, and a SystemC one. Fig. 31 shows a simplified idea of the system connections. The upper area represents the RTL Verilog domain, while the bottom one is high level SystemC. Fig. 31: Schematic design of the system to be build, representing main blocks In order to interconnect those language worlds, the simulation engine provides a special wrapping layer called SystemC shell. It is an entity that acts both as a SystemC object and an ensemble of Verilog wires. Its functionality is only to glue SystemC wires to Verilog wires. As it can be seen in the figure, the transactor acts as a bridge between the USB core and the system bus. This transactor is transparent for both sides. It will be shown later that the transactor showed in the picture is actually a set of two transactors, one for each bus port Creation of TLM adaptors The SpiraTech transactor is compiled into a SystemC library, and has a header file detailing its interface. It provides five master and five slave ports, each at a different abstraction level. PV, PVT, CC, CA and wire levels are available. The Arm subsystem supports PV and PVT levels. PV is chosen because it is the most abstract level, in contrast to the RTL core, which is at the lowest abstraction level supported by the transactor. Connecting PV to RTL through a transactor is therefore the most challenging and interesting configuration. The USB core has one slave port, and one master port. For that reason two transactors are necessary. One will connect its PV master port to a slave port in the system bus, and its RTL slave signals to the USB master port. The other one will connect its PV slave port to 37

48 the system bus and its RTL master wires to the USB slave port. PV ports from the transactor and the subsystem can not be directly connected. It is necessary to use an adapter in the interface. This is due to the differences between the Transaction Level model for AHB bus transactions of SpiraTech and NXP. Since there is no standard in the way protocols are abstracted, the differences can be quite big. One of the objectives of the SPRINT project is to get to an agreement on this topic, in order to use a standard that enables an easy exchange of IPs between different companies. Fig. 40 shows the case of mismatch in the TLM interfaces between transactor ports and NXP subsystem ports. In this case there is no direct connection possible, and adaptors are necessary. Fig. 32: Direct connection between blocks with different TLM standards is not possible. As it can be seen in the following table, different functions are used in each TLM style. Each TLM interface has a set of callable functions, and a data class which defines the objects that can be send through those functions. This is better explained in the following paragraphs. Transaction Level Modelling Master request Slave execution style NXP TLM (PSE) pv_read_request_send pv_read_request_receive pv_write_request_send pv_write_request_receive SpiraTech TLM activate_ahb_single_read join_ahb_single_read activate_ahb_single_write join_ahb_single_write Fig. 33 includes the signature of the functions that form TLM SpiraTech API. The SpiraTech AHB transaction model at the PV level is constituted by an API with four functions. These functions are: activate_ahb_single_write and activate_ahb_single_read, available from the master port, and join_ahb_single_write and join_ahb_single_read implemented in the slave port. 38

49 //Master request void activate_ahb_single_write ( unsigned int _p addr, T_AHB_lock _p lock, T_AHB_transfer_type _p transfer_type, T_AHB_transfer_size _p transfer_size, T_AHB_burst_type _p burst_type, T_AHB_prot _p protection_ctrl, unsigned int _p wdata, T_AHB_slave_response& _p resp ); void activate_ahb_single_read ( unsigned int _p addr, T_AHB_lock _p lock, T_AHB_transfer_type _p transfer_type, T_AHB_transfer_size _p transfer_size, T_AHB_burst_type _p burst_type, T_AHB_prot _p protection_ctrl, unsigned int& _p rdata, T_AHB_slave_response& _p resp ); //Slave execution void* join_ahb_single_write ( unsigned int addr, T_AHB_lock lock, T_AHB_transfer_type transfer_type, T_AHB_transfer_size transfer_size, T_AHB_burst_type burst_type, T_AHB_prot protection_ctrl, AHB_DATA_TYPE wdata, T_AHB_slave_response& resp ); void *join_ahb_single_read ( unsigned int _p addr, T_AHB_lock _p lock, T_AHB_transfer_type& _p transfer_type, T_AHB_transfer_size _p transfer_size, T_AHB_burst_type _p burst_type, T_AHB_prot _p protection_ctrl, unsigned int& _p rdata, T_AHB_slave_response& _p resp ); Fig. 33: PV functions in SpiraTech TLM. There is also a specific data class, but its data types are inherited from standard types. Each parameter, such as address or data length is passed to the API functions separately. NXP AHB bus is modelled at Programmers View level with a data class and an API. The signatures of the API functions are detailed in Fig. 34. //Master call data_pv_write_response_channel pv_write_request_send ( data_pv_write_request_channel<config32>, unsigned int ) //Slave implementation data_pv_write_response_channel pv_write_request_receive ( data_pv_write_request_channel<config32>, unsigned int ) data_pv_read_response_channel<config32> pv_read_request_send ( data_pv_read_request_channel<config32> &request, unsigned int port_number ) data_pv_read_response_channel<config32> pv_read_request_receive ( data_pv_read_request_channel<config32> &request, unsigned int port_number ) Fig. 34: PV TLM write function in NXP It consists of two functions available from the master port: pv_read_request_send and 39

50 pv_write_request_send, and its responses implemented in the slave port: pv_read_request_receive and pv_write_request_receive, as it The TLM model interfaces for NXP are defined in the PSE library, and follow the CoReuse directives for TLM modelling. The data class is detailed in Fig. 35. //Write transactions data_pv_write_request_channel ( <config32>::pvwdata_t//data, <config32>::pvaddr_t //address, <config32>::pvsize_t //data size, unsigned int //burst length, <config32>::pvprot_t //protection control, bool //debug mode ) //Read transactions data_pv_read_request_channel ( <config32>::pvaddr_t //data, <config32>::pvsize_t //address, unsigned int //data size, <config32>::pvprot_t //protection control, bool //debug mode ) data_pv_write_response_channel () //==bool data_pv_read_response_channel ( <config32>::hrdata_t //data, <config32>::hresp_t //response ) Fig. 35: PV TLM write data class in NXP It is also comprised of four objects: data_pv_write_request_channel, data_pv_read_request_channel, data_pv_write_response_channel and data_pv_read_response_channel. Those objects are the parameters fed to the API functions, and include all the data necessary to complete the transaction, such as address, size, type of burst or protection. Even though the information transmitted is essentially the same, the differences in the structures make necessary the construction of the TLM adaptor depicted in Fig. 36. The arrows represent the functions, while the blocks inside them represent the data payload. Please note how NXP TLM functions only carry a single object while SpiraTech TLM functions carry one object per parameter encoding the transaction. In order to simplify the adaptor, and make it as efficient as possible, it is written without sc_methods, or sc_threads. It only maps plain C++ calls from SpiraTech TLM to functions of NXP TLM and vice versa, building or extracting the necessary data for the data objects sent. It extracts the parameters needed by SpiraTech join transactions from the NXP request_channel objects, and generates response_channel objects out of the parameters provided by SpiraTech activate transactions. This architecture was proved to be simple and work very satisfactorily. Other similar adaptors were later built in NXP following the same structure to adapt TLM models from different vendors. Those adaptors are currently used as standard IPs for allowing cosimulation with designs following different modelling standards. For further references on different TLM standards used for the AHB bus see [1], [16], [30], [31]. 40

Fig. 36: Possible TLM transactions and their equivalent correspondence. 6.3.5 Transactor integration with the subsystem.

51 Fig. 36: Possible TLM transactions and their equivalent correspondence Transactor integration with the subsystem. The mismatch in the TLM interface used by NXP and SpiraTech is not the only one in the integration of the transactor to the bus of the subsystem. We must remember that the transactor follows the AHB protocol, and has a bit width of 32 bits. On the other hand, the bus we want to connect with is 64 bit width AXI. This can be easily solved with the inclusion of a couple of bridges converting 32 bit transfers into 64 bits and vice versa, and a couple converting AXI traffic into AHB. AHB and AXI are very similar protocols, being versions 2 and 3 of the ARM AMBA TM protocol [32]. For this reason adaptors for these two protocols are rather simple. These adaptors are part of the standard set of NXP SystemC models. The AXI-AHB adapter (BP137) and AHB-AXI adapter (BP136 )convert the incoming AXI transactions into outgoing AHB transactions and vice versa. They also generate the required bytelane and strobe signals. The AXI Expander (BP129) and AXI downsizer (BP131) adapters expand the 32 bit input interface into a 64 bit output, or reduce 64 bit input into a 32 bit interface. They adequately manipulate the byte flow as required. Fig. 37 shows the whole chain of adapters needed to finally connect the transactors to the subsystem. From top to bottom the blocks represent the bit width adapters connected to the AXI bus, followed by the protocol bridges( AXI-AHB) and finally the TLM interface adapters. 41

Fig. 37: Chain of connections and adaptors from system bus to transactors 6.3.6 Transactor integration test.

52 Fig. 37: Chain of connections and adaptors from system bus to transactors Transactor integration test. Once all the adaptors are placed in the subsystem, and the two transactors are connected, verification is needed. In order to check the correctness of the connections and configuration, a simple VHDL memory with its memory controller is connected to the slave transactor. For this integration a simple VHDL shell has also been generated and instantiated. A memory is the simplest device that can be used, since it only provides read and write functions. The chosen IP are ip_ssd_2107 and a standard SSD SRAM memory from the NXP portfolio. The ip_ssd_2107 is a small (1.6K gates) controller which supports AHB protocol and can drive SRAM and ROM memory devices. A net list is generated with all the previously discussed models and a new top level design of the ARM subsystem is created and compiled as depicted in Fig. 38. After modifying the memory map of the processor, it is possible to run software that makes use of the new memory. 42

53 Adaptor 32to18 bit slave_haddr SpiraTech to NXP adaptor SpiraTech TLM Port AXI to AHB adaptor SpiraTech transactor slave_hresetn slave_hwdata slave_hsize slave_hwrite slave_hrdata slave_hburst slave_hresp slave_htrans slave_hclk slave_hready ahb_ready_out sel 1 Signal level RTL slave (memory) 64 to 32 bit adaptor TLM memory AXI 64 bit bus AXI 32 bit bus ARM11 processor Fig. 38: System set-up with integrated memory, showing the adaptors and transactor needed. The new memory has been assigned to addresses 0xA to 0xA in Read/Write mode. A simple software that reads and writes data should result in satisfactory writing and reading when using this range, and in error when accessing addresses higher than 0xA See test code in Fig. 39. The result of the executed test is fully satisfactory. The written data matches the data read, and the last access will result in six errors, due to the fact that there is nothing mapped to addresses higher than 0xA The packed is a directive for the ARM processor that allows the use of unaligned memory addresses. With this small example we guarantee that read and write work for 1, 2 and 4 byte words, at any address, and that memory violations will be detected. While actually only the slave transactor is tested with this method, for symmetry reasons this test is considered sufficient for this stage of the design. Nevertheless an exhaustive verification will be required once the final system is ready, in order to check that there are no problems in the master transactor configuration. 43

54 #include <stdio.h> int main (void) { volatile long * base_ram_long=(volatile long *)0xA ; packed volatile short * base_ram_short=(volatile short *)0xA ; packed volatile char * base_ram_char=(volatile char *)0xA ; for(offset=1;offset<0xa ;offset=2*offset) { *(base_ram_long+offset)=(long)offset; printf ("writing long %d in address : %p \n", offset,(base_ram_long+offset)); printf ("reading long %d in address : %p \n", *(base_ram_long+offset),(base_ram_long+offset)); *(base_ram_short+offset)=(short)offset; printf ("writing short %d in address : %p \n", offset,(base_ram_short+offset)); printf ("reading short %d in address : %p \n", *(base_ram_short+offset),(base_ram_short+offset)); *(base_ram_char+offset)=(char)offset; printf ("writing char %d in address : %p \n", offset,(base_ram_char+offset)); printf ("reading char %d in address : %p \n", *(base_ram_char+offset),(base_ram_char+offset)); } return 1; } Fig. 39: C code executed for testing the VHDL memory Fixing PV-RTL incompatibility During the integration process of the system, a flaw in the modelling style at Programmer's View (PV) was found. It affects NXP models, but it is a general problem for modelling timeless IPs with SystemC. The flaw is in the way synchronization is done when there is no time, and prevents PV models to be cosimulated with RTL in the same simulation kernel. In order to emulate hardware parallelism, SystemC simulators are event based. This means that the simulator executes all the actions required by all the elements in a system without advancing the simulation time. Instead of it, delta events, with zero delay are used. When there are no more actions scheduled for a specific time, the simulator evaluates all the variables changed in a cycle and actualizes them to the final values. In case of conflict, precedence and strength rules are applied. In models described at the PV level, there is no notion of time, only of order. There are different techniques to keep the order of transactions without using any time reference. Some of these are the scheduling of events, registration of calls, external arbitration, or the use of delta time. NXP PV models make use of delta delay for guaranteeing the order of the transactions. Blocks that can initiate transactions are always coded in threads, so the simulator can schedule its execution, or put them to sleep when necessary. Once a block has initiated a transaction, it poses a wait statement to the simulator for a zero time. This is done with wait(sc_zero_time) statements into infinite loops. SC_ZERO_TIME means a delta amount of time, corresponding to zero seconds of simulation time. This puts the block immediately into wait mode, and other blocks start 44

statement. Fig. 40 represents the structure of the way time is handled in PV models.

The function things_to_do() represents all the actions that a SystemC thread takes for every simulation cycle.

$void my_ip :: ClockProcess() { for (;;) { things_to_do(); wait (SC_ZERO_TIME); } } Fig.$ and simulation time does not advance. Those two are the requirements that a PV model must fulfill.

and simulation time does not advance. Those two are the requirements that a PV model must fulfill.

One of them is the impossibility of performing cosimulation with RTL blocks.

The period of the clock is twice the amount of time between toggles.

cycle, so simulated time will never advance and clocks will keep in the same value.

55 execution. When all the blocks scheduled for execution are finished, the simulator returns control to the first block who made a wait statement. Fig. 40 represents the structure of the way time is handled in PV models. Function ClockProcess() is called when a specific thread of my_ip model becomes active. The function things_to_do() represents all the actions that a SystemC thread takes for every simulation cycle. After finishing its duty, the thread releases control by calling a wait function. void my_ip :: ClockProcess() { for (;;) { things_to_do(); wait (SC_ZERO_TIME); } } Fig. 40: Time processing in standard NXP models at PV level Ordering in the simulation is guaranteed with the use of delta delay and simulation time does not advance. Those two are the requirements that a PV model must fulfill. In spite of the simplicity of this solution, there are some drawbacks. One of them is the impossibility of performing cosimulation with RTL blocks. The problem of cosimulation of PV with RTL comes from the RTL clocks. The period of the clock is twice the amount of time between toggles. Since there are always PV blocks scheduled for execution when sc_zero time passes, the simulator will never start a new time cycle, so simulated time will never advance and clocks will keep in the same value. When clocks do not advance, the whole RTL block is stopped, and only PV blocks are active. As an example, we can use the case of a system with two PV, and one RTL blocks. Fig. 41 graphically depicts the order in which a SystemC simulator kernel schedules the execution of the different blocks. When simulation time starts, all blocks are given execution time, and after a first loop of executions, they get scheduled for the next moment in which they have activity. The clock gets scheduled for the time when its value has to toggles, and the PV blocks get scheduled for zero time later. That is, there is an infinite loop of execution of PV blocks with delta events in time zero, and simulation time never moves forward. 45

The solution to solve this problem is to use a ordering mechanism other than delta delays.

and cadence for RTL. The same applies for co emulation with RTL implemented in hardware.

and there is no need to change the PV models.

alternative synchronization method, or to create a specific model in the system that

This second approach could be considered as the timeless abstract model of a clock, in a

All the solutions previously proposed require structural changes in the models or

On the other hand, in order to have the system running, a temporary solution was implemented.

the RTL clock. This change has to be done in every master block of the design.

Since time is senseless in PV, this time advance has no importance, while in RTL it allows

56 The solution to solve this problem is to use a ordering mechanism other than delta delays. Two different simulators can also be used to avoid the problem, such as OSCI for PV blocks and cadence for RTL. The same applies for co emulation with RTL implemented in hardware. In those cases time advances because execution time can not be monopolized by the PV blocks, and there is no need to change the PV models. In the case we want to use a single simulation engine, the modelling style needs to be changed. Some solutions to the problem are to modify the simulator in order to give support for an alternative synchronization method, or to create a specific model in the system that guarantees ordering of the execution. This second approach could be considered as the timeless abstract model of a clock, in a wide concept. All the solutions previously proposed require structural changes in the models or simulators, clearly out of the scope of this thesis. On the other hand, in order to have the system running, a temporary solution was implemented. It consists on substituting the delta time increments for small time increments compared to the RTL clock. This change has to be done in every master block of the design. Ordering among the PV blocks is guaranteed, while time is allowed to advance. Since time is senseless in PV, this time advance has no importance, while in RTL it allows the clocks to move. This solution is heavily discouraged, and has only to be seen as a temporary workaround that allows the use of existing PV models while a final solution is developed. An exhaustive test was later carried out in order to check the correctness of this modifications in the subsystem models. All the tests provided with the standard subsystem were run again with the new blocks. Also new test software was written, see Fig. 43. It 46

57 consists on a series of sequential accesses addressed to the PV subsystem memory and the RTL previously integrated memory. The system set-up structure in which this test was carried out was shown in Fig. 38. The results of all the tests was satisfactory, in functionality and order. #include <stdio.h> int main (void) { volatile long * base_rtl=(volatile long *)0xB ; volatile long * base_pv=(volatile long *)0xE ; unsigned int data; unsigned int offset=0; for(offset=0;offset<150;offset++) { *(base_rtl+offset)=(long)offset; printf ("RTL:writing %d in address : %p \n", offset,(base_rtl+offset)); data=*(base_rtl+offset); printf ("RTL:reading %d in address : %p \n", data,(base_rtl+offset)); *(base_pv+offset)=(long)offset; printf ("TLM:writing %d in address : %p \n", offset,(base_pv+offset)); data=*(base_pv+offset); printf ("TLM:reading %d in address : %p \n", data,(base_pv+offset)); } return 1; } Fig. 43: Test alternating read and write accesses to RTL and TLM memories While this solution has been tested and proved functional, we do not recommend such approach as a definitive solution. Other techniques such as external arbitration or event scheduling in the kernel may be more elegant, and provide higher simulation speeds by avoiding the overhead resulting from 1 ps wait statements Shell generation. Once the transactor has been verified to work as expected with the rest of the subsystem, we can build the final System including the two transactors and the USB core. As previously commented, every component written in a language other than SystemC needs to be wrapped in a shell that allows its interconnection with the rest of the blocks. A shell is a wrapper surrounding the child instance written in the language of the parent. This shell is read by the simulator, and translates the wire activity in the SystemC shell into wire activity in the wrapped object (child) language. In this case, it will be Verilog. Cadence provides a utility named Ncshell, which facilitates shell generation and model import for use with the simulator. With the help of this tool a SystemC entity will be automatically generated with the same interface of theverilog top level. It is clear that Verilog does not have the same types that exist in SystemC. For that reason the following equivalence between types will be used: 47

58 Verilog SystemC shell input sc_in < sc_bit > input [n:0] sc_in < sc_bv<n+1>> output sc_out < sc_bit > output [n:0] sc_out < sc_bv<n+1>> This equivalences must be indicated to the tool Ncshell with the option -SCTYPE var_name:sc_bit where var_name is the name of a variable of the interface. This line must be added for every variable which type we want to modify from the default value [33], [34]. Since Verilog input and output are mapped by default to sc_logic, and the transactor interface only has sc_bits and sc_bv (bit vectors), this modifiers must be added for every signal in the interface. The shell code created by Ncshell still needs a slight modification: #ifndef Verilog_USB_SHELL define Verilog_USB_SHELL and #endif have to be added to the header, in the beginning and end respectively. This is done so no duplicate library is generated during the system building. This is a requirement for its successful integration with the SystemC arm subsystem. After this, the shell instance can be instantiated in the system top level and is ready to be connected to the transactor and used. At the transactors and USB shell ports there are now the wires of an AHB bus. Please refer to [35] for further information on AMBA TM -AHB TM protocol. In order to avoid compilation errors due to unconnected signals, the transactor interface defines its ports as sc_signal_in_if, sc_signal_out_if or sc_signal_inout_if. This means that for connecting the signals of the USB core to the transactors we use a syntax different from the more usual BlockA->wire1( BlockB->wire2 ). Instead of connecting inputs to outputs, in a net-list style, we connect inputs to pointers to input interfaces and outputs to pointers to output interfaces, using C++ code. As an example to clarify the previous point, lets examine the connection of the signal Master Write, a single bit wire: USB Shell sc_out < sc_bit > mhwrite; Transactor sc_signal_out_if< sc_bit > *master_hwrite; Connection chip_usbhs_otg_mpd_inst->mhwrite ( *(transactor_master->master_hwrite)); Connection of AHB wires. At this point of the integration the Verilog USB core is wrapped in SystemC, and all the definitive components are ready to be instantiated in the Top Level entity. Most of the AHB signals can be directly connected from the USB shell to one transactor or the other. The USB slave port to the slave transactor, and the master port to the master transactor. In spite of this it exists a slight ports mismatch. While address is defined in the 48

59 slave USB port as 11 bits long word, the transactor has a 32 bits output for the same variable. To solve this it has been added a simple adaptor that discards the 21st most significant bits of the word address, feeding the USB variable with the required 11 bits. The address adaptor is modelled as a SC_method belonging to the system Top Level, sensitive to changes in the 32 bits address coming from the transactors. It only has a 32 bits input, and a 11 bits output. The adaptors' inputs and outputs are connected to variables of the Top Level, which are themselves connected to the transactor and the USB port respectively. Fig. 44 details all the lines connected between a transactor and a USB core port, including the bit width adaptor. Only one of the ports is represented, since master and slave ports are symmetrical. adaptor32to11 slave_haddr TLM Port SpiraTech transactor slave_hresetn slave_hwdata slave_hsize slave_hwrite slave_hrdata slave_hburst slave_hresp slave_htrans slave_hclk slave_hready USB slave port ahb_ready_out Fig. 44: Wires connected between USB slave port and transactor, including address adaptor The snippet in Fig. 45 displays all the connections corresponding to the AHB bus. It is an extract of the Top Level top.cpp file. Chip_usbhs_otg_mpd_inst is the instance of the USB core in the system. Transactor_master is the instance of the transactor connected to the USB master port. Transactor_slave is the instance of the second transactor, connected to the slave port of the USB core. Address_adaptor is the included adaptor which cuts the 32 bit addresses provided by the transactor_slave to fit in the 18 bit length address words required by the USB port. The first block of connections are for the master port wires, followed by the slave port wires. Finally it can be seen how slave_haddr has a special treatment, as it is connected to the previously discussed Top Level variables for the bit width adaptation. Other signals used for configuration of the USB have been connected to variables initialized to the desired values. These variables are: ambassize, ambamsize, mhprot, hsdisable, wakeup, mhgrantdma, shseldma, ahb_ready, endian, scanen, scanmode, scanclk, scanin,onbist, and slave_reset_low_clocks. 49

60 chip_usbhs_otg_mpd_inst->mhrdata (*( transactor_master->master_hrdata )); chip_usbhs_otg_mpd_inst->mhresp (*( transactor_master->master_hresp )); chip_usbhs_otg_mpd_inst->mhaddr (*( transactor_master->master_haddr )); chip_usbhs_otg_mpd_inst->mhtrans (*( transactor_master->master_htrans )); chip_usbhs_otg_mpd_inst->mhwrite (*( transactor_master->master_hwrite )); chip_usbhs_otg_mpd_inst->mhsize (*( transactor_master->master_hsize )); chip_usbhs_otg_mpd_inst->mhburst (*( transactor_master->master_hburst )); chip_usbhs_otg_mpd_inst->mhwdata (*( transactor_master->master_hwdata )); chip_usbhs_otg_mpd_inst->mhlockdma(*( transactor_master->master_hlock )); chip_usbhs_otg_mpd_inst->mhready (*( transactor_master->master_hready )): chip_usbhs_otg_mpd_inst->clk5k (*( transactor_master->master_hclk) ); chip_usbhs_otg_mpd_inst->reset (*( transactor_slave->slave_hresetn )); chip_usbhs_otg_mpd_inst->shwrite (*( transactor_slave->slave_hwrite )); chip_usbhs_otg_mpd_inst->shtrans (*( transactor_slave->slave_htrans )); chip_usbhs_otg_mpd_inst->shsize (*( transactor_slave->slave_hsize )); chip_usbhs_otg_mpd_inst->shwdata (*( transactor_slave->slave_hwdata )); chip_usbhs_otg_mpd_inst->shresp (*( transactor_slave->slave_hresp)); chip_usbhs_otg_mpd_inst->shrdata (*( transactor_slave->slave_hrdata )); chip_usbhs_otg_mpd_inst->shburst (*( transactor_slave->slave_hburst )); chip_usbhs_otg_mpd_inst->shreadyo (*( transactor_slave->slave_hready )); chip_usbhs_otg_mpd_inst->hclk (*( transactor_slave->slave_hclk)); address_adaptor->address11 (signal_address11); chip_usbhs_otg_mpd_inst->shaddr (signal_address11); address_adaptor->address32 (*( transactor_slave->slave_haddr )); Fig. 45: extract of Top Level file, showing connections between USB and transactors For instance, the structure of the connection of ambassize, which defines the length of the data bus is shown in Fig. 46. The code snippet represents the structure of the Top Level constructor. Ambassize is declared at the beginning of the constructor. Later, it is connected to the USB core shell, as the rest of the wires previously seen in Fig. 45. At the end of the constructor, the variable is assigned a fixed value of 00. Ambassize means Amba Slave Size, and value 00 defines a 32 bits length. Top::Top(args) : //constructor...,ambassize("ambassize")... {... chip_usbhs_otg_mpd_inst->ambassize(ambassize );... ambassize.write((sc_bv<2>)"00");... } Fig. 46: ambassize definition and use Please, note that the USB core has two clock domains. The clock of the AHB domain has been connected to the one provided by the slave transactor. The UTMI+ domain clock (clk5k) has been connected to the slave transactor clock. Please note that both transactors have the same clock output for this specific configuration, and that the clock fed to the UTMI+ domain is irrelevant for the tests carried out. 50

61 UTMI interface of the USB Besides the AHB interfaces, the USB core has other ports and wires. The main one is the UTMI interface, which connects to the physical interface. Others are configuration lines, debug facilities, resets and interrupts. See Fig. 47 for a detailed picture of the interfaces of the USB core. Configuration Debug UTMI+ USB core AHB slave Interrupts Physical reset clk5k Wakeup AHB Master reset The initial idea for the system was to connect the UTMI interface to a Verilog model of the physical interface provided by Evatronix. The IP vendor has a set of verification tests composed by an USB external simulator connected to the UTMI and some Verilog blocks simulating the bus, system memory and processor. Porting the processor model instructions to ARM code, substituting the bus model for the real AHB subsystem bus and leaving the USB simulator in the Verilog domain would allow to mimic the standalone test within our cosimulation environment. While this would be very desirable, a number of circumstances made such a set-up infeasible. In spite of this, due to problems with the transactor's protocol support, this set-up could not be done Fixing protocol problem Fig. 47: USB interface During the development of the work, a major incompatibility was found between SpiraTech transactors and the USB core. The master port of the IP generates DMA transactions to transmit data to the system memory, and uses burst transfers to do so. SpiraTech transactors have two limitations: they only support AHB Lite, that is, single master and single slave, and does not support burst transfers, but only supports single transactions. The limitation of supporting communication for only a single master and a single slave in a bus was circumvented with the generation of some local arbitration logic. 51

62 The AHB protocol supports multiple masters and slaves, with master arbitration, while AXI protocol is a switch fabric so that every connection is point to point. This means that for an AXI master, the rest of the subsystem behaves like a single slave. The network will take care of making all the internal connections and arbitration. For this reason, all the communications with the USB master and slave AHB ports are virtually point to point, that's it, single master to single slave. The only issues to take care of are the select, bus request and bus grant AHB wires of the USB ports, which are not available in the transactors interfaces. Since the bus is arbitrated in the SystemC domain, Verilog code corresponding to the USB slave port is only executed when it is called from a SystemC model and allowed by such arbiter. In the same way, transfers coming from the USB master port are blocked until the arbiter allows it. This means that the previously mentioned signals can be generated locally, without compromising the correct functioning nor the fidelity of the model to the real system modelled. Signals select and bus grant remain always asserted. Bus request becomes a dummy signal connected to nothing. The second limitation, burst transfer support takes us to a dead end. While this limitation does not interfere with functioning of the USB slave port, all the burst generated by DMA transfers from USB to the system memory will result in access errors in the master transactor. The possibility of writing an adaptor with a buffer to convert burst access into multiple single accesses was evaluated, but then discarded because of the distortion introduced with respect to the real system activity. Models should be as close as possible to the original. With respect to this point, the transactor vendor was contacted and informed about this incompatibility, and agreed on shipping a new transactor with full support to AHB protocol. This new item was not received by the time of writing of the present report. While waiting for the new transactor, an alternative set up was designed. Fig. 48 displays all the wires and ports connected to the USB core in detail. Instead of the Verilog USB traffic generator, the UTMI interface was connected to a dummy block modelling an unconnected port, therefore with no traffic. In addition to this the power saving related features were disconnected, so that the IP would always be awake, but with no traffic in the physical interface. The debug lines such as scan chain were disabled. Verilog simulation provides full visibility of the core, therefore these wires are unnecessary. All the interrupt lines are nevertheless connected to the system, and the interrupt controller and ARM processor are configured to assign IRQ 6 to USB related interrupts and IRQ 7 to USB DMA interrupts. These interrupts will never occur because there is no physical activity, but leaving them configured will allow a faster development for further designers once a fully compatible transactor is available. The result of this limited set up is that only a reduced set of tests can be applied to the USB core. Specifically all the ones involving the AHB functionality, and internal registers correctness can be executed without problems. A set of tests provided by the vendor, that sequentially check the internal registers after a set up time are chosen to be run on the ARM processor. This test only makes sequential accesses to the USB core through the slave AHB 52

63 port. Configuration lines with a fixed value, such as ambassize and ambamsize, and unused debug lines such as scan enable are not represented in Fig. 48 for clarity. ARM11 Interrupt Controller Interrupt requests Master port Slave port AXI bus Slave port Master port Adaptors layer Adaptors layer 1 transactor_master transactor_slave sel Master port Slave port AHB clk USB core Clk USB Dummy UTMI Interface Dummy sleep control Verification software Fig. 48: USB scheme of connections to the system As said before, a good evidence of the transparency of the transactors is to run USB verification software designed for a prototype, in the SystemC subsystem. Such an experiment demonstrates both the correctness of the set-up and the time savings that engineers will experience due to the reuse of verification software in the cosimulation set-up. As a visual example let us take a short part of the USB registers test. Source code cannot be reproduced here for confidentiality issues, but its functioning will be explained step by step. Fig. 49 gives an idea of the test code. 53

64 A data structure is created representing all the registers of the core, with their length in bytes and name. Each register can be later on accessed independently with a reference to its name. There is also a method called checkregister which receives a pointer to the USB base address, a pointer to the register to be read, its size and its expected value as parameters. It finally prints the result of the comparison in the standard output. typedef unsigned char u8; typedef unsigned short u16; typedef unsigned long u32; typedef struct StructUsb {... u16 indmaien ;// 0x19C; u16 outdmaien ;// 0x19E;... u8 dmaivect ;// 0x1A1;... u8 usbcs ;// 0x1A3;... }StructUsb; int checkregister (void* USB, u8* a, u32 size, u32 value) {... // Compare read and expected value... } int USBCoreTest (StructUsb* usb_device) {... checkregister((void*)&usb_device->outdmaien, " usb_device->outdmaien ", 2, 0x0);// 0x19E; checkregister((void*)&usb_device->indmaien, " usb_device->indmaien ", 2, 0x0); // 0x19C; checkregister((void*)&usb_device->usbcs, " usb_device->usbcs", 1, 0x0); // 0x1A3; checkregister((void*)&usb_device->dmaivect, " usb_device->dmaivect", 1, 0x0); // 0x1A1;... } Fig. 49: Extract of USB registers test provided by Evatronix The part of the test that we are going to analyse is the checking of registers at addresses 0x19E, 0x19C, 0x1A3 and 0x1A1. This part of the test has been chosen because shows sequential access to registers of 8 and 16 bits. The sequence of events for the execution of each register test is: First the ARM processor fetches the instruction of reading a 2 bytes word (long integer) at address 0x19E. This instruction is not actually in the subsystem memory, but it is directly read by the simulator to improve speed. Then the processor tries to access the AXI bus with a single read TLM transaction with length 1 (2 1 bytes), address 0x19E, and cache and protection flags set to 0. After this, the ARM processor releases control. Then the AXI bus takes control of the execution. In fact, when the processor tries to read the memory, it does so trough a blocking call to the a slave port of the bus. Blocking is therefore natively done by normal program execution flow. The bus then arbitrates the call and forwards the transaction to the slave port of the first adaptor to which the slave transactor is connected. This is the AXI downsizer, which calls the next adaptor itself after 54

65 converting the 64 bits transaction into its 32 bits equivalent. The same is repeated with the AXI to AHB converter. All the calls are blocking. Once the transaction arrives to the bridge which adapts NXP TLM to SpiraTech TLM, its data structure is changed to match the required format, and finally the transactor is called. This call is also blocking, which guarantees the order in the execution. The transactor has a set of state machines that generates a sequence of signal activity in its RTL interface. Once the valid data, address and control information is available at the output of the transactor wires interface, it releases control, and the Verilog simulation starts. The USB core runs the time required to complete the assigned task, set the port wires to the correct values indicating the requested data and result of the operation, and finally indicate that it is ready by setting the HREADY signal to high. The wire activity resulting from the execution is shown in Fig. 50. Fig. 50: Extract of the waveform activity When HREADY is asserted the simulation control is recovered by the transactor, which checks that no protocol violations occurred, and generates in this case a valid transaction containing all the information returned by the USB core. Since all the calls in the chain of adaptors to the AXI bus were blocking, the transaction data generated by the transactor will travel back from adaptor to adaptor as the return value of the calls. At the AXI bus no new arbitration is needed because the return value is still part of the first transaction. The data will be passed to the ARM processor as a return value of the first function called, and therefore there is no need for further control. The processor will read the data received and continue execution of the program. This is comparing the result to the expected value, and print in the standard output the result of the comparison. The log of the standard system output, printed by the test software running on the ARM processor can be read in Fig

EEM870 Embedded System and Experiment Lecture 4: SoC Design Flow and Tools

EEM870 Embedded System and Experiment Lecture 4: SoC Design Flow and Tools Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: wylin@mail.cgu.edu.tw March 2013 Agenda Introduction