Institutionen för systemteknik

Size: px

Start display at page:

Download "Institutionen för systemteknik"

Joanna Caldwell
5 years ago
Views:

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Bus System for Coresonic SIMT DSP Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet av Gustav Svensk LiTH-ISY-EX--16/4944--SE Linköping 2016 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

3 Bus System for Coresonic SIMT DSP Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet av Gustav Svensk LiTH-ISY-EX--16/4944--SE Handledare: Examinator: Mikael Rudberg MediaTek Andreas Ehliar isy, Linköpings universitet Linköping, 20 juni 2016

5 Avdelning, Institution Division, Department Computer Engineering Department of Electrical Engineering SE Linköping Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--16/4944--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Bussystem för Coresonic SIMT DSP Bus System for Coresonic SIMT DSP Författare Author Gustav Svensk Sammanfattning Abstract This thesis consists of designing and implementing a bus system for a specific computer system for MediaTek Sweden AB. The focus of the report is to show the considerations and choices made in the design of a suitable bus system. Implementation details describe how the system is constructed. The results show that it is possible to maintain a high bandwidth in many parts of the system if an appropriate topology is chosen. If all units in a bus system are synchronous it is difficult to reach low latency in the communication. Nyckelord Keywords On chip communication, Bus system, AXI4

7 Abstract This thesis consists of designing and implementing a bus system for a specific computer system for MediaTek Sweden AB. The focus of the report is to show the considerations and choices made in the design of a suitable bus system. Implementation details describe how the system is constructed. The results show that it is possible to maintain a high bandwidth in many parts of the system if an appropriate topology is chosen. If all units in a bus system are synchronous it is difficult to reach low latency in the communication. iii

9 Acknowledgments I want to thank my supervisor, Mikael Rudberg, for his help and guidance throughout the project. I also want to thank the staff at MediaTek Sweden AB for motivation and making me feel welcome. Finally I want to thank Olle Seger and Andreas Ehliar at Linköpings University. Linköping, June 2015 Gustav Svensk v

11 Contents Notation ix 1 Introduction Problem Definition Methodology Theory On Chip Communication Protocol Topology Initial Study Computer System Protocol AXI CoreConnect Wishbone Specialised Solution Summary Decision Topology Shared Bus Shared Address Bus Multiple Data Bus Switch Fabric Bus Hierarchy Decision Implementation Protocol Terminology Architecture Ordering Model Handshake vii

12 viii Contents Write Address- and Transaction ID Map Protocol Assertions Master and Slave Bus System Handshake Unit Pipeline Stage Arbiter Decoder Bus Bridge Problems Deadlock Tools and Scripts Python cog (System)Verilog VCS DesignCompiler Scripts Results and Discussion Functionality Performance Area and Timing Discussion Conclusions Conclusions Improvements and Future Work Latency Improvements Area Improvements Stalling Improvements Future Work A Figures and Tables 83 A.1 Topologies A.2 Protocol A.3 Problems A.4 Results Bibliography 95

13 Notation Abbreviations Abbreviation Meaning rtl Register Transfer Level soc System on Chip dma Direct Memory Access cpu Central Processing Unit noc Network on Chip simt Single Instruction, Multiple Threads dsp Digital Signal Processor ip Intellectual Propery axi4 Advanced extensible Interface 4 amba Advanced Microcontroller Bus Architecture ahb Advanced High-performance Bus apb Advanced Peripheral Bus ix

15 1 Introduction Computer systems are rapidly increasing in complexity and often a whole system is integrated on a single chip, a so called system on chip (soc). As the complexity and component count of these systems grows a fast and efficient communication system is necessary. The speed of the communication system has not increased in the same rate as the speed of processors and memories; this has lead to the communication system to become a bottleneck, both in terms of performance and energy efficiency. The goal of this thesis has been to investigate how a bus system is designed and implemented to fit a specific computer system to achieve high bandwidth and low latency. The requirements of the bus system are chosen so that the Coresonic simt dsp can utilize the bus system efficiently. The thesis was performed at MediaTek Sweden AB in Linköping who also suggested the subject of the thesis. MediaTek Sweden AB develops digital processing solutions such as the Coresonic simt dsp. The company was founded in 2004 as Coresonic but was in 2012 acquired by the Taiwanese company MediaTek Inc., one of the largest fabless semiconductor companies in the world. This chapter will present the topic of the thesis and the methodology used to produce the results. 1.1 Problem Definition The thesis studies different bus protocols and topologies. A bus system 1 with arbitration, multiple masters and pipelining will be implemented in Register Transfer Level (rtl) and evaluated. The thesis is divided into two parts: 1 Bus system refers to a bus protocol and a topology 1

16 2 1 Introduction Choosing a computer system that the bus system will be designed for. Study and compare different bus protocols and topologies. Key aspects of bus protocols will be compared between standardised protocols; more specialized protocols will also be compared. A similar comparison will be made between different topologies. When the study is finished a bus protocol and topology suitable for the computer system will be chosen for implementation. Implementation and evaluation of chosen bus protocol and topology. A proof of concept implementation in rtl will be made to ensure that the protocol works. Next an implementation of the bus system will be made in rtl and its functionality will be verified. Finally statistical data from the bus system will be collected, for example bus utilization and average latency. The goal of the thesis is to study how difficult and time consuming it is to implement a certain bus system. 1.2 Methodology The project has been divided into tasks listed below. The tasks have been carried out in the order they are listed, except for the documentation which has been written in parallel with the other tasks. 1. Choose a computer system that the bus will be designed for. 2. Identify aspects of bus protocols and topologies that are important for the computer system. 3. Study different bus protocols and topologies. Focus on the identified aspects. 4. Decide on a bus protocol and a topology. 5. Study the chosen bus system in detail. 6. Implement a proof of concept in rtl. 7. Implement the bus system in rtl. 8. Evaluate and validate the bus system. 9. Write documentation.

17 2Theory This chapter presents the theory of on chip communications and motivates the importance of on chip communication. 2.1 On Chip Communication The information in the following paragraph is from [4]. On chip communication and interconnect in general is important because it is often the limiting factor of the system. Communication also becomes more of a problem with time because it does not scale with technology the same way that processing and memory elements does. The complexity of system-on-chips (soc) is increasing rapidly. One way the soc are evolving is that they have more and more components, such as processors, memories, Direct Memory Access units (dma), accelerators and peripherals. All these components need to communicate with each other. One of the most widely used means of communication between components is a shared bus [11, chapter 2]. The shared bus consists of parallel wires that are connected to various components. Only one component can control the bus at any given time. To make sure that only one component tries to control the bus an arbiter is used, any component that wants to control the bus has to request bus access from the arbiter and can only control the bus when the request has been granted. Since only one component can control the bus at the same time it is not possible to perform parallel data transfers. The shared bus therefore quickly becomes a bottleneck as more components are added. The requirements for on chip communication are not the same as for communication between chips or between computer systems. This difference means that 3

18 4 2 Theory the communication solutions for computer systems or inter-chip communication do not necessarily solve the problems with on chip communication. Some typical requirements for on chip communication are high bandwidth, low latency, low energy consumption and low area. Two important choices when designing the on chip communication are the choice of protocol and the choice of the topology. These two aspects are described in the following sections Protocol A protocol is a set of rules that allows communication between components that follow these rules. The rules can define the structure of the sent messages as well as the physical interface (e.g. how many pins are needed to interface with the communication system). Protocols facilitate the communication between components, if a component follows a certain protocol it can communicate with any other component that follows the same protocol. This simplifies design reuse considering that it is possible to reuse a memory from another project without having to redesign its interface. Protocols also simplify the use of ip blocks since they are able to communicate with the rest of the system without modification as long as they follow the communication protocol. Examples of features in protocols for on-chip-buses are bursts, priority and error detection. These features are detailed further below. Bursts A protocol with burst support allows transferring multiple data items without the overhead for every data item. There are various ways of avoiding the overhead of every data item in a burst, such as not sending the destination address for every data item, instead the address is calculated from control information sent when the burst is started. Another way a burst can reduce the overhead is if the receiver of the data transfer only acknowledges that the burst in its entirety has arrived, instead of acknowledging that every data item in the burst has arrived. Bursts are very useful when reading data from a component that is far away, i.e. the latency to it is several clock cycles. With bursts there are fewer transfers that have to go there and back, as illustrated in figures 2.1 and 2.2. Priority Priority levels can be attached to the transactions to indicate how important that transaction is. When a communication system with priority support chooses between two transactions it will choose the one with the highest priority. Various priority schemes are possible, e.g. static priority levels per master, each master assigns a priority level to every transaction and rotating priority levels to avoid starvation.

19 2.1 On Chip Communication 5 clock address_valid address data_valid A B data address_ok data_ok D(A) D(B) Figure 2.1: Two data items transferred without burst. One address is sent for each data item. One data_ok signal is sent for each data item, and no new data items can be sent before the previous data item is confirmed to have arrived. The data_ok signal for the second data item does not fit in the illustration. clock address_valid address AB data_valid data address_ok data_ok D(A) D(B) Figure 2.2: Two data items transferred with burst. Only the address for the burst is sent, not for each data item. Only one data_ok is sent for the whole burst, instead of one for each data item.

6 2 Theory Error Detection Sometimes errors occur during a transaction which results in the

To detect these errors some extra bits set to the result of a function of the message are sent with

The destination can then apply the same function on the received message and compare it to the

To get the correct data the destination can send an error response to the transaction which will

Another way to get the correct data is to encode the sent data using an error-correcting code that

2 Topology The topology of a computer system refers to how the components are logically and

Topologies can be combined to create more specialized topologies.

20 6 2 Theory Error Detection Sometimes errors occur during a transaction which results in the destination not receiving the exact data that the source sent. To detect these errors some extra bits set to the result of a function of the message are sent with the transaction. The destination can then apply the same function on the received message and compare it to the extra bits in the message. To get the correct data the destination can send an error response to the transaction which will trigger the source to resend the data. Another way to get the correct data is to encode the sent data using an error-correcting code that includes additional information to recover the original data when an error has occurred Topology The topology of a computer system refers to how the components are logically and physically connected. Examples of topologies are bus, star, ring and tree (see figure 2.3). Topologies can be combined to create more specialized topologies. (a) Bus topology (b) Star topology (c) Ring topology (d) Tree topology Figure 2.3: Topology examples. The shared bus is a widely used topology for soc [11, chapter 2]. But this topology does not scale well as the number of components connected grows nor with the technology, so processors spend a lot of time waiting for the communication system [4]. Other topologies are used to avoid the problems of the shared bus.

21 2.1 On Chip Communication 7 Two different approaches are used to solve the problems, system specific topologies and network on chip (noc) topologies. The system specific topology is designed specifically for the soc, so the requirements of the system are known and the topology can be tweaked to fit those requirements [15]. The noc approach is to use a structured and scalable topology to keep the energy consumption low while maintaining a high bandwidth and low latency [8, 14, 5].

23 3 Initial Study This chapter describes the initial study done before the implementation. It contains three sections. The first section describes the computer system that the bus system will be designed for. The second section discusses different bus protocols and how suitable they are for the computer system. The third section discusses bus topologies, how well the computer system maps to them and finally a topology is chosen. 3.1 Computer System A bus system connects various components together to form a computer system. This section describes the computer system and what interconnection should be possible within it. The computer system in this project should be able to handle variable workloads and therefore a modular architecture is chosen where some modules can be turned off when the workload is small and turned on when the workload is high. The computer system should have an architecture that can work as a base for both high- and middle-end products, so it should be scalable in the sense that adding a module should not drastically reduce performance nor require a major architectural change of the system. The modules are processing blocks consisting of a cpu, a local memory, a dma and an accelerator. The modules should be able to communicate with each other and have access to a shared memory outside of the module. A debug unit should be able to read the local memory in the modules and the shared memory outside of the modules. The computer system only has one clock domain. A figure of the computer system is shown in figure 3.1. Note that the bus system 9

24 10 3 Initial Study in the figure just indicates what connections should be possible, not the actual topology. Figure 3.1: Architecture of the computer system. It consists of two processing blocks with local memory and accelerators. Outside the processing block are shared memories, a debug unit and an IO unit. M indicates a master and S a slave 3.2 Protocol To be able to choose a suitable bus protocol, a set of key aspects that are important for the computer system has been selected. The compared aspects are described in the table below. The aspects have a priority of High, Medium or Low indicating how important they are for the computer system. The priority of the aspects have been chosen to suit the specified computer system by discussions with the supervisor.

25 3.2 Protocol 11 Aspect Priority Description Pipeline High Does the protocol have support for pipelining? Deadlock free High Is the protocol guaranteed to be free from deadlocks? Testability High How easy is it to verify the functionality of the implementation of the protocol? Burst High What kind of bursts does the protocol support? Handshake latency High What is the minimum latency of a read/write cycle? Write data width Medium What write data width does the protocol support? Read data width Medium What read data width does the protocol support? Complexity Medium How complex must the interconnection network be to support the protocol? How many wires are needed? How complex must the intermediary nodes be? Compatibility Medium What other components can the protocol communicate with? Priority levels Medium Does the protocol support transactions with different priority levels? Number of masters Medium How many masters does the protocol support? Number of slaves Medium How many slaves does the protocol support? Error signaling Medium Can the protocol signal that an error has occurred?

26 12 3 Initial Study Aspect Priority Description Flexibility Medium How much of the interconnection network is determined by the protocol? Low-power state Low Is it possible to signal nodes to go into lowpower states to save power? Coherency Low Does the protocol support coherency between its nodes? Extensibility Low Is it possible to add new (user defined) features to the protocol? The compared protocols were AXI4, CoreConnect and Wishbone. These protocols were chosen because they are among the most popular protocols for on-chip buses [9, p. 7-15]. A solution specialized for the application has also been considered, based on solutions from scientific papers AXI4 AXI4 (Advanced extensible Interface 4) is defined in the amba 4 (Advanced Microcontroller Bus Architecture) specification made by arm [3]. The AXI4 protocol is designed for high-bandwidth and low-latency designs. The protocol includes AXI4-Lite which is a subset of AXI4 designed for simpler control. In this project the AXI4-Lite protocol has not been considered because the application demands high bandwidth and low latency. The AXI4 protocol defines five independent transaction channels: Write address Write data Write response Read address Read data The protocol allows for multiple outstanding transactions, which means that a master can request multiple reads or writes even if the previous transactions are not yet finished. The protocol also support out-of-order transaction completion which means that the transactions are not required to complete in the same order they were issued. Timing diagrams of a read and write cycle in the AXI4 protocol are shown in figures 3.2 and 3.3, note that some signals are not shown in the figures.

27 3.2 Protocol 13 Listed below is how the AXI4 specification handles the key aspects that are of importance for the computer system. Aspect Pipeline Deadlock free Testability Burst Handshake latency Write data width Read data width Details Each channel transfers information in one direction. This allows for pipelining of each individual channel, which makes the protocol flexible in regards to timing. If the protocol is implemented correctly it is guaranteed to be free from deadlocks. The AXI4 protocol comes with protocol assertions written in SystemVerilog [2]. The assertions can be added to the design when it is simulated to verify that it complies with the protocol. AXI4 is burst based, in a transaction the master sends the start address and burst information and the slave must calculate the subsequent addresses. The protocol defines three types of bursts: fixed, incrementing and wrapped. The fixed and wrapped burst types support burst length of 1 to 16 transfers and the incrementing type support burst length of 1 to 256. A burst must not cross a 4KB address boundary. The minimum latency for the first read or write data transfer in a burst is 2 clock cycles. The following transfers in the burst can be finished in 1 clock cycle per data transfer. 8, 16, 32, 64, 128, 256, 512 or 1024 bits. 8, 16, 32, 64, 128, 256, 512 or 1024 bits.

28 14 3 Initial Study Aspect Complexity Compatibility Priority levels Number of masters Number of slaves Error signaling Flexibility Low-power state Coherency Extensibility Details Since separate read and write address channels and separate read and write data channels are used the protocol will use a lot of wires. The nodes in the interconnection network need to keep track of the order of read and write transactions to ensure that certain transactions reaches the slave or master in the same order as they were issued. The complexity of the nodes in the interconnection network is largely dependent on the topology of the interconnection. AXI4 is backwards compatible with earlier AXI versions and there are bridges available to the ahb and apb protocols. The protocol includes for quality of service (QoS) which can be used for priority levels for each transaction. The protocol does not limit the number of masters, but recommends a maximum of 16. The protocol does not limit the number of slaves. The protocol contains signals to alert the master of slave and decoding errors. The protocol makes no assumptions about the interconnection network. The fact that each channel can be pipelined individually makes the protocol flexible in regards to the interconnection network. The protocol has support for asking peripherals to enter and wake up from low-power states. AXI4 does not support coherency in itself, but the AXI Coherency Extension (ace) can add that functionality if it is necessary. AXI4 includes USER signals that can be used to extend the protocol to the user needs. To support all features of AXI4 a quite complex system is needed. Since not all

29 3.2 Protocol 15 features are necessary only a subset of them needs to be supported which reduces the complexity of the system. The fact that there are protocol assertions to verify that an implementation conforms to the protocol is a big advantage of the protocol. If no such assertions had been available they would have to be written and it would be possible to make the same mistake in the RTL-code as in the assertions and the error would go unnoticed. For this project it is good that AXI4 does not define a topology, since a part of the project is to decide on one. In general the AXI4 protocol is flexible and seems to suit this project well.

30 16 3 Initial Study Figure 3.2: AXI4 read cycle, from the AXI4 specification [3]. An AXI4 master starts a read transaction by putting a read address on the ADDR bus and asserting AVALID. The AXI4 slave accepts the read address by asserting AREADY. The slave then puts out the requested read on the RDATA bus and asserts RVALID. The same cycle as the last read data is sent the RLAST signal is asserted.

31 3.2 Protocol 17 Figure 3.3: AXI4 write cycle, from the AXI4 specification [3]. An AXI4 master starts a write transaction by putting a write address on the ADDR bus and asserting AVALID. The AXI4 slave accepts the write address by asserting AREADY. The master then puts the write data on the WDATA bus and asserts the AWALID signal. The slave then asserts WREADY to indicate that it can accept the data. The master asserts WLAST when the last write data is put on the bus. After the slave has accepted the last write data it sends a write response indicating that the write data has arrived. It does this by putting a OKEY code on the BRESP bus and asserting the BVALID signal. The master has already asserted the BREADY signal, indicating that it is ready to receive the write response.

32 18 3 Initial Study CoreConnect CoreConnect is a bus protocol from ibm. It is segmented into three buses, the processor local bus (plb), the on-chip peripheral bus (opb) and the device control register (dcr) [6]. In this project only the processor local bus has been considered since that is the bus designed for high throughput and low latency. Furthermore only the latest version of the plb, version 6, has been studied. Timing diagrams of a read and write cycle in the PLB6 protocol are shown in figures 3.4 and 3.5. Listed below is how the CoreConnect specification with the plb bus controller core handles the key aspects that are of importance for the computer system. Aspect Pipeline Deadlock free Testability Burst Handshake latency Write data width Read data width Details plb supports pipelining in the sense that a second command can be requested before the first command completes. There is no limit to the number of pending commands except that a master can only have two pending commands if it has not yet won arbitration of the bus. The plb is free from deadlocks if implemented correctly. No assertions for the protocol are available from ibm. If the soft core bus controller is used it contains error registers that can be used to detect internal errors caused by masters and slaves not following the protocol. Bursts of size 1-8 is supported. For a read cycle it takes a minimum of 10 clock cycles from the cycle that the master requests access to the bus until it receives the first data. For a write cycle it takes a minimum of 13 clock cycles from the cycle that the master requests access to the bus until the slave receives the first data. 128 bits. 128 bits.

33 3.2 Protocol 19 Aspect Complexity Compatibility Priority levels Number of masters Number of slaves Error signaling Flexibility Low-power state Coherency Extensibility Details With the soft core bus controller already implemented most of the complexity is taken care of. Intermediary nodes can be relatively simple since they mostly will forward data between the bus controller and the slave or master. There are available bridges to the other buses in the CoreConnect specification, OPB and DCR. There is also a bridge available to the AMBA AHB protocol. The different masters have programmable priorities, but the protocol does not support different priority levels for individual transfers. Up to 16 masters. Up to 8 segments with up to 4 slaves in each segment. The protocol supports error signaling and the slaves can signal error or request for a retried transfer. The bus controller has registers to indicate internally detected errors. The protocol is designed to use the plb bus controller core, a soft core which defines the routing and arbitration. The bus controller core is parameterised so a core specific to the application can be generated. plb does not support the signaling of low power requests. plb supports cache coherency between up to 8 masters and up to 4 slaves. There is also support for hardware coherency via a module in the plb bus controller core. There are no user defined signals. With the plb bus controller core there is no extensibility, because the interface is fixed. The plb protocol is quite complex with a lot of fixed timings between signals,

34 20 3 Initial Study partly because of the snooping and coherency capabilities of the protocol. It is possible to reduce the complexity of the protocol by removing features such as coherency and removing the signals relating to those features. It might also be possible to reduce some of the fixed timings when features are removed, but at that point it seems like more work is put to rework the protocol instead of making an simplified implementation of it. The complexity and high latency from a request until data is received are problems with this protocol.

35 3.2 Protocol 21 Figure 3.4: plb read cycle, from the plb specification [7]. The arrows indicate dependencies. Signals starting with mst_m0 are driven by the master, plb_m0 by the bus controller to the master, plb_s1 by the bus controller to the slave and slv_s1_s by the slave. The read transaction starts with the master asserting mst_m0_req and putting the MSBs of the read address on mst_m0_addr. The next clock cycle the LSBs of the read address is put on mst_m0_addr. The bus controller accepts the transaction by asserting plb_m0_ready and signals the slave by asserting plb_s1_req. The slave receives the read address on the plb_s1_addr signals. The slave then asserts slv_s10_rd_req to indicate that it needs to send data, the bus controller grants the request by asserting plb_s10_rd_gnt. The slave then sends the data by asserting slv_s1_rd_val and putting the data on the slv_s1_rd_data bus. The bus controller then signals the master that read data is available by asserting plb_m0_rd_val and the read data is available on the plb_m0_rd_data bus.

36 22 3 Initial Study Figure 3.5: plb write cycle, from the plb specification [7]. The arrows indicate dependencies. Signals starting with mst_m0 are driven by the master, plb_m0 by the bus controller to the master, plb_s1 by the bus controller to the slave and slv_s1 by the slave. The master starts a write transactin by asserting mst_m0_req and putting the write address on mst_m0_addr. The next clock cycle the LSBs of the write address is put on mst_m0_addr. The bus controller grants the write request by asserting plb_m0_ready. The bus controller then checks that there is a slave on the requested address and returns an ack to the master on the plb_m0_comb signals. The master then indicates that it needs to send data by asserting mst_m0_wr_req, the bus controller grants the request by asserting plb_m0_wr_gnt. The master then sends the data by asserting mst_m0_wr_val and putting the write data on the mst_m0_wr_data bus. The bus controller then signals the slave that write data is available by asserting plb_s1_wr_val and the data is available on the plb_s1_wr_data bus.

37 3.2 Protocol Wishbone The Wishbone protocol was designed by Silicore Corporation, but has been released into the public domain [10]. The protocol is maintained by OpenCores Organization. The Wishbone protocol studied in this report is revision B.4. The protocol supports outstanding transactions. Timing diagrams of a read and write cycle in Wishbone are shown in figures 3.6 and 3.7. Listed below is how the Wishbone specification handles the key aspects that are of importance for the computer system. Aspect Pipeline Deadlock free Testability Burst Details There is pipelined mode in the specification, but according to the protocol at least one signal (STALL_O -> STALL_I) must be non registered [10, p. 36]. It is probably possible to insert a small flow control block to allow the pipelining of this signal as well.the pipelined mode allows for a second transfer to begin before the first is completed. It is possible to make masters and slaves that are compatible with the protocol, but not with each other, causing a deadlock [10, p. 35]. The specification recommends that the interconnection module should be designed in a way that prevents deadlocks, for example with watchdog timers. The Wishbone protocol does not come with official assertions to verify its functionality. There are community created assertions available, but it is not clear that they cover the entire protocol. The protocol is relatively simple with one data channel and one address channel that do not work independently. This increases the testability of the protocol. The protocol supports bursts, but the address must still be supplied, so the address channels are busy during the entire transfer. The burst are used to avoid the flow control overhead for every transfer. There is support for constant, incrementing and wrapping bursts. There is no limit on the burst length.

38 24 3 Initial Study Aspect Handshake latency Write data width Read data width Complexity Compatibility Priority levels Number of masters Number of slaves Error signaling Low-power state Flexibility Details If bursts are not used the latency for a read and write is 2 clock cycles. If bursts are used the latency for the first transfer in the burst is 2 clock cycles and the rest of the transfers finish in 1 clock cycle per transfer. It is possible for the slave to respond to a masters request combinationally to reduce the handshake latency to 1 clock cycle per transfer, but this normally leads to a long combinational path which prevents a high clock frequency. 8, 16, 32 or 64 bits. 8, 16, 32 or 64 bits. The Wishbone protocol uses comparatively few wires due to it being a simple protocol. The nodes in the interconnection network do not have to remember the order of transfers because the protocol does not support multiple outstanding transactions. This allows for simple nodes. The Wishbone protocol is compatible with any component that implements the Wishbone protocol. The protocol in itself does not support priority levels, but it is still possible to implement via arbiters with configurable priorities. The protocol does not limit the number of masters. The protocol does not limit the number of slaves. The protocol has signals to alert the master of an abnormal cycle termination or if the transfer should be retried. The protocol does not support low-power states other than reducing the clock frequency of the master, slave and interconnection interfaces. The protocol makes no assumptions about the interconnection network except that the connection from STALL_O to STALL_I must not be registered.

39 3.2 Protocol 25 Aspect Coherency Extensibility Details The protocol does not support coherency. The Wishbone interface supports adding user defined signals associated with an address, a data word or a bus cycle. The technique is called tagging. Wishbone seems like a good and simple protocol and is probably the easiest to implement of the protocols compared. But since it does not have support for multiple outstanding transactions the performance will probably not be enough for the computer system. The fact that it needs to have a path from slaves to masters that is non registered makes the protocol less flexible, because it is no longer possible to trade latency in clock cycles against clock frequency. Figure 3.6: Wishbone read cycle, from the Wishbone specification [10]. The Wishbone master starts a read transaction by asserting STB_O and CYC_O, negating WE_O, putting the read address on ADR_O and which parts of the data bus it expects data on SEL_O. The slave responds to the request by asserting ACK_I and putting the read data on DAT_I. The TGA, TGD and TGC signals are tag signals that are not used in this example.

40 26 3 Initial Study Figure 3.7: Wishbone write cycle, from the Wishbone specification [10]. The Wishbone master starts a write transaction by asserting STB_O and CYC_O, putting WE_O high, a valid address on ADR_O, which parts of the data bus that contains valid data on SEL_O and valid data on DAT_O. The slave responds to the request by asserting ACK_I. The TGA, TGD and TGC signals are tag signals that are not used in this example.

41 3.2 Protocol Specialised Solution A specialized solution allows for a protocol that is tailored specifically for the computer system so most of the key aspects will be fulfilled. But designing a protocol is not easy, especially one with support for outstanding transactions and pipelining. A specialized protocol will not have perfect compatibility with other protocol without some kind of bridge, which can be troublesome if the system needs to interface with other components. To verify that an implementation conforms to the protocol test assertions have to be written. To avoid spending a lot of time and most of the difficulties of designing a new protocol one can modify an existing protocol to better suit the intended application. Unused signals can be removed to reduce the wire count and features can be removed to reduce the complexity or added to better suit the application. The master, slave and interconnections will be designed in this project so it is not necessary to include all signals of the original protocol. Since the specialized protocol is based on another protocol it is easier to achieve compatibility if that should be needed in the future. Scientific papers have been studied to see current trends and where the field is heading. Most interconnect solutions in scientific papers seem to focus on packet based routing with a network topology [5, 14]. This solution tends to have lower latency, higher throughput and improved duty factor of the wires at the cost of extra area. In some cases packet switching networks can suffer from congestion at high loads which leads to long delays [14]. One downside with this method is that no standard protocol has emerged, so it is difficult to get compatibility between components unless you design all of them or create a bridge between the protocols. Due to the lack of standardised protocols for this solution it will not be studied further. For this project it would be too time consuming to design a completely new protocol and verify its performance. A more feasible alternative is to modify a protocol to suit the computer system Summary This section provides a summary of the different protocols and how they perform in the aspects with high priority. Pipeline AXI4 Each channel can be pipelined individually. A simple flow control block is needed for each pipeline stage. CoreConnect Each master can only have two pending commands if it has not yet won arbitration of the bus. Wishbone Support for pipelined mode, but according to the specification at least one signal must be non-registered. It is probably possible to insert some flow control to pipeline that signal as well.

42 28 3 Initial Study Deadlock free AXI4 Deadlock free. CoreConnect Deadlock free. Wishbone It is possible to design a master and slave that will deadlock if the slave implements some optional features that the master does not [10, p. 35]. Testability Burst AXI4 The protocol assertions from arm make it easy to verify that the implementation follows the protocol. CoreConnect The protocol is complex and no protocol assertions are available from from ibm. Wishbone Community created assertion are available and the protocol is much simpler than AXI4 or CoreConnect which increases testability. AXI4 The protocol is burst based with three types of burst: fixed, incrementing and wrapped. Burst length 1-16 supported for all burst types. For incrementing bursts burst lengths up to 256 is supported. CoreConnect The protocol supports burst lengths of 1-8. Wishbone The protocol supports burst, but some control information, such as the address, still must be supplied for every data transfer. The burst are used to avoid the flow control overhead for every transfer. There are no limits on the burst lengths. Handshake latency AXI4 Minimum 2 clock cycle latency for first read or write data transfer, the following transfers in the burst have minimum of 1 clock cycle latency. CoreConnect Minimum of 10 cycles from a read request to data arrival. Minimum of 13 cycles from a write request to the first data arrival at the slave. The long delays stems from coherency requirements that are not present in the other protocols. Wishbone The minimum latency for the first read or write in a burst is 2 clock cycles, the following transfers have a minimum of 1 clock cycle latency Decision The protocol chosen for this project is a subset of AXI4, but with some optional features removed due to them not being necessary for this project.

43 3.2 Protocol 29 The AXI4 protocol was chosen because of its flexibility with respect to topologies, its test assertions and because the other protocols were lacking in important aspects. Wishbone was deemed too simple due to the lack of proper support for outstanding transactions and pipelining. CoreConnect was too complex due to its coherency features. Table 3.6 describes the features and the corresponding wires of AXI4 that were removed.

44 30 3 Initial Study Feature Cache Protection User Low power Region Explanation The cache signals are used to indicate how the transaction will progress through the system. The cache features will not be used because they demand increased complexity of the interconnection system. Furthermore the system will not contain many different memory types to benefit from this feature. To follow the AXI4 protocol these signals can be hardwired since neither the masters nor the slaves in this project use the signals. Signals hardwired: AWCACHE and ARCACHE. The protection type indicates privilege and security level of a transaction and whether it is a data or instruction access. The signals will not be used since the protection features will be implemented outside of the interconnection protocol. To follow the AXI4 protocol these signals can be hardwired since neither the masters nor the slaves in this project will use the signals. Signals hardwired: AWPROT, ARPROT. The user signals are for adding user-defined signals. This feature is optional and will not be used in the project. Signals removed: AWUSER, ARUSER, WUSER, RUSER and BUSER. Low power interface signals can signal low power requests to masters and slaves. This feature is optional and will not be used in the project. Signals removed: CSYSREQ, CSYSACK and CACTIVE. The region signals in AXI4 are used for providing multiple logical interfaces to one physical interface and to make address decoding in slaves more simple. The region signals are an optional extension to the AXI4 interface signal set. They will not be used in this project because the application will not benefit from them. Signals removed: AWRE- GION and ARREGION. Table 3.6: Removed features from AXI4

45 3.3 Topology Topology Key aspects for the topology have been identified and been assigned different priorities according to their importance for the computer system. The compared aspects for the topology are listed in table 3.7. Note that the topology aspects are also affected by the protocol. So it is important when deciding on a topology to take into account the chosen protocol. The compared topologies were shared bus, shared address bus multiple data bus, switch fabrics and bus hierarchy. Aspect Priority Description Parallel transactions High Can multiple transactions be communicated on the interconnection network at the same time? Latency High What is the expected latency of the topology? What is the average latency? How does it vary with load? Throughput High What is the maximum throughput? What is the expected throughput? Area Medium How much area is used by the interconnections and the nodes? Node complexity Medium How complex must the nodes be to support the protocol and topology? Power consumption Medium How much power is consumed by the interconnections and the nodes? Scalability Low How does the topology handle more communication nodes? Table 3.7: Topology aspects Shared Bus A shared bus consists of an address bus and a data bus that connects multiple components (fig 3.8). Since the bus is shared only one master can access the bus at a time. Listed below are the key topology aspects and how a shared bus handles them.

46 32 3 Initial Study Aspect Parallel transactions Latency Throughput Area Node complexity Power consumption Scalability Details The shared bus topology does not support parallel transactions because only one master has access to the bus at a time. The latency is highly dependent on the load of the bus. If the load is high masters will have to wait for other masters to complete their accesses which can result in high latencies. The throughput is quite low, since no parallel transactions can be made. This topology uses little area. There are no intermediary nodes in a shared bus. The only component with logic in it is the arbiter, which will be fairly simple. Due to the small area and simple control a shared bus topology does not consume much power. The performance does not scale well in this topology, especially not when adding addition masters. The area scales very well when adding additional masters or slaves. A shared bus topology is not suitable for the computer system in this project because the performance of the topology does not scale well when adding more masters. The computer system also loses its modularity when it has a shared bus topology since the components are no longer grouped together. A block diagram of the computer system with a shared bus topology is shown in figure A.1 on page 83.

47 3.3 Topology 33 Figure 3.8: Shared bus topology with multiple masters.

48 34 3 Initial Study Shared Address Bus Multiple Data Bus A shared address bus multiple data bus is similar to a shared bus, but has multiple data buses (fig 3.9). The motivation for this topology is that data is sent over the bus more frequently than addresses. With this topology there can be parallel data transactions as long as they are on different data buses. Listed below is how the topology handles the key aspects. Aspect Parallel transactions Latency Throughput Area Node complexity Power consumption Details Transactions cannot be issued at the same time, since the address bus is used when issuing a transaction. But a new transaction can be issued when another is being handled, that is, transactions can be overlapped. The maximum of parallel transactions is the number of data buses in the topology. The latency is dependent on the load of the bus. Since there are parallel data buses less time will be spent waiting for the bus to be available. If many short data transactions are made the address bus will be congested and the latency will be high. A higher throughput than a shared bus is possible due to multiple data buses. This topology is quite area efficient if the number of parallel data buses is not too high. The arbiter is more complex than in a shared bus topology since it must control access to the data buses individually. This topology uses more power than a shared bus because of the added wires and logic. The power consumption varies a lot because the buses tends to be used intensively for a short while and then they are not used for a while.

49 3.3 Topology 35 Aspect Scalability Details The scalability depends on the number of data buses and access patterns. In general the performance scalability is much better than a shared bus. If many components are added the topology might benefit from adding another data bus. The area scales well, but increases when adding additional data buses. The shared address bus multiple data bus topology can be a quite good fit for the computer system, if there is one data bus per module and one extra for the shared memories and other non-local components. One downside of this topology is that the locality of components in the computer system is not taken into account at all. Also, if many modules are used the address bus will be congested and the throughput and latency will degrade. A block diagram of the computer system with the shared address bus multiple data bus is shown in figure A.2 on page 84. Figure 3.9: Shared address bus multiple data bus topology with two parallel data buses Switch Fabric Switch fabric is a many-to-many topology supporting multiple parallel transactions. Masters compete for accesses to slaves or a group of slaves instead of access to the bus as in a shared bus topology. Figure 3.10 shows a switch fabric network with 3 masters and four slaves. The network is built up with 2x2 switches, meaning that each switch has two inputs

50 36 3 Initial Study and two outputs and can route each input to any output if that output is not being used at the moment. Listed below are the key topology aspects and how switch fabrics handles them. Aspect Parallel transactions Latency Throughput Area Node complexity Power consumption Scalability Details Parallel transactions are supported, but not without restrictions. Two masters connected to the same switch cannot access the same output at the same time. In general more parallel transactions can be made with this topology than the other studied topologies. Since the switches all contain logic the delay can be quite big. To prevent low clock speeds pipeline registers can be inserted before or after the switches. These registers increase the latency of the protocol. To reduce the latency the nodes can buffer write data if the next switch or the slave is busy. The switch fabric topology is capable of high throughput due to many parallel transactions. This topology occupies a lot of area due to the many switches. The area depends a lot on the complexity of the nodes, whether they have internal buffers or not. The node complexity depends on the implementation. The nodes can be simple switches or they can contain big internal buffers and understand the bus protocol. The more advanced switches can be used to answer handshake signals for the slaves to increase the throughput. The power consumption is relatively high due to the high area usage and the logic in the switches. This topology scales well with respect to latency and throughput because it is modular and distributed, there are no central control elements. But the area and power consumption increase a lot when more masters or slaves are added to the system.

51 3.3 Topology 37 This topology fits the protocol well since both are suitable for packet based transactions. On the other hand it is not suitable for the computer system because the topology does not exploit that certain connections will be made more often than others, for example connections within a processing block. Figure 3.10: Switch fabric topology. Each intermediary node can route one of its two inputs to any of its two outputs if it is not busy. A block diagram of the computer system with the switch fabric topology is shown in figure A.3 on page Bus Hierarchy The bus hierarchy topology consists of dividing a bus into multiple buses connected with another bus, creating a two level bus hierarchy (fig 3.11). The bus being divided can be one shared bus or multiple parallel buses. It is possible to have higher level hierarchies by dividing the buses further. Bus bridges are used to transfer transactions from one hierarchy level to another. This topology is suitable when there are groups of components that communicate within the group frequently and not so frequently between the groups. The performance of a bus hierarchy is dependent on whether the access patterns show a lot of locality or not. If there is no locality a bus hierarchy will not perform well.

52 38 3 Initial Study Aspect Parallel transactions Latency Throughput Area Node complexity Power consumption Scalability Details Parallel transactions are supported because a transaction on a bus can be handled at the same time as another transaction on another bus in the same hierarchy level. Only one transaction can be handled at a time on the highest bus in the hierarchy. The latency can be kept low if most of the communication is on the local buses. If there are a lot of transactions on the highest bus there is risk for congestion and high latencies due to stalling. If the traffic is mostly local the throughput will be high because not much time is spent waiting for the bus. This topology uses relatively little area. Bus bridges are needed to separate the bus levels. Their complexity depends on the implementation; they can contain internal buffers to avoid waiting for the bus when it is busy. Arbiters are needed when there are multiple masters on a bus, which will always be the case for the highest bus. The power consumption is low since most transactions will be made on small energy efficient local buses. The performance scalability depends on the traffic patterns. If most accesses are local the topology scales well. If the traffic is random adding another slave or master can drastically decrease performance. This topology fits the computer system well since the computer system consists of multiple processing blocks with a lot of internal communication and not so much external communication. A block diagram of the computer system with the bus hierarchy topology is shown in figure A.4 on page 85. The topology on a specific hierarchy level does not have to be a bus. It is possible to have multiple buses connected by switch fabric or the other way around. This makes the hierarchical solution flexible.

53 3.3 Topology Decision The topology chosen for this project is the bus hierarchy. This topology is chosen because it fits the computer system well and provides good flexibility. The shared bus topology will be used on all levels of the hierarchy.

54 40 3 Initial Study Figure 3.11: Bus hierarchy topology with two levels.

55 4 Implementation This chapter explains how the bus system is built, what components it is built from and how those components work. It contains four sections. The first section explains parts of the protocol that are necessary to know to understand why the bus system is built like it is. The second sections describes how the master and slave are designed and why. The third section describes the implementation of the bus system and its components. The fourth section describes the problems encountered in the implementation and how they were solved. The bus system only has one clock domain, so no special care is taken for for clock domain borders. A block diagram of the bus system is shown in figure

56 42 4 Implementation Figure 4.1: The computer system with bus components.

57 4.1 Protocol Protocol This section describes the protocol used in the bus system. For more details of the protocol that are not mentioned in this section, see the AXI4 specification. The protocol used in the bus system is a subset of the AXI4 protocol [3]. Table A.1 on page 88 lists all the signals in the protocol and their usage. The features that are removed from AXI4 are described section Terminology When a master initiates a bus operation targeting a slave: The complete set or required operations on the bus form a transaction. Any required data is transferred as a burst. A burst can comprise multiple data transfers Architecture The protocol contains five independent transaction channels: Write address Write data Write response Read address Read data The address channels transfer address and control information. The read data channel transfers read data as well as read response information from the slave to the master. The write data channel transfers write data from the master to the slave. The write response channel is used by the slave to respond to a write transaction. The following relationships between the channels must be maintained: A write response must always follow the last write transfer in the write transaction of which it is part. Read data must always follow the address to which the data relates. The few relationship requirements make it possible to insert small blocks of register slices with simple flow control in any channel, at the cost of increasing the latency by one clock cycle per register slice. This allows for a trade-off between latency and the maximum clock frequency. There must not be any combinatorial paths between input and output signals. The protocol is designed to be as stateless as possible, meaning that the information used for routing in the bus system is included in each bus transfer. This

58 44 4 Implementation allows for simpler components in the bus system, since they do not have to remember where to route each transfer. Instead it just needs to inspect the bus transfer and route it accordingly. The only exception to this statelessness is the write data channel, see section for details Ordering Model Every channel except write data has ID signals. Transactions with the same ID must be returned in the order as they were issued. The ID signals of the read data and write response channels must match the IDs of read address and write address respectively. There are no ordering restrictions between read and write transactions, even if they have the same ID. The write data channel has no ID (details in section 4.1.5). When a master is connected to the bus system, the interconnect appends additional bits to the ID signals that are unique to the master port. This has the effect that transactions from different masters have no ordering requirements. For read data the additional bits in the read data ID signal are used to determine which master to route the data to. For write response the additional bits in the write response ID are used to determine which master to route the response to. The additional bits are removed by the interconnect before they are passed to the master. There are four different IDs used for ordering (2 bits). The choice of four IDs for ordering is based on that the AXI4 specification recommends that the width of the ID fields inside the interconnect is 8 bits and 6 bits are needed for routing in this implementation, see sections & for details Handshake Each channel has its own pair of handshake signals, VALID and READY. VALID is asserted by the source to indicate that the information on that channel is valid. READY is asserted by the destination to indicate that it is ready to receive the information on that channel. The transfer occurs only when both the VALID and READY signals are asserted. A source is not permitted to wait until READY is asserted before asserting VALID as this could cause deadlock. A destination is permitted to wait until VALID is asserted before asserting READY. Once VALID is asserted it must remain asserted until the end of the handshake. If possible the READY signal should be asserted as often as possible and not wait for the VALID signal to be asserted. This is because if READY is not asserted by default a transfer takes at least two clock cycles (fig 4.2), but if it is asserted it only takes one clock cycle (fig 4.3).

59 4.1 Protocol 45 clock data VALID READY Figure 4.2: The READY signal waits for the VALID signal. The transfer takes two clock cycles. clock data VALID READY Figure 4.3: The READY signal is asserted before the VALID signal. The transfer takes one clock cycle.

60 46 4 Implementation Write In this protocol the write data channel has no ID signals. This means that write data must be issued in the same order as the write addresses. If the bus system combines write transactions from various masters it must ensure that the write data is output in the address order. These restrictions apply even if the write transactions have different IDs, because there is no way to know which transaction the write data belongs to except for the ordering. The lack of ID signals on the write data channels means that the interconnect must track to which destination the write data should be sent. In the other channels the destination is included in the ID or the address of the transfer Address- and Transaction ID Map The system has two address mappings, one for units inside processing blocks, such as CPUs and DMAs (table 4.1), and one for units outside the processing blocks, such as debug units (table 4.2). Address (upper 4 bits) Slave local memory local accelerator 1000 memory in block memory in block reserved 1011 reserved 1100 external memory external memory I/O 1111 unused Table 4.1: Address map for units inside a processing block (e.g. CPU). Address (upper 3 bits) Slave 000 memory in block memory in block reserved 011 reserved 100 external memory external memory I/O 111 unused Table 4.2: Address map for units outside the processing blocks (e.g. debug unit). The transaction ID is 8 bits wide and is divided into four 2-bit fields (fig 4.4). The first field is used for ordering (see section 4.1.3). The other three fields are used

61 4.1 Protocol 47 for routing the read data and write responses. The last field is used to route the transfer in the current environment, where the environment can be a processing block or the external bus. This means that there can never be more than four masters inside a processing block or outside of the processing blocks. The bus bridge counts as a master, both inside a processing block and the external bus. The remaining two fields are used as a routing stack by the bus bridges. Two fields are needed because a transaction can cross two bridges, if it starts in one processing block and ends in another. For more information about how the routing stack is used see section Routing stack Routing in current environment 7:6 5:4 3:2 1:0 For ordering For routing Figure 4.4: The structure of the 8 bit transaction ID. The numbers inside the boxes shows the indices of that field Protocol Assertions The AXI4 protocol comes with protocol assertions to ensure that the protocol is being followed [3]. The protocol assertions are written in SystemVerilog and can be attached to a design. They give warnings during simulation when the design does not follow the protocol. The assertions have been modified to only assert the subset of AXI4 that is used in this project. The protocol assertions make it easy to check if the protocol is being followed every time something is changed in the implementation. Without these protocol assertions it would be difficult to ensure that the bus system always follows the protocol. A pseudocode example of an assertion is shown in listing 4.1. Listing 4.1: Pseudocode example of a protocol assertion checking that ARADDR is stable when it is valid. a s s e r t posedge clk ) (ARVALID &!ARREADY) > s t a b l e (ARADDR) / / I f ARADDR i s v a l i d but not a c c e p t e d, a s s e r t t h a t / / ARADDR i s s t a b l e u n t i l next c l o c k c y c l e

62 48 4 Implementation 4.2 Master and Slave The master and slave units are written in SystemVerilog. Their purpose is to generate and respond to read and write transactions. They are not synthesizable because they are only used for simulation purposes. They are used in the project to emulate the interface of different units in a computer system, such as CPUs, DMAs, memories or accelerators. The master unit reads an input file and generates read and write transactions according to that file. The master and slave unit write logs when a transaction is started or finished and when data is sent or received. These logs are then analyzed by a script to ensure that all transactions are successful. The master and slave units only test whether the data that is being written or read gets through the bus system correctly. There are features in the AXI4 protocol, such as writing a single byte instead of a whole word, that are never tested by the master and slave units. Those features are handled only by the master and the slave, the bus system only passes the data from one end to another. The focus on this thesis is the bus system, not the masters and the slaves, so these features are ignored. The slave units can produce read data or accept write data in one or two clock cycles. This may not be case in the real world, especially with large memories. The fast slaves are used to simplify the implementation and to analyze the best case, so all the delay comes from the bus system and not the masters or slaves. 4.3 Bus System The bus system consists of four different components: Pipeline stage - to break long paths into shorter paths Arbiter - to connect multiple masters to the bus Decoder - to connect multiple slaves to the bus Bridge - to connect buses between hierarchy levels All components use double buffering to maintain full throughput even though they break a path from one or more master to one or more slaves. Double buffering allows a component to accept data even though it does not know whether the data can be received immediately at the other end. This is possible by inserting an internal buffer capable of storing the data from one transfer. Without double buffering data is only transfered every second clock cycle, even though the destination is always ready to receive data (READY signal asserted) and the source is always ready to send data (VALID signal asserted) (fig 4.5). With double buffering data is sent every clock cycle, so full throughput is maintained (fig 4.6). The only effect the pipeline stage has on the transaction is that the latency is increased by one clock cycle. Section describes the implementation of the double buffering.

63 Bus System T3 T2 T5 T4 T6 T7 clock data_master A B C valid_master ready_master data_slave A B C valid_slave ready_slave Figure 4.5: The handshake for a pipeline stage without double buffering. Blue lines indicate that the pipeline stage has accepted data from the master. Red lines indicate that the slave has accepted data from the pipeline stage and that the pipeline stage is ready to receive more data from the master. At T2, T4 and T6 the data from the master is accepted and is forwarded to the slave by asserting the valid_slave signal and putting the data on the data_slave signals. The pipeline stage must now deassert the ready_master signal to indicate that it is not ready to accept more data at the time. It is not ready to accept more data because it does not know whether the slave can accept the incoming transfer or not. At T3, T5 and T7 the slave accepts the transfer because both valid_slave and ready_slave are asserted. The pipeline stage can now assert ready_master to indicate that it is ready to accept more data.

64 50 4 T3 T2 T5 T4 Implementation T6 clock data_master A B C valid_master ready_master C buffer data_slave A B C valid_slave ready_slave Figure 4.6: The handshake for a pipeline stage with double buffering. Blue lines indicate that the pipeline stage has accpeted data from the master. Red lines indicate that the slave has accepted data from the pipeline stage. At T2 the transfer from the master is accepted and forwarded to the slave by asserting valid_slave and putting the data on the data_slave signals. This pipeline stage uses double buffering, meaning that it has an internal buffer where it can store data from one transfer. So even though it does not know whether the slave can accept the incoming data in the next clock cycle it can accept one more transfer from the master. At T3 and T4 the pipeline stage accepts another transfer from the master. It must now decide what to do with the data from this transfer. If the slave accepted the previous transfer the data can be forwarded to the slave (T3 in the figure). If the slave did not accept the previous transfer the data is stored in the internal buffer and the pipeline stage will deassert the ready_master (T4 in the figure), since the internal buffer is now full. At T5 the slave accepts the second transfer and the pipeline stage forwards the buffered data to the slave. The pipeline stage now asserts ready_master to indicate that it can receive transfers since the buffer is now empty.

65 4.3 Bus System Handshake Unit The handshake unit handles the flow control for the double buffering (fig 4.7). All bus components have at least one handshake unit to control the double buffering. The handshake unit has two inputs and five outputs: valid_m input The valid signal from the source. ready_s input The ready signal from the destination. ready_m output The ready signal from the handshake unit to the master that indicates that it is ready to receive data. valid_s output The valid signal from the handshake unit to the slave that indicates that the output data is valid. enable_buffer output When this signal is active the incoming data from the master (data_m) should be written to a internal buffer. enable_out output When this signal is active data should be written to the output signals (data_s). ctrl output Indicates if buffered data or data from the master should be written to the output signals. The handshake unit is implemented as a state machine (fig 4.8). Figure 4.7: A handshake unit used in a pipeline stage.

66 52 4 Implementation Figure 4.8: Handshake unit state machine handling flow control.

67 4.3 Bus System Pipeline Stage The pipeline stage is used to register interconnects, especially to break a long path into two shorter paths. Each channel can be pipelined individually and a configurable number of registers can be inserted in each channel. Figure 4.7 shows the block diagram of a pipeline stage for one channel. The only differences between the pipeline stages in the different channels are the names, numbers and sizes of the interconnect signals that are being registered. This component is similar to a fully registered ARM BP130 [1] that is used to register AXI interconnects Arbiter The arbiter combines multiple masters into one master interface (fig 4.9). The arbiter is divided into one read arbiter and one write arbiter since reads and writes are independent of each other. Figure 4.9: An arbiter combining two masters into one master interface. Read Arbiter The read arbiter controls the read address and read data channels of the arbiter. The read address and the read data are independent of each other in the read arbiter. A round robin priority system is used for read addresses to avoid starvation. When more than one master tries to issue a read address the one with highest priority is accepted and forwarded to the slave. When the slave accepts a read address the priorities are rotated one step. No priority is needed for the read data since read data is only coming from one source. The transaction ID in the read data is used to route the data to the correct master. Double buffering is used for the read data to maintain full throughput. A block diagram of the read data part of the read arbiter is shown in figure 4.10.

68 54 4 Implementation Figure 4.10: Block diagram for the read data part of the arbiter.

69 4.3 Bus System 55 Write Arbiter The write arbiter controls the write address, write data and write response channels of the arbiter. The write address and write data channels are dependent on each other. The write arbiter cannot accept any write data until it has received a write address. A write address cannot be accepted until the last write data from the previous transaction has been accepted by the slave. Write data is only accepted from the master that most recently got its write address accepted; this is because the write data does not contain any routing information. The write response is independent of the other channels. Double buffering is used in the write data and write response channels. The state machine controlling the write arbiter is shown in figure If there are multiple valid write addresses available at the same time the one with the highest priority is accepted and forwarded. The priority changes every time the last write data is accepted by the slave. The priority system is similar to the one used for the read address. There is only one possible destination for write addresses and write data, so no address decoding needs to be done in the arbiter. The block diagram for the write data is shown in figure The transaction ID in the write response is used to determine which master to route the response to. The block diagram for the write response is identical to the one for read data in the read arbiter (see figure 4.10). Figure 4.11: State machine for write arbiter. The write arbiter can only accept incoming write data in the WRITING state. The write arbiter can only accept incoming write addresses in the IDLE state.

70 56 4 Implementation Figure 4.12: Block diagram for the write data part of the arbiter.

71 4.3 Bus System Decoder The decoder combines multiple slaves into a single slave interface (fig 4.13). The decoder is divided into one read decoder and one write decoder since reads and writes are independent of each other. Figure 4.13: A decoder combining two slaves into one slave interface. Read Decoder The read decoder controls the read address and read data channels. The read address and read data channels are independent of each other in the decoder. The read data channel uses double buffering. Before the read decoder can accept an incoming read address it must first ensure that: there are no unfinished read transactions with the same transaction ID as the incoming read address the decoder is not already sending a read address to the destination slave that the slave has not yet accepted. If these conditions are met the incoming read address is accepted and forwarded to the destination slave. If not, the decoder waits until they are. The destination slave is determined by decoding the address. Round robin priority is used for the read data since there are multiple sources. The priority shifts every time the decoder accepts the last read data from a slave. When the decoder accepts the first read data transfer from a slave, it only accepts read data transfers from this slave until the last read data has been accepted by the master. The state machine that controls the read data channel is shown in figure A block diagram of the read data part of the decoder is shown in figure 4.15.

72 58 4 Implementation Figure 4.14: State machine for the read data in the decoder. The cur- Figure 4.15: Block diagram for the read data in the decoder. rently_reading register is set by the state machine in figure 4.14.

73 4.3 Bus System 59 Write Decoder The write decoder controls the write address, write data and write response channels. The write response channel works independently of the other channels. The write address and write data channels are dependent on each other. The decoder cannot accept any write data before it has accepted the corresponding write address. This is because without the write address it is impossible to know where to route the write data. The address in the write address is used to route the write address and write data transfers. The decoder cannot accept a write address until the previous write transaction has completed. Double buffering is used only in the write data channel. Figure 4.16) shows the state machine controlling the write decoder. The block diagram for the write data is shown in figure A priority system is used in the write response channel, since the other channels only have one source. The priority system is the same round robin system as in the read decoder. Figure 4.16: State machine for the write decoder. The decoder can only accept write addresses in the IDLE state.

74 60 4 Implementation Figure 4.17: Block diagram for the write data part of the decoder.

75 4.3 Bus System Bus Bridge The bus bridge handles the transactions that cross a processing block border. The bridge is divided into two units, one handling transactions from the processing block to the external bus and one handling transaction from the external bus to the processing block. Both units rewrite the address and/or the ID fields of the transactions they are handling. Bridge from Processing Block to External Bus When a transaction crosses the bridge from the processing block to the external bus the address must be changed because the processing block does not have the same address map as the external bus. The upper 4 bits of the address are left shifted one step to obtain the new address that is valid on the external bus. The address maps are shown in section For write address and read address transfers the transaction ID must be rewritten. The ID of the master inside the processing block that started the transaction must be saved. This ID is pushed to the routing stack, which is a part of the transaction ID. The ID of the processing block must be added to the transaction ID if the responses are to be routed correctly. The two least significant bits are overwritten with the ID of the processing block. Figure 4.18 shows how the transaction ID field is changed when the transaction crosses a bus bridge from the processing block to the external bus. The transaction IDs of read data and write responses that are crossing the bridge from the processing block to the external bus must be rewritten to ensure correct routing. When the response reaches the bus bridge the ID of the master that started the transaction is popped from the stack and written to the two least significant bits of the transaction ID (fig 4.19). inside processing block o id m i,id b e,id bridge external bus o id m i,id b e,id Figure 4.18: Transaction ID changes for write- and read addresses from a processing block to the external bus. m i,id is the ID of the master inside the processing block that started the transaction. b e,id is the index of the bridge on the external bus (which is the same as the index of the processing block). o id is used for transaction ordering.

76 62 4 Implementation inside processing block o id s 0 m e,id b i,id bridge external bus o id s 0 m e,id Figure 4.19: Transaction IDs for read data and write responses from the processing block to the external bus. m e,id is the index of the master on the external bus that started the transaction, which is the target for the response. b i,id is the index of the bridge inside the processing block. o id is used for transaction ordering. s 0 is an ID on the stack. Bridge from External Bus to Processing Block The address of transactions that crosses the bridge must be changed due to the address map being different inside and outside of processing blocks. The four upper bits of the address are set to zero, so the only slave inside the processing block that can be accessed is the memory. For write- and read addresses the transaction ID must be rewritten. The ID of the master from the external bus that started the transaction is pushed to the routing stack and the two least significant bits of the transaction ID are set to the ID of the bridge inside the processing block (fig 4.20). This way the responses of the transaction are routed to the bus bridge. The transaction IDs for read data and write responses must be rewritten when they are crossing the bridge. The ID of the master inside the processing block (the target for the response) is popped from the stack and written to the two least significant bits of the transaction ID (fig 4.21). external bus o id s 0 m e,id b i,id bridge inside processing block o id s 0 m e,id b i,id Figure 4.20: Transaction ID changes for write- and read addresses from the external bus to a processing block. m e,id is the ID of the master on the external bus that started the transaction. b i,id is the index of the bridge inside the processing block. o id is used for transaction ordering. s 0 is an ID on the routing stack.

77 4.4 Problems 63 external bus o id m i,id b e,id bridge inside processing block o id m i,id Figure 4.21: Transaction IDs for read data and write responses from the external bus to a processing block. m i,id is the index of the master inside the processing block that started the transaction, which is the target for the response. b e,id is the index of the bridge on the external bus (which is the same as the index of the processing block). o id is used for transaction ordering. 4.4 Problems This section discusses the main problem encountered in the implementation and how it has been handled Deadlock There has been only one problem major enough to change the architecture of the bus system: deadlock. The only difference between the initial architecture and the final one is that an additional arbiter has been added inside the processing blocks. The initial architecture of the system is shown in figure A.5 on page 89. Deadlock could occur when one master in each of the processing blocks tried to write to a slave in the other processing block. It would only occur in write transactions since the arbiters and decoders only accept write data from the last master that got its write address accepted; so the arbiter or decoder is locked to one master for the entire transaction. The cause of the deadlock was that both the arbiters in the processing blocks accepted the write address of the sending master and therefore only accepts write data from that master. The destination of the write transaction is a slave in the other processing block, so it must pass the arbiter in that processing block. But that arbiter cannot accept the write address, since it has already accepted a write address from the master in the same processing block. Neither transaction will finish since both transaction are waiting for a unit that is locked by the other transaction. Figure 4.22 shows a simplified system with the same deadlock problem. The problem was solved by adding another arbiter inside the processing blocks. This arbiter ensures that transactions from outside the processing block do not pass through the same arbiter as the masters inside the processing block use to access units outside of the processing block. This solution maintains the con-

78 64 4 Implementation nectivity of the original architecture but increases the latency by one clock cycle because the path from master to slave now has one more synchronous unit in it. Figure 4.23 shows a simplified system with the same fix to the deadlock problem. Figure 4.22: A simplified system showing how deadlock can occur. Blue components are handling a write transaction issued by master M0,0. Red components are handling a write transaction issued by master M1,0.

79 4.4 Problems 65 Figure 4.23: A simplified system showing how to solve the deadlock problem. Blue components are handling a write transaction issued by master M0,0. Red components are handling a write transaction issued by master M1,0.

80 66 4 Implementation 4.5 Tools and Scripts This section describes the tools and scripts that have been used throughout the project Python Python is a general purpose, high level programming language. All the scripts used in this project to build test benches, analyze logs and verify the functionality of the system are written in Python cog Cog is code generation tool written in Python. Cog transforms files by finding embedded python code in the files, execute that code and inserting the output into the file. In this project Cog is used to efficiently build test benches when the architecture of the system changes (System)Verilog Verilog is a hardware description language used to model electronic systems. Verilog can be used for both simulation and synthesis of hardware. SystemVerilog is based on extensions to Verilog to allow for a higher level of abstraction for modeling and verification than Verilog. All the components in the bus system in this project are written in mostly Verilog with some features from SystemVerilog. The test benches used to verify the functionality of the bus system are written in SystemVerilog VCS VCS is a functional verification solution for soc from Synopsys. VCS can be used for simulation, functional verification, coverage analysis and assertion. VCS has support for Verilog, SystemVerilog and SystemC. In this project VCS has been used to simulate and debug the bus system DesignCompiler DesignCompiler is a tool from Synopsys that synthesises HDL designs (in this case written in SystemVerilog) into optimized, technology dependent netlists. DesignCompiler also gives estimates for timing, power consumption and area of the design Scripts Various scripts have been written to simplify and automate tasks throughout the project. All these tools are written in the Python programming language. Listed below are the tasks that have been simplified or automated: Generate files that are read by the masters in the bus system that control how many transactions are issued and to what destinations.

81 4.5 Tools and Scripts 67 Generate bus systems with different component configurations. Generate test benches with specific configurations. Analyze logs written by the master and slave components to assert that all transactions completed successfully. Analyze logs to measure the performance of the bus system. Generate plots.

83 5 Results and Discussion This chapter presents the results of the project. All results are based on simulations of the bus system. The whole bus system has been simulated to verify the functionality and the performance of the system. The goal of the functionality testing is to verify that the bus system works as specified. The performance simulations have been made to see how well the bus system handles different workloads. This chapter contains four sections. The first presents the results from functionality testing. The second section discusses performance testing and results. The third section presents the area and the timing of the bus system. The last section discusses the results. 5.1 Functionality The functionality of the bus system has been verified by running simulations of the whole system with a modified version of ARMs AXI4 protocol assertions [2] on all endpoints of the system. The protocol assertions only show if the protocol is being followed or not, they do not guarantee that a transaction always reaches its destination. In the simulations the masters and slaves of the system write a log every time a transfer is sent or received. These logs are then analyzed by a script to make sure that all sent transfers reached their intended destination. The bus system has been simulated more than a thousand times and each simulation consists of around 2500 transfers for a total of more than 2.5 million simulated transfers without any protocol violations or failed transfers. 69

84 70 5 Results and Discussion The synthesis tool DesignCompiler [13] has been used to verify that the bus system is synthesizable. No actual master or slaves have been designed in this project so synthesis results from just the bus system could be misleading. 5.2 Performance The performance of the bus system has been measured by simulating the system and analyzing the logs written by the masters and the slaves. Different measures of performance have been used: how often transactions can be issued in the best case, how often a unit stalls, how long time a transfer takes to reach its destination, how the bus system handles congestion and how much the different buses are utilized. In the best case one read transaction can be issued every second clock cycle (see fig A.6 on page 90). Note that this is only possible if no other read transaction with the same transaction ID is being handled at the same time. A write transaction can be issued every third clock cycle if the transaction only consists of one transfer (see fig A.7 on page 92). This is only possible if the write channels are unused. Within a transaction it is possible to transmit data every clock cycle, both for reads and writes (see figures A.8 & A.9 on page 93). Different access and timing patterns have been simulated to see how the system performs under various types of loads. The access and timing patterns affect where transactions are headed, how many transfers there are per transaction, how often new transactions are issued, how many transaction each master issues and the distribution of read and write transactions. Below is a list of all the simulated patterns and how they affect the transactions: normal - The CPU mainly issues transactions to the local memory, some transactions are issued to the local accelerator and even fewer to the external memories. Almost all the transactions issued by the CPU consist of only one transfer (one word, 32 bits). The DMA issues most of its transactions to the local accelerator or the external memories. The transactions from the DMA usually consist of 4, 8, 12 or 16 transfers. Each master waits between 0 and 6 clock cycles before issuing a new transaction. The CPU issues 200 transactions, the DMA 40 and the debug unit is unused. Half of the transactions are reads and the rest are writes. random - All masters issue transaction to all the slaves with equal probability. Each transaction consists of 1 to 16 transfers, with a uniform distribution. Each master waits between 0 to 6 clock cycles before issuing a new transaction. Each master issues 80 transactions. Half of the transactions are reads and the rest are writes. internal - This pattern is the same as the normal pattern except that all transactions are issued to slaves within the same processing block as the

85 5.2 Performance 71 master. external - This pattern is the same as the normal pattern except that all transactions are issued to the external memories or the IO unit. one_slave - This pattern is the same as the normal pattern except that all transactions are issued to the memory in processing block 0. This pattern is used to show how the bus system handles congestion. cpu_only - This pattern is the same as the normal pattern except that only the CPUs issue transactions. one_slave_burst - This pattern is the same as the one_slave except that the masters issue transactions as often as they can. This pattern is used to show how the bus system handles congestion. normal_burst - This pattern is the same as the normal except that the masters issue transactions as often as they can. To measure the performance of the system the stall time and transfer time have been calculated from the logs. Stall time is how long a master or slave must wait after it has put valid data on the bus until that data is accepted by the bus system, freeing the sending unit to send more data. If the data is accepted in the next clock cycle the stall time is 0, if the data is accepted one clock cycle later the stall time is one clock cycle. Transfer time is the time between a master or slave puts valid data on the bus and the time that the data is accepted by the intended destination. The time for a whole transaction has not been measured, but it can be estimated from the transfer times. The first read data arrives at the master about two times the average transfer time. The first write data arrives at the slave about the average transfer time after the transaction has been issued by the master. Figure 5.1 shows the average number of clock cycles a unit stalled per transfer and the average number of clock cycles to complete a transfer. To measure how the bus system reacts to congestion the stall time has been measured on the write address channel while increasing the fraction of write transactions from 0 (no write transactions) to 1 (only write transactions). In these simulations all the masters write to the same slave as often as they can to get the maximum congestion. The bus utilization on the write channel has been measured to get a number on the congestion. Figure 5.2 shows the bus utilization and stall time of the bus system. The write address channel was used to measure the stall time because in that channel each data transfer is subject to stalling. In the write data channel generally only the first transfer is subject to stalling and the rest of the transfers in the transaction have no stalling. The write data channel was used to measure the bus utilization since the actual data is transferred on that channel. The bus utilization for each channel and for each bus is shown in figure 5.3. The normal access and timing pattern was used to generate this data.

86 72 5 Results and Discussion Figure 5.1: The average number of clock cycles stalled and the number of clock cycles the average transfer takes to complete. The x-axis shows different settings used to generate the transactions.

$2: Bus utilization and stall time of write channel vs fraction of transactions that are writes. Figure 5.$

87 5.2 Performance 73 Figure 5.2: Bus utilization and stall time of write channel vs fraction of transactions that are writes. Figure 5.3: Bus utilization for the different channels and the different buses in the system.

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

SoC Design Lecture 11: SoC Bus Architectures Shaahin Hessabi Department of Computer Engineering Sharif University of Technology On-Chip bus topologies Shared bus: Several masters and slaves connected to