Hardware Design, Synthesis, and Verification of a Multicore Communications API

Hardware Design, Synthesis, and Verification of a Multicore Communications API Benjamin Meakin Ganesh Gopalakrishnan University of Utah School of Computing {meakin, ganesh}@cs.utah.edu Abstract Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent hardware and software architectures are inevitable in future systems. One of the greatest problems in these systems is communication. Providing coherence, consistency, synchronization, and sharing of data in a multicore system requires that communication overhead is minimal. It is essential that scalable, flexible, and efficient hardware/software mechanisms be researched and developed to ease the technical community into developing concurrent systems. This research effort is to create such mechanisms by designing a scalable hardware implementation of a multicore communication API. This API, developed by the Multicore Association, targets embedded devices and aims to provide communication primitives for embedded systems on chips. It is a lightweight message passing interface that offers the potential for greater communication performance and lower power than other solutions at the expense of broad functionality. Realizing this potential is almost entirely up to the implementation. This paper describes the design, synthesis, and verification of such an implementation. The low latency, low power, and high throughput aspirations of the API are the very performance metrics this implementation seeks to optimize. This is achieved by designing an on-chip network as the physical communication medium and modifying a MIPS processor to implement an extension to the instruction set. This will permit implementation of the API with direct hardware support. The result of this effort is a useful case study of a hardware design, synthesis, and verification flow of a possible implementation of an emerging multicore communication API. It will show that such an organization yields comparable performance results to current communication architectures with greater scalability and potential for future innovations. 1. Introduction It is widely accepted that modern and future computing systems will see performance improvements primarily through exploiting increased process/thread level parallelism. One of the main prerequisites to exploiting this parallelism efficiently is the availability of APIs that are well matched with the communication / synchronization needs of this area. Clearly, one size does not fit all. In the area of largescale cluster computing, the Message Passing Interface (MPI) a very sophisticated API with over 300 functions -- is the lingua franca. MPI is used to program cluster computers with up to hundreds of thousands of processing nodes. In other realms such as embedded systems using commodity microprocessors, various real-time operating system primitives and shared memory threads serve the needs of communication and synchronization. However for the rapidly exploding area of embedded systems based on multiple cores, chips not only contain multiple general purpose computing cores, but for cost effectiveness and performance also contain application specific accelerators, I/O interfaces, and memory controllers. All of these on-chip devices need low overhead communication. As semiconductors continue to scale, more and more of these devices will be found on the same chip. Instead of reinventing the wheel, it is imperative that semiconductor companies agree on a standard software API that, on one hand, offers high efficiency, but on the other hand offers the ability to build and re-use applications software. The need for such a standardized API is underscored by the emergence of on-chip networks as the physical communication mechanism (as opposed to busses). Thus the standard API must be able to mimic the functionality offered by MPI and threads but in a

much light-weight manner, and in a manner that meshes well with the existing (bus-based) and emerging (network based) hardware transport mechanisms. This paper describes an effort to merge these two trends sophisticated transport mechanisms in hardware, driven by reusable standard API based high performance software. This effort is part of the emerging multicore communication API (MCAPI) led by over two dozen MCA member companies. There are also a few University members in the MCA including our very research group! Our primary goal in joining the MCA (by paying an annual membership fee) has been to observe the creation of MCAPI at early stages, understand the motivations behind it, and to develop meaningful formal methods solutions for this area. This paper describes our efforts to understand the hardware design of MCAPI thoroughly (a companion paper describes our efforts to thoroughly understand the software formal verification needs of this area). Our approach was similar to the plea get real; get physical i.e. one could endlessly tout the virtues of MCAPI, but unless one has an actual implementation of MCAPI in silicon, one cannot assess its true merits. This paper represents the following ambitious journey embarked on by one graduate student and his advisor: design a nine MIPS-core based MCAPI fabric on FPGA; modify the MIPS core ISA to support MCAPI primitives efficiently; demo this subsystem and release it in the public domain for everyone in the community to play with an actual MCAPI in silicon. Thereafter, work on several projects as follow-ons: (i) write real MCAPI applications in C, compile it and run it on our target, (ii) write a BlueSpec specification for our architecture and re-derive our hardware, and (iii) apply IBM s SixthSense tools to formally verify our hand-designed MCAPI in silicon. (Note: Thanks to IBM, we are the only university to have a license for IBM s SixthSense tools, so far as we know. It should be possible for other universities to benefit from the results IBM observes happening vis-à-vis SixthSense in our group. Also our work is being presented in the Multi-core Expo in Spring 2009 as a poster.) MCAPI is a lightweight message passing specification targeted towards embedded SoC s. It has been shown in [2, 3 and 8] among many other publications that on-chip networks are a necessary direction in parallel architectures to circumvent the issues associated with rapidly increasing wire delays and increasing latency due to contention for shared buses. Any future concurrent system should have some sort of scalable interconnection network. This is the main hardware design focus of this work. A high throughput, low-latency on-chip network has been designed with MCAPI and embedded applications in mind. This design is described in section 3 and the communication performance that it achieves in section 5. The phases of our long-term work are the following. First, we offer a detailed assessment of MCAPI by providing the first public domain design (in FPGA) of an MCAPI based communication architecture. This design consists of nine modified MIPS cores connected through a worm-hole routed NoC fabric. The second contribution will be to formally verify parts of our implementation using the IBM SixthSense tools. The third contribution will be to re-derive parts of our design using the BlueSpec language and compilation system. Because of the scale of these steps, what we have concretely achieved consists of the first of these three phases. 2. Multicore API implementation 2.1. MCAPI Overview The Multicore Communication API is a message passing interface that is similar to MPI. However, it is designed primarily for embedded devices where broad functionality is not as important as high performance in a few types of communication. MCAPI provides the communication primitives that can be used by operating systems, libraries, and applications to improve code portability across different hardware generations. Since the API is designed for on-chip communication, it makes few assumptions about the hardware architecture and leaves a lot of freedom for the implementation to take advantage of whatever optimizations may be permitted through the architecture. For example, if there is shared memory available, transmission of data can be implemented as pointer passing. This eliminates unnecessary copies that may be required by other solutions. There are two types of communication defined in MCAPI. The first is connected packet and scalar channels. These channels require that the user define two endpoints and a communication link between them. Data can then be sent on this connection with very high throughput. The second type of communication is connectionless messages. These messages do not require a connection between two

endpoints to be established; they can be sent to any other endpoint. However, they require greater overhead to transmit, so throughput is not as good. While the aspirations of the API sound impressive, they are largely dependent on the implementation to realize the potential. 2.2. MIPS ISA Extension The core functionality of the implementation described in this paper is controllable via a set of RISC type assembly instructions added as an extension to the MIPS instruction set. These instructions are given in figure 1. The decision to make the control of the communication hardware programmable is based on an effort to avoid over complicating the hardware. 2.3. Example Send/Receive With these added instructions MCAPI can be implemented as a C library with in-line assembly code. Example implementations of MCAPI message send and receive functions are given in figures 2 and 3 respectively. For brevity, some error checking code has been omitted. Note that the code size of these functions is relatively small. This is critical to minimizing the memory footprint of the library implementation such that it is suitable for embedded applications. Fig. 1 Using these instructions all of the communication functionality of MCAPI can be implemented. The send header instruction builds the packet header and sends it on the network. It includes source and destination node identifiers, as well as a packet class that indicates the type of data being sent (i.e. pointer/buffer, short, integer, or long). The receive header instruction subsequently gets the packet header and writes it to a register. The header data can then be parsed using bit mask and shift instructions (standard with the MIPS ISA). The send/receive word instructions send or receive word length chunks of data. The get ID and flag instructions write the local node ID and the specified network flag to registers respectively. Available network flags are described in detail in the next section, but they include the necessary information for determining when a packet is available, when the network is busy, and some simple error checking. Fig. 2 Both send and receive functions can be expected to return very quickly. Since the underlying hardware supports zero copy data transfers through pointer passing, the send function is very fast. The receive function contains loops to check for data being available. These are mostly to ensure correctness since it is expected that a user would call mcapi_msg_available before calling the receive function.

This reasoning does not even consider the second advantage of grid networks which is short wire lengths. Increasing wire delays are becoming a major problem in designing chips in modern process technologies [4]. Grid networks partition communication paths such that adding more cores to a design will not increase the length of the wires. Example NoC Layout Fig. 3 3. On-Chip Network Design 3.1. Network Topology As the physical communication medium, a two dimensional grid network was designed to meet the data transport needs of MCAPI. Grid network topologies have several properties which make them attractive for this type of application. First, they are highly scalable. For an N x N grid network, the number of cores is N^2. The worst case communication latency in network hops is linear, following a curve of about 2N. Compare this to a bus where the worst case latency for communication is equal to the number of cores, where there is a fair arbitration scheme. So with respect to the number of cores, grid networks have sub-linear scalability for performance. Fig. 4 Figure 4 shows an example network layout similar to what has been designed here. The only differences being that this diagram shows accelerators, I/O interfaces, and L2 caches illustrating the heterogeneous nature of systems that may use this type of MCAPI implementation. Note that the tiled layout of this network physically distributes the L2 cache, but logically the L2 cache is shared. This architecture has been proposed in [5]. 3.2. Wormhole Router with Virtual Channels The key hardware component in a scalable interconnect is an on-chip router. There are many different types of routers, but wormhole routers utilizing virtual channels have been shown in [6] to be efficient designs for on-chip networks. In a wormhole router, packets are divided into flits, and flits are passed through a network in a pipelined fashion. This means that routers only need buffer space for a small portion of a packet. This saves

power and chip area, which are critical design metrics in an embedded chip. MCAPI packets are divided into flits as shown in figure 5. The head flit consists of a destination node and port ID, packet class, and sender ID. In general each core is a node. The port ID is included for future extensions which may support multiple endpoints per node. The sender ID is necessary so that the destination can determine where the packet came from. Different packet classes are used to implement the functionality of MCAPI. Since data can be sent as scalar values of various bit widths and as pointers, these packet classes tell the user how to interpret the received data. It will be shown through the description of the implementation that opening a channel,thus reserving network resources, is as simple as sending a header flit. Therefore, packet and scalar channels are implemented with the same instructions used to send connectionless messages. The only difference is that channel communication can consist of an arbitrary number of packets. It will be shown that this can negatively affect the overall network performance. Fig. 5 The on-chip router that has been designed for this implementation resembles the block diagram in figure 6. The key features include: 5 input and output channels each with two virtual channels, a single-cycle 16-bit data path, fair round-robin channel arbitration, and deadlock free dimension-order routing. The flow control of the router is a simple token scheme where each VC has buffer space for two flits. Once a VC has a flit, it sets its token signal high to stall the pipeline. One flits worth of buffer space is insufficient because another flit may have already been sent at the same time the token gets set high. To prevent buffer overflow, two buffer slots are needed in this single cycle design. Router Block Diagram Fig. 6 The key to permitting high throughput in a wormhole routing scheme is virtual channels. Flits of a packet can be strung out all across a network. Since only the header flit contains routing information, the body flits must follow immediately behind the header. This causes other packets contending for the same channel to be stalled. It is for this reason that packet and scalar channels can negatively impact overall network performance, since they have arbitrary lengths of data. Virtual channels allow these packets to continue moving through the network by allocating resources to packets at the buffer level, not the physical channel. Since there are two VC s per physical channel, each node may only open a single packet or scalar channel at a time. This is to help ensure that a VC will usually be available and that there is an upper bound on how long a packet may be stalled at each hop. In this design, the decision of which virtual channel to use for a packet is determined by token inputs and saturating counters that track network traffic on each physical channel. Decisions about which VC to use can only be made for the header flit, subsequent flits must follow the header. Therefore, if two packets are sequentially sent across the same physical link the second packet will choose the opposite VC as the first. This balances network load and improves average case throughput and latency.

The arbitration process is fairly simple, and is based on the techniques described in [6]. The router data path computes the routing direction, latches the data, and sends a request to the arbiter for the output channel in the first phase. In the second phase, the arbiter sends a grant signal to the VC and sets the appropriate control signals for the crossbar, which forwards the data to the next router. When two flits want to use the same output channel, a decision is made based on the value of a ring counter. The goal of the arbiter is to forward as many flits as possible every clock cycle. The router module is the main bottleneck in system clock speed. The decision to use a single cycle design for the router is part of an effort to achieve low latency, but it won t be able to be clocked as fast as a pipelined router. However, because MCAPI is targeted towards embedded designs this seemed to be a worthwhile trade off, since embedded chips have lower clock rates. Research shown in [7] demonstrates several techniques for decreasing single cycle clock time in routers by taking the routing function and arbitration process off of the critical path. This further justifies the use of a single cycle router. However, it is very important for overall system clock speed that there are no combinational paths through the router. This is to ensure that there are no signals that propagate from one network node to another without being latched. If this were allowed to happen, then the system clock period would be the time taken to traverse multiple network hops. 3.3. Network Interface Unit The network interface unit (NIU) provides the necessary functionality to easily interface a modified MIPS processor to the grid network. A block diagram is given in figure 7. NIU Block Diagram The NIU consists of send and receive modules, send and receive buffers, and a register containing operation flags. The send module builds flits and sends them on the network according to the bus inputs from the processor and the opcode. It also makes an initial decision about which VC to send the packet on. The receive module observes the front of each VC receive buffer which causes state transitions, which determine whether the available data is a header, body, or tail flit. It then removes buffer items when the appropriate opcode is seen. The flags that are set include send and receive port busy flags, header and data available flags, send and receive error flags, and a network busy flag. These flags provide the necessary information to the user for detecting buffer overflow, network overload, and for determining when a message is available. 4. Synthesis and Verification 4.1. Synthesis for Virtex5 The target platform for this implementation is a Xilinx Virtex 5 FPGA. Programmable logic is used so that the designs here could be used in other research efforts exploring multicore system innovations. The synthesis results in terms of device utilization and performance are given in table 1 for each module and for an entire 9-core system laid out similar to the design in figure 4. Note, the complete system excludes L2 cache but does include 16 B of L1 cache. Future extensions may include shared L2 cache because the total device utilization for the existing 9-core system is only about 50%. Synthesis Results Module LUT's Registers Clk Rate MIPS Core 468 182 200 MHz NIU 221 54 326 MHz Router 876 123 155 MHz 9-Core NoC 11309 3133 130 MHz Table 1 4.2. Testing Methodology Fig. 7 Testing is performed by running test programs and observing the output waveforms in a VHDL simulator. An assembler has been created for this system and an example program has been run successfully. However, this test program did not stress the network. It involved a simple producer / consumer communication pattern between two MIPS cores while the remaining cores were idle. It

is important that more extensive applications be used to accurately observe the average case communication latency. Formal verification of critical hardware components is also important, but at the time of this writing no significant formal verification has been done on these designs. 5. Communication Performance The best and worst-case communication performance in terms of clock cycles can be summarized by equations 1 and 2. These represent the latency for the header flit. Subsequent flits follow directly behind the header. Best Case: L = H + k (1) Worst Case: L = 5*H*S + k (2) In both equations, 'L' is the latency in cycles, 'H' is the number of hops, 'S' is the average size in flits of other packets, and 'k' is a constant derived from other non-communication instructions in the implementation. From initial test of a lightly loaded network, latency for a single flit traveling across 3 hops is 9 cycles. This represents the best case latency. Due to the design characteristics of the network, the worst-case latency is very unlikely. For the worst case to occur, 5 packets would need to come into each router at the same time, and at each router the flit being sent would have to be last on the arbitration schedule. It has been observed that the average case latency is much closer to the best case latency. For connected packet and scalar channels, the body packets achieve best-case latency. Throughput is also increased because there is much less header/tail overhead and since network resources are reserved when the channel is opened a flit can progress through the connection every clock cycle. At 130 MHz the throughput of a channel connection is about 260 MB/s, because each body flit is 2-bytes. 6. Conclusions and Future Work The work shown here demonstrates the viability and usefulness of hardware support for inter-core communication. Even in an FPGA implementation, low communication latency and high throughput (260 MB/s) is possible. Current research in on-chip network design provides solutions for minimizing cost and power while improving performance. These innovations will continue and further justify the implementation of communication API's in hardware. What has been presented here is an efficient and scalable concurrent computing platform with greater potential for future innovations than existing solutions. Several immediate directions for future work related to this project include: the evaluation of IBM's SixthSense VHDL model checker and Bluespec's automated design tools using hardware designs presented in this paper; and the development of automated network synthesis and optimization algorithms for MCAPI workloads. Evaluation of SixthSense has already begun. A tutorial for using the tool has been created and is available in [11]. The key units in this design that need formal verification are the arbitration and VC allocation units of the router module. This is because it is difficult to create workloads that will stress the network. For correctness, it must be verified that a packet will never be sent on a VC being used by another packet until that packet releases the resource. This is a perfect case study for hardware verification because it tests a situation that traditional test vectors would be unlikely to catch. 7. Acknowledgments This work has been funded by SRC task ID 1847.001. 8. References [1] Multicore Association Communication API Specification V1.063, http://www.multicoreassociation.org [2] M. Ali, M. Welzl, M. Zwicknagl, Networks on Chips: Scalable Interconnects for Future Systems on Chips, IEEE 2008. [3] R. Das, S. Eachempati, A. K. Mishra, V. Narayanan, C. R. Das, Design and Evaluation of a Hierarchical On- Chip Interconnect for Next-Generation CMPs, HPCA 2009. [4] R. Ho, K. Mai, M. Horowitz, The Future of Wires, Proceedings of the IEEE, April 2001, pp. 490-504. [5] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, R-NUCA: Data Placement in Distributed Shared Caches, ISCA 2009. [6] E. Shin, V. Mooney, G. Riley, Round-Robin Arbiter Design and Generation, ISSS 2002.

[7] Mullins, West, Moore, Low-Latency Virtual-Channel Routers for On-Chip Networks, ISCA 2004. [8] W. Dally, B. Towels, Route, Packets, Not Wires: On- Chip Interconnection Networks, Proceedings of IEEE Design Automation Conference, 2001, pp. 684-689. [9] V. Dvorak, Communication Performance of Mesh and Ring Based NoC's, 7th International Conference on Networking, 2008, pp. 156-161. [10] I. Nousias, T. Arslan, Wormhole Routing with Virtual Channels using Adaptive Rate Control for Network-on-Chip (NoC), Proceedings of 1st NASA/ESA Conference on Adaptive Hardware and Systems, 2006, pp. 420-423. [11] B. Meakin, SixthSense Tutorial, www.cs.utah.edu/formal_verification/mediawiki/index.ph p/sixthsense