NoC Test-Chip Project: Working Document

Size: px

Start display at page:

Download "NoC Test-Chip Project: Working Document"

David Smith
5 years ago
Views:

1 NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance network-on-chip (NoC) architecture that we propose to connect 16 processing cores in the Test-Chip, which will be realized with a 90nm technology process. The main purpose of the Test-Chip is to test the functionality and performance of the long-distance links (LDLs) when they operate as part of a fairly complex NoC. The LDLs employ current-mode and low-swing signalling techniques to achieve low-power communication at near speed of the light transmission rate and are expected to offer a major improvement in terms of the NoC performance-per-watt. Channels made of such LDSs can have very high bandwidth (10Gbps) and span an end-to-end physical distance equal to 1cm or more. Figure 1 illustrates the proposed architecture for the Test-Chip. All 16 processors run concurrently, each executing a simple sequential program that is stored in the local RAM together with the necessary data. Occasionally the processors exchange data through the network by using a DMA mechanism. Each processor is connected to the NoC through two network interfaces (NIs), which are responsible for sending and receiving data and control messages. A data message can be segmented in a sequence of packets. A control message is usually a small message, one clock cycle long and with the same parallelism of the channel. The chip operates alternating between two main modes: configuration and execution. In configuration mode the NoC and the processors are off while program code and data are stored in the local RAMs using an external FPGA. In execution mode the chip is effectively a closed system that does not communicate with the outside world. Various communication scenarios can be tested by properly programming the processors. Essentially the testing session will consist of a sequence of three steps: 1) Uploading: the Test-Chip is in configuration mode. Programs and data are loaded onto the RAMs. After the uploading is complete, the overall content of the 16 RAMs is called starting configuration. At this point the processors receive a signal to start execution from the external FPGA. 2) Run: the Test-Chip is in execution mode. Each processor runs its program processing local data and exchanging data with other processors. Each processor notifies the external FPGA as soon as it completes the program execution and than stops and waits. After all the processors have notified their completion, the RAMs contain the final configuration and the test-chip can be switched back to configure mode. 3) Downloading: the Test-Chip is in configuration mode. The final configuration is downloaded from the RAMs and than compared against an expected final configuration to validate the correctness of the computation. II. TOPOLOGY The proposed NoC topology, which is shown in Figure 1, combines a standard packet-switched mesh with a circuit-switched fully-connected network based on LDLs. The mesh transports short data messages and control messages, while the circuit-switched network conveys large size DMA transfers or eventually high priority data. The mesh is a 16-node network while the circuit-switched network has only 4 nodes. Each network link is bidirectional. The two networks connect 16 tiles each containing a processing core. Each tile is equipped with two network interfaces (NIs). One NI connects the processor with a router of the mesh, while the other NI connects it to one of the four low-swing network nodes. Different messages are exchanged on the two network based on the software program running on the local processor. Notice that any given message exchange occurs using only one of the two networks. Each tile contains also a 5x5 router: the router has 4 bidirectional ports to relay packets while the fifth port is used by the processor to inject (eject) messages into (from) the packet-switched mesh network. Indeed, the mesh packed-switched network is composed by 16 routers, each connected to a processor. The circuit-switched network, instead, has a different architecture. It has just 4 nodes low-swing network nodes that manage the injection and ejection of traffic over the circuit-switched network. Each node is attached to 4 different processors and communicates directly with the other 3 low-swing network nodes through LDLs. The

Fig. 1. Proposed Architecture for the NoC Test Chip. processor layout can be divided into 4 square quadrants, each containing 4 processors that are attached to the same low-swing network node.

2 Fig. 1. Proposed Architecture for the NoC Test Chip. processor layout can be divided into 4 square quadrants, each containing 4 processors that are attached to the same low-swing network node. This includes an arbitration mechanism to grant each of the 3 low-swing interfaces to one of the 4 processors. The group of 4 processors with the low-swing network node is called low-swing island. This topology is innovative. It conjugates a packet-switched network with a circuit-switched network, a match whose introduction has being lately advocated by many NoC researchers. The packet-switched network uses traditional full-swing signaling an is used for short data/control packets. The circuit-switched network offers a low-latency high-bandwidth low-power solution for the exchange of large messages. It is interesting to analyze the possibility of having a higher bandwidth on the LDLs than on the mesh links, because the penalty in terms of power dissipation when the bandwidth grows is lower. It is also interesting to perform a performance analysis to see the best way of distributing the traffic across the two networks. The LDLs are used to connect pairs of processors from different low-swing islands, because in general the number of hops is high. For processors with a distance of 1 or 2 hops the packet-switched network can be used. This leads to lower power dissipation for long distance communications.

3 Fig. 2. Micro-Architecture of the Processing Core. III. CORE ARCHITECTURE To simplify the design task, we propose that instead of designing a fully-functional processor we build a softwareconfigurable traffic generator. The capability of programming the generation of different traffic patterns will allow us to perform on-the-field analysis of the NoC correctness and performance. Hence, the main components of our simplified processor are (see Figure 2): a local memory (RAM), a Processing Unit (PU), a DMA engine, a Full-Swing Network Interface (FSNI), a Low-Swing Network Interface (LSNI). The RAM has 3 sections: 1) the code area, 2) the statistics area, 3) the data section. The RAM has one output port for the code and one I/O port for the data, so that a DMA engine can manage the memory while just one incoming communication and one outgoing communication are possible at any given time. The PU executes a program that is stored in the instruction area of the RAM. The program contains a sequence of instructions that allow us to emulate a real program running on the processor and generating data transfers over the networks. For this purpose we have defined a simple assembly code (see Section VI). The DMA engine is a functional block that is capable of reading from the RAM a portion of data, after receiving as input the base address and the size of the data transfer. The output of the memory is then demultiplexed toward one of the two network interfaces according to the specification given in the communication instruction. The FSNI is the interface toward the packet-switched mesh network. Over this network, small data messages and control messages are allowed. The small data messages are generated by the DMA, while the control messages are one-flit long and are generated directly from the PU. The LSNI is the interface toward the low-swing circuit network. It is able to forward on this network long data messages and it is also equipped with a control interface that allows the reservation of circuits when a transfer has to take place. The control messages related with a low-swing communication are carried by the packet network before the circuit transfer can happen. The processor architecture is shown in Figure 2. The PU and the DMA are able to read from the RAM the code and the data. The PU generates control messages to be sent over the packet network. Each control message corresponds to a distinct data transfer (either incoming or outgoing) that takes place over one of the two networks. The data read by the DMA engine from the RAM are forwarded to the appropriate network interface.

4 IV. LOW-SWING NETWORK NODE The node of the low-swing network is a circuit managing the traffic coming from the four tiles. It is equipped with 4 tile-interfaces (TI) and with 3 low-swing transceivers (LSTxRx). The TIs are used to receive the requests of communications from the processors and to give the grants. The low-swing network node essentially behaves as an arbiter, receiving many requests and allocating a number of resources lower than the number of entities that make the requests. When a circuit interface is reserved (either a Tx or a Rx) a grant signal is sent to the tile that will send or receive on that circuit. V. HIGH-LEVEL INTER-PROCESSOR COMMUNICATION The Test-Chip is a distributed concurrent system where 16 processors compute concurrently and exchange data via either the packet-switched network or point-to-point channels on the circuit-switched network. From a model of computation (MOC) point of view, the system is a collection of sequential processes running concurrently. Each process is specified as a program in the local memory. The program is made of an interleaved sequence of computation and communication instructions. During a communication phase two processors communicate through a rendezvous mechanism. In a two-processor rendezvous communication the sender processor writes a message into the RAM of the receiver processor. Both processors are running their own process and each of these processes contains an instruction related to this communication, respectively a RECEIVE and a SEND instruction. The two processes synchronize as they execute these instructions. Two protocols are suitable to implement the control phase of a data transfer using a rendezvous scheme: 1) The first protocol is illustrated in Figure 3. When the destination reaches the RECEIVE instruction (at time t rcv ) it sends a RECEIVE control message to the sender to signal that is ready to receive the data. When the sender reaches the SEND instruction (at time t snd ), it waits for the reception of the RECEIVE control message (unless it has already arrived and has been buffered) before sending the message over the network. 2) The second protocol is illustrated in Figure 4. Here, the sender starts the communication by emitting a SEND control message when it reaches the SEND instruction. After executing the RECEIVE instruction, the receiver waits for the incoming SEND control message (unless it has already arrived and has been buffered). The the LDL is requested to the low-swing network node and when the link is granted the RECEIVE control message is sent back. The sender starts to transfer the data just after receiving the RECEIVE control message. The first protocol is faster because only one control message is sent on the network and the time when the transfer starts (t tx ) is t tx = max(t snd,t rcv + RTT/2), where RTT is the Round Trip Time for a message. The second protocol, instead requires the exchange of two control messages and, it is slower because t tx = max(t snd + d rsv + RTT,t rcv + d rsv + RTT/2), where d rsv is the delay imposed by the arbitration of the low-swing link. The second protocol, however, has an advantage: if the RECEIVE control messages is also used to reserve resources over the network, those resources are reserved only when both processors are ready for the communication. In the first protocol the reservation would happen when the receiver is ready, which could occur much earlier than the time the communication really starts. The trade-offs between the two protocols can be exploited by using a different protocol in each of the different networks, as discussed in Sections VIII and IX. VI. ASSEMBLY LANGUAGE In order to emulate a real program execution together with its various possible communication traffic patterns we propose to implement a simple instruction set that includes the following instructions: SEND (source address, destination processor, destination address, length, network) RECEIVE (source processor, destination address, network) WHILE (number of instructions, number of iterations) NOP (number of cycles) STOP These instructions are sufficient to generate traffic patterns on both the packet-switched and the circuit-switched networks as well as to implement communications that are able to protect the memory consistency through the rendezvous mechanism. When a process reaches a SEND it has to write into another processor RAM a message. When a process reaches a RECEIVE it has to wait the reception of a message from another processor.

5 Fig. 3. First rendezvous communication protocol. The WHILE instruction allows to loop on a group of instructions executed just before the WHILE itself. The two parameters indicate the number of instructions to consider and the number of iterations. This instruction allows to compact a program, saving RAM memory, and repeat many times a traffic pattern. The NOP instruction is used to emulate the computation delay. The STOP instruction ends a program. VII. ROUTING AND FLOW CONTROL The packet-switched network uses wormhole flow control, which is based on the segmentation of a packet in a train of flits. A flit is defined as a set of bits that matches the link parallelism. The first flit contains the information used by the routers to forward the packet. All the following flits are sequentially forwarded along the same path following the first flit. If there is a resource conflict on a router port, the first flit can be blocked by another flit: in this case, a back-pressure mechanism stops the rest of the flits, which depending on the length of the packets can be stored in the router buffers and, possibly, even on the other routers along the path. A wormhole flow control may lead to deadlock because, differently from stored-and-forward mechanisms, the resources are allocated along the path in more than a single router. In order to avoid deadlock it is necessary to avoid the creation of cyclic dependencies by choosing carefully the routing algorithm. One of the simplest routing algorithm for a packet-switched mesh network is the XY dimension order routing, which routes a packet first along the X dimension until it reaches the destination column and then on the Y dimension until it reaches the final destination. The XY dimension order routing is known to be deadlock free. It is possible to introduce more efficient (but also more complex) routing algorithms and it is possible to make them deadlock free by adding a proper number of virtual channels (VCs). VCs allow to forward a packet on a port even if that port is the destination of a blocked packet, thereby improving also the network throughput. The insertion of VCs not only adds complexity in terms of buffer management and router design, but it needs also more buffering resources. For our purpose, since the mesh mainly carries control messages and small data packets, we plan to use a simple deadlock-free routing algorithm without the introduction of VCs. VIII. TRANSFER PROTOCOL OVER THE PACKET-SWITCHED MESH NETWORK As discussed in Section V there is more than one high-level communication protocol that can be used to implement the rendezvous communication scheme. For the packet-switched network we plan to use the first protocol discussed above (Figure 3) because it has the advantage to use just one control message and to have a lower latency. IX. CIRCUIT RESERVATION PROTOCOL The LDLs are used in a circuit-switched network that is suitable for large and infrequent messages, because there is a higher overhead in setting up the path. Once a point-to-point communication circuit is reserved, there is no need to implement a flow control mechanism or to segment the message in multiple packets because a dedicated

6 Fig. 4. Second rendezvous communication protocol. channel is available to transfer all the data. Besides, the longer is the message the lower is the overhead of the circuit set-up procedure. For the circuit-switched network we use the second protocol proposed in Section V (see Figure 4). With this protocol it is possible to convey the path-setup process over the RECEIVE control message. Before sending the RECEIVE message the receiver processor must request the reservation of the Rx on the low-swing link that will be used for the communication and receive the grant for that link (if it is already busy, it means that a connection over the same link is taking place, so the new communication is postponed). When the RECEIVE control message is received the sender can turn on the Tx on the low-swing link (the reservation protocol avoids to have another communication in progress on that link at the time the sender asks for it, because the Rx is already locked) and send the data. The protocol above uses two control messages, thus increasing the load over the packet-switched network and the latency of the control phase. On the other hand, with this protocol it is possible to reserve the resources for the circuit communication and the circuit communication has a latency that is negligible once it is set up, because it connects directly the two processor and no routing decisions (one or more clock cycles per router) have to be made. The discussed protocol resolves conflicts among different communications toward the same destination. Let s suppose that the processors A and C need both to send data to the same processor B and for both processors this operation is the next rendezvous point. In this case, B has the two RECEIVE instructions from the two processors in sequence, eventually with some computation between them. Let s suppose to have RECEIVE(A) and then RECEIVE(C) in the B code, but contemporary C executes the SEND(B) instruction before A does. In this situation B stores the SEND message coming from C, reaches the RECEIVE(A) instruction and eventually waits for the SEND message from A. Only after receiving this message a RECEIVE message is sent from B to A and the resources for the circuit are requested. The communication with C will take place only after B sends a RECEIVE message to C, i.e. when the RECEIVE(C) instruction is executed by B, only after the A-to-B transfer is completed. According to the scenario described above, a core is able to store all the possible SEND requests (15 in this case) possibly coming from the other cores and then it serves them sending the RECEIVE message in the same order it executes the relative rendezvous instructions. This allows each core to reserve a circuit only when both ends of the communication have reached the rendezvous point, so to avoid any possibility of deadlock. On the other side, as previously discussed, the receiver core sends the RECEIVE message only when the receiver of the LDL has been already reserved, i.e. when the grant from the low-swing network node has been received. With this constraint, race conditions are avoided between couples of cores that share the same LDL.

7 Fig. 5. Interface between LSNI and low-swing network node. X. MAIN DESIGN AREAS The design effort for the proposed NoC can be divided in four main parts: 1) LSNI and low-swing network node development and testing; 2) Router and full-swing network design with routing and flow control testing; 3) RAM, PU, DMA and FSNI design, with instruction fetch, instruction translation and data transfer testing; 4) Development of system-level simulator (based on Omnet++) for application-level testing and to support the programming of the various traffic scenarios for the Test-Chip. XI. LSNI AND LOW-SWING NETWORK NODE INTERFACE The interface between LSNI and the low-swing network node is implemented in order to allow the LSNI to request a LDL and receive the grant to use that resource, after an arbitration protocol executed into the low-swing network node and that takes into account the requests coming from the all four connected cores. A bi-directional data bus is also provided to allow the transfer of data to/from the core RAM from/to the low-swing network node. Figure 5 shows the signals of the interface. Request TX for the LSNI to request a transmission Request RX for the LSNI to request a reception Grant for the low-swing network node to grant the requested link Destination for the LSNI to indicate the addressed remote low-swing network node, i.e. the requested link Data to transfer the data in both directions (a single bus is sufficient because a core cannot receive and send data contemporary) Data valid to validate the data on the bus on the rising edge of the clock While the Data and Data valid signal frequency is 500MHz to simplify the design of the RAM controller, the other signal frequency is still 1GHz in order to speed up the reservation protocol. Figure 6 shows the requesting protocol for the reservation of a LDL. The Request (either TX or RX) signal is asserted when the core asks for a LDL and remains high until the end of the transfer. The Grant signal is asserted when the resource is available for the requesting core, and it remains high until the Request is high. XII. OMNET++ SIMULATOR We plan to model the entire system on two levels of granularity: message based, every transaction is a message containing all the data, delivered to the destination and representing the data transfer (including the control message exchange). With this level of granularity we speed up the simulation neglecting the flow control and we verify the communication pattern to be deadlockfree and we obtain at the end of the simulation the expected final configuration.

8 Fig. 6. LDL request protocol. flit based, the data messages (worms) are composed by a head and a tail flits, without considering all the flits between them. The head flit is routed and reserves the path. The tail flit follows the same path at the same speed of the head flit, being blocked when the head does not win a contention. This approach permits to keep low the computation effort to simulate the system modelling the flow control and makes possible the design exploration. All the data are stored and carried by the tail flit. The expected final configuration is the same because not affected by the flow control. For both the levels of abstraction we plan to model, beside the channels: the core, following the model in Figure 2 the router, with two modules modelling the arbitration mechanism and the routing algorithm the low-swing network node It is necessary to have full control of the memory. So, the data memory is considered as an HEX file while the program memory is considered as an ASCII file containing the assembly code for the program. A scenario of traffic is represented by the following files: 16.TXT files, each containing the program for a core, as input of the simulator 16.ASM files, each containing the translation of the relative.txt file, to load into the program memories of the real chip 16.HEX files, each containing the initial configuration of the data memory for a core, as input of both the simulator and the real chip 16.HEX files, each containing the expected final configuration of the core data memory, as output of the simulator 16.HEX files, each containing the final configuration of the core data memory, as content of the memories after the real chip activity At the end of the test of a scenario the expected final configuration and the final configuration must fully match. XIII. QUANTITATIVE DESIGN DIMENSIONING Global clock frequency: 1GHz RAM clock frequency: 500MHz Circuit-switched network line-rate: 80Gbps Flit width: 18bit (16bit for data + 2bit for control) Packet-switched network line-rate: 16Gbps Memory word: 16bit Data bus width between FSNI and low-swing network node: 160bit Data bus frequency between FSNI and low-swing network node: 500M Hz Number of interleaved RAM units: 10 The code and the data RAM are managed and loaded separately Data RAM block size: TBD Code RAM size: TBD XIV. OPEN ISSUES The data bus in the interface between LSNI and Low-swing network node is composed by 160 wires, with a clock frequency of 500MHz and a non negligible length (around 0.8mm for an 8mm side chip). Those factors drive to estimate an high power dissipation on that bus. A low-power interconnection is preferred on that interface.

9 The number of interleaved RAM blocks is set to be 10. This makes the generation of the address not really clean, because a counter mod10 is needed to access every single block. A number of blocks power of 2 can help the design of the memory controller. To do that a line rate multiple of 2Gbps is needed for the LDLs, i.e. 8Gbps instead of 10Gbps per wire. Eventually, it is possible to think about reducing the LDL bandwidth in order to simplify the RAM access. The size of each memory block and of the code RAM has to be defined based on the layout constraints. The division of the memory between data and code has to be made depending on the size of the transfers and the number of communications that should be run for a complete study of the system. XV. TIMELINE Dec 7: Modelling of the LDL and Low-swing network node RTL description Jan 8: RTL description of the entire chip for cycle-accurate simulations Jan 8: Simulation environment with some deadlock-free traffic patterns Jan 15: Automatic stochastic generator of deadlock-free traffic patterns Apr 15: Tapeout

Basic Low Level Concepts

Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock