Hardware Design, Synthesis, and Verification of a Multicore Communications API
|
|
- Nathan Gibson
- 5 years ago
- Views:
Transcription
1 Hardware Design, Synthesis, and Verification of a Multicore Communications API Benjamin Meakin Ganesh Gopalakrishnan University of Utah School of Computing {meakin, ganesh}@cs.utah.edu Abstract Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent hardware and software architectures are inevitable in future systems. One of the greatest problems in these systems is communication. Providing coherence, consistency, synchronization, and sharing of data in a multicore system requires that communication overhead is minimal. It is essential that scalable, flexible, and efficient hardware/software mechanisms be researched and developed to ease the technical community into developing concurrent systems. This research effort is to create such mechanisms by designing a scalable hardware implementation of a multicore communication API. This API, developed by the Multicore Association, targets embedded devices and aims to provide communication primitives for embedded systems on chips. It is a lightweight message passing interface that offers the potential for greater communication performance and lower power than other solutions at the expense of broad functionality. Realizing this potential is almost entirely up to the implementation. This paper describes the design, synthesis, and verification of such an implementation. The low latency, low power, and high throughput aspirations of the API are the very performance metrics this implementation seeks to optimize. This is achieved by designing an on-chip network as the physical communication medium and modifying a MIPS processor to implement an extension to the instruction set. This will permit implementation of the API with direct hardware support. The result of this effort is a useful case study of a hardware design, synthesis, and verification flow of a possible implementation of an emerging multicore communication API. It will show that such an organization yields comparable performance results to current communication architectures with greater scalability and potential for future innovations. 1. Introduction It is widely accepted that modern and future computing systems will see performance improvements primarily through exploiting increased process/thread level parallelism. One of the main prerequisites to exploiting this parallelism efficiently is the availability of APIs that are well matched with the communication / synchronization needs of this area. Clearly, one size does not fit all. In the area of largescale cluster computing, the Message Passing Interface (MPI) a very sophisticated API with over 300 functions -- is the lingua franca. MPI is used to program cluster computers with up to hundreds of thousands of processing nodes. In other realms such as embedded systems using commodity microprocessors, various real-time operating system primitives and shared memory threads serve the needs of communication and synchronization. However for the rapidly exploding area of embedded systems based on multiple cores, chips not only contain multiple general purpose computing cores, but for cost effectiveness and performance also contain application specific accelerators, I/O interfaces, and memory controllers. All of these on-chip devices need low overhead communication. As semiconductors continue to scale, more and more of these devices will be found on the same chip. Instead of reinventing the wheel, it is imperative that semiconductor companies agree on a standard software API that, on one hand, offers high efficiency, but on the other hand offers the ability to build and re-use applications software. The need for such a standardized API is underscored by the emergence of on-chip networks as the physical communication mechanism (as opposed to busses). Thus the standard API must be able to mimic the functionality offered by MPI and threads but in a
2 much light-weight manner, and in a manner that meshes well with the existing (bus-based) and emerging (network based) hardware transport mechanisms. This paper describes an effort to merge these two trends sophisticated transport mechanisms in hardware, driven by reusable standard API based high performance software. This effort is part of the emerging multicore communication API (MCAPI) led by over two dozen MCA member companies. There are also a few University members in the MCA including our very research group! Our primary goal in joining the MCA (by paying an annual membership fee) has been to observe the creation of MCAPI at early stages, understand the motivations behind it, and to develop meaningful formal methods solutions for this area. This paper describes our efforts to understand the hardware design of MCAPI thoroughly (a companion paper describes our efforts to thoroughly understand the software formal verification needs of this area). Our approach was similar to the plea get real; get physical i.e. one could endlessly tout the virtues of MCAPI, but unless one has an actual implementation of MCAPI in silicon, one cannot assess its true merits. This paper represents the following ambitious journey embarked on by one graduate student and his advisor: design a nine MIPS-core based MCAPI fabric on FPGA; modify the MIPS core ISA to support MCAPI primitives efficiently; demo this subsystem and release it in the public domain for everyone in the community to play with an actual MCAPI in silicon. Thereafter, work on several projects as follow-ons: (i) write real MCAPI applications in C, compile it and run it on our target, (ii) write a BlueSpec specification for our architecture and re-derive our hardware, and (iii) apply IBM s SixthSense tools to formally verify our hand-designed MCAPI in silicon. (Note: Thanks to IBM, we are the only university to have a license for IBM s SixthSense tools, so far as we know. It should be possible for other universities to benefit from the results IBM observes happening vis-à-vis SixthSense in our group. Also our work is being presented in the Multi-core Expo in Spring 2009 as a poster.) MCAPI is a lightweight message passing specification targeted towards embedded SoC s. It has been shown in [2, 3 and 8] among many other publications that on-chip networks are a necessary direction in parallel architectures to circumvent the issues associated with rapidly increasing wire delays and increasing latency due to contention for shared buses. Any future concurrent system should have some sort of scalable interconnection network. This is the main hardware design focus of this work. A high throughput, low-latency on-chip network has been designed with MCAPI and embedded applications in mind. This design is described in section 3 and the communication performance that it achieves in section 5. The phases of our long-term work are the following. First, we offer a detailed assessment of MCAPI by providing the first public domain design (in FPGA) of an MCAPI based communication architecture. This design consists of nine modified MIPS cores connected through a worm-hole routed NoC fabric. The second contribution will be to formally verify parts of our implementation using the IBM SixthSense tools. The third contribution will be to re-derive parts of our design using the BlueSpec language and compilation system. Because of the scale of these steps, what we have concretely achieved consists of the first of these three phases. 2. Multicore API implementation 2.1. MCAPI Overview The Multicore Communication API is a message passing interface that is similar to MPI. However, it is designed primarily for embedded devices where broad functionality is not as important as high performance in a few types of communication. MCAPI provides the communication primitives that can be used by operating systems, libraries, and applications to improve code portability across different hardware generations. Since the API is designed for on-chip communication, it makes few assumptions about the hardware architecture and leaves a lot of freedom for the implementation to take advantage of whatever optimizations may be permitted through the architecture. For example, if there is shared memory available, transmission of data can be implemented as pointer passing. This eliminates unnecessary copies that may be required by other solutions. There are two types of communication defined in MCAPI. The first is connected packet and scalar channels. These channels require that the user define two endpoints and a communication link between them. Data can then be sent on this connection with very high throughput. The second type of communication is connectionless messages. These messages do not require a connection between two
3 endpoints to be established; they can be sent to any other endpoint. However, they require greater overhead to transmit, so throughput is not as good. While the aspirations of the API sound impressive, they are largely dependent on the implementation to realize the potential MIPS ISA Extension The core functionality of the implementation described in this paper is controllable via a set of RISC type assembly instructions added as an extension to the MIPS instruction set. These instructions are given in figure 1. The decision to make the control of the communication hardware programmable is based on an effort to avoid over complicating the hardware Example Send/Receive With these added instructions MCAPI can be implemented as a C library with in-line assembly code. Example implementations of MCAPI message send and receive functions are given in figures 2 and 3 respectively. For brevity, some error checking code has been omitted. Note that the code size of these functions is relatively small. This is critical to minimizing the memory footprint of the library implementation such that it is suitable for embedded applications. Fig. 1 Using these instructions all of the communication functionality of MCAPI can be implemented. The send header instruction builds the packet header and sends it on the network. It includes source and destination node identifiers, as well as a packet class that indicates the type of data being sent (i.e. pointer/buffer, short, integer, or long). The receive header instruction subsequently gets the packet header and writes it to a register. The header data can then be parsed using bit mask and shift instructions (standard with the MIPS ISA). The send/receive word instructions send or receive word length chunks of data. The get ID and flag instructions write the local node ID and the specified network flag to registers respectively. Available network flags are described in detail in the next section, but they include the necessary information for determining when a packet is available, when the network is busy, and some simple error checking. Fig. 2 Both send and receive functions can be expected to return very quickly. Since the underlying hardware supports zero copy data transfers through pointer passing, the send function is very fast. The receive function contains loops to check for data being available. These are mostly to ensure correctness since it is expected that a user would call mcapi_msg_available before calling the receive function.
4 This reasoning does not even consider the second advantage of grid networks which is short wire lengths. Increasing wire delays are becoming a major problem in designing chips in modern process technologies [4]. Grid networks partition communication paths such that adding more cores to a design will not increase the length of the wires. Example NoC Layout Fig On-Chip Network Design 3.1. Network Topology As the physical communication medium, a two dimensional grid network was designed to meet the data transport needs of MCAPI. Grid network topologies have several properties which make them attractive for this type of application. First, they are highly scalable. For an N x N grid network, the number of cores is N^2. The worst case communication latency in network hops is linear, following a curve of about 2N. Compare this to a bus where the worst case latency for communication is equal to the number of cores, where there is a fair arbitration scheme. So with respect to the number of cores, grid networks have sub-linear scalability for performance. Fig. 4 Figure 4 shows an example network layout similar to what has been designed here. The only differences being that this diagram shows accelerators, I/O interfaces, and L2 caches illustrating the heterogeneous nature of systems that may use this type of MCAPI implementation. Note that the tiled layout of this network physically distributes the L2 cache, but logically the L2 cache is shared. This architecture has been proposed in [5] Wormhole Router with Virtual Channels The key hardware component in a scalable interconnect is an on-chip router. There are many different types of routers, but wormhole routers utilizing virtual channels have been shown in [6] to be efficient designs for on-chip networks. In a wormhole router, packets are divided into flits, and flits are passed through a network in a pipelined fashion. This means that routers only need buffer space for a small portion of a packet. This saves
5 power and chip area, which are critical design metrics in an embedded chip. MCAPI packets are divided into flits as shown in figure 5. The head flit consists of a destination node and port ID, packet class, and sender ID. In general each core is a node. The port ID is included for future extensions which may support multiple endpoints per node. The sender ID is necessary so that the destination can determine where the packet came from. Different packet classes are used to implement the functionality of MCAPI. Since data can be sent as scalar values of various bit widths and as pointers, these packet classes tell the user how to interpret the received data. It will be shown through the description of the implementation that opening a channel,thus reserving network resources, is as simple as sending a header flit. Therefore, packet and scalar channels are implemented with the same instructions used to send connectionless messages. The only difference is that channel communication can consist of an arbitrary number of packets. It will be shown that this can negatively affect the overall network performance. Fig. 5 The on-chip router that has been designed for this implementation resembles the block diagram in figure 6. The key features include: 5 input and output channels each with two virtual channels, a single-cycle 16-bit data path, fair round-robin channel arbitration, and deadlock free dimension-order routing. The flow control of the router is a simple token scheme where each VC has buffer space for two flits. Once a VC has a flit, it sets its token signal high to stall the pipeline. One flits worth of buffer space is insufficient because another flit may have already been sent at the same time the token gets set high. To prevent buffer overflow, two buffer slots are needed in this single cycle design. Router Block Diagram Fig. 6 The key to permitting high throughput in a wormhole routing scheme is virtual channels. Flits of a packet can be strung out all across a network. Since only the header flit contains routing information, the body flits must follow immediately behind the header. This causes other packets contending for the same channel to be stalled. It is for this reason that packet and scalar channels can negatively impact overall network performance, since they have arbitrary lengths of data. Virtual channels allow these packets to continue moving through the network by allocating resources to packets at the buffer level, not the physical channel. Since there are two VC s per physical channel, each node may only open a single packet or scalar channel at a time. This is to help ensure that a VC will usually be available and that there is an upper bound on how long a packet may be stalled at each hop. In this design, the decision of which virtual channel to use for a packet is determined by token inputs and saturating counters that track network traffic on each physical channel. Decisions about which VC to use can only be made for the header flit, subsequent flits must follow the header. Therefore, if two packets are sequentially sent across the same physical link the second packet will choose the opposite VC as the first. This balances network load and improves average case throughput and latency.
6 The arbitration process is fairly simple, and is based on the techniques described in [6]. The router data path computes the routing direction, latches the data, and sends a request to the arbiter for the output channel in the first phase. In the second phase, the arbiter sends a grant signal to the VC and sets the appropriate control signals for the crossbar, which forwards the data to the next router. When two flits want to use the same output channel, a decision is made based on the value of a ring counter. The goal of the arbiter is to forward as many flits as possible every clock cycle. The router module is the main bottleneck in system clock speed. The decision to use a single cycle design for the router is part of an effort to achieve low latency, but it won t be able to be clocked as fast as a pipelined router. However, because MCAPI is targeted towards embedded designs this seemed to be a worthwhile trade off, since embedded chips have lower clock rates. Research shown in [7] demonstrates several techniques for decreasing single cycle clock time in routers by taking the routing function and arbitration process off of the critical path. This further justifies the use of a single cycle router. However, it is very important for overall system clock speed that there are no combinational paths through the router. This is to ensure that there are no signals that propagate from one network node to another without being latched. If this were allowed to happen, then the system clock period would be the time taken to traverse multiple network hops Network Interface Unit The network interface unit (NIU) provides the necessary functionality to easily interface a modified MIPS processor to the grid network. A block diagram is given in figure 7. NIU Block Diagram The NIU consists of send and receive modules, send and receive buffers, and a register containing operation flags. The send module builds flits and sends them on the network according to the bus inputs from the processor and the opcode. It also makes an initial decision about which VC to send the packet on. The receive module observes the front of each VC receive buffer which causes state transitions, which determine whether the available data is a header, body, or tail flit. It then removes buffer items when the appropriate opcode is seen. The flags that are set include send and receive port busy flags, header and data available flags, send and receive error flags, and a network busy flag. These flags provide the necessary information to the user for detecting buffer overflow, network overload, and for determining when a message is available. 4. Synthesis and Verification 4.1. Synthesis for Virtex5 The target platform for this implementation is a Xilinx Virtex 5 FPGA. Programmable logic is used so that the designs here could be used in other research efforts exploring multicore system innovations. The synthesis results in terms of device utilization and performance are given in table 1 for each module and for an entire 9-core system laid out similar to the design in figure 4. Note, the complete system excludes L2 cache but does include 16 B of L1 cache. Future extensions may include shared L2 cache because the total device utilization for the existing 9-core system is only about 50%. Synthesis Results Module LUT's Registers Clk Rate MIPS Core MHz NIU MHz Router MHz 9-Core NoC MHz Table Testing Methodology Fig. 7 Testing is performed by running test programs and observing the output waveforms in a VHDL simulator. An assembler has been created for this system and an example program has been run successfully. However, this test program did not stress the network. It involved a simple producer / consumer communication pattern between two MIPS cores while the remaining cores were idle. It
7 is important that more extensive applications be used to accurately observe the average case communication latency. Formal verification of critical hardware components is also important, but at the time of this writing no significant formal verification has been done on these designs. 5. Communication Performance The best and worst-case communication performance in terms of clock cycles can be summarized by equations 1 and 2. These represent the latency for the header flit. Subsequent flits follow directly behind the header. Best Case: L = H + k (1) Worst Case: L = 5*H*S + k (2) In both equations, 'L' is the latency in cycles, 'H' is the number of hops, 'S' is the average size in flits of other packets, and 'k' is a constant derived from other non-communication instructions in the implementation. From initial test of a lightly loaded network, latency for a single flit traveling across 3 hops is 9 cycles. This represents the best case latency. Due to the design characteristics of the network, the worst-case latency is very unlikely. For the worst case to occur, 5 packets would need to come into each router at the same time, and at each router the flit being sent would have to be last on the arbitration schedule. It has been observed that the average case latency is much closer to the best case latency. For connected packet and scalar channels, the body packets achieve best-case latency. Throughput is also increased because there is much less header/tail overhead and since network resources are reserved when the channel is opened a flit can progress through the connection every clock cycle. At 130 MHz the throughput of a channel connection is about 260 MB/s, because each body flit is 2-bytes. 6. Conclusions and Future Work The work shown here demonstrates the viability and usefulness of hardware support for inter-core communication. Even in an FPGA implementation, low communication latency and high throughput (260 MB/s) is possible. Current research in on-chip network design provides solutions for minimizing cost and power while improving performance. These innovations will continue and further justify the implementation of communication API's in hardware. What has been presented here is an efficient and scalable concurrent computing platform with greater potential for future innovations than existing solutions. Several immediate directions for future work related to this project include: the evaluation of IBM's SixthSense VHDL model checker and Bluespec's automated design tools using hardware designs presented in this paper; and the development of automated network synthesis and optimization algorithms for MCAPI workloads. Evaluation of SixthSense has already begun. A tutorial for using the tool has been created and is available in [11]. The key units in this design that need formal verification are the arbitration and VC allocation units of the router module. This is because it is difficult to create workloads that will stress the network. For correctness, it must be verified that a packet will never be sent on a VC being used by another packet until that packet releases the resource. This is a perfect case study for hardware verification because it tests a situation that traditional test vectors would be unlikely to catch. 7. Acknowledgments This work has been funded by SRC task ID References [1] Multicore Association Communication API Specification V1.063, [2] M. Ali, M. Welzl, M. Zwicknagl, Networks on Chips: Scalable Interconnects for Future Systems on Chips, IEEE [3] R. Das, S. Eachempati, A. K. Mishra, V. Narayanan, C. R. Das, Design and Evaluation of a Hierarchical On- Chip Interconnect for Next-Generation CMPs, HPCA [4] R. Ho, K. Mai, M. Horowitz, The Future of Wires, Proceedings of the IEEE, April 2001, pp [5] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, R-NUCA: Data Placement in Distributed Shared Caches, ISCA [6] E. Shin, V. Mooney, G. Riley, Round-Robin Arbiter Design and Generation, ISSS 2002.
8 [7] Mullins, West, Moore, Low-Latency Virtual-Channel Routers for On-Chip Networks, ISCA [8] W. Dally, B. Towels, Route, Packets, Not Wires: On- Chip Interconnection Networks, Proceedings of IEEE Design Automation Conference, 2001, pp [9] V. Dvorak, Communication Performance of Mesh and Ring Based NoC's, 7th International Conference on Networking, 2008, pp [10] I. Nousias, T. Arslan, Wormhole Routing with Virtual Channels using Adaptive Rate Control for Network-on-Chip (NoC), Proceedings of 1st NASA/ESA Conference on Adaptive Hardware and Systems, 2006, pp [11] B. Meakin, SixthSense Tutorial, p/sixthsense
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationThe Design and Implementation of a Low-Latency On-Chip Network
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current
More informationFPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)
FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor
More informationLecture 3: Flow-Control
High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor
More informationMULTICORE SYSTEM DESIGN WITH XUM: THE EXTENSIBLE UTAH MULTICORE PROJECT
MULTICORE SYSTEM DESIGN WITH XUM: THE EXTENSIBLE UTAH MULTICORE PROJECT by Benjamin Meakin A thesis submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationNoC Test-Chip Project: Working Document
NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance
More informationDesign of a System-on-Chip Switched Network and its Design Support Λ
Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of
More informationLecture 18: Communication Models and Architectures: Interconnection Networks
Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationCHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP
133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located
More informationNetwork on Chip Architecture: An Overview
Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology
More informationLecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID
Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationSwitching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.
Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine
More informationLecture 16: On-Chip Networks. Topics: Cache networks, NoC basics
Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationOASIS Network-on-Chip Prototyping on FPGA
Master thesis of the University of Aizu, Feb. 20, 2012 OASIS Network-on-Chip Prototyping on FPGA m5141120, Kenichi Mori Supervised by Prof. Ben Abdallah Abderazek Adaptive Systems Laboratory, Master of
More informationECE 551 System on Chip Design
ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs
More informationXUM Documentation: MIPS Instruction Set Extension
XUM Documentation: MIPS Instruction Set Extension Part of XUM version 1.0 Preliminaries: This document standardizes the MIPS instruction set extension implemented in XUM. This instruction set extension
More informationRouting Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip
Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract
More informationDesign and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3
IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 02, 2014 ISSN (online): 2321-0613 Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek
More informationCCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward
More informationDESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER
G MAHESH BABU, et al, Volume 2, Issue 7, PP:, SEPTEMBER 2014. DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G.Mahesh Babu 1*, Prof. Ch.Srinivasa Kumar 2* 1. II. M.Tech (VLSI), Dept of ECE,
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationLecture 2: Topology - I
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and
More informationNoC Simulation in Heterogeneous Architectures for PGAS Programming Model
NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationUltra-Fast NoC Emulation on a Single FPGA
The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo
More informationOpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April
More informationFault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson
Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationLecture 7: Flow Control - I
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical
More informationHARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK
DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication
More informationDESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC
DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com
More informationMinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect
MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU
More informationDesign of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques
Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Nandini Sultanpure M.Tech (VLSI Design and Embedded System), Dept of Electronics and Communication Engineering, Lingaraj
More informationDesign and Implementation of Buffer Loan Algorithm for BiNoC Router
Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India
More informationECE/CS 757: Advanced Computer Architecture II Interconnects
ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction
More informationPacket Switch Architecture
Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.
More informationPacket Switch Architecture
Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.
More informationA VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK
A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements
More informationA Survey of Techniques for Power Aware On-Chip Networks.
A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology
More informationJoint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals
Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University
More informationBasic Low Level Concepts
Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock
More informationEfficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip
ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,
More informationNetwork-on-Chip Architecture
Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)
More informationLecture 22: Router Design
Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip
More informationInterconnection Networks: Flow Control. Prof. Natalie Enright Jerger
Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:
More informationISSN Vol.03, Issue.02, March-2015, Pages:
ISSN 2322-0929 Vol.03, Issue.02, March-2015, Pages:0122-0126 www.ijvdcs.org Design and Simulation Five Port Router using Verilog HDL CH.KARTHIK 1, R.S.UMA SUSEELA 2 1 PG Scholar, Dept of VLSI, Gokaraju
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationDynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers
Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California
More informationFlow Control can be viewed as a problem of
NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationApplying the Benefits of Network on a Chip Architecture to FPGA System Design
white paper Intel FPGA Applying the Benefits of on a Chip Architecture to FPGA System Design Authors Kent Orthner Senior Manager, Software and IP Intel Corporation Table of Contents Abstract...1 Introduction...1
More informationSoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik
SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on
More informationDesign and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA
Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationSONA: An On-Chip Network for Scalable Interconnection of AMBA-Based IPs*
SONA: An On-Chip Network for Scalable Interconnection of AMBA-Based IPs* Eui Bong Jung 1, Han Wook Cho 1, Neungsoo Park 2, and Yong Ho Song 1 1 College of Information and Communications, Hanyang University,
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip
More informationOn RTL to TLM Abstraction to Benefit Simulation Performance and Modeling Productivity in NoC Design Exploration
On to TLM Abstraction to Benefit Simulation Performance and Modeling Productivity in NoC Design Exploration Sven Alexander Horsinka, Rolf Meyer, Jan Wagner, Rainer Buchty and Mladen Berekovic TU Braunschweig,
More informationLOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.705
More informationFast Flexible FPGA-Tuned Networks-on-Chip
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael, James C. Hoe
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationModule 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth
Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012
More informationA Novel Energy Efficient Source Routing for Mesh NoCs
2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony
More informationDesign of Synchronous NoC Router for System-on-Chip Communication and Implement in FPGA using VHDL
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationOASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS
OASIS NoC Architecture Design in Verilog HDL Technical Report: TR-062010-OASIS Written by Kenichi Mori ASL-Ben Abdallah Group Graduate School of Computer Science and Engineering The University of Aizu
More informationDesign and Simulation of Router Using WWF Arbiter and Crossbar
Design and Simulation of Router Using WWF Arbiter and Crossbar M.Saravana Kumar, K.Rajasekar Electronics and Communication Engineering PSG College of Technology, Coimbatore, India Abstract - Packet scheduling
More informationA Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing
727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationQuality-of-Service for a High-Radix Switch
Quality-of-Service for a High-Radix Switch Nilmini Abeyratne, Supreet Jeloka, Yiping Kang, David Blaauw, Ronald G. Dreslinski, Reetuparna Das, and Trevor Mudge University of Michigan 51 st DAC 06/05/2014
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationQuest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling
Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer
More informationMinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems
MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems Mohammad Ali Jabraeil Jamali, Ahmad Khademzadeh Abstract The success of an electronic system in a System-on- Chip is highly
More informationEC 513 Computer Architecture
EC 513 Computer Architecture On-chip Networking Prof. Michel A. Kinsy Virtual Channel Router VC 0 Routing Computation Virtual Channel Allocator Switch Allocator Input Ports VC x VC 0 VC x It s a system
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationInterconnection Networks
Lecture 15: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Credit: some slides created by Michael Papamichael, others based on slides from Onur Mutlu
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationA NEW ROUTER ARCHITECTURE FOR DIFFERENT NETWORK- ON-CHIP TOPOLOGIES
A NEW ROUTER ARCHITECTURE FOR DIFFERENT NETWORK- ON-CHIP TOPOLOGIES 1 Jaya R. Surywanshi, 2 Dr. Dinesh V. Padole 1,2 Department of Electronics Engineering, G. H. Raisoni College of Engineering, Nagpur
More informationSoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology
SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology Outline SoC Interconnect NoC Introduction NoC layers Typical NoC Router NoC Issues Switching
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationThomas Moscibroda Microsoft Research. Onur Mutlu CMU
Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank
More informationLecture 23: Router Design
Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip
More informationLecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control
Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationBrief Background in Fiber Optics
The Future of Photonics in Upcoming Processors ECE 4750 Fall 08 Brief Background in Fiber Optics Light can travel down an optical fiber if it is completely confined Determined by Snells Law Various modes
More informationLecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels
Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a
More informationNoc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari
Global Journal of Computer Science and Technology: E Network, Web & Security Volume 15 Issue 6 Version 1.0 Year 2015 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
More informationArchitecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC
BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationRe-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
More information