The Design and Implementation of a Low-Latency On-Chip Network

Similar documents
HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

Low-Power Interconnection Networks

Lecture: Interconnection Networks

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Hardware Design, Synthesis, and Verification of a Multicore Communications API

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Prediction Router: Yet another low-latency on-chip router architecture

Evaluating Bufferless Flow Control for On-Chip Networks

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

High-Level Simulations of On-Chip Networks

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Ultra-Fast NoC Emulation on a Single FPGA

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

A Novel Energy Efficient Source Routing for Mesh NoCs

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Lecture 18: Communication Models and Architectures: Interconnection Networks

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

ISSN Vol.03,Issue.06, August-2015, Pages:

Brief Background in Fiber Optics

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

Interconnection Networks

On-Die Interconnects for next generation CMPs

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

ES1 An Introduction to On-chip Networks

Packet Switch Architecture

Packet Switch Architecture

ISSN Vol.03, Issue.02, March-2015, Pages:

Design of a System-on-Chip Switched Network and its Design Support Λ

DSENT A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM

High Data Transmission with Shared Buffer Routers Using RoShaQ Architecture

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

Mapping and Configuration Methods for Multi-Use-Case Networks on Chips

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Cluster-based approach eases clock tree synthesis

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

High Performance Interconnect and NoC Router Design

Route Packets, Not Wires: On-Chip Interconnection Networks

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

Real-Time Mixed-Criticality Wormhole Networks

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

The RM9150 and the Fast Device Bus High Speed Interconnect

CHAPTER 1 INTRODUCTION

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

Interconnection Networks

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

4. Networks. in parallel computers. Advances in Computer Architecture

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NoC Test-Chip Project: Working Document

A Survey of Techniques for Power Aware On-Chip Networks.

Lecture 12: SMART NoC

A Literature Review of on-chip Network Design using an Agent-based Management Method

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques

A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

From Temporal Partitioning and Temporal Placement to Algorithmic Skeletons

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Mapping of Real-time Applications on

Interconnection Networks

Globally Asynchronous Locally Synchronous FPGA Architectures

Deadlock-free XY-YX router for on-chip interconnection network

ECE 697J Advanced Topics in Computer Networks

Modeling the Traffic Effect for the Application Cores Mapping Problem onto NoCs

Lecture 3: Flow-Control

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Reconfigurable Multicore Server Processors for Low Power Operation

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

Quality-of-Service for a High-Radix Switch

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

3D WiNoC Architectures

4. Hardware Platform: Real-Time Requirements

POLYMORPHIC ON-CHIP NETWORKS

DATA REUSE DRIVEN MEMORY AND NETWORK-ON-CHIP CO-SYNTHESIS *

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

SCORPIO: 36-Core Shared Memory Processor

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Transcription:

The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan.

Introduction Current economic and technology scaling trends will force a step change in computing architectures and approaches to VLSI design Design methodologies will shift from computation-centric to communicationcentric ones. This talk will examine a major component of such approaches: the on-chip network 2

Economic Trends Falling chip design budgets Hardware budgets squeezed as software complexity grows Rising Non-Recoverable Engineering (NRE) costs as fabrication technologies scale Continued time to market pressures Need to reduce complexity and risk 3

Technology Scaling Trends Interconnect scales poorly Begins to dominate delay, power budgets and area Benefits of regular interconnects increase Ability to better optimise power and delay Reduced verification effort Simple to analyse, low risk Yield and reliability issues Fault tolerant design, remapping and reconfiguration Power limited designs Optimizing power boosts performance 4

5 Design Trends Systems will be continue to be composed from larger numbers of IP blocks Increasing use of coarse-grain parallelism The last remaining tool to maintain historical performance gains in a power constrained environment Economic and risk pressures are forcing designs to become increasingly programmable and general purpose Ability to map many applications to a single chip

6 Communication-Centric SoC Design Scalable communication infrastructure Regular and optimised Network eases application mapping, reuse and integration issues General purpose interconnect Network schedules compute resources: Optimises/manages power Has global view and influence Manages local thermal budgets Central to fault tolerant abilities Much more than simply a move from buses to networks Processors /DSPs ROM and Flash Memories Configurable I/Os On-Chip Network edram, SRAM and Cache Blocks Custom IP Embedded FPGAs

7 Many Challenges Application mapping Network topologies Fault-tolerant techniques System-level communication-centric power management Guaranteeing correctness in these increasingly distributed systems Low-power techniques for on-chip networks.. This talk will look at: Building low-latency on-chip routers How to clock on-chip networks

8 Introduction to On-Chip Networks All chip-wide communications are handled by an on-chip network Packet-switched network Each router contains Input buffers Routing logic Scheduling hardware Arbitration Crossbar

Virtual-Channel Flow Control 9

Synchronous Router Pipeline Router Pipeline may be many stages Increases communication latency Can make packet buffers less effective Incurs pipelining overheads 10

Speculative Router Architecture VC and switch allocation may be performed concurrently: Speculate that waiting packets will be successful in acquiring a VC Prioritize non-speculative requests over speculative ones Li-Shiuan Peh and William J. Dally, A Delay Model and Speculative Architecture for Pipelined Routers, In Proceedings HPCA 01, 2001. 11

Single Cycle Speculative Router R. D. Mullins, A. West and S. W. Moore, Low-Latency Virtual-Channel Routers for On-Chip Networks, In Proceedings ISCA 04. 12

Basic Concept Consider two extremes of operation: Multiple flits are queued waiting for access to the same output port We have all the information we need to schedule the output port accurately ahead of time No requests are outstanding for a particular output port In this case we speculate that arbitration will be unnecessary and permit any new flit to be routed to its required output immediately Easy to abort if things go wrong. Just look at newly arriving flits and the output ports they require 13

14 Optimisations To produce control signals for the next clock cycle we compute the requests (VC or switch allocation) that we know will remain In the case of the VC allocator it is important for performance that this is accurate For the switch allocator logic a better trade-off is to minimise this logic and obtain gains through reduction in cycletime

15 Results Comparison to single-cycle router without speculative optimisations 4x4 mesh network, random traffic, 4 flit (256-bit) packets

16 The LOCHSIDE Testchip UMC 0.18um Process 4x4 mesh network, 25mm 2 Single Cycle Routers (router + link = 1 clock) May be clocked by both traditional H-tree and DCG 4 virtual-channels/input 80-bit links 64-bit data + 16-bit control 250MHz (worst-case PVT) 16Gb/s/channel (~35 FO4) Approx 5M transistors TILE Traffic Generator, Debug & Test R

Clocking On-Chip Networks Challenges: Clock Distribution Issues Challenging due to networks physically distributed implementation Potentially a high-frequency clock Power and skew concerns Synchronization IP Blocks will run at many different or even adaptive clock frequencies What frequency does network run at? Interesting problem! Would like to avoid running at max. freq all the time - may not want to increase latency? 17

Data-Driven Clocking Idea: Generate the clock locally at each router Generate clock pulses only when required! Existence of data on router s input triggers new clock pulses Local calibrated delay line ensures clock frequency never exceeds router s maximum Clock is aperiodic 18

Benefits of Data-Driven Clocking Robust value safe synchronization No synchronization delay if router is quiescent Event-driven synchronous system! Benefits of asynchronous implementation but router remains fast and simple Can still exploit synchronous single-cycle router design No one single network operating frequency No global clock! Network links can be fully-asynchronous if beneficial 19

20 Data-Driven Clocking Arbitration is necessary to determine whether input data is admitted on the subsequent clock cycle or not. If there are always input requests waiting the clock will be periodic and operating at its maximum frequency

Summary Single cycle speculative routers Reduce router pipeline to single stage This provides a significant reduction in network latency Data-driven clocking for on-chip networks Removes need for global clock Network router are clocked at rate determined by traffic 21

Conclusion Current trends suggest a major shift to a communication-centric approach will be inevitable On-chip networks are one important piece of the puzzle! Continued performance gains depend on shift in design practices End of the road for evolutionary advances Cannot rely on technology alone for gains 22

23 Thank You Comments/Questions? Email: Robert.Mullins@cl.cam.ac.uk Papers, slides and tutorial at http://www.cl.cam.ac.uk/users/rdm34

Other Slides. 24

Distributed Clock Generator (DCG) Exploits self-timed circuitry to generate and distribute a clock in a distributed fashion Low-skew and lowpower solution to providing global synchrony Topology matches that of a mesh network Single Frequency clock gating? S. Fairbanks and S. Moore Self timed circuitry for global clocking, ASYNC 05 25