Lecture 22: Router Design

Similar documents
Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 23: Router Design

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture: Interconnection Networks

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

A Survey of Techniques for Power Aware On-Chip Networks.

Prediction Router: Yet another low-latency on-chip router architecture

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Packet Switch Architecture

Packet Switch Architecture

Phastlane: A Rapid Transit Optical Routing Network

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Interconnection Networks

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Interconnection Networks

Network-on-chip (NOC) Topologies

Lecture 18: Communication Models and Architectures: Interconnection Networks

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

CS575 Parallel Processing

Ultra-Fast NoC Emulation on a Single FPGA

Flow Control can be viewed as a problem of

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 3: Flow-Control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

Express Virtual Channels: Towards the Ideal Interconnection Fabric

Interconnect Technology and Computational Speed

ES1 An Introduction to On-chip Networks

Lecture 3: Topology - II

Low-Power Interconnection Networks

Evaluating Bufferless Flow Control for On-Chip Networks

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Basic Low Level Concepts

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

4. Networks. in parallel computers. Advances in Computer Architecture

Interconnection Networks

Brief Background in Fiber Optics

EC 513 Computer Architecture

in Oblivious Routing

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Lecture 7: Flow Control - I

EECS 570 Final Exam - SOLUTIONS Winter 2015

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

ECE 669 Parallel Computer Architecture

TDT Appendix E Interconnection Networks

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm

Tackling Permanent Faults in the Network-on-Chip Router Pipeline

Asynchronous Bypass Channel Routers

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

Basic Switch Organization

NOC: Networks on Chip SoC Interconnection Structures

Deadlock-free XY-YX router for on-chip interconnection network

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

EE/CSCI 451: Parallel and Distributed Computation

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

Chapter 1 Bufferless and Minimally-Buffered Deflection Routing

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

ISSN Vol.03,Issue.06, August-2015, Pages:

DESIGN, IMPLEMENTATION AND EVALUATION OF A CONFIGURABLE. NoC FOR AcENoCS FPGA ACCELERATED EMULATION PLATFORM. A Thesis SWAPNIL SUBHASH LOTLIKAR

Design of a high-throughput distributed shared-buffer NoC router

Lecture 2: Topology - I

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

Network on Chip Architecture: An Overview

OASIS Network-on-Chip Prototyping on FPGA

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-coherent NoCs

WITH THE CONTINUED advance of Moore s law, ever

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Routing Algorithms. Review

NoC Test-Chip Project: Working Document

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

Transcription:

Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State 1

Router Pipeline Four typical stages: RC routing computation: compute the output channel VA virtual-channel allocation: allocate VC for the head flit SA switch allocation: compete for output physical channel ST switch traversal: transfer data on output physical channel Cycle 1 2 3 4 5 6 7 STALL Head flit Body flit 1 RC VA SA ST -- -- SA ST RC VA SA SA ST -- -- -- SA ST Body flit 2 -- -- SA ST -- -- -- SA ST Tail flit -- -- SA ST -- -- -- SA ST 2

Data Points On-chip network s power contribution in RAW (tiled) processor: 36% in network of compute-bound elements (Intel): 20% in network of storage elements (Intel): 36% bus-based coherence (Kumar et al. 05): ~12% Contributors: RAW: links 39%; buffers 31%; crossbar 30% TRIPS: links 31%; buffers 35%; crossbar 33% Intel: links 18%; buffers 38%; crossbar 29%; clock 13% Unlike traditional off-chip networks, link power is not dominant 3

Network Power Energy for a flit = E R. H + E wire. D = (E buf + E xbar + E arb ). H + E wire. D E R = router energy H = number of hops E wire = wire transmission energy D = physical Manhattan distance E buf = router buffer energy E xbar = router crossbar energy E arb = router arbiter energy This paper assumes that E wire. D is ideal network energy (assuming no change to the application and how it is mapped to physical nodes) Optimizations are attempted to E R and H 4

Segmented Crossbar By segmenting the row and column lines, parts of these lines need not switch less switching capacitance (especially if your output and input ports are close to the bottom-left in the figure above) Need a few additional control signals to activate the tri-state buffers (~2 control signals, ~64 data signals) Overall crossbar power savings: ~15-30% 5

Cut-Through Crossbar Attempts to optimize the common case: in dimension-order routing, flits make up to one turn and usually travel straight 2/3 rd the number of tristate buffers and 1/2 the number of data wires Straight traffic does not go thru tristate buffers Some combinations of turns are not allowed: such as E N and N W (note that such a combination cannot happen with dimension-order routing) Crossbar energy savings of 39-52%; at full-load, with a worst-case routing algorithm, the probability of a conflict is ~50% 6

Write-Through Input Buffer Input flits must be buffered in case there is a conflict in a later pipeline stage If the queue is empty, the input flit can move straight to the next stage: helps avoid the buffer read To reduce the datapaths, the write bitlines can serve as the bypass path Power savings are a function of rd/wr energy ratios and probability of finding an empty queue 7

Express Channels Express channels connect non-adjacent nodes flits traveling a long distance can use express channels for most of the way and navigate on local channels near the source/destination (like taking the freeway) Helps reduce the number of hops The router in each express node is much bigger now 8

Express Channels Routing: in a ring, there are 5 possible routes and the best is chosen; in a torus, there are 17 possible routes A large express interval results in fewer savings because fewer messages exercise the express channels 9

Results Uniform random traffic (synthetic) Write-thru savings are small Exp-channel network has half the flit size to maintain the same bisection-bandwidth as other models (express interval of 2) Baseline model power breakdown: link 44%, crossbar 33%, buffers 23% Express cubes also improve 0-load latency by 23% -- the others have a negligible impact on performance 10

Conventional Router Slide taken from presentation at OCIN 06 11

The RoCo Router 12

VC Allocation XY routing is deadlock-free; need a minimum of 8 VCs to allow every possible flit traversal: 2 d x, 2 d y, 2 t xy, 1 Inj xy, 1 Inj yx XY-YX routing needs 2 more VCs to enable deadlock freedom Adaptive routing needs 12 VCs Additional constraints on VCs may lower performance 13

RoCo Router Key features: Early ejection mechanism for flits destined for the PE (saves 2 cycles since they don t have to go through SA and xbar stages) Flits are steered to the appropriate crossbar thanks to routing info computed in previous stage enables use of 2 2x2 crossbars instead of 1 5x5 crossbar Results show much lower contention probability for RoCo (?!) Need fewer and smaller arbiters: 2x2 xbar arbiter algorithm: (mirroring) Generic case: for each input port, one arbiter selects the winner for each output port, an arbiter selects the winner RoCo: for each input port, two arbiters select two winners for each 2x2 xbar, one arbiter selects the winner for one port and the outcome is mirrored on the 2 nd port 14

Results 15

Title Bullet 16