Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Similar documents
MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Lecture: Interconnection Networks

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

ES1 An Introduction to On-chip Networks

Evaluating Bufferless Flow Control for On-Chip Networks

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Low-Power Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Chapter 1 Bufferless and Minimally-Buffered Deflection Routing

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

Staged Memory Scheduling

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

Prediction Router: Yet another low-latency on-chip router architecture

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

Packet Switch Architecture

Packet Switch Architecture

Interconnection Networks

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 22: Router Design

Lecture 3: Flow-Control

Prevention Flow-Control for Low Latency Torus Networks-on-Chip

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

A Case for Hierarchical Rings with Deflection Routing: An Energy-Efficient On-Chip Communication Substrate

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

CHIPPER: A Low-complexity Bufferless Deflection Router

EECS 570 Final Exam - SOLUTIONS Winter 2015

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Energy-Efficient and Fault-Tolerant Unified Buffer and Bufferless Crossbar Architecture for NoCs

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

arxiv: v1 [cs.dc] 18 Feb 2016

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Basic Low Level Concepts

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Smart Port Allocation in Adaptive NoC Routers

VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS

15-740/ Computer Architecture Lecture 1: Intro, Principles, Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University

MINIMALLY BUFFERED ROUTER USING WEIGHTED DEFLECTION ROUTING FOR MESH NETWORK ON CHIP

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Ultra-Fast NoC Emulation on a Single FPGA

Abstract. Paper organization

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Interconnection Networks

Lecture 23: Router Design

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

in Oblivious Routing

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

Computer Architecture Lecture 24: Memory Scheduling

The Runahead Network-On-Chip

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

VIX: Virtual Input Crossbar for Efficient Switch Allocation

Asynchronous Bypass Channel Routers

Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Interconnection Networks

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

TDT Appendix E Interconnection Networks

Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-coherent NoCs

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Input Buffering (IB): Message data is received into the input buffer.

Application-Aware Deadlock-Free Oblivious Routing

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

Fall 2012 Parallel Computer Architecture Lecture 20: Speculation+Interconnects III. Prof. Onur Mutlu Carnegie Mellon University 10/24/2012

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE

On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects

ISSN Vol.03,Issue.06, August-2015, Pages:

Routers In On Chip Network Using Roshaq For Effective Transmission Sivasankari.S et al.,

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm

Transcription:

Thomas Moscibroda Microsoft Research Onur Mutlu CMU

CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank Cache -Bank Cache -Bank Cache -Bank Memory Controller

Connect cores, caches, memory controllers, etc Examples: Intel 80-core Terascale chip MIT RAW chip Design goals in NoC design: High throughput, low latency Fairness between cores, QoS, Low complexity, low cost Power, low energy consumption

Connect cores, caches, memory controllers, etc Examples: Intel 80-core Terascale chip MIT RAW chip Energy/Power in On-Chip Networks Power is a key constraint in the design Design goals in NoC design: of high-performance processors High throughput, low latency Fairness between NoCs cores, consume QoS, substantial portion of system Low complexity, power low cost ~30% in Intel 80-core Terascale [IEEE Micro 07] Power, low energy consumption ~40% in MIT RAW Chip [ISCA 04] NoCs estimated to consume 100s of Watts [Borkar, DAC 07]

Existing approaches differ in numerous ways: Network topology [Kim et al, ISCA 07, Kim et al, ISCA 08 etc] Flow control [Michelogiannakis et al, HPCA 09, Kumar et al, MICRO 08, etc] Virtual Channels [Nicopoulos et al, MICRO 06, etc] QoS & fairness mechanisms [Lee et al, ISCA 08, etc] Routing algorithms [Singh et al, CAL 04] Router architecture [Park et al, ISCA 08] Broadcast, Multicast [Jerger et al, ISCA 08, Rodrigo et al, MICRO 08] Existing work assumes existence of buffers in routers!

Input Channel 1 VC1 Scheduler Routing Computation Credit Flow to upstream router VC2 VCv VC Arbiter Switch Arbiter Input Port 1 Output Channel 1 Input Channel N VC1 VC2 Output Channel N Credit Flow to upstream router VCv Input Port N N x N Crossbar Buffers are integral part of existing NoC Routers

Buffers are necessary for high network throughput buffers increase total available bandwidth in network Avg. packet latency small buffers medium buffers large buffers Injection Rate

Buffers are necessary for high network throughput buffers increase total available bandwidth in network Buffers consume significant energy/power Dynamic energy when read/write Static energy even when not occupied Buffers add complexity and latency Logic for buffer management Virtual channel allocation Credit-based flow control Buffers require significant chip area E.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD 06]

How much throughput do we lose? How is latency affected? latency no buffers buffers Injection Rate Up to what injection rates can we use bufferless routing? Are there realistic scenarios in which NoC is operated at injection rates below the threshold? Can we achieve energy reduction? If so, how much? Can we reduce area, complexity, etc? Answers in our paper!

Introduction and Background Bufferless Routing (BLESS) FLIT-BLESS WORM-BLESS BLESS with buffers Advantages and Disadvantages Evaluations Conclusions

Always forward all incoming flits to some output port If no productive direction is available, send to another direction packet is deflected Hot-potato routing [Baran 64, etc] Deflected! Buffered BLESS

arbitration policy Flit-Ranking Routing Flit-Ranking VC Arbiter Port- Switch Arbiter Prioritization Port- Prioritization 1. Create a ranking over all incoming flits 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking

Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Flit-Ranking 1. Oldest-first ranking Port- Prioritization 2. Assign flit to productive port, if possible. Otherwise, assign to non-productive port. Network Topology: Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, ) 1) #output ports #input ports at every router 2) every router is reachable from every other router Flow Control & Injection Policy: Completely local, inject whenever input port is free Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking

Potential downsides of FLIT-BLESS Not-energy optimal (each flits needs header information) Increase in latency (different flits take different path) Increase in receive buffer size BLESS with wormhole routing? Problems: [Dally, Seitz 86] Injection Problem (not known when it is safe to inject) new worm! Livelock Problem (packets can be deflected forever)

Deflect worms if necessary! Flit-Ranking 1. Oldest-first ranking 2. If flit is head-flit Port-Prioritization a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm Truncate worms if necessary! At low congestion, packets travel routed as worms Body-flit turns into head-flit allocated to West This worm is truncated! & deflected! allocated to North Head-flit: West See paper for details

BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers FLIT-BLESS with Buffers WORM-BLESS with Buffers Whenever a buffer is full, it s first flit becomes must-schedule must-schedule flits must be deflected if necessary See paper for details

Introduction and Background Bufferless Routing (BLESS) FLIT-BLESS WORM-BLESS BLESS with buffers Advantages and Disadvantages Evaluations Conclusions

Advantages No buffers Purely local flow control Simplicity - no credit-flows - no virtual channels - simplified router design No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! Router latency reduction Area savings Disadvantages Increased latency Reduced bandwidth Increased buffering at receiver Header information at each flit Impact on energy?

Baseline Router (speculative) BLESS gets rid of input buffers and virtual channels head flit BW RC body flit VA SA ST LT BW SA ST LT BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead [Dally, Towles 04] Router Latency = 3 Can be improved to 2. BLESS Router (standard) Router 1 RC ST LT Router 2 RC ST LT Router Latency = 2 BLESS Router (optimized) Router 1 RC ST LA LT Router 2 LT RC ST LT Router Latency = 1

Advantages No buffers Purely local flow control Simplicity - no credit-flows - no virtual channels - simplified router design No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! Router latency reduction Area savings Disadvantages Increased latency Reduced bandwidth Increased buffering at receiver Header information at each flit Extensive evaluations in the paper! Impact on energy?

2D mesh network, router latency is 2 cycles o 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank) o 4x4, 16 core, 16 L2 cache banks (each node Simulation is a core and is cycle-accurate an L2 bank) Models stalls in network o 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core) and processors o 128-bit wide links, 4-flit data packets, 1-flit address Self-throttling packets behavior o For baseline configuration: 4 VCs per physical input Aggressive port, 1 processor packet deep model Benchmarks o o o Multiprogrammed SPEC CPU2006 and Windows Desktop applications Heterogeneous and homogenous application mixes Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement x86 processor model based on Intel Pentium M o o o o Most of our evaluations with perfect L2 caches 2 GHz processor, 128-entry instruction window Puts maximal stress 64Kbyte private L1 caches on NoC Total 16Mbyte shared L2 caches; 16 MSHRs per bank DRAM model based on Micron DDR2-800

Energy model provided by Orion simulator [MICRO 02] o 70nm technology, 2 GHz routers at 1.0 V dd For BLESS, we model o o Additional energy to transmit header information Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO,, ROMM)

First, the bad news Uniform random injection Average Latency 100 BLESS has significantly lower saturation throughput 40 compared to buffered 20 baseline. 0 80 60 0 0.07 0.1 0.13 0.16 BLESS 0.19 0.22 0.25 0.28 0.31 0.34 Best Baseline 0.37 0.4 0.43 Injection Rate (flits per cycle per node) 0.46 0.49

milc benchmarks (moderately intensive) Perfect caches! Very little performance degradation with BLESS (less than 4% in dense network) With router latency 1, BLESS can even outperform baseline (by ~10%) Significant energy improvements (almost 40%) W-Speedup Energy (normalized) 18 16 14 12 10 8 6 4 2 0 1.2 1 0.8 0.6 0.4 0.2 0 DO ROMM DO ROMM DO ROMM 4x4, 8x milc 4x4, 16x milc 8x8, 16x milc DO ROMM 4x4, 8x milc Baseline BLESS RL=1 BufferEnergy LinkEnergy RouterEnergy DO ROMM DO ROMM 4x4, 16x milc 8x8, 16x milc

milc benchmarks (moderately intensive) Observations: Perfect caches! Very little performance degradation with BLESS (less than 4% in dense network) W-Speedup 18 16 14 12 10 8 6 4 2 0 0.2 0 DO ROMM DO ROMM DO ROMM 4x4, 8x milc 4x4, 16x milc 8x8, 16x milc 1.2 BufferEnergy LinkEnergy RouterEnergy 1 2) 0.8 BLESS For can even bursts and temporary hotspots, use 0.6 outperform network baseline links as buffers! 0.4 With router latency 1, (by ~10%) Significant energy improvements (almost 40%) Energy (normalized) DO ROMM Baseline BLESS RL=1 1) Injection rates not extremely high on average self-throttling! DO ROMM DO ROMM 4x4, 8 8x milc 4x4, 16x milc 8x8, 16x milc

BLESS increases buffer requirement at receiver by at most 2x overall, energy is still reduced See paper for details Impact of memory latency with real caches, very little slowdown! (at most 1.5%) W-Speedup 18 16 14 12 10 8 6 4 2 0 DO ROMM DO ROMM DO ROMM 4x4, 8x matlab 4x4, 16x matlab 8x8, 16x matlab

BLESS increases buffer requirement at receiver by at most 2x overall, energy is still reduced See paper for details Impact of memory latency with real caches, very little slowdown! (at most 1.5%) Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications) little performance degradation significant energy savings in all cases no significant increase in unfairness across different applications Area savings: ~60% of network area can be saved!

Aggregate results over all 29 applications Sparse Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case Network Energy -39.4% -28.1% -46.4% -41.0% System Performance -0.5% -3.2% -0.15% -0.55% Energy (normalized) 1 0.8 0.6 0.4 0.2 0 BufferEnergy LinkEnergy RouterEnergy 8 7 6 5 4 3 2 BASE FLIT WORM BASE FLIT WORM 1 0 W-Speedup BASE FLIT WORM BASE FLIT WORM

Aggregate results over all 29 applications Sparse Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case Network Energy -39.4% -28.1% -46.4% -41.0% System Performance -0.5% -3.2% -0.15% -0.55% Dense Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case Network Energy -32.8% -14.0% -42.5% -33.7% System Performance -3.6% -17.1% -0.7% -1.5%

For a very wide range of applications and network settings, buffers are not needed in NoC Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc ) Performance slowdown is minimal (can even increase!) A strong case for a rethinking of NoC design! We are currently working on future research. Support for quality of service, different traffic classes, energymanagement, etc