Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Similar documents
EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Lecture 3: Flow-Control

Lecture 7: Flow Control - I

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Lecture 23: Router Design

Flow Control can be viewed as a problem of

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture: Interconnection Networks

ES1 An Introduction to On-chip Networks

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip

Basic Low Level Concepts

Advanced Computer Networks. Flow Control

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Parallel Computing Interconnection Networks

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Packet Switch Architecture

Packet Switch Architecture

Advanced Computer Networks. Flow Control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Interconnection Networks

Interconnection Networks

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

4. Networks. in parallel computers. Advances in Computer Architecture

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

Interconnection Networks

Input Buffering (IB): Message data is received into the input buffer.

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Evaluating Bufferless Flow Control for On-Chip Networks

Abstract. Paper organization

Lecture 22: Router Design

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

NoC Test-Chip Project: Working Document

Express Virtual Channels: Towards the Ideal Interconnection Fabric

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

ISSN Vol.03,Issue.06, August-2015, Pages:

International Journal of Advanced Research in Computer Science and Software Engineering

Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-coherent NoCs

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

Deadlock and Router Micro-Architecture

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

Routing Algorithms. Review

EC 513 Computer Architecture

Lecture 18: Communication Models and Architectures: Interconnection Networks

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

EECS 570 Final Exam - SOLUTIONS Winter 2015

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

Topologies. Maurizio Palesi. Maurizio Palesi 1

TDT Appendix E Interconnection Networks

Performance Evaluation of Different Routing Algorithms in Network on Chip

Connection-oriented Multicasting in Wormhole-switched Networks on Chip

EE 382C Interconnection Networks

ECE 669 Parallel Computer Architecture

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Multicomputer distributed system LECTURE 8

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

Topologies. Maurizio Palesi. Maurizio Palesi 1

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

CUTBUF: Buffer Management and Router Design for Traffic Mixing in VNET-based NoCs

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Deadlock and Livelock. Maurizio Palesi

Scheduling Computations on a Software-Based Router

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits

Architecture or Parallel Computers CSC / ECE 506

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

ECEN Final Exam Fall Instructor: Srinivas Shakkottai

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

An introduction on the on-chip networks (NoC)

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

NOC: Networks on Chip SoC Interconnection Structures

Network-on-chip (NOC) Topologies

Evaluation of NOC Using Tightly Coupled Router Architecture

NOC Deadlock and Livelock

VLSI D E S. Siddhardha Pottepalem

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

A Hybrid Interconnection Network for Integrated Communication Services

WITH THE CONTINUED advance of Moore s law, ever

Outline. Computer Communication and Networks. The Network Core. Components of the Internet. The Network Core Packet Switching Circuit Switching

Conquering Memory Bandwidth Challenges in High-Performance SoCs

What Is Congestion? Effects of Congestion. Interaction of Queues. Chapter 12 Congestion in Data Networks. Effect of Congestion Control

Transcription:

Interconnection Networks: Flow Control Prof. Natalie Enright Jerger

Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine allocation of resources to messages as they traverse network Buffers and links Significant impact on throughput and latency of network 2

Flow Control Control State Channel Bandwidth Buffer capacity Control state records allocation of channels and buffers to packets current state of packet traversing node Channel bandwidth advances flits from this node to next Buffers hold flits waiting for channel bandwidth 3

Packets Messages: composed of one or more packets If message size is <= maximum packet size only one packet created Packets: composed of one or more flits Flit: flow control digit Phit: physical digit Subdivides flit into chunks = to link width 4

Packets (2) Message Header Payload Packet Route Seq# Head Flit Body Flit Tail Flit Flit Type VCID Head, Body, Tail, Head & Tail Phit Off-chip: channel width limited by pins Requires phits On-chip: abundant wiring means phit size == flit size 5

Packets(3) Cache line RC Type VCID Add r Bytes 0-15 Bytes 16-31 Bytes 32-47 Bytes 48-63 Head Flit Body Flits Tail Flit Coherence Command RC Type VCID Add r Cmd Head & Tail Flit Packet contains destination/route information Flits may not all flits of a packet must take same route 6

Switching Different flow control techniques based on granularity Message-based: allocation made at message granularity (circuit-switching) Packet-based: allocation made to whole packets Flit-based: allocation made on a flit-by-flit basis 7

Message-Based Flow Control Coarsest granularity Circuit-switching Pre-allocates resources across multiple hops Source to destination Resources = links Buffers are not necessary Probe sent into network to reserve resources 8

Circuit Switching Once probe sets up circuit Message does not need to perform any routing or allocation at each network hop Good for transferring large amounts of data Can amortize circuit setup cost by sending data with very low perhop overheads No other message can use those resources until transfer is complete Throughput can suffer due setup and hold time for circuits Links are idle until setup is complete 9

Circuit Switching Example 0 Configuration Probe 5 Data Circuit Significant latency overhead prior to data transfer Data transfer does not pay per-hop overhead for routing and allocation Acknowledgement 10

Circuit Switching Example (2) 0 1 2 5 Configuration Probe Data Circuit Acknowledgement When there is contention Significant wait time Message from 1 2 must wait 11

Location Time-Space Diagram: Circuit-Switching 0 S 08 A 08 D 08 D 08 D 08 D 08 T 08 1 S 08 S 08 A 08 D 08 D 08 D 08 D 08 T 08 2 S 28 A 08 D 08 D 08 D 08 D 08 T 08 S 28 5 S 08 S 08 A08 A 08 D 08 D 08 D 08 D 08 T 08 S 28 8 D 08 D 08 D 08 D 08 T 08 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time to setup+ack circuit from 0 to 8 Time 0 1 2 Time setup from 2 to 8 is blocked 3 4 5 6 7 8 12

Packet-based Flow Control Break messages into packets Interleave packets on links Better utilization Requires per-node buffering to store in-flight packets Two types of packet-based techniques 13

Store and Forward Links and buffers are allocated to entire packet Head flit waits at router until entire packet is received before being forwarded to the next hop Not suitable for on-chip Requires buffering at each router to hold entire packet Packet cannot traverse link until buffering allocated to entire packet Incurs high latencies (pays serialization latency at each hop) 14

Store and Forward Example 0 5 Total delay = 4 cycles per hop x 3 hops = 12 cycles High per-hop latency Serialization delay paid at each hop Larger buffering required 15

Location Time-Space Diagram: Store and Forward 0 1 2 5 8 H B B B T H B B B T H B B B T H B B B T H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time 21 22 23 24 16

Packet-based: Virtual Cut Through Links and Buffers allocated to entire packets Flits can proceed to next hop before tail flit has been received by current router But only if next router has enough buffer space for entire packet Reduces the latency significantly compared to SAF But still requires large buffers Unsuitable for on-chip 17

Virtual Cut Through Allocate Example 4 flit-sized 0 Allocate 4 flit-sized buffers before buffers head before head proceeds proceeds 5 Total delay = 1 cycle per hop x 3 hops + serialization = 6 cycles Lower per-hop latency Large buffering required 18

Location Time-Space Diagram: VCT 0 1 2 H B B B T H B B B T H B B B T 5 8 H B B B T H B B B T 0 1 2 3 4 5 6 7 8 Time 19

Virtual Cut Through Cannot proceed because only 2 flit buffers available Throughput suffers from inefficient buffer allocation 20

Location Time-Space Diagram: VCT (2) 0 1 H B B B T H B B B T 2 5 8 H B B B T Insufficient Buffers H B B B T H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 Time 21

Flit-Level Flow Control Help routers meet tight area/power constraints Flit can proceed to next router when there is buffer space available for that flit Improves over SAF and VCT by allocating buffers on a flit-by-flit basis 22

Wormhole Flow Control Pros More efficient buffer utilization (good for on-chip) Low latency Cons Poor link utilization: if head flit becomes blocked, all links spanning length of packet are idle Cannot be re-allocated to different packet Suffers from head of line (HOL) blocking 23

Wormhole Example Red holds this channel: channel remains idle until read proceeds Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed 6 flit buffers/input port Blocked by other packets 24

Location Time-Space Diagram: Wormhole 0 1 H B B B T H B B B T 2 Contention H B B B T 5 H B B B T 8 H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 Time 25

Virtual Channels First proposed for deadlock avoidance We ll come back to this Can be applied to any flow control First proposed with wormhole 26

Virtual Channel Flow Control Virtual channels used to combat HOL blocking in wormhole Virtual channels: multiple flit queues per input port Share same physical link (channel) Link utilization improved Flits on different VC can pass blocked packet 27

Virtual Channel Flow Control (2) A (in) B (in) A (out) B (out) 28

Virtual Channel Flow Control (3) In1 AH A1 A2 A3 A4 A5 1 1 2 2 3 3 AT 3 3 3 2 2 1 1 In2 BH B1 B2 B3 B4 B5 1 2 2 3 3 3 3 BT 3 3 3 2 2 1 1 Out AH BH A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 AT BT A Downstream B Downstream AH A1 A2 A3 A4 A5 AT BH B1 B2 B3 B4 B5 BT 29

Virtual Channel Flow Control (3) Packets compete for VC on flit by flit basis In example: on downstream links, flits of each packet are available every other cycle Upstream links throttle because of limited buffers Does not mean links are idle May be used by packet allocated to other VCs 30

Virtual Channel Example Buffer full: blue cannot proceed 6 flit buffers/input port 3 flit buffers/vc Blocked by other packets 31

Summary of techniques Circuit- Switching Store and Forward Virtual Cut Through Links Buffers Comments Messages N/A (buffer-less) Setup & Ack Packet Packet Head flit waits for tail Packet Packet Head can proceed Wormhole Packet Flit HOL Virtual Channel Flit Flit Interleave flits of different packets 32

Deadlock Using flow control to guarantee deadlock freedom give more flexible routing Recall: routing restrictions needed for deadlock freedom If routing algorithm is not deadlock free VCs can break resource cycle Each VC is time-multiplexed onto physical link Holding VC implies holding associated buffer queue Not tying up physical link resource Enforce order on VCs 33

Deadlock: Enforce Order C A0 A B A1 B1 B0 D D C D B 0 D 1 All message sent through VC 0 until cross dateline After dateline, assigned to VC 1 Cannot be allocated to VC 0 again 34 C1 A C0

Deadlock: Escape VCs Enforcing order lowers VC utilization Previous example: VC 1 underutilized Escape Virtual Channels Provide an escape path for every packet in a potential cycle Have 1 VC that uses a deadlock free routing subfunction Example: VC 0 uses DOR, other VCs use arbitrary routing function Access to VCs arbitrated fairly: packet always has chance of landing on escape VC Once in escape VC, must remain there (routing restrictions) Assign different message classes to different VCs to prevent protocol level deadlock Prevent req-ack message cycles 35

Virtual Channel Reallocation VC0 VC2 VC3 P3(H) P2(H) P3(B) P2(T) P3(T) R0 R1 R2 VC1 P1(H) P1(T) When can a VC be reallocated to a new packet? Two options: conservative and aggressive Conservative (atomic): only reallocate empty VC Aggressive (non-atomic): reallocate once tail flit has been received Implications for deadlock 36

VCT + non-atomic allocation Typically uses non-atomic VC allocation VC allocated only if there is enough free buffer space to hold entire packet Guarantees that whole packet can be received by VC Will entirely vacate upstream VC New packet at head of VC can bid for resources 37

VC Reallocation: Wormhole + Aggressive Turn-model and deterministic routing No cyclic channel dependence Can use aggressive reallocation VC0 VC2 VC3 P3(B) P2(H) R0 R1 R2 P1(T) P1(B) VC1 P1(H) P1(T) P1(B) P1(H) P3(T) P2(T) P1 can be forwarded immediately High VC utilization but low adaptivity 38

VC Reallocation: Wormhole + Adaptive Routing Adaptive routing Cyclic channel dependence With aggressive reallocation Packet spans 2 VCs and head flit not at head of one VC Cannot reach escape VC Leads to deadlock Requires conservative reallocation Can only reallocate an empty VC Low VC utilization R0 EVC 1 AVC 1 P 1 (B) P 1 (T) P 0 (H) P 0 (B) R2 R4 P 4 (B) P 4 (H) P 5 (T) P 5 (B) P2(B) P2(T) P1(H) P1(B) Dest(P 2 )=R3 Dest(P1)=R1 Dest(P5)=R5 EVC2 AVC2 R1 P 2 (B) P 2 (H) P 3 (T) P 3 (H) AVC 3 EVC 3 Dest(P 0 )=R0 AVC0 P0(B) EVC0 P4(B) P0(B) P0(B) P4(B) P4(B) EVC 4 AVC 4 Dest(P 6 )=R3 P5(H) P5(B) R3 Dest(P 4 )=R4 AVC 6 EVC 6 P 7 (H) P 7 (T) P6(B) P6(T) AVC5 EVC5 Dest(P3)=R2 P0(T) P4(T) Dest(P7)=R2 P 6 (H) P 6 (B) R5 39

Buffer Backpressure Need mechanism to prevent buffer overflow Avoid dropping packets Upstream nodes need to know buffer availability at downstream routers Significant impact on throughput achieved by flow control Two common mechanisms Credits On-off 40

Credit-Based Flow Control Upstream router stores credit counts for each downstream VC Upstream router When flit forwarded Decrement credit count Count == 0, buffer full, stop sending Downstream router When flit forwarded and buffer freed Send credit to upstream router Upstream increments credit count 41

Credit Timeline t1 t2 t3 Process Node 1 Node 2 Flit departs router Credit round trip delay t4 t5 Process Round-trip credit delay: Time between when buffer empties and when next flit can be processed from that buffer entry If only single entry buffer, would result in significant throughput degradation Important to size buffers to tolerate credit turn-around 42

On-Off Flow Control Credit: requires upstream signaling for every flit On-off: decreases upstream signaling Off signal Sent when number of free buffers falls below threshold F off On signal Sent when number of free buffers rises above threshold F on 43

t1 On-Off Timeline Node 1 Node 2 F off threshold reached F off set to prevent flits arriving before t4 from overflowing t2 t3 t4 Process F on set so that Node 2 does not run out of flits between t5 and t8 t5 t6 t7 t8 Process Less signaling but more buffering Fall 2014 On-chip buffers ECE 1749H: more Interconnection expensive Networks (Enright than Jerger) wires F on threshold reached 44

Buffer Sizing Prevent backpressure from limiting throughput Buffers must hold flits >= turnaround time Assume: 1 cycle propagation delay for data and credits 1 cycle credit processing delay 3 cycle router pipeline At least 6 flit buffers 45

Actual Buffer Usage & Turnaround Delay 1 1 3 1 Actual buffer usage Credit propagation delay Credit pipeline delay flit pipeline delay flit propagation delay Flit arrives at node 1 and uses buffer Flit leaves node 1 and credit is sent to node 0 Node 0 receives credit Node 0 processes credit, freed buffer reallocated to new flit New flit leaves Node 0 for Node 1 New flit arrives at Node 1 and reuses buffer 46

Flow Control Summary On-chip networks require techniques with lower buffering requirements Wormhole or Virtual Channel flow control Avoid dropping packets in on-chip environment Requires buffer backpressure mechanism Complexity of flow control impacts router microarchitecture (next) 47