ECE 669 Parallel Computer Architecture

Similar documents
NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

L14-15 Scalable Interconnection Networks. Scalable, High Performance Network

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

TDT Appendix E Interconnection Networks

Distributed Memory Machines. Distributed Memory Machines

CS 552 Computer Networks

Interconnection Network

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

EE 6900: Interconnection Networks for HPC Systems Fall 2016

Routing Algorithms. Review

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Deadlock and Livelock. Maurizio Palesi

Communication Performance in Network-on-Chips

Lecture: Interconnection Networks

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 3: Flow-Control

Interconnection Networks

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Deadlock and Router Micro-Architecture

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

4. Networks. in parallel computers. Advances in Computer Architecture

Scalable Multiprocessors

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Lecture 22: Router Design

Interconnection Networks

Lecture 7: Flow Control - I

NOC Deadlock and Livelock

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Chapter 9 Multiprocessors

CS575 Parallel Processing

Multiprocessor Interconnection Networks

EE 382C Interconnection Networks

Advanced Computer Networks Data Center Architecture. Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Lecture 18: Communication Models and Architectures: Interconnection Networks

Basic Low Level Concepts

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

CMSC 611: Advanced. Interconnection Networks

Packet Switch Architecture

Packet Switch Architecture

CSCI Computer Networks

Interconnection Networks

Boosting the Performance of Myrinet Networks

Interconnection networks

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Lecture 15: Networks & Interconnect Interface, Switches, Routing, Examples Professor David A. Patterson Computer Science 252 Fall 1996

Parallel Computer Architecture II

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

The Network Layer and Routers

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Internetworking Part 1

From Routing to Traffic Engineering

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

NOC: Networks on Chip SoC Interconnection Structures

Reminder: Datalink Functions Computer Networking. Datalink Architectures

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Lecture 3: Topology - II

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Architecture or Parallel Computers CSC / ECE 506

Chapter 3 Part 2 Switching and Bridging. Networking CS 3470, Section 1

Interconnection Networks

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 9: Bridging & Switching"

Generic Methodologies for Deadlock-Free Routing

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

Cisco IOS Switching Paths Overview

Flow Control can be viewed as a problem of

Routing and Deadlock

in Oblivious Routing

Deadlock-free XY-YX router for on-chip interconnection network

Lecture 23: Router Design

Interconnection Network

Transcription:

ECE 669 Parallel Computer Architecture Lecture 21 Routing

Outline Routing Switch Design Flow Control Case Studies

Routing Routing algorithm determines which of the possible paths are used as routes how the route is determined Issues: Routing mechanism - arithmetic - source-based port select - table driven - general computation Properties of the routes Deadlock free

Routing Mechanism need to select output port for each input packet in a few cycles Simple arithmetic in regular topologies ex: Dx, Dy routing in a grid - west (-x) Dx < 0 - east (+x) Dx > 0 - south (-y) Dx = 0, Dy < 0 - north (+y) Dx = 0, Dy > 0 - processor Dx = 0, Dy = 0 Reduce relative address of each dimension in order Dimension-order routing in k-ary n-cubes Routing in hypercubes

Routing Mechanism P 3 P 2 P 1 P 0 Source-based message header carries series of port selects used and stripped en route CRC? Packet Format? CS-2, Myrinet, MIT Artic Table-driven message header carried index for next port at next switch - o = R[i] table also gives index for following hop - o, I = R[i ] ATM, HPPI

Properties of Routing Algorithms Deterministic route determined by (source, dest), not intermediate state (i.e. traffic) Adaptive route influenced by traffic along the way Minimal only selects shortest paths Deadlock free no traffic pattern can lead to a situation where no packets move forward

Deadlock Freedom How can it arise? necessary conditions: - shared resource - incrementally allocated - non-preemptible think of a channel as a shared resource that is acquired incrementally - source buffer then dest. buffer - channels along a route How do you avoid it? constrain how channel resources are allocated ex: dimension order How do you prove that a routing algorithm is deadlock free

Proof Technique resources are logically associated with channels messages introduce dependences between resources as they move forward need to articulate the possible dependences that can arise between channels show that there are no cycles in Channel Dependence Graph find a numbering of channel resources such that every legal route follows a monotonic sequence => no traffic pattern can lead to deadlock network need not be acyclic, on channel dependence graph

Example: k-ary 2D array Thm: x,y routing is deadlock free Numbering +x channel (i,y) -> (i+1,y) gets i similarly for -x with 0 as most positive edge +y channel (x,j) -> (x,j+1) gets N+j similary for -y channels any routing sequence: x direction, turn, y direction is increasing 1 2 3 00 01 02 03 2 1 0 18 17 17 16 10 11 12 13 18 20 21 22 23 19 30 31 32 33

More examples: Consider other topologies butterfly? tree? fat tree? Any assumptions about routing mechanism? amount of buffering? What about wormhole routing on a ring? 2 1 0 3 4 5 6 7

Deadlock free wormhole networks? Basic dimension order routing techniques don t work for k-ary n-cubes only for k-ary n-arrays (bi-directional) Idea: add channels! provide multiple virtual channels to break the dependence cycle good for BW too! Input Ports Output Ports Cross-Bar Do not need to add links, or xbar, only buffer resources

Breaking deadlock with virtual channels Packet switches from lo to hi channel

Up*-Down* routing Given any bidirectional network Construct a spanning tree Number of the nodes increasing from leaves to roots UP increase node numbers Any Source -> Dest by UP*-DOWN* route up edges, single turn, down edges Performance? Some numberings and routes much better than others interacts with topology in strange ways

Turn Restrictions in X,Y +Y -X +X XY routing forbids 4 of 8 turns and leaves no room for adaptive routing -Y Can you allow more turns and still be deadlock free

Minimal turn restrictions in 2D +y -x +x West-first north-last -y negative first

Example legal west-first routes Can route around failures or congestion Can combine turn restrictions with virtual channels

Adaptive Routing R: C x N x S -> C Essential for fault tolerance at least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations fully/partially adaptive, minimal/non-minimal can introduce complexity or anomolies little adaptation goes a long way!

Switch Design Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling

How do you build a crossbar I o I o I 1 I 2 I 3 O 0 I 1 O i O 2 I 2 O 3 I 3 phase RAM addr Din Dout I o I 1 I 2 I 3 O 0 O i O 2 O 3

Input buffered swtich Input Ports R0 Output Ports R1 R2 Cross-bar R3 Independent routing logic per input FSM Scheduler logic arbitrates each output priority, FIFO, random Scheduling Head-of-line blocking problem

Output Buffered Switch Input Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Output Ports Control How would you build a shared pool?

Example: IBM SP vulcan switch Flow Input Port Control FIFO 8 8 CRC check Route control Deser ializer 64 Central Queue In Arb Out Arb 64 Serializer Ouput Port 8 FIFO XBar Arb Flow Control CRC Gen 8 Flow Input Port Control FIFO 8 8 CRC check Route control Deserializer 64 8 RAM 64x128 8 x 8 Crossbar 8 Ser ializer Ouput Port Flow Control FIFO 8 8 XBar CRC Arb Gen Many gigabit ethernet switches use similar design without the cut-through

Output scheduling Input Buffers R0 R1 O0 Output Ports R2 Cross-bar O1 R3 O2 n independent arbitration problems? static priority, random, round-robin simplifications due to routing algorithm? general case is max bipartite matching

Stacked Dimension Switches Dimension order on 3D cube? Host In Cube connected cycles? Zin 2x2 Zout Yin 2x2 Yout Xin 2x2 Xout Host Out

Flow Control What do you do when push comes to shove? ethernet: collision detection and retry after delay FDDI, token ring: arbitration token TCP/WAN: buffer, drop, adjust rate any solution must adjust to output rate Link-level flow control Ready Data

Examples Short Links F/E Ready/Ack Req F/E Source Data Destination long links several flits on the wire

Smoothing the flow Incoming Phits Flow-control Symbols Full Stop Go High Mark Low Mark Empty Outgoing Phits How much slack do you need to maximize bandwidth?

Example: T3D Read Req - no cache - cache - prefetch - fetch&inc Route Tag Dest PE Command Addr 0 Addr 1 Src PE Read Resp Route Tag Dest PE Command Read Resp - cached Route Tag Dest PE Command Word 0 Word 0 Word 1 Word 2 Word 3 Packet Type req/resp coomand 3 1 8 3D bidirectional torus, dimension order (NIC selected), virtual cut-through, packet sw. 16 bit x 150 MHz, short, wide, synch. rotating priority per output Write Req - Proc - BLT 1 - fetch&inc Route Tag Dest PE Command Addr 0 Addr 1 Src PE Word 0 logically separate request/response 3 independent, stacked switches Write Req - proc 4 - BLT 4 Route Tag Dest PE Command Addr 0 Addr 1 Src PE Word 0 Word 1 Word 2 Word 3 Write Resp Route Tag Dest PE Command 8 16-bit flits on each of 4 VC in each directions BLT Read Req Route Tag Dest PE Command Addr 0 Addr 1 Src PE Addr 0 Addr 1

Example: SP 16-node Rack Multi-rack Configuration Inter-Rack External Switch Ports E 0 E 1 E 2 E 3 E 15 Switch Board 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock packet sw, cut-through, no virtual channel, source-based routing variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes per output, 16 phit links 128 8-byte chunks in central queue, LRU per output run in shadow mode P 0 P 1 P 2 P 3 P 15 Intra-Rack Host Ports

Summary Routing Algorithms restrict the set of routes within the topology simple mechanism selects turn at each hop arithmetic, selection, lookup Deadlock-free if channel dependence graph is acyclic limit turns to eliminate dependences add separate channel resources to break dependences combination of topology, algorithm, and switch design Deterministic vs adaptive routing Switch design issues input/output/pooled buffering, routing logic, selection logic Flow control Real networks are a package of design choices