Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Similar documents
Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 3: Topology - II

Network-on-chip (NOC) Topologies

Lecture 2: Topology - I

Topologies. Maurizio Palesi. Maurizio Palesi 1

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

4. Networks. in parallel computers. Advances in Computer Architecture

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Topology basics. Constraints and measures. Butterfly networks.

Interconnection networks

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Physical Organization of Parallel Platforms. Alexandre David

Chapter 3 : Topology basics

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Interconnection Network

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Lecture: Interconnection Networks

Basic Switch Organization

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

Interconnection Networks

Interconnection Networks

Interconnection Networks

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

CS575 Parallel Processing

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Interconnection Networks. Issues for Networks

TDT Appendix E Interconnection Networks

EE/CSCI 451: Parallel and Distributed Computation

Homework Assignment #1: Topology Kelly Shaw

Lecture 2 Parallel Programming Platforms

Interconnection Network

The final publication is available at

Packet Switch Architecture

Packet Switch Architecture

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

INTERCONNECTION NETWORKS LECTURE 4

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Communication Performance in Network-on-Chips

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

EE/CSCI 451: Parallel and Distributed Computation

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Interconnect Technology and Computational Speed

Chapter 4 : Butterfly Networks

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Lecture 7: Flow Control - I

INTERCONNECTION networks are used in a variety of applications,

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

A Survey of Techniques for Power Aware On-Chip Networks.

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Parallel Computing Platforms

Lecture 22: Router Design

Advanced Parallel Architecture. Annalisa Massini /2017

IV. PACKET SWITCH ARCHITECTURES

Phastlane: A Rapid Transit Optical Routing Network

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiprocessor Interconnection Networks- Part Three

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

CSC630/CSC730: Parallel Computing

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

EE/CSCI 451: Parallel and Distributed Computation

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

ECE 669 Parallel Computer Architecture

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 3: Flow-Control

A Case for Random Shortcut Topologies for HPC Interconnects

Network Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

ES1 An Introduction to On-chip Networks

Parallel Architecture. Sathish Vadhiyar

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EE/CSCI 451 Midterm 1

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Interconnection Networks

Lecture 7: Parallel Processing

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Parallel Computer Architecture II

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

Multiprocessor Interconnection Networks

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Transcription:

Interconnection Networks: Topology Prof. Natalie Enright Jerger

Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design Significant impact on network cost-performance Determines number of hops Latency Network energy consumption Implementation complexity Node degree Ease of layout 2

ABSTRACT METRICS 3

Abstract Metrics Use metrics to evaluate performance and cost of topology Also influenced by routing/flow control At this stage Assume ideal routing (perfect load balancing) Assume ideal flow control (no idle cycles on any channel) 4

Abstract Metrics: Degree Switch Degree: number of links at a node Proxy for estimating cost Higher degree requires more links and port counts at each router A 2 B 2,3,4 B 4 B A A 5

Abstract Metrics: Hop Count Path: ordered set of channels between source and destination Hop Count: number of hops a message takes from source to destination Simple, useful proxy for network latency Every node, link incurs some propagation delay even when no contention Minimal hop count: smallest hop count connecting two nodes 6

Hop Count Network diameter: large min hop count in network Average minimum hop count: average across all src/dst pairs Implementation may incorporate non-minimal paths Increases average hop count 7

Hop Count A B B B Max = 4 Avg = 2.2 A Max = 4 1.77 A Max = 2 1.33 Uniform random traffic Ring > Mesh > Torus Derivations later 8

Latency Time for packet to traverse network Start: head arrives at input port End: tail departs output port Latency = Head latency + serialization latency Serialization latency: time for packet with Length L to cross channel with bandwidth b (L/b) Approximate with hop count Other design choices (routing, flow control) impact latency Unknown at this stage 9

Abstract Metrics: Maximum Channel Load Estimate max bandwidth the network can support Max bits per second (bps) that can be injected by every node before it saturates Saturation: network cannot accept any more traffic Determine most congested link For given traffic pattern Will limit overall network bandwidth Estimate load on this channel 10

Preliminary Maximum Channel Load Don t know specifics of link yet Define relative to injection load Channel load of 2 Channel is loaded with twice injection bandwidth If each node injects a flit every cycle 2 flits will want to traverse bottleneck channel every cycle If bottleneck channel can only handle 1 flit per cycle Max network bandwidth is ½ link bandwidth A flit can be injected every other cycle 11

Maximum Channel Load Example A B E F C D G H Uniform random Every node has equal probability of sending to every node Identify bottleneck channel Half of traffic from every node will cross bottleneck channel 8 x ½ = 4 Network saturates at ¼ injection bandwidth 12

Bisection Bandwidth Common off-chip metric Proxy for cost Amount of global wiring that will be necessary Less useful for on-chip Global on-chip wiring considered abundant Cuts: partition all the nodes into two disjoint sets Bandwidth of a cut Bisection A cut which divides all nodes into (nearly) half Channel bisection min. channel count over all bisections Bisection bandwidth min. bandwidth over all bisections With uniform traffic ½ of traffic crosses bisection 13

Throughput Example 0 1 2 3 4 5 6 7 Bisection = 4 (2 in each direction) With uniform random traffic 3 sends 1/8 of its traffic to 4,5,6 3 sends 1/16 of its traffic to 7 (2 possible shortest paths) 2 sends 1/8 of its traffic to 4,5 Etc Channel load = 1 14

Path Diversity Multiple shortest paths between source/destination pair (R) Fault tolerance Better load balancing in network Routing algorithm should be able to exploit path diversity A B A A B B R A-B = 1 R A-B = 6 R A-B = 2 15

NETWORK EVALUATION 16

Evaluating Networks Analytical and theoretical analysis E.g. mathematical derivations of max channel load Analytical models for power (DSENT) Simulation-based analysis Network-only simulation with synthetic traffic patterns Full system simulation with real application benchmarks Hardware implementation HDL implementation to measure power, area, frequency etc. Measurement on real hardware Profiling and analyzing communication 17

Evaluating Topologies Important to consider traffic pattern Talked about system architecture impact on traffic If actual traffic pattern unknown Synthetic traffic patterns Evaluate common scenarios Stress test network Derive various properties of network 18

Traffic Patterns Historically derived from particular applications of interest Spatial distribution Matrix Transpose Transpose traffic pattern d i = s i+b/2 mod b b-bit address, d i : ith bit of destination Fall 2014 19

Traffic Patterns Examples Fast Fourier Transform (FFT) or sorting application shuffle permutation Fluid dynamics neighbor patterns Shuffle: d i = s i-1 mod b Neighbor: d x = s x + 1 mod k 20

Traffic Patterns (3) Uniform random Each source equally likely to communication with each destination Most commonly used traffic pattern Very benign Traffic is uniformly distributed Balances load even if topology/routing algorithm has very poor load balancing Need to be careful But can be good for debugging/verifying implementation Well-understood pattern 21

Stress-testing Network Uniform random can make bad topologies look good Permutation traffic will stress-test the network Many types of permutation (ex: shuffle, transpose, neighbor) Each source sends all traffic to single destination Concentration of load on individual pairs Stresses load balancing 22

Traffic Patterns For topology/routing discussion Focus on spatial distribution Traffic patterns also have temporal aspects Bursty behavior Important to capture temporal behavior as well Motivate need for new traffic patterns 23

Application Full System Simulation Full System Simulator Processor Packets Sent NoC Simulator Cache Disk NoC Other Components Packets Arrived Feedback! Fall 2014 Accurate But Slow ECE 1749H: Interconnection Networks (Enright Jerger) 24

Application Trace Simulation Trace Simulator NoC Simulator Processor Cache Disk Trace Packets Sent NoC A NoC B Other Fall 2014 Faster But Less Accurate ECE 1749H: Interconnection Networks (Enright Jerger) 25

Application Traffic Patterns Synthetic Traffic Driver NoC Simulator Traffic Pattern Uniform Random Bit Complement Bit Reverse Bit Rotation Shuffle Transpose Tornado Neighbour NoC Fall 2014 Very Fast But Inaccurate ECE 1749H: Interconnection Networks (Enright Jerger) 26

COMMON TOPOLOGIES 27

Types of Topologies Focus on switched topologies Alternatives: bus and crossbar Bus Connects a set of components to a single shared channel Effective broadcast medium Crossbar Directly connects n inputs to m outputs without intermediate stages Fully connected, single hop network Component of routers 28

Types of Topologies Direct Each router is associated with a terminal node All routers are sources and destinations of traffic Indirect Routers are distinct from terminal nodes Terminal nodes can source/sink traffic Intermediate nodes switch traffic between terminal nodes To date: Most on-chip networks use direct topologies 29

Torus (1) K-ary n-cube: k n network nodes N-Dimensional grid with k nodes in each dimension 3-ary 2-mesh 2-cube 2,3,4-ary 3-mesh 30

Torus (2) Map well to planar substrate for on-chip Topologies in Torus Family Ex: Ring -- k-ary 1-cube Edge Symmetric Good for load balancing Removing wrap-around links for mesh loses edge symmetry More traffic concentrated on center channels Good path diversity Exploit locality for near-neighbor traffic 31

Hop Count Average shortest distance over all pairs of nodes Torus: ì For uniform random traffic Packet travels k/4 hops in each of n dimensions Mesh: nk ï k even H min = í 4 æ nç k 4-1 ö ï k odd î è 4kø ì nk ï k even 3 H min = í æ nç k 3-1 ö ï k odd î è 3k ø 32

Torus (4) Degree = 2n, 2 channels per dimension All nodes have same degree Total channels = 2nN 33

Channel Load for Torus Even number of k-ary (n-1)-cubes in outer dimension Dividing these k-ary (n-1)-cubes gives a 2 sets of k n-1 bidirectional channels or 4k n-1 ½ Traffic from each node cross bisection channelload = N 2 k 4N = k 8 Mesh has ½ the bisection bandwidth of torus 34

Deriving Channel Load: 4-ary 2-cube Divide network in half Number of bisection channels 8 links, bidirectional = 16 channels 4N k = 4 16 4 ½ traffic crosses bisection N 2 = 8 N/2 traffic distributed over 16 links Channel load = ½ Loaded at twice injection bandwidth 35

Torus Path Diversity 2 dimensions* NW, NE, SW, SE combos 2 edge and node disjoint minimum paths *assume single direction for x and y 36

Mesh A torus with end-around connection removed Same node degree Bisection channels halved Max channel load = k/4 Higher demand for central channels Load imbalance 37

Butterfly Indirect network K-ary n-fly: k n network nodes 0 1 2 3 0 1 0 00 10 20 01 11 21 0 1 2 3 Routing from 000 to 010 Dest address used to directly route packet Bit n used to select output port at stage n 4 5 6 7 02 03 12 13 22 23 2-ary 3-fly 2 input switch, 3 stages 38 4 5 6 7

No path diversity Butterfly (2) R xy =1 Can add extra stages for diversity Increase network diameter 0 1 x0 00 10 20 0 1 2 3 x1 01 11 21 2 3 4 5 x2 02 12 22 4 5 6 7 x3 03 39 13 23 6 7

Hop Count Butterfly (3) Log k N + 1 Does not exploit locality Hop count same regardless of location Switch Degree = 2k Requires long wires to implement 40

Butterfly: Channel Load H min x N: Channel demand Number of channel traversals required to deliver one round of packets Channel Load uniform traffic Equally loads channels NH min C = k n (n +1) k n (n +1) =1 Increases for adversarial traffic 41

Butterfly: Deriving Channel Load Divide network in half Number of bisection channels: 4 0 1 00 10 20 0 1 4 nodes (top half) send ½ traffic to lower half 4 2 = 2 Distributed across 2 channels (C) Channel load = 1 2 3 4 5 6 7 01 02 03 11 12 13 21 22 23 2 3 4 5 6 7 42

Butterfly: Channel Load Adversarial traffic All traffic from top half sent to bottom half 0 1 2 3 00 01 10 11 20 21 0 1 2 3 E.g. 0 sends to 4, 1 sends to 5 Channel load: 2 Loaded at ½ injection bandwidth 4 5 6 7 02 03 12 13 22 23 4 5 6 7 43

Clos Network 3-stage indirect network Larger number of stages: built recursively by replacing middle stage with 3-stage Clos Characterized by triple (m, n, r) M: # of middle stage switches N: # of input/output ports on input/output switches R: # of input/output switches Hop Count = 4 44

Clos Network nxm input switch nxm input switch nxm input switch nxm input switch rxr input switch rxr input switch rxr input switch rxr input switch rxr input switch mxn output switch mxn output switch mxn output switch mxn output switch 45

Clos Network Strictly non-blocking when m > 2n-1 Any input can connect to any unique output port r x n nodes Degree First and last stages: n + m, middle stage: 2r Path diversity: m Can be folded along middle switches Input and output switches are shared 46

Folded Clos (Fat Tree) Bandwidth remains constant at each level Regular Tree: Bandwidth decreases closer to root 47

Fat Tree (2) Provides path diversity 48

Application of Topologies to On-Chip Networks FBFly Convert butterfly to direct network Swizzle switch Circuit-optimized crossbar Rings Simple, low hardware cost Mesh networks Several products/prototypes MECS and bus-based networks Broadcast and multicast capabilities 49

Implementation Folding Equalize path lengths Reduces max link length Increases length of other links 0 1 2 3 3 2 0 1 50

Concentration Don t need 1:1 ratio of routers to cores Ex: 4 cores concentrated to 1 router Can save area and power Increases network complexity Concentrator must implement policy for sharing injection bandwidth During bursty communication Can bottleneck 51

Implication of Abstract Metrics on Implementation Degree: useful proxy for router complexity Increasing ports requires additional buffer queues, requestors to allocators, ports to crossbar All contribute to critical path delay, area and power Link complexity does not correlate with degree Link complexity depends on link width Fixed number of wires, link complexity for 2-port vs 3- port is same 52

Implications (2) Hop Count: useful proxy for overall latency and power Does not always correlate with latency Depends heavily on router pipeline and link propagation Example: Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs. Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal 53

Implications (2) Hop Count: useful proxy for overall latency and power Does not always correlate with latency Depends heavily on router pipeline and link propagation Hop Count says A is better than B But A has 18 cycle latency vs 6 cycle latency for B Example: Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs. Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal 54

Implications (3) Topologies typically trade-off hop count and node degree Max channel load useful proxy for network saturation and max power Higher max channel load greater network congestion Traffic pattern impacts max channel load Representative traffic patterns important Max power: dynamic power is highest with peak switching activity and utilization in network 55

Topology Summary First network design decision Critical impact on network latency and throughput Hop count provides first order approximation of message latency Bottleneck channels determine saturation throughput 56