Topologies. Maurizio Palesi. Maurizio Palesi 1

Similar documents
Topologies. Maurizio Palesi. Maurizio Palesi 1

Network-on-chip (NOC) Topologies

Chapter 3 : Topology basics

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Topology basics. Constraints and measures. Butterfly networks.

Lecture 3: Topology - II

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Lecture 2: Topology - I

Flow Control can be viewed as a problem of

4. Networks. in parallel computers. Advances in Computer Architecture

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture: Interconnection Networks

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Homework Assignment #1: Topology Kelly Shaw

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Interconnection Network

Interconnect Technology and Computational Speed

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 7: Flow Control - I

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Physical Organization of Parallel Platforms. Alexandre David

Routing Algorithms. Review

Chapter 4 : Butterfly Networks

Lecture 22: Router Design

EECS 578 Interconnect Mini-project

Interconnection networks

Lecture 2 Parallel Programming Platforms

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

INTERCONNECTION networks are used in a variety of applications,

Basic Switch Organization

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

TDT Appendix E Interconnection Networks

Packet Switch Architecture

Packet Switch Architecture

CS575 Parallel Processing

Phastlane: A Rapid Transit Optical Routing Network

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

Lecture 18: Communication Models and Architectures: Interconnection Networks

Deadlock and Livelock. Maurizio Palesi

Communication Performance in Network-on-Chips

INTERCONNECTION NETWORKS LECTURE 4

The final publication is available at

Multiprocessor Interconnection Networks- Part Three

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Interconnection Networks

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A Survey of Techniques for Power Aware On-Chip Networks.

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Interconnection Networks

Lecture 3: Flow-Control

Bandwidth Aware Routing Algorithms for Networks-on-Chip

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Interconnection Network

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Deadlock and Router Micro-Architecture

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Finding Worst-case Permutations for Oblivious Routing Algorithms

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

SHARED MEMORY VS DISTRIBUTED MEMORY

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

EE/CSCI 451: Parallel and Distributed Computation

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Parallel Computer Architecture II

Network Infrastructure

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

WITH THE CONTINUED advance of Moore s law, ever

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Express Virtual Channels: Towards the Ideal Interconnection Fabric

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Asynchronous Bypass Channel Routers

Network on Chip Architecture: An Overview

NOC Deadlock and Livelock

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Topology Awareness in the Tofu Interconnect Series

Parallel Computing Interconnection Networks

Parallel Computing Platforms

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Slim Fly: A Cost Effective Low-Diameter Network Topology

EE/CSCI 451: Parallel and Distributed Computation

Transcription:

Topologies Maurizio Palesi Maurizio Palesi 1

Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance Cost and performance determided by many factors (flow control, routing, traffic) Measures to evaluate just the topology Bisection bandwidth Channel load Path delay Maurizio Palesi 2

Units of Resource Allocation A message is a contiguous group of bits that is delivered from source terminal to destination terminal. A message consists of packets A packet is the basic unit for routing and sequencing Packets maybe divided into flits A flit (flow control digit) is the basic unit of bandwidth and storage allocation. Flits do not have any routing or sequence information and have to follow the route for the whole packet A phit (physical transfer digits) is the unit that is transfered across a channel in a single clock cycle Maurizio Palesi 3

Units of Resource Allocation Maurizio Palesi 4

Network-on-Chip Factors that influence the performance of a NoC are Topology (static arrangement of channels and nodes) Routing Technique (selection of a path through the network) Flow Control (how are network resources allocated, if packets traverse the network) Router Architecture (buffers and switches) Traffic Pattern Maurizio Palesi 5

Topologies Direct network (Every node is both a terminal and a switch) Indirect network Switch Terminal node Maurizio Palesi 6

Direct and Indirect Networks Direct Network Every Node in the network is both a terminal and a switch Indirect Network Nodes are either switches or terminal Maurizio Palesi 7

Direct Network Topologies Most popular direct network topologies fall into n dimensional meshes Nodes have from n to 2n neighbors depending on their location k ary n cubes All nodes have the same number of neighbors (2n) 2 ary 4 cube 3 ary 2 cube 3D mesh Maurizio Palesi 8

Direct to Indirect Maurizio Palesi 9

Cuts A cut of a network, C(N 1,N 2 ), is a set of channels that partitions the set of all nodes into two disjoint sets, N 1 and N 2 Each element in C(N 1,N 2 ) is a channel with a source in N 1 and destination in N 2 or vice versa Total bandwidth of the cut C(N 1,N 2 ) B N 1,N 2 = c C N 1, N 2 b c Maurizio Palesi 10

Bisection The bisection is a cut that partitions the entire network nearly in half The channel bisection of a network, B C, is the minimum channel count over all bisections B C = min bisections The bisection bandwidth of a network, B B, is the minimum bandwidth over all bisections B B = min bisections C N 1, N 2 B N 1, N 2 Maurizio Palesi 11

Diameter The diameter of a network, H max, is the largest, minimal hop count over all pairs of terminal nodes H max = max x,y N H x, y For a fully connected network with N terminals built from switches with out degree O, H max is bounded by H max log δo N (1) Each terminal can reach at most O other terminals after one hop At most O 2 after two hops, and at most O H after H hops If we set O H = N and solve for H, we get (1) Maurizio Palesi 12

Average Minimum Hop count The average minimum hop count of a network, H min, is defined as the average hop count over all sources and destinations H min = 1 N 2 x, y N H x, y Maurizio Palesi 13

Physical Distance and Delay The physical distance of a path is D P = c P l c The delay of a path is t P =D P /v Maurizio Palesi 14

Traffic Patterns We represents message distributions with a traffic matrix where each matrix element s,d gives the fraction of traffic sent from node s to node d Hisorically based on communication patterns that arise in particular applications Matrix transpose or corner turn operations FFT, sorting shuffle Fluid dynamics simulations neighbor patterns Maurizio Palesi 15

Traffic Patterns Uniform Transpose 1 dst.x W src.x, dst.y H src.y Transpose 2 dst.x src.x, dst.y src.y Bit reversal Butterfly Shuffle Maurizio Palesi 16

Performance Throughput Data rate in bits/s that the network accepts per input port It is a property of the entire network It depends on Routing Flow control Topology Maurizio Palesi 17

Ideal Throughput Ideal throughput of a topology Throughput that the network could carry with perfect flow control (no contention) and routing (load balanced over alternative paths) Maximum throughput It occurs when some channel in the network becomes saturated Maurizio Palesi 18

Channel Load We define the load of a channel c, c, as γ c = bandwidth demanded from channel c bandwidth of the input ports Equivalently Amount of traffic that must cross c if each input injects one unit of traffic Of course, it depends on the traffic pattern considered We will assume uniform traffic Maurizio Palesi 19

Maximum Channel Load Under a particular traffic pattern, the channel that carries the largest fraction of traffic determines the maximum channel load max of the topology γ max =max c C γ c Maurizio Palesi 20

Ideal Throughput When the offered traffic reaches the throughput of the network, the load on the bottleneck channel will be equal to the channel bandwidth b Any additional traffic would overload this channel The ideal throughput Θ ideal is the input bandwidth that saturates the bottleneck channel bandwidth demanded from channel c γ c = bandwidth of the input ports γ c =γ max = b Θ ideal Θ ideal = b γ max Maurizio Palesi 21

Bounds for max max is very hard to compute for the general case (arbitrary topology and arbitrary traffic pattern) For uniform traffic some upper and lower bounds can be computed with much less effort Maurizio Palesi 22

Lower Bound on max The load on the bisection channels gives a lower bound on γ max Let us assume uniform traffic On average, half of the traffic (N/2 packets) must cross the B C bisection channels The best throughput occurs when these packets are distributed evenly across the bisection channels Thus, the load on each bisection channel γ B is at least γ max γ B = N 2B C ½(N/2) ½(N/2) ½(N/2) ½(N/2) Maurizio Palesi 23

Upper Bound on ideal We found that Θ ideal = b γ max and γ max γ B = N 2B C Combining the above equations we have Θ ideal 2 bb C N = 2B B N Maurizio Palesi 24

Another Lower Bound on max The average number of channel traversals required to deliver a packet for a given traffic is H min N Let us assume the best case in which all channels are loaded equally The load on every channel in the network is γ c, LB =γ max,lb = H min N C Maurizio Palesi 25

Simple Upper Bound on max Let us suppose a routing function that balances load across all minimal paths equally E.g., If there are R xy paths from x to y, 1/ R xy is credited to each channel of each path The maximum load over a channel c is γ c,ub = 1 N x N y N { P xy 1 / R xy if c P 0 otherwise So, the maximum load max,ub is the largest c,up over all channels γ max,ub =max c C γ c, UB Maurizio Palesi 26

Example (1/3) Let us consider a eight node ring network 0 1 2 3 4 5 6 7 γ c,ub = 1 N x N y N { P xy 1 / R xy if c P 0 otherwise γ max,ub =max c C γ c, UB Maurizio Palesi 27

Example (2/3) Let us focus on channel (3,4) γ 3,4, UB = 1 N x N y N P xy { 1 / R xy if c P 0 otherwise 0 1 2 3 4 5 6 7 γ 3,4, UB = 1 8 [ 6 1 4 1 2 ] =1 Maurizio Palesi 28

Example (3/3) It is simple to observe that the load on all channels is the same max,ub = c,ub = 1 Now, let us compute the lower bound γ c, LB =γ max,lb = H min N C H min = 2 hops max,lb = 2 8/16 = 1 max,lb max max,ub max = 1 Maurizio Palesi 29

Latency The latency of a network is the time required for a packet to traverse the network From the time the head of the packet arrives at the input port to the time the tail of the packet departs the output port Maurizio Palesi 30

Components of the Latency We separate latency, T, into two components Head latency (T h ): time required for the head to traverse the network Serialization latency (T s ): time for a packet of length L to cross a channel with bandwidth b T =T h T s =T h L b Maurizio Palesi 31

Contributions Like throughput, latency depends on Routing Flow control Design of the router Topology Maurizio Palesi 32

Latency at Zero Load We consider latency at zero load, T 0 Latency when no contention occurs T h : sum of two factors determined by the topology Router delay (T r ): time spent in the routers Time of flight (T w ): time spent on the wires T h =T r T w =H min t r D min v T 0 =H min t r D min v L b Maurizio Palesi 33

Latency at Zero Load Topology Technology T 0 =H min t r D min v L b Router design Node degree Maurizio Palesi 34

Packet Propagation time Arrive at x Leave x t r x y z Arrive at y t xy Leave y Arrive at z t r t yz Routing delay Link latency Serialization latency L/b Maurizio Palesi 35

Case Study A good topology exploits characteristics of the available packaging technology to meet bandwidth and latency requirements of the application To maximize bandwidth a topology should saturate the bisection bandwidth Maurizio Palesi 36

Bandwidth Analysis (Torus) 03 13 23 33 02 12 22 32 01 11 21 31 00 10 20 30 Asume: 256 signals @ 1Gbits/s Bisection bandwidth 256 Gbits/s Maurizio Palesi 37

Bandwidth Analysis (Torus) 16 unidirectional channels cross the midpoint of the topology To saturate the bisection of 256 signals Each channel crossing the bisection should be 256/16 = 16 signals wide Constraints Each node packaged on a IC Limited number of I/O pins (e.g., 128) 8 channels per node 8x16=128 pins OK Maurizio Palesi 38

Bandwidth Analysis (Ring) 0 1 2 3 4 5 6 7 4 unidirectional channels cross the mid point of the topology To saturate the bisection of 256 signals Each channel crossing the bisection should be 256/4 = 64 signals wide Constraints 15 14 13 12 11 10 9 8 Each node packaged on a IC Limited number of I/O pins (e.g., 128) 4 channels per node 4x64=256 pins INVALID With identical technology constraints, the ring provides only half the bandwidth of the torus Maurizio Palesi 39

Delay Analysis The application requires only 16Gbits/s but also minimum latency The application uses long 4,096 bit packets Suppose random traffic Average hop count Torus = 2 Ring = 4 Channel size Torus = 16 bits Ring = 32 bits Maurizio Palesi 40

Delay Analysis Serialization latency (channel speed 1GHz) Torus = 4,096/16 * 1ns = 256 ns Ring = 4,096/32 * 1ns = 128 ns Latency assuming 20ns hop delay Torus = 256 + 20*2 = 296 ns Ring = 128 + 20*4 = 208 ns No one topology is optimal for all applications Different topologies are appropriate for different constraints and requirements Maurizio Palesi 41