Chapter 3 : Topology basics

Similar documents
Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1

Network-on-chip (NOC) Topologies

Topology basics. Constraints and measures. Butterfly networks.

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Homework Assignment #1: Topology Kelly Shaw

Chapter 4 : Butterfly Networks

Lecture 3: Topology - II

4. Networks. in parallel computers. Advances in Computer Architecture

Lecture 2: Topology - I

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Interconnection networks

Interconnect Technology and Computational Speed

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Lecture: Interconnection Networks

INTERCONNECTION networks are used in a variety of applications,

Multiprocessor Interconnection Networks- Part Three

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Interconnection Network

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

InfiniBand SDR, DDR, and QDR Technology Guide

The Impact of Optics on HPC System Interconnects

The final publication is available at

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

SHARED MEMORY VS DISTRIBUTED MEMORY

Basic Switch Organization

Lecture 2 Parallel Programming Platforms

Parallel Computing Platforms

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 7 Slicing and Dicing

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Interconnection Network

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 204 Lecture Notes on Elementary Network Analysis

CS 614 COMPUTER ARCHITECTURE II FALL 2005

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Parallel Computer Architecture II

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Slim Fly: A Cost Effective Low-Diameter Network Topology

Finding Worst-case Permutations for Oblivious Routing Algorithms

Deadlock and Livelock. Maurizio Palesi

Local Area Network Overview

Network Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami

EE/CSCI 451: Parallel and Distributed Computation

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Multiconfiguration Multihop Protocols: A New Class of Protocols for Packet-Switched WDM Optical Networks

Place and Route for FPGAs

EE382 Processor Design. Illinois

Lecture 7: Flow Control - I

Physical Organization of Parallel Platforms. Alexandre David

Data Communication and Parallel Computing on Twisted Hypercubes

CS575 Parallel Processing

Chapter 4 NETWORK HARDWARE

Multiprocessor Interconnection Networks

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Multicomputer distributed system LECTURE 8

Worst-case Ethernet Network Latency for Shaped Sources

Performance Analysis of Storage-Based Routing for Circuit-Switched Networks [1]

CH : 15 LOCAL AREA NETWORK OVERVIEW

More on LANS. LAN Wiring, Interface

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

3. Evaluation of Selected Tree and Mesh based Routing Protocols

CSC630/CSC730: Parallel Computing

From Routing to Traffic Engineering

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

Chapter 06 IP Address

Linux System Administration

Optical Loss Budgets

Lecture 3: Sorting 1

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Estimation of Wirelength

Sorting is ordering a list of objects. Here are some sorting algorithms

Dynamic Stress Wormhole Routing for Spidergon NoC with effective fault tolerance and load distribution

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Hardware Evolution in Data Centers

ET4254 Communications and Networking 1

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Interconnection Networks

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

GIAN Course on Distributed Network Algorithms. Network Topologies and Local Routing

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Hyper-Butterfly Network: A Scalable Optimally Fault Tolerant Architecture

Parallel Architecture. Sathish Vadhiyar

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Transcription:

1 Chapter 3 : Topology basics What is the network topology Nomenclature Traffic pattern Performance Packaging cost Case study: the SGI Origin 2000

2 Network topology (1) It corresponds to the static arrangement of channel and nodes in an interconnection network Topology selection is the first step in the design of a network It specifies both the type of network and the associated details Selection of a good topology consists in fitting the requirements in the available packaging technology Design depends on the number of ports and duty factor of ports But also on the pins available per chip and board, wire density, signaling rate andlength of cables The choice is based on cost and performance Performance can be evaluated considering throughput and latency Cost is based on the number and complexity of the chips, as much as density and length of interconnections used

3 Network topology (2) The choice cannot be based only on the data communication model of the problem It seems a good choice, but generally a special purpose network is a bad idea The load is poorly balanced, because of dynamic load imbalance and or mismatch between between problem size or machine size If data and threads are modified to balance load, the initial match is load The available packaging doesn t allow implementation of such networks The network is inflexible If the algorithm changes, the network cannot be modified as well Some examples (Fig.3.1)

4 Nomenclature (1) Nodes and channels N* set of nodes and N set of terminal nodes, with C set of channels: * Channel c = ( x, y) C where x, y N sc source node and dc destination node w c channel width f c channel frequency l c physical length t c latency; in general l where v is the propagation c = vtc velocity b c bc = wc fc channel bandwidth; it is Switch node x: * N N Cx = CIx COx Channel δset: x = C x Degree: δ Ix + δox It can be expressed as the sum, the in δ and out degree If it is the same for each node, it is indicated as

5 Nomenclature (2) Direct and indirect networks In direct network, every node is both a terminal and a switch (Fig.3.1a) Packets are forwarded directly between terminal nodes The resources of a terminal are available to each switch In indirect network, a node is either a terminal or a switch (Fig.3.1b) Packets are forwarded indirectly using dedicated switch nodes Every direct network can be represented as indirect, by splitting each node into a terminal and a switch (Fig.3.2)

6 Nomenclature (3) Cuts Set of channels that partitions the set of all nodes into two disjoint sets N1 and N2 Each channel of the cut connects a node from N1 to a node from N2 The total bandwidth of the cut is B( N1, N2) = bc Bisections It s a cut that divides the entire network nearly in a half The channel bisection is indicated as The bisection bandwidth is indicated as B C B B c C ( N, N ) If the network has a uniform channel bandwidth b, 1 2 (Sec.3.1.3) (Sec.3.1.3) B B = bb C

7 Nomenclature (4) Paths A path (or route) is an ordered set of channels P, where the destination node of a channel in the set correspond to the source of the following one If, for a particular network, at least one path exists between all source-destination pairs, the network is connected A minimal path from x to y is the path with the minimal hop count connecting the two nodes The set of all minimal paths is denoted R xy The hop count of a minimal path is H ( x, y) Diameter is the largest minimal hop count over all pairs H max It is bounded for a fully-connected network (eq.3.1) Average minimum hop count is H min and it is defined as the average hop count over all sources and destinations (sec.3.1.4) The phisical distance of a path is D( P) = lcand delay is t( P) = D( P) / v c P

8 Nomenclature (5) Symmetry A network is vertex-simmetric if there exists an automorphism that maps any node a into another node b Basicly the topology looks the same from the point-of-view of all the nodes This can simplify routing A network is edge-simmetric if there exists an automorphism that maps any channel a into another channel b It improves the load balance

9 Traffic patterns (1) Spatial distribution of messages in the interconnections networks Traffic matrix Λ : each matrix element λ s, d gives the fraction of traffic sent from s to d Common static traffic patterns (Tab.3.1) Random traffic Each source s is equally likely to send to each destination It balances load even for topologies and routing algorithms with very poor load balance Permutation traffic Each source s sends all its traffic to a single destination Permutations stress the load balance of a topology and a routing algorithm

10 Traffic patterns(2) Bit permutations The destination address is computed by permuting and selectively complementing the bits of the source address Digit permutations The digits of the destination address are calculated from the digits of the source address (they apply only to networks in which the terminal addresses can be expressed as n-digit)

11 Performance and cost To select a topology we base our choice on performance and cost Performance can be evaluated considering: Throughput and maximum channel load Latency Path diversity Cost of a topology is based on the sum of all constrains that derive from the used packaging technology

12 Throughput The throughput is the data rate in bits per second that the network accept per input port It depends on routing and flow control as much as on the topology The ideal throughput can be evaluated comsidering a perfect flow control and a balanced routing We often refer to the ideal throughput of a network on uniform traffic as capacity Maximum throughput occurs when some channel becomes saturated To calculate the throughput it s needed to consider the channel load

13 Channel load It s the ratio of the bandwidth demanded from channel c to the bandwidth of the input ports Maximum channel load is the load of the channel that carries the largest fraction of the traffic for a specific traffic pattern If the traffic reaches the throughput of the network, the load will be equal to the channel bandwidth Any additional traffic overload the channel The ideal throughput of a topology is expressed in (eq.3.2) Maximum channel load and throughput can be computed solving a multicommodity problem In case of uniform traffic, it s possible to calculate some upper and lower bounds

Throughput upper bound in a uniform traffic pattern 14 The load on the bisection channels gives a lower bound to the maximum channel load, and an upper bound on throughput For uniform traffic, N/2 packets must cross the bisection channels Bc As consequence,the load on each bisection is at least equals to the equation (eq.3.3) This gives an upper bound to the throughput (eq.3.4) For example in a k node ring, Bc=4 and the ideal throughput is equal to 8b/k

Channel load bounds in a uniform traffic pattern 15 A channel load lower bound can be computed in this manner Hmin*N gives the channel demand for a given traffic patterns Dividing this demand by the number of channel bounds the load (eq.3.5) These lower bounds can be complemented with a simple upper bound by considering a balanced routing function If there are Rxy minimal path, 1/Rxy is loaded on each channel of each minimal path Th maximum load is mathematically defined in (eq.3.6) γ γ γ For any topology, max, LB max max, UB For an edge-simmetric topology, both the bounds correspond to the maximum one

Example of ideal throughput estimation in an eight node ring network 16 Topology description of the considered network (Fig.3.3) Application of the upper bound approach to the channel (3,4) Considering figure 3.3: Dotted lines represent paths that count as half There are 6 solid lines and 4 dotted lines The maximum channel load is equal to 1 The use of the lower bound gives the same result Hmin*N/C=2*8/16=1 In the general case, an optimal distribution that minimized the channel load should be computed The solution calculation is beyond the scope of this book It s enough to describe the problem formulation

17 Formulation of the mathematical problem For each destination, vector xd defines the average distribution of packets over the channels A valid distribution is obtained by adding flow balance equations at each node The sum of the incoming distributions minus the sum over the outgoing channel must equal the average number that the node is sourcing (+) or sinking (-) In case of a distribution under uniform traffic, all terminal nodes source 1/N units and the destination 1 units It is represented using the element balance vector fd (eq.3.7) The topology can be expressed with the matrix A (eq.3.8) The optimization problem is written in (eq.3.9) Modifying (eq.3.6) and (eq.3.9), it is possible to generalize the problem to an arbitrary traffic pattern

18 Latency Defined as the time required for a packet to traverse the network It can be divided into two components: The head latency is the time required for the head of the message to traverse the network The serialization latency is the time required for the tail to catch up (eq.3.10) It depends on the topology, the routing, the flow control and also the design of the router We will focus on the contribution of the topology

Dependency of the latency on the topology choice 19 When no contention occurs, head latency depends on two factors connected with the topology: The router delay, that is the time spent in the routers The flight delay, that is the time spent on the wires The average router delay is Hmin*tr, while the average flight delay is Dmin/v The resultant expression for the zero-load latency is in (eq.3.11) In case of contention, an additional term Tc has to be added, considering the time spent waiting for resources Hmin, Dmin and b (eq.3.11) depend most on topology (but also on packaging)

20 Examples Packet propagating on a two-hop route from node x to node z, via node y (Fig.3.4) First row: each phit of the packet arriving at node x Second row: leaving x (routing delay tr) Third row: arriving at y (link latency txy) Fourth row: leaving y (second routing delay tr) Fifth row: arriving z (link latency tyz) At this head latency the serialization latency should be added (L/b) 64-node network with Havg=4 hops and 16-bit wide channel The frequency fc =1GHz, tc=5ns and tr=8ns Total routing delay 32ns (8*4) Total wire delay is 20ns (5*4) If L=64bytes, and b=2gbytes/s, serialization delay is equal to 32ns Total latency is 84ns

21 Path diversity A network with multiple routes between most pairs of node is more robust than a network with only a single route This property is called path diversity It improves the balance of the channel load and the fault tolerance Path diversity can be described considering a network with arbitrary permutation traffic Arbitrary permutation traffic is more challenging than uniform Without path diversity, traffic could be focused on a single bottleneck channel Path diversity allows to handle faults It is critical for large networks to tolerate faulty nodes or links One measure of the network fault tolerance is number of edgedisjoint or node-disjoint paths between two nodes But if a fault affect all the neighbors of a node, there is no solution Network isn t connected anymore

22 Example Bit permutation traffic: all nodes send a packet to the destination with bit permuted address Sequence is {0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15} Behavior of a 2-ary 4-fly butterfly (fig.3.5) All the packets from 0,1,8,9 traverse channel 10,20 Same situation for others node Channel load is equal to 4 and the throughput is 25% of the capacity Behavior of a 4-ary 2-cube network (fig.3.6) 2 routes traverse no channel, 4 routes one channel, 4 routes two channels, 4 routes three channels and 4 routes four channel The one-hop channel is the bottleneck For this network the throughput is 50% of capacity But if the 4 one-hop routes use also non minimal path, the traffic is spread uniformly The resulting throughput can reach the 89% of capacity

23 Packaging costs During the construction of a network, nodes of a topology are mapped to packaging modules (chips, boards, chassis) Topology and packaging generate some constraints on the channels bandwidth, that can be used to compare different topologies We consider as example a two-level packaging hierarchy We indicate the channel width as w We fix as constraint both the number of pins per node W s and the amount of global wiring We will discuss how channel frequency is affected by the topology and packaging choice W n

Constraints on a two-level packaging hierarchy: channel width (1) 24 At the first level, individual routers are connected by local wiring Local wiring is unexpensive and abundant For an example, see Figure 3.7 In case of an efficient local arrangement of nodes, constraint on channel width depends only on the available number of pins Wn In particular w δ The second level connects block via global wiring For an example, see Figure 3.8 The number of available global wires bounds the width of individual channels It is a good idea to use the bisection as partitioning of nodes in local group Ws Using minimum bisection, the constraint is expressed by w Bc

Constraints on a two-level packaging hierarchy: channel width (2) 25 Combining the two expression we obtain equation 3.14 Networks with low degree are constrained by the first term Generally they are node-pin limited Networks with high degree are constrained by the second term It is possible to express the the constraint in term of bandwidth Equation 3.14 can be rewrited

Constraints on a two-level packaging hierarchy: wire length 26 In addition to the width of available wires, it is needed to consider the length It should be kept short because the frequency falls quadratically The critical length is related to the maximum frequency dependent attenuation tolerated by the system Table 3.2 shows the length of common types of wires at a 2GHz rate The length can be increased inserting repeaters Actually the repeater cost is the same as a switch cost It is suggested to respect the minimum channel length and insert switch on the longest routes It is impratical to build electrical networks using topologies that require long channel It is more convenient to use optical signaling, but more expensive

27 Example Comparison between two six-nodes rings (Fig. 3.9) A simple ring with degree equal to 4 and Bc equal to 4 A Cayley graph with degree 6 and Bc 10 The maximum pin number is 140 and the global wiring is 200 signals wide Applying the previous equation we obtain equation 3.15 for the first network and equation 3.16 for the second one The results for a signal frequency of 1GHz are in table 3.3 Cayley graph has better throughput, but the ring has lower zero-load latency Cayley graph take advantage from the full bisection width, but his higher degree limits the size of an individual channel, increasing the serialization latency Counterintuitive result, seen that the Cayley ring has a lower hop count

28 Case study: SGI Origin 2000 (1) It supports up to 512 nodes with 2 MIPS R10000 each Its network is based on the SGI SPIDER routing chip 6 bidirectional network channels Each channel is 20 bit wide and operates at 400MHz Channel bandwidth is 6.4Gbits/s and total node bandwith 38.4Gbits/s The channels should be driven across a backplane and three of them can drive up to 5 meters of cable Figure 3.11 illustrates the modification of topology due to the increasing of nodes number Every processing node is connected to a router using 2 of the 6 available channels and leaving four channels Systems with up to 16 routers are configured as binary n-cubes If there are unused channels, they can be connected across the machine to reduce network diameter

29 Case study: SGI Origin 2000 (2) Figure 3.12 shows the topology in case there are more than 16 routers It is a hierarchical approach: 8-routers local subnetworks configured as binary 3-cube networks 8 routers-only global subnetworks use to connect the local ones together A maximal 256-router configuration uses 8 32-nodes binary 5-cube for global interconnections The Origin 200 is packaged in a hierarchy of boards, modules and racks, as shown in figure 3.13 Each node is packaged on a single board, each router is packaged on a different board 4 node boards and 2 routers boards are packaged in a chassis and connected by a midplane 2 chassis are place in each cabinet 64 is the maximum number of cabinets for a system (256 routers)

30 Case study: SGI Origin 2000 (3) Table 3.4 shows the performance of Origin according to the number of nodes Zero-load latency grows with average hop count and distance They depend on the diameter and serialization latency Serialization latency is fixed by the 20-bit width To keep latency low, Origin has a topology in which diameter and hop count increase with the machine size The hierarchical topology allow to keep the logarithmic grow of diameter and hop count in configuration with more than 16 routers The Origin topology provides a flat bisection bandwidth per node Bisection cuts N channel, with N equal to the router number 2 n For small machines, routers and channels For large machines, each node has a channel to a global subnet and each global subnet has a bisection bandwidth equal to the input bandwidth