Interconnection Networks

Size: px

Start display at page:

Download "Interconnection Networks"

Cameron O’Brien’
5 years ago
Views:

1 Interconnection Networks

interconnection network End Node End Node Interconnection Network End Node Internetworking: interconnection of multiple networks End Node

2 Interconnection Networks Introduction How to connect individual devices together into a group of communicating devices? Device: r r r Component within a computer Single computer System of computers Types of elements: r r r end nodes (device + interface) links interconnection network End Node End Node Interconnection Network End Node Internetworking: interconnection of multiple networks End Node Device Device Device Device SW Interface SW Interface SW Interface SW Interface HW Interface HW Interface HW Interface HW Interface Link Link Link Link Slide 2

3 Interconnection Networks Introduction Interconnection networks should be designed to transfer the maximum amount of information within the least amount of time (and cost, power constraints) so as not to bottleneck the system Slide 3

4 Types of Interconnection Networks Four different domains: r Depending on number & proximity of connected devices On-Chip networks (OCNs or NoCs) r Devices are microarchitectural elements (functional units, register files), caches, directories, processors r Latest systems: dozens, hundreds of devices m Ex: Intel TeraFLOPS research prototypes 80 cores m Xeon Phi 60 cores r Proximity: millimeters Slide 4

5 System/Storage Area Networks (SANs) Multiprocessor and multicomputer systems r Interprocessor and processor-memory interconnections Server and data center environments r Storage and I/O components Hundreds to thousands of devices interconnected r IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors) Maximum interconnect distance r tens of meters (typical) r a few hundred meters (some) m InfiniBand: 120 Gbps over a distance of 300m Examples (standards and proprietary) r InfiniBand, Myrinet, Quadrics, Advanced Switching Interconnect Slide 5

6 Local Area Network (LANs) Interconnect autonomous computer systems Machine room or throughout a building or campus Hundreds of devices interconnected (1,000s with bridging) Maximum interconnect distance r few kilometers r few tens of kilometers (some) Example (most popular): Ethernet, with 10 Gbps over 40Km Slide 6

7 Wide Area Networks (WANs) Interconnect systems distributed across the globe Internetworking support is required Many millions of devices interconnected Maximum interconnect distance r many thousands of kilometers Example: ATM (asynchronous transfer mode) Slide 7

8 Interconnection Network Domains Distance (meters) 5 x x x 10 0 LANs SANs WANs 5 x 10-3 OCNs ,000 10,000 >100,000 Number of devices interconnected Slide 8

9 Focus: On-Chip Networks Slide 9

10 On-Chip Networks (OCN or NoCs) Why On-Chip Network? r Ad-hoc wiring does not scale beyond a small number of cores m Prohibitive area m Long latency OCN offers r scalability r efficient multiplexing of communication r often modular in nature (eases verification) Slide 10

11 Differences between on-chip and off-chip networks Significant research in multi-chassis interconnection networks (off-chip) r Supercomputers r Clusters of workstations r Internet routers Leverage research and insight but r Constraints are different Slide 11

12 Off-chip vs. on-chip Off-chip: I/O bottlenecks r Pin-limited bandwidth r Inherent overheads of off-chip I/O transmission On-chip r Wiring constraints m Metal layer limitations m Horizontal and vertical layout m Short, fixed length m Repeater insertion limits routing of wires q Avoid routing over dense logic q Impact wiring density r Power m Consume 10-15% or more of die power budget r Latency m Different order of magnitude m Routers consume significant fraction of latency Slide 12

13 On-Chip Network Evolution Ad hoc wiring r Small number of nodes Buses and Crossbars r Simplest variant of on-chip networks r Low core counts r Like traditional multiprocessors m Bus traffic quickly saturates with a modest number of cores r Crossbars: higher bandwidth m Poor area and power scaling Slide 13

14 Multicore Examples (1) XBAR Sun Niagara Niagara 2: 8x9 crossbar (area ~= core) Rock: Hierarchical crossbar (5x5 crossbar connecting clusters of 4 cores) Slide 14

15 Multicore Examples (2) RING IBM Cell Element Interconnect Bus r 12 elements r 4 unidirectional rings m 16 Bytes wide m Operates at 1.6 GHz IBM Cell Slide 15

16 Many Core Example 2D MESH Intel TeraFLOPS r 80 core prototype r 5 GHz r Each tile: m Processing engine + on-chip network router Slide 16

17 Many-Core Example (2): Intel SCC Intel s Single-chip Cloud Computer (SCC) uses a 2D mesh with state of the art routers Slide 17

18 Performance and Cost Latency (sec) Zero load latency Performance: latency and throughput Cost: area and power Offered Traffic (bits/sec) Saturation throughput Slide 18

19 Interfaces Topology Routing Flow Control Router Microarchitecture Topics to be covered Slide 19

20 System Interfaces Slide 20

21 Systems and Interfaces Look at how systems interact and interface with network Types of multi-processors r r Shared-memory m From high end servers to embedded products Message passing m Multiprocessor System on Chip (MPSoC) q Mobile consumer market m Clusters We focus on on-chip networks for shared-memory multi-core Slide 21

22 Shared Memory CMP Architecture Core L2 Cache L1 I/D Cache Router Tags Data Controller Logic L2: Private or distributed shared cache Centralized shared cache will have a different organization A tile could be a core or L2 bank Slide 22

23 Impact of Coherence Protocol on Network Performance Coherence protocol shapes communication needed by system Single writer, multiple reader invariant Requires: r r r Data requests Data responses Coherence permissions Slide 23

24 Broadcast vs. Directory Memory Controller 2 Request broadcast 1 Read Cache miss 3 Send Data Directory receives request 2 Directory 1 Read Cache miss 3 Send Data Slide 24

25 Coherence Protocol Requirements Different message types r Unicast, multicast, broadcast Directory protocol r Majority of requests: Unicast m Lower bandwidth demands on network r More scalable due to point-to-point communication Broadcast protocol r Majority of requests: Broadcast m Higher bandwidth demands r Often rely on network ordering Slide 25

26 Protocol Level Deadlock Network End Node Interconnection Network Reply Q Memory / Cache Controller Request Q Request-Reply Dependency r Network becomes flooded with requests that cannot be consumed until the network interface has generated a reply Deadlock dependency between multiple message classes Virtual channels can prevent protocol level deadlock (to be discussed later) Slide 26

27 Home Node/Memory Controller Issues Heterogeneity in network r Some tiles are memory controllers m Co-located with processor/cache or separate tile m Share injection/ejection bandwidth? Home node r Directory coherence information r <= number of tiles Potential hot spots in network? Slide 27

28 Network Interface Slide 28

29 Network Interface: Miss Status Handling Registers Core Cache Request Type Addr Data Type Addr Data Reply Cache Protocol Finite State Machine MSHRs Status Addr Data Message Format and Send To network Dest RdReq Addr RdReply Addr Data Message Receive From network Dest Writeback Addr Data Request Addr Dest Reply Addr Data WriteAck Addr Slide 29

30 Transaction Status Handling Registers (for centralized directory) Src RdReq Addr Src Writeback Addr Data From network Message Receive Dest RdReply Addr Data Dest WriteAck Addr To network Message Format and Send Directory Cache TSHRs Status Src Addr Data Memory Controller Off-chip memory Slide 30

31 MPSoCs Slide 31

32 Synthesized NoCs for MPSoCs System-on-Chip (SoC) r Chips tailored to specific applications or domains r Designed quickly through composition of IP blocks Fundamental NoC concepts applicable to both CMP and MPSoC Key characteristics r Applications known a priori r Automated design process r Standardized interfaces r Area/power constraints tighter Slide 32

33 Application Characterization vld vop memory 70 Inverse scan 362 Run length decode 362 Stripe memory AC/DC predictio n padding 362 Describe application with task graphs Annotate with traffic volumes iquant vop reconstruction 353 Up samp 16 idct 16 ARM Slide 33

34 Design Requirements Less aggressive r CMPs: GHz clock frequencies r MPSoCs: MHz clock frequencies r Pipelining may not be necessary r Standardizes interfaces add significant delay Area and power r CMPs: 100W for server r MPSoC: several watts only Time to market r Automatic composition and generation Slide 34

35 Application NoC Synthesis Input traffic model Codesign simulation Constraint graph Comm graph User objectives: power, hop delay Constraints: area, power, hop delay, wire length NoC Component library IP Core models FPGA Emulation NoC Area models NoC Power models Topology Synthesis Includes: Floorplanner NoC Router SunFloor System specs: Platform Generation (xpipes- Compiler) SystemC code RTL Architectural Simulation Floorplanning specifications Synthesis Placement and Routing To fab Area, power characterization Slide 35

36 NoC Synthesis Tool chain r Requires accurate power and area models r Quickly iterate through many designs r Library of soft macros for all NoC building blocks r Floorplanner m Determine router locations m Determine link lengths (delay) Slide 36

37 NoC Network Interface Standards Standardized protocols r Plug and play with different IP blocks Bus-based semantics r Widely used Out of order transactions r Relax strict bus ordering semantics r Migrating MPSoCs from buses to NoCs. Slide 37

38 Summary Architecture r Impacts communication requirements r Coherence protocol: Broadcast vs. Directory r Shared vs. Private Caches CMP vs. MPSoC r General vs. Application specific r Custom interfaces vs. standardized interfaces Slide 38

39 Interfaces Topology Routing Flow Control Router Microarchitecture Topics to be covered Slide 39

40 Types of Topologies Slide 40

41 Types of Topologies Focus on switched topologies r Alternatives: bus and crossbar r Bus m Connects a set of components to a single shared channel m Effective broadcast medium r Crossbar m Directly connects n inputs to m outputs without intermediate stages m Fully connected, single hop network m Component of routers Slide 41

42 Types of Topologies Direct r Each router is associated with a terminal node r All routers are sources and destinations of traffic Indirect r Routers are distinct from terminal nodes r Terminal nodes can source/sink traffic r Intermediate nodes switch traffic between terminal nodes Most on-chip network use direct topologies Slide 42

43 Torus (1) K-ary n-cube: k n network nodes N-Dimensional grid with k nodes in each dimension 3-ary 2-mesh 2-cube 2,3,4-ary 3-mesh Slide 43

44 Torus (2) 1D or 2D torus map well to planar substrate for on-chip Topologies in Torus Family r Ex: Ring -- k-ary 1-cube Edge Symmetric r Good for load balancing r Removing wrap-around links for mesh loses edge symmetry m More traffic concentrated on center channels Good path diversity Exploit locality for near-neighbor traffic Slide 44

45 Torus (3) Degree = 2n, 2 channels per dimension r All nodes have same degree Total channels = 2nN r N is total number of nodes Slide 45

46 Mesh A torus with end-around connection removed Same node degree Higher demand for central channels r Load imbalance Slide 46

47 Butterfly Indirect network K-ary n-fly: k n network nodes Routing from 000 to 010 r Dest address used to directly route packet r Bit n used to select output port at stage n ary 3-fly 2 input switch, 3 stages Slide 47

48 Butterfly (2) No path diversity R =1 xy r Can add extra stages for diversity m Increase network diameter 0 1 x x x x Slide 48

49 Butterfly (3) Hop Count r Log k N + 1 r Does not exploit locality m Hop count same regardless of location Switch Degree = 2k Requires long wires to implement Slide 49

50 Clos network 3-stage networks where all input/output nodes are connected to all middle routers Key attribute: path diversity r Input node can select any middle router r Can enable non-blocking routing algorithms (5,3,4) Clos network Slide 50

51 Fat Tree Bandwidth remains constant at each level Regular Tree: Bandwidth decreases closer to root Slide 51

52 Fat Tree (2) Provides path diversity Slide 52

53 Irregular Topologies Slide 53

54 Irregular Topologies MPSoC design leverages wide variety of IP blocks r Regular topologies may not be appropriate given heterogeneity r Customized topology m Often more power efficient and deliver better performance Customize based on traffic characterization Slide 54

55 Irregular Topology Example VLD Run length decoder Inverse scan R R R VLD Run length decoder Inverse scan idct iquant AC/DC predict idct iquant AC/DC predict R R R R R up samp R ARM core VOP reconstr R VOP Memory Stripe Memory Padding R R R R up samp ARM core VOP reconstr R VOP Memory R Stripe Memory R Padding Slide 55

56 Topology Customization Merging r Start with large number of switches r Merge to adjacent routers reduce area and power Splitting r Large crossbar connecting all nodes r Iteratively split into multiple small switches m Accommodate design constraints Slide 56

57 Topology Implementation Slide 57

58 Implementation Folding r Equalize path lengths m Reduces max link length m Increases length of other links Slide 58

59 Concentration Don t need 1:1 ratio of routers to cores r Ex: 4 cores concentrated to 1 router Can save area and power Increases network complexity r r Concentrator must implement policy for sharing injection bandwidth During bursty communication m Can bottleneck Slide 59

60 Implication of Abstract Metrics on Implementation Degree: useful proxy for router complexity r Increasing ports requires additional buffer queues, requestors to allocators, ports to crossbar r All contribute to critical path delay, area and power r Link complexity does not correlate with degree m Link complexity depends on link width m Fixed number of wires, link complexity for 2-port vs 3-port is same Slide 60

better than B m Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs.

61 Implications (2) Hop Count: useful proxy for overall latency and power r Does not always correlate with latency m Depends heavily on router pipeline and link propagation r Example: Hop Count says A is better than B m Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs. But A has 18 cycle latency vs 6 cycle m Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal latency for B Slide 61

62 First network design decision Topology Summary Critical impact on network latency and throughput r Hop count provides first order approximation of message latency r Bottleneck channels determine saturation throughput Slide 62

63 Routing Slide 63

64 Routing Overview Discussion of topologies assumed ideal routing In practice r Routing algorithms are not ideal Goal: distribute traffic evenly among paths r Avoid hot spots, contention r More balanced à closer throughput is to ideal Keep complexity in mind Slide 64

65 Routing Basics Once topology is fixed Routing algorithm determines path(s) from source to destination Slide 65

66 Routing Algorithm Attributes Types r Deterministic, Oblivious, Adaptive Number of destinations r Unicast, Multicast, Broadcast? Adaptivity r Oblivious or Adaptive? Local or Global knowledge? r Minimal or non-minimal? Implementation r Source or node routing? r Table or circuit? Slide 66

67 Routing Deadlock A B D C Each packet is occupying a link and waiting for a link Without routing restrictions, a resource cycle can occur r Leads to deadlock Slide 67

68 Types of Routing Algorithms Slide 68

69 Deterministic All messages from Source to Destination traverse the same path Common example: Dimension Order Routing (DOR) r Message traverses network dimension by dimension r Aka XY routing Cons: r Eliminates any path diversity provided by topology r Poor load balancing Pros: r Simple and inexpensive to implement r Deadlock-free Slide 69

70 Dimension Order Routing a.k.a X-Y Routing r Traverse network dimension by dimension r Can only turn to Y dimension after finished X Slide 70

71 Oblivious Routing decisions are made without regard to network state r Keeps algorithms simple r Unable to adapt Deterministic algorithms are a subset of oblivious Slide 71

72 Valiant s Routing Algorithm To route from s to d r Randomly choose intermediate node d r Route from s to d and from d to d. Randomizes any traffic pattern r All patterns appear uniform random r Balances network load Non-minimal Destroys locality s d d Slide 72

73 Minimal Oblivious Valiant s: Load balancing but significant increase in hop count Minimal Oblivious: some load balancing, but use shortest paths r d must lie within min quadrant r 6 options for d r Only 3 different paths s d Slide 73

74 Oblivious Routing Valiant s and Minimal Adaptive r Deadlock free m When used in conjunction with X-Y routing Randomly choose between X-Y and Y-X routes r Oblivious but not deadlock free! Slide 74

75 Exploits path diversity Adaptive Uses network state to make routing decisions r Buffer occupancies often used r Coupled with flow control mechanism Local information readily available r Global information more costly to obtain r Network state can change rapidly r Use of local information can lead to non-optimal choices Can be minimal or non-minimal Slide 75

76 Minimal Adaptive Routing d s Local info can result in sub-optimal choices Slide 76

77 Non-minimal adaptive Fully adaptive Not restricted to take shortest path Misrouting: directing packet along non-productive channel r Priority given to productive output r Some algorithms forbid U-turns Livelock potential: traversing network without ever reaching destination r Mechanism to guarantee forward progress m Limit number of misroutings Slide 77

78 Non-minimal routing example d d s Longer path with potentially lower latency s Livelock: continue routing in cycle Slide 78

79 Adaptive Routing Example Should 3 route clockwise or counterclockwise to 7? r 5 is using all the capacity of link 5 à 6 Queue at node 5 will sense contention but not at node 3 Backpressure: allows nodes to indirectly sense congestion r Queue in one node fills up, it will stop receiving flits r Previous queue will fill up If each queue holds 4 packets r 3 will send 8 packets before sensing congestion Slide 79

80 Adaptive Routing: Turn Model DOR eliminates 4 turns r N to E, N to W, S to E, S to W r No adaptivity Some adaptivity by removing 2 of 8 turns r Remains deadlock free (like DOR) West first r Eliminates S to W and N to W West first Slide 80

81 Turn Model Routing Negative first r Eliminates E to S and N to W North last r Eliminates N to E and N to W Odd-Even Negative first North last r Eliminates 2 turns depending on if current node is in odd or even col. m Even column: E to N and N to W m Odd column: E to S and S to W r Deadlock free (disallow 180 turns) r Better adaptivity Slide 81

82 Negative-First Routing Example (2,3 ) (0,3 ) (0,0 ) (2,0 ) Limited or no adaptivity for certain source-destination pairs Slide 82

83 Turn Model Routing Deadlock What about eliminating turns NW and WN? Not a valid turn elimination r Resource cycle results Slide 83

84 Adaptive Routing and Deadlock Option 1: Eliminate turns that lead to deadlock r Limits flexibility Option 2: Allow all turns r Give more flexibility r Must use other mechanism to prevent deadlock r Rely on flow control (later) m Escape virtual channels Slide 84

85 Routing Algorithm Implementation Slide 85

86 Routing Implementation Source tables r Entire route specified at source r Avoids per-hop routing latency r Unable to adapt dynamically to network conditions r Can specify multiple routes per destination m Give fault tolerance and load balance r Support reconfiguration (not specific to topology) Slide 86

87 Source Table Routing Destination Route 1 Route 2 00 X X 10 EX EX 20 EEX EEX 01 NX NX 11 NEX ENX 21 NEEX ENEX 02 NNX NNX 12 ENNX NNEX 22 EENNX NNEEX 03 NNNX NNNX 13 NENNX ENNNX 23 EENNNX NNNEEX (0,0 ) Arbitrary length paths: storage overhead and packet overhead Slide 87

88 Node Tables Store only next direction at each node Smaller tables than source routing Adds per-hop routing latency Can adapt to network conditions r Specify multiple possible outputs per destination r Select randomly to improve load balancing Slide 88

89 Node Table Routing Implements West-First Routing Each node would have 1 row of table r Max two possible output ports To From X - N - N - E - E N E N E - E N E N 01 S - 02 S - X - N - E S E - E N E S E - E N S - X - E S E S E - E S E S E - 10 W - W - W - X - N - N - E - E N E N 11 W - W - W - S - X - N - E S E - E N 12 W - W - W - S - S - X - E S E S E - 20 W - W - W - W - W - W - X - N - N - 21 W - W - W - W - W - W - S - X - N - 22 W - W - W - W - W - W - S - S - X - (1,0) Slide 89

90 Implementation Combinational circuits can be used r Simple (e.g. DOR): low router overhead r Specific to one topology and one routing algorithm m Limits fault tolerance Tables can be updated to reflect new configuration, network faults, etc Slide 90

91 Circuit Based sx x sy y =0 =0 Productive Direction Vector exit +x -x +y -y Queue lengths Route selection Selected Direction Vector exit +x -x +y -y Next hop based on buffer occupancies Or could implement simple DOR Fixed w.r.t. topology Slide 91

92 Routing Algorithms: Implementation Routing Algorithm Deterministic Source Routing Combinational Node Table DOR Yes Yes Yes Oblivious Valiant s Yes Yes Yes Minimal Yes Yes Yes Adaptive No Yes Yes Slide 92

93 Routing: Irregular Topologies MPSoCs r Power and performance benefits from irregular/custom topologies Common routing implementations r Rely on source or node table routing Maintain deadlock freedom r Turn model may not be feasible m Limited connectivity Slide 93

94 Routing Summary Latency paramount concern r Minimal routing most common for NoC r Non-minimal can avoid congestion and deliver low latency To date: NoC research favors DOR for simplicity and deadlock freedom r On-chip networks often lightly loaded Only covered unicast routing r Recent work on extending on-chip routing to support multicast Slide 94

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly