Lecture 3: Topology - II

Similar documents
Lecture 2: Topology - I

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Interconnection Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Homework Assignment #1: Topology Kelly Shaw

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Network-on-chip (NOC) Topologies

Topology basics. Constraints and measures. Butterfly networks.

Lecture 7: Flow Control - I

Interconnection Networks

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Topologies. Maurizio Palesi. Maurizio Palesi 1

Chapter 4 : Butterfly Networks

Interconnection Network

EE/CSCI 451: Parallel and Distributed Computation

Interconnection networks

Physical Organization of Parallel Platforms. Alexandre David

Lecture: Interconnection Networks

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Parallel Computing Platforms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

4. Networks. in parallel computers. Advances in Computer Architecture

Interconnection Network

Interconnection Networks

The Impact of Optics on HPC System Interconnects

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Chapter 3 : Topology basics

Lecture 3: Flow-Control

GIAN Course on Distributed Network Algorithms. Network Topologies and Local Routing

Parallel Computing Interconnection Networks

EE/CSCI 451: Parallel and Distributed Computation

CS575 Parallel Processing

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Lecture 18: Communication Models and Architectures: Interconnection Networks

The final publication is available at

Packet Switch Architecture

Packet Switch Architecture

Slim Fly: A Cost Effective Low-Diameter Network Topology

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 22: Router Design

Communication Performance in Network-on-Chips

Lecture 12: SMART NoC

ANALYSIS AND IMPROVEMENT OF VALIANT ROUTING IN LOW- DIAMETER NETWORKS

Interconnection Networks. Issues for Networks

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

INTERCONNECTION networks are used in a variety of applications,

TDT Appendix E Interconnection Networks

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

Basic Switch Organization

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

GIAN Course on Distributed Network Algorithms. Network Topologies and Interconnects

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Network Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami

A Case for Random Shortcut Topologies for HPC Interconnects

EECS 570 Final Exam - SOLUTIONS Winter 2015

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Multiprocessor Interconnection Networks

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

Routing Algorithms. Review

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Advanced Parallel Architecture. Annalisa Massini /2017

High Performance Computing Programming Paradigms and Scalability

INTERCONNECTION NETWORKS LECTURE 4

Interconnection Networks

A NEW ROUTER ARCHITECTURE FOR DIFFERENT NETWORK- ON-CHIP TOPOLOGIES

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

International Journal of Advanced Research in Computer Science and Software Engineering

Fast Flexible FPGA-Tuned Networks-on-Chip

POLYMORPHIC ON-CHIP NETWORKS

Multiprocessor Interconnection Networks- Part Three

A Survey of Techniques for Power Aware On-Chip Networks.

Chapter 9 Multiprocessors

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100

CSC630/CSC730: Parallel Computing

Lecture 13: System Interface

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Interconnect Technology and Computational Speed

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Rou1ng. Prof. Natalie Enright Jerger

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Lecture 2 Parallel Programming Platforms

Chapter 7 Slicing and Dicing

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Parallel Architectures

Operational Evaluation of Network on Chip Topologies

EE 4683/5683: COMPUTER ARCHITECTURE

Phastlane: A Rapid Transit Optical Routing Network

Transcription:

ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and Computer Engineering Georgia Institute of Technology tushar@ece.gatech.edu

Topology: How to connect the nodes with links 2 ~Road Network

Run-Time Metrics Hop Count Latency 3 Maximum Channel Load Throughput

Maximum Channel Load Identify channel with maximum traffic Count total flows through it 4 Maximum Throughput = 1 / (max channel load)

Maximum Channel Load 5 B F A C E G D H Identify bottleneck channel For uniform random traffic, is the bisection channel Suppose each node generates p messages per cycle 4p messages per cycle in left ring 2p message per cycle will cross to other ring Link can handle one message per cycle So maximum injection rate of p = ½

6 Maximum Channel Load B F A C E G D I H What if Hot Spot Traffic? Suppose every node sends to node G Which is the bottleneck channel? Used by A, B, C, D, E, and F to send to G Max Throughput = 1 / 6

Maximum Channel Load 7 0 1 2 3 4 5 6 7 With uniform random traffic 3 sends 1/8 of its traffic to 4,5,6 3 sends 1/16 of its traffic to 7 (2 possible shortest paths) 2 sends 1/8 of its traffic to 4,5 Etc Max Channel load = 1

Traffic Patterns Historically derived from particular applications of interest Important to stress test the network with different patterns Uniform random can make bad topologies look good For a particular topology and traffic pattern, one can derive Avg Hop Count (à Low- Load Latency) Max Channel Load (à Peak Throughput) 8

Is it possible to achieve derived lowload latency & peak throughput? 9 Throughput given by flow control Latency Zero load latency (topology+routing+f low control) Min latency given by routing algorithm Throughput given by routing Throughput given by topology Min latency given by topology Offered Traffic (bits/sec)

Uniform Random Traffic on a k x k Mesh 10 Zero-load latency? ( Ideal Latency ) T = (H+1).(t router + t stall_avg )+ (H+2).(t wire ) + T ser H = number of hops inside network t router = per-hop router pipeline delay t wire = per-hop link delay t stall = per-hop stall delay (due to contention) T ser = serialization delay Ideal case: t router = 1, t wire = 1 Let s assume 1-flit packets (T ser = 0) Zero-load => t stall_avg ~ 0 Suppose k = 8, H avg = 5.333 => T zero-load = 13.666

11 Uniform Random Traffic on a k x k Mesh Saturation Throughput? ( Ideal Throughput or Peak Injection Rate) 1 / max channel load Lets calculate load on one of the bisection links - k 2 /2 nodes on the left. - Half their messages (k 2 /4) cross the bisection links - Total k bisection links from left to right. - Load on each bisection link = k 2 /4k = k/4 - Peak Throughput = 4/k For k = 4, peak throughput = 1 flit/node/cycle For k = 8 (64-core mesh), peak throughput = ½ flits / node / cycle

Latency-throughput for 8x8 Mesh in Lab 1 12 Actual zero-load Latency (Lab 1) Throughput Gap Actual throughput (lab 1) *Garnet injected 1-flit and 5-flit packets with probability 2/3 and 1/3 Latency 34.4 Latency Gap Throughput given by topology Min latency given by topology 13.66 0.37* Offered Traffic (flits/node/cycle) What is the default router delay in Garnet? 0.5

Another representation: Injection Rate as a % of capacity 13 For 4x4 Mesh, 100 => 1 flit/node/cycle For 8x8 Mesh, 100% => 0.5 flits/node/cycle Latency This representation is better to understand if we are able to achieve the throughput the network was actually designed for 50 100 Offered Traffic (% of capacity)

14 Topology Classification Direct Each router is associated with a terminal node All routers are sources and destinations of traffic Example: Ring, Mesh, Torus Most on-chip networks use direct topologies Indirect Routers are distinct from terminal nodes Terminal nodes can source / sink traffic Intermediate nodes switch traffic Examples: Crossbar, Butterfly, Clos, Omega, Benes,

15 Crossbar S S S S D D D Switch Diameter =? 1 Degree =? 1 Bisection BW =? N D Pros Every node connected to all others (non-blocking) Low latency and high bandwidth Used by GPUs Cons Area and Power goes up quadratically (O(N 2 ) cost) Expensive to layout Difficult to arbitrate

Butterfly (k-ary n-fly) 16 As a convention, source and destination nodes drawn logically separate on the left and right, though physically the two 0s, two 1s, etc are often the same physical node. Radix of each switch = k (i.e., k inputs and k outputs Number of stages = n Total Source/Destination Terminal Nodes = k n In each stage, k n-1 switches Each switch is a k x k crossbar Sources 2-ary 4-fly Destinations

Butterfly (k-ary n-fly): Metrics 17 Degree? k Diameter? n+1 Bisection Bandwidth? N/4 where N = k n ) 2-ary 4-fly Hop Count? Channel Load? (for uniform traffic) Path Diversity? n+1 1 None. Only one route between any pair

Tackling path diversity in a butterfly 18 Additional Stage

19 Beneš Network Pronounced Ben-ish Back to back butterflies N-alternate paths between any pair Is non-blocking

Shuffle/Omega Network (Isomorphic Butterfly) 20 00 10 20 00 10 20 01 12 21 01 11 21 02 11 22 02 12 22 03 13 23 03 13 23 Shuffle Network 2-ary 3-fly

Clos Networks: (m, n, r) 21 3-stages m = number of middle switches n = number of input (output) ports on input (output) switches r = number of input / output switches Clos (5, 3, 4)

Non-blocking Clos A clos network is strictly non-blocking for unicast traffic iff m >= 2n-1 an unused input on an ingress switch can always be connected to an unused output on an egress switch without having to re-arrange existing routes Proof (1953): Suppose an input switch has one free terminal and this has to be connected to a free terminal of an output switch Worst case (n-1) input terminals of input switch use (n-1) separate middle switches (n-1) output terminals of output switch use (n-1) separate middle switches We need another middle switch to connect this input to output Total = (n-1) + (n-1) + 1 = 2n-1 22

23 Non-blocking Clos A clos network is rearrangeably non-blocking for unicast traffic iff m >= n an unused input on an ingress switch can always be connected to an unused output on an egress switch but this might require re-arranging of existing routes Proof (1953): If m = n, each input can use one middle switch to connect to its output

Binary Fat Tree 24 4 2 Can be built by folding a multi-stage clos Bisection Bandwidth? N 1 Diameter? 2log 2 N

Beneš à Folded Clos 25

Hierarchical Topologies: Concentrators 26 Advantages: - Low diameter - Fewer links Disadvantages: - Lower bisection bandwidth - Link at concentrator can become bottleneck

More Hierarchical Topologies 27 ATAC: PACT 2010

Which topology should you choose? Hard to optimize for everything Desired bandwidth Desired latency Physical Constraints Wire budget Indirect topologies popular off-chip On-chip networks often use direct topologies due to wiring constraints Wire layout Topologies should be easy to layout on a planar 2D substrate Router complexity Number of ports 28

Lab 2 Topology Comparison! You will implement a Flattened Butterfly Topology Kim et al., Flattened Butterfly Topology for On-Chip Networks, MICRO 2007 Read the paper to understand the topology You can ignore the routing/flow-control details for now as we haven t covered that in class yet Compare performance against Mesh keeping the bisection bandwidth constant I will email out details Looking at $gem5/configs/topologies to see how topologies are implemented. You will write a FlattenedButterfly.py file 29