CS-534 Packet Switch Architecture

Similar documents
CS-534 Packet Switch Architecture

2. Link and Memory Architectures and Technologies

6.2 Per-Flow Queueing and Flow Control

2. Link and Memory Architectures and Technologies

4. Time-Space Switching and the family of Input Queueing Architectures

3. Time Switching, Multi-Queue Memories, Shared Buffers, Output Queueing Family

Packet Switch Architecture

Packet Switch Architecture

IV. PACKET SWITCH ARCHITECTURES

The Network Layer and Routers

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Switching. An Engineering Approach to Computer Networking

BROADBAND AND HIGH SPEED NETWORKS

Chapter - 7. Multiplexing and circuit switches

Sample Routers and Switches. High Capacity Router Cisco CRS-1 up to 46 Tb/s thruput. Routers in a Network. Router Design

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

2. Link and Memory Architectures and Technologies

BROADBAND AND HIGH SPEED NETWORKS

Communication Fundamentals in Computer Networks

A closer look at network structure:

Lecture 3: Flow-Control

LS Example 5 3 C 5 A 1 D

4. Networks. in parallel computers. Advances in Computer Architecture

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

Switching CHAPTER 8. Solutions to Review Questions and Exercises. Review Questions

Fairness, Queue Management, and QoS

Scheduling in Non-Blocking Buffered Three-Stage Switching Fabrics

Unit 2 Packet Switching Networks - II

Network Superhighway CSCD 330. Network Programming Winter Lecture 13 Network Layer. Reading: Chapter 4

Quality-of-Service for a High-Radix Switch

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

EE 122: Router Design

Sharing Channels CS 118

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai

Chapter 1. Introduction

CSC 4900 Computer Networks: Network Layer

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Routing, Routers, Switching Fabrics

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Local Area Networks (LANs): Packets, Frames and Technologies Gail Hopkins. Part 3: Packet Switching and. Network Technologies.

Wave Division Multiplexing. Circuit Switching (1) Switching Networks. Nodes. Last Lecture Multiplexing (2) Source: chapter8

Lecture 16: Router Design

Outline: Connecting Many Computers

ECE 697J Advanced Topics in Computer Networks

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

CS551 Router Queue Management

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

CSC 401 Data and Computer Communications Networks

CS 3640: Introduction to Networks and Their Applications

CS 457 Networking and the Internet. Network Overview (cont d) 8/29/16. Circuit Switching (e.g., Phone Network) Fall 2016 Indrajit Ray

Lecture 7: Flow Control - I

Address InterLeaving for Low- Cost NoCs

Network-on-chip (NOC) Topologies

Congestion Management in Lossless Interconnects: Challenges and Benefits

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

TDT Appendix E Interconnection Networks

Chapter 4 ATM VP-Based Ring Network

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Introduction: Two motivating examples for the analytical approach

CSCD 330 Network Programming

CS557: Queue Management

Chapter 6: Congestion Control and Resource Allocation

Lecture 16: Network Layer Overview, Internet Protocol

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

Basic Low Level Concepts

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

CSE 123A Computer Networks

Routers: Forwarding EECS 122: Lecture 13

Lecture 17: Router Design

Local Area Network Overview

Lecture 14: Congestion Control"

1 Introduction

Quality of Service in the Internet

Quality of Service in the Internet

Wireless Networks (CSC-7602) Lecture 8 (15 Oct. 2007)

TOC: Switching & Forwarding

TOC: Switching & Forwarding

Chapter II. Protocols for High Speed Networks. 2.1 Need for alternative Protocols

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

CSE 3214: Computer Network Protocols and Applications Network Layer

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS343: Data Communication Switched Communication Network

IP Packet Switching. Goals of Todayʼs Lecture. Simple Network: Nodes and a Link. Connectivity Links and nodes Circuit switching Packet switching

Congestion Control in Communication Networks

DC Assignment III. Communication via circuit switching implies that there is a dedicated communication path between two stations.

Packet Switch Architecture

Parallelism in Network Systems

Optical Packet Switching

Long Round-Trip Time Support with Shared-Memory Crosspoint Buffered Packet Switch

2. Modelling of telecommunication systems (part 1)

Network Performance: Queuing

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routers: Forwarding EECS 122: Lecture 13

Transcription:

CS-534 Packet Switch Architecture The Hardware Architect s Perspective on High-Speed Networking and Interconnects Manolis Katevenis University of Crete and FORTH, Greece http://archvlsi.ics.forth.gr/~kateveni/534

1. Basic Concepts and Queueing Architectures Table of Contents: 1.1 Problem Statement scalable, distributed, multi-party communication 1. Basic Concepts and Terminology feasible traffic, internal blocking output contention, buffering, flow control, admission control circuit versus packet switching time vs. space switching, multiplexor, crossbar, buffer memories 1.3 Queueing Architectures Family 1: shared buffer, output queueing, crosspoint queueing 1.4 Queueing Architectures Family : input queueing: head-of-line blocking, per-output queues 1. - U.Crete - M. Katevenis - CS-534

1.1 Problem Statement Scalable, Distributed, Multi-party Communication Link Capacity = 1.00 Port 0 Port 1 Port Network Port N-3 Port N- Port N-1 Link Capacity = 1.00 Output Contention: incoming rates > outgoing link capacity 1.1 - U.Crete - M. Katevenis - CS-534 3

1.1 Problem Statement Scalable, Distributed, Multi-party Communication Pi (1) am I the only one sending to Pj? (3) adjust rate... (c) reaction effect (b) feedback () no, you are not! Pj (a) event Reaction Effect Delay = Round-Trip Time (RTT) 1.1 - U.Crete - M. Katevenis - CS-534 4

100% Example: Interdependent Constraints λ 1 λ λ 3 λ 6 λ 4 λ 5 100% λ 1 + λ + λ 3 100% (output contention) 33%+33%+33%? (fairness) λ 3 + λ 4 + λ 5 + λ 6 100% (input rate limitation) 5%+5%+5%+5%? λ 3 =5% λ 1 +λ =75% λ 1 =λ =37.5% (max-min fairness)? orλ 3 =0, λ 1 =λ =50%, λ 4 =λ 5 =λ 6 =33.3% (maximum utilization)? 1.1 - U.Crete - M. Katevenis - CS-534 5

1.1 Problem Statement Scalable, Distributed, Multi-party Communication P0 P1 P P3 P4 P5 P6 P7 (a) P0 P1 P P3 P4 P5 P6 P7 (b) 4x4 4x4 4x4 (a) all-to-all connectivity costs O(N ) (b) lower-cost scalability is feasible with or without compromise on performance? 1.1 - U.Crete - M. Katevenis - CS-534 6 P0 P1 P P3 P4 P5 P6 P7

1. Basic Concepts & Terminology: Feasible Traffic Port 0 Port i Port N-1 Network λ i,0 λ i,j λ 0,j λ i,n-1 λ i,j λ N-1,j Link Capacities = 1.00 each Port 0 Port j Port N-1 Rates satisfying: i: j λ i,j 1, i.e. do not violate input link capacities; and: j: i λ i,j 1, i.e. do not violate output link capacities do not create output contention. Flow control & congestion management strive to determine and enforce feasible rates at the sources not at all an easy problem 1. - U.Crete - M. Katevenis - CS-534 7

1. Basic Concepts & Terminology: Internal Blocking Local Internal Blocking Local traffic traffic OK switch1 (remote traffic infeasible beyond a limit) switch OK Overall network Externally feasible traffic for overall network, but Internally not feasible, due to internal link oversubscription. Int. blocking in a network is output contention in a subnetwork, but: Output contention is the customer s responsibility, while Internal blocking is the network provider s responsibility. 1. - U.Crete - M. Katevenis - CS-534 8

Internal Blocking: a quiz and a preview of a key Result end-station 6x6 switch does this network have internal blocking? (a) if each flow is always routed through a fixed path? (b) if routing paths can adapt to traffic patterns? hint: consider the traffic crossing a 45 bisection ( bisectional throughput ) Key Result (see later): A N N network made of (N/) log N or less switch elements will always have internal blocking. The Benes network, using multipath routing, has N log N switch elements and is internally non-blocking. 1. - U.Crete - M. Katevenis - CS-534 9

Dealing with Contention for Link Throughput Option 1: Ensure contention never appears preschedule everything fixed-rate traffic circuit switching Option : Allow dynamicly varying rates packet switching (a): Dealing with Short-term contention manageable volume of excess traffic either: buffer excess packets, temporarily, in memories, or: drop excess packets and possibly retransmit later OK in some applications, and if we ensure it rarely happens, e.g. if it only happens on memory overflow, or w. massive overspeed (b): Dealing with Long-term contention unmanageable volume if excess traffic allowed to persist either, beforehand, use admission control increased latency before traffic allowed to start or change rate or, after-the-fact, use flow control congestion management need large (RTT) buffer space(s) and multiple queues 1. - U.Crete - M. Katevenis - CS-534 10

1. Basic Concepts & Terminology: Circuit Switching Circuits: unused, wasted capacity A B C 1 3 4 periodic frame 1 3 4 periodic frame 1 3 4 periodic frame time Originates from telephonic circuits, digitized and time-multiplexed Fixed-rate, prescheduled at connection set-up time like trains Data-only no headers needed: time-slot position in frame implicitely provides circuit ID (flow ID) and routing information + Simplicity: static, off-line routing decisions and contention resolution - Partitioned Capacity: throughput is statically partitioned among circuits: unused capacity in one circuit is wasted cannot be used by other ckts 1. - U.Crete - M. Katevenis - CS-534 11

1. Basic Concepts & Terminology: Packet Switching Flows: A B C globally unused capacity time Varying or unpredictable traffic like automobiles Self-describing packets: header provides destination address + Transmission capacity of link is dynamically shared among flows - Demanding: dynamic, on-line, run-time routing decisions and contention resolution 1. - U.Crete - M. Katevenis - CS-534 1

1. Basic Concepts & Terminology: Time Switching High-throughput link (C=3) time time Time-Division Multiplexing (TDM) Multiplexer (Aggregation type) Demultiplexer time All packets pass through a single point in space, at different times Similar to time-sharing multiprogramming on a single processor Buses are in this category (distributed multiplexor, built w. tristate drivers) + Economize on datapaths, wires, memories + Easy to share aggregate capacity among competing flows - Non-scalable: infeasible beyond technology limit for aggregate capacity 1. - U.Crete - M. Katevenis - CS-534 13

1. Basic Concepts & Terminology: Space Switching Crossbar Multiplexer (Selection type) Packets at a given time pass through different paths in space Similar to multiprocessing on parallel processors Crossbars are in this category (single-stage space switches) + Scalable: use when aggregate throughput > upper limit of time switching - Partitioned memories, wires harder to route, schedule, load balance 1. - U.Crete - M. Katevenis - CS-534 14

Combination Example: Time-Space-Time Circuit Switch B0 A0 B1 A1 TSI TSI Output Conflict: same output port, same input time requires input TSI B0 A0 A1 B1 periodic frames Time-Slot Interchange (one frame worth of buffer memory) Input Conflict: same input port, same output time requires output TSI TSI Crossbar TSI Time switching (TSI s) needed to resolve output and input conflicts A1 A0 A1 A0 B0 B1 B1 B0 1. - U.Crete - M. Katevenis - CS-534 15

Contention Resolution: Buffer Memories Multiplexor (Aggregation type) C=4 C=4 C=4 C=4 Buf. Buf. Buf. Buf. Memories are needed to temporarily buffer packets that cannot proceed due to (hopefully short-term) output contention 1. - U.Crete - M. Katevenis - CS-534 16

1.3 Queueing Architectures family 1 Output Queueing the reference architecture Packets are buffered in peroutput memories, right next to their desired output Work Conserving Operation: an output will never remain idle while even a single packet destined to it exists in some switch buffer memory + Minimum possible delay Multiplexor (Aggregation type) Buf. C=4 C=4 C=4 C=4 Buf. Buf. Buf. + Full (100%) utilization of outgoing link capacity + Adaptable to any quality-of-service (QoS) policy: organize queues and scheduler as desired within each per-output buffer; but - Wasteful in buffer-memory throughput see shared buffer arch, below - Partitioned buffer space is less efficient than shared see below 1.3 - U.Crete - M. Katevenis - CS-534 17

Shared Buffer the best architecture (when feasible) 1... N 1... N C=N Output Queueing C=N C=N C=N C=N M(1) M() M(...) M(N) 1... N Shared Buffer C=N N + Aggregate memory throughput = N, versus N (N+1) = N +N for Outp.Q + Same high performance & minimum delay as Outp.Q with proper data str. + Shared buffer space is more efficiently used than partitioned in Outp.Q ng - Non-scalable: requires building a buffer memory with throughput = N (Outp.Q ng is not scalable either: requires mem s of thruput (N+1) each) 1.3 - U.Crete - M. Katevenis - CS-534 18

Memory Throughput determines Feasibility & Cost Is memory thruput arbitrarily scalable by increasing its width? Not for memory widths exceeding the packet size! Multi-packet-wide memories are in reality full-fledged switches Example: Internet traffic consists of ~ 60% min-size packets, of size 40 Bytes (30 bits) each; assume mem. cycle time = ns peak memory thruput = 500 Maccesses/s = 500 Mpackets/s = 500 M 30 bits/s = 160 Gbps; for 10 Gbps links, this allows the shared buffer arch. to scale only up to N=16 8 8 switch On-chip memory: power consumption ~ throughput e.g. 130 nm 1.5 to mw / Gbps (for small mem. blocks: consumption dominated by sense amp s; large blocks: by size) Off-chip memory: wire, pin, and chip count ~ throughput RAM chip address and data throughput 500 to 800 Mbps/pin pin & wire count determine pckg size, board area, power cons. 1.3 - U.Crete - M. Katevenis - CS-534 19

Crosspoint Queueing scalable but very costly impl. of OQ 1... N Multipexor (selection) module implementing one Output Queue Buffer 1... N + Same high performance & minimum delay as Output Q ng or Shared Buf. + Scalable: each memory needs throughput of only, independent of N - Very expensive: total memory thruput = N, versus N for shared buf. - Highly partitioned memory: very poor buffer space utilization 1.3 - U.Crete - M. Katevenis - CS-534 0

1.4 Queueing Architectures family Input Queueing the practical architecture Per-input buffer memories + Scalable: each memory needs throughput = feasible independ. of N + Low cost: total memory thruput = N same as shared buffer 1... N Multipexor (selection) - Performance suffers a lot, unless (i) multiple queues per input, and (ii) sophisticated scheduler, and usually (iii) other modifications to be seen later (small crosspoint Q s, or internal speedup and OQ s) 1.3 - U.Crete - M. Katevenis - CS-534 1 1 Scheduler... N

1... N Input Queueing is not the dual of Output Queueing 1... N C=N C=N C=N C=N Input Queueing 1... N Dual of Output Queueing 1... N Asymmetry between packet arrival and packet departure conflicts: Simultaneous arrivals may conflict with each other (packets destined to the same output), and the switch is obliged to accept them. Simultaneous departures are scheduled by the switch can be made to not originate from the same input albeit at potential performance cost. 1.3 - U.Crete - M. Katevenis - CS-534

Head-of-Line (HOL) Blocking needlessly blocked (HOL blocking) 1 3 4 4 1 3 4 blocked due to input mem. thruput limitation 3 3 Single FIFO queue per input Whenever one First-in-First-Out (FIFO) queue feeds multiple destinations, beware of the danger of head-of-line (HOL) blocking (called HOL-blocking when bottleneck is the FIFO organization not when other bottleneck, e.g. memory read throughput or output port throughput) 1.3 - U.Crete - M. Katevenis - CS-534 3 Blocked! Blocked (outp. thruput) 1 3 4

Multiple Lanes needed to resolve HOL blocking As with cars in the road network, HOL blocking spreads the negative effects of congestion (blue cars) to other, unrelated traffic (green cars) Multiple lanes (queues, virtual circuits) (as many as the congested destinations???) can resolve this problem when properly architected (when packets heading to a common congestion point are prevented from occupying more than one lane) how should we do this?... 1.3 - U.Crete - M. Katevenis - CS-534 4

Crossbar Scheduling with Multiple Queues is tricky 1 A A A Q1A 1 A A A Q1A Blocked B B Q1B B B Q1B A QA A QA Blocked QB QB Configuration (a) A B Configuration (b) A B Per-output queues at the inputs avoid HOL blocking (a): even if Q1A cells are older, the younger Q1B cells can bypass them, since they reside in a separate Q and the scheduler can see them. Which of the two configurations should the scheduler choose? (a) for higher aggregate thruput? but then, flow Q1A will starve! 1.3 - U.Crete - M. Katevenis - CS-534 5

Summary of Queueing Architectures N N switch Performance Architecture per-mem thruput num.of mem s tot.mem thruput mem.sp utilizatn Complexity Conclusions Shared Buffer N 1 N best best multiple queues best if feasible Output Q ing N+1 N N +N medium best simple refer nc only Crosspt Q ing N N worst best simple simple scalable Inp.Q singleq N N medium worst simple simple, poor perf Inp.Q multiq Variants (later) to 4 N N++ N 4 N to 8 N++ medium medium medium very good multiq s, schedul r multiq s, schedul r textbook only practical scalable 1.3 - U.Crete - M. Katevenis - CS-534 6