Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

Similar documents
Providing Flow Based Performance Guarantees for Buffered Crossbar Switches

Switch Fabrics. Switching Technology S P. Raatikainen Switching Technology / 2006.

Switch Fabrics. Switching Technology S Recursive factoring of a strict-sense non-blocking network

IV. PACKET SWITCH ARCHITECTURES

Packet Switch Architectures Part 2

Scalable Schedulers for High-Performance Switches

Sample Routers and Switches. High Capacity Router Cisco CRS-1 up to 46 Tb/s thruput. Routers in a Network. Router Design

Routers with a Single Stage of Buffering * Sigcomm Paper Number: 342, Total Pages: 14

Switching CHAPTER 8. Solutions to Review Questions and Exercises. Review Questions

An O(log 2 N) parallel algorithm for output queuing

ECE 697J Advanced Topics in Computer Networks

Routers with a Single Stage of Buffering *

CS 552 Computer Networks

BROADBAND AND HIGH SPEED NETWORKS

Switching. An Engineering Approach to Computer Networking

Routers with a Single Stage of Buffering *

Basic Switch Organization

Router architectures: OQ and IQ switching

Router/switch architectures. The Internet is a mesh of routers. The Internet is a mesh of routers. Pag. 1

Efficient Queuing Architecture for a Buffered Crossbar Switch

Designing Efficient Benes and Banyan Based Input-Buffered ATM Switches

Int. J. Advanced Networking and Applications 1194 Volume: 03; Issue: 03; Pages: (2011)

206 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 1, FEBRUARY The RGA arbitration can also start from the output side like in DRR [13] and

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

Parallelism in Network Systems

DISTRIBUTED EMBEDDED ARCHITECTURES

EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches

Lectures 8/9. 1 Overview. 2 Prelude:Routing on the Grid. 3 A couple of networks.

Switching Hardware. Spring 2015 CS 438 Staff, University of Illinois 1

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Multicast Traffic in Input-Queued Switches: Optimal Scheduling and Maximum Throughput

The Arbitration Problem

Switching Using Parallel Input Output Queued Switches With No Speedup

Chapter 7. Network Flow. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 8 Switching 8.1

Space-division switch fabrics. Copyright 2003, Tim Moors

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Scheduling. Scheduling algorithms. Scheduling. Output buffered architecture. QoS scheduling algorithms. QoS-capable router

Analyzing CIOQ Routers with Localized Memories

Direct Routing: Algorithms and Complexity

Switches, Routers and Networks

A Four-Terabit Single-Stage Packet Switch with Large. Round-Trip Time Support. F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I.

Introduction to ATM Technology

Lecture 14: M/G/1 Queueing System with Priority

048866: Packet Switch Architectures

Analyzing Parallel Routers with Slower Memories

Homework Assignment #1: Topology Kelly Shaw

Dynamic Scheduling Algorithm for input-queued crossbar switches

5.2 Switching Fabric Topologies

EE 122: Router Design

Unit 2 Packet Switching Networks - II

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Scaling Internet Routers Using Optics Producing a 100TB/s Router. Ashley Green and Brad Rosen February 16, 2004

The Bounded Edge Coloring Problem and Offline Crossbar Scheduling

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

PERFECT MATCHING THE CENTRALIZED DEPLOYMENT MOBILE SENSORS THE PROBLEM SECOND PART: WIRELESS NETWORKS 2.B. SENSOR NETWORKS OF MOBILE SENSORS

Chapter 3. Graphs CLRS Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Buffer Sizing in a Combined Input Output Queued (CIOQ) Switch

The Network Layer and Routers

High Performance Computing Programming Paradigms and Scalability

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach

Scheduling Algorithms to Minimize Session Delays

CS-534 Packet Switch Architecture

Parallel Packet Copies for Multicast

1 Architectures of Internet Switches and Routers

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

7. NETWORK FLOW III. assignment problem input-queued switching. Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley

Literature Survey of nonblocking network topologies

1 Introduction

Induction Review. Graphs. EECS 310: Discrete Math Lecture 5 Graph Theory, Matching. Common Graphs. a set of edges or collection of two-elt subsets

EE/CSCI 451: Parallel and Distributed Computation

Design of a Tile-based High-Radix Switch with High Throughput

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

6 Distributed data management I Hashing

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

BROADBAND PACKET SWITCHING TECHNOLOGIES

by conservation of flow, hence the cancelation. Similarly, we have

Optical Packet Switching

Queuing. Congestion Control and Resource Allocation. Resource Allocation Evaluation Criteria. Resource allocation Drop disciplines Queuing disciplines

Work-Conserving Distributed Schedulers for Terabit Routers

Buffered Crossbar based Parallel Packet Switch

COMP/ELEC 429/556 Introduction to Computer Networks

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai

Module 7. Independent sets, coverings. and matchings. Contents

High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks

A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches

On Achieving Throughput in an Input-Queued Switch

Lecture (08, 09) Routing in Switched Networks

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

A survey on core switch designs and algorithms

Chapter 1. Introduction

Networking hierarchy Internet architecture

Graph theory - solutions to problem set 1

IN OUR APPROACH to packet-scheduling, the designer first

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Approximation Algorithms

A Split-Central-Buffered Load-Balancing Clos-Network Switch with In-Order Forwarding

Dynamic Routing on Networks with Fixed-Size Buffers

Distributed Sorting. Chapter Array & Mesh

Transcription:

Crossbar Crossbar - example Simple space-division switch Crosspoints can be turned on or off i n p u t s sessions: (,) (,) (,) (,) outputs Crossbar Advantages: simple to implement simple control flexible Drawback: again, limited scale number of crosspoints, N large VLSI layout area vulnerable to single faults Bottleneck: number of pins! Need log n pins for each input port to encode output port Bottom line: good for small N (hundreds) Combination: Time-space switching Precede each input link in a crossbar with a TSI Delay samples so that they arrive at the right time for the space division switch s schedule mux mux demux demux Crosspoints: (not 6) memory speed : (not )

Time-Space: Example Need a schedule! Desired permutation: (,,,) mux mux Internal speed = twice link speed demux demux time time Definition of the problem. Input: a permutation σ of [n] σ(i) = j means input port i goes to output j Size of crossbar, s s Implied: Each crossbar line handles m calls, where m=n/s Output: A schedule of length t m realizing σ For each crossbar input line: order to place samples For each time step : what permutation to realize at crossbar 5 6 Finding a schedule Some Graph Theory Goal: Minimize length of schedule Model by a routing graph a node for each crossbar input, a node for each crossbar output an edge connects nodes a,b if there s a call from a to b Need to find good edge coloring (,), (,), (,), (,) Edge coloring: Assignment of colors to edges, so that no incident edges have the same color. Note: always #colors max degree. Bipartite graph: two sets of nodes, edges only between sets (not inside a set). G=(A,B,E), where E A B. Theorem: a graph is bipartite iff all its cycles have even length. A B 7 8

König s Theorem [96] The edges of any bipartite graph with maximal degree Δ can be colored in Δ colors. Proof: By an algorithm.. Pick an edge from some node of degree Δ.. If leading to an unvisited node of degree Δ, continue.. When stopped: remove path edges number,,5... Repeat until done max degree Δ-, apply induction Analysis: Step is well defined because all cycles are of even length (bipartite graph) After Step, all nodes on the path get degrees < Δ. Corollary Can find optimal schedule! Because length of schedule at least number of inputs per crossbar input Computation is a bit time consuming But needs to be done only at setup time. 9 0 General Space division Goal: extend the benefits of crossbars by better topology design Graph representation: Nodes for inputs, outputs, and internal switches Edges for links between them Routing: edge disjoint paths Cost: number of crosspoints VLSI Layout area Multistage switch In a crossbar, in each step, only one crosspoint may be active in each row or column: call goes through a single crosspoint (internal switch) Multistage switch basic idea: Allow calls to go through more than one internal switch

Example Clos networks: (N,n,k) 5 sessions: (,) (,6) (,) (,) (5,) (6,5) 5 N N switch, three layers (stages): Layers,: N/n sub-switches, each subswitch is n k Layer : k subswitches, each is N/n N/n Each layer subswitch is connected to all layer and switches # crosspoints: (0,0,) Clos Network 6 6 Stage : switches Stage : switches Stage : switches This is less than N for Clos Networks Can suffer internal blocking, e.g., if k < n because total width of nd stage is k N/n. Types of non-blocking: Strict sense: can find a route from a free input to a free output without changing existing routes Rearrangable: can route any set of pairs (destroying existing state) Example New call from input 6 to output x x x x x x x x x x x N=8 n= k= 5 6

Clos Theorem [95] Blocking Example Clos network is strict-sense nonblocking iff k n. Proof: If k n-: Consider a call from subswitch a in stage to sub-switch b in stage. # current calls at a and b (n-), hence exists a free stage- sub-switch to connect a, b. If k < n-: From a sub-switch a in stage, connect to n different stage- sub-switches. Must use n different level sub-switches. Then can t connect from a stage sub-switch b to all layer sub-switches. 7 n k (N/n) (N/n) k n N=6 n= k= 8 Rearrangeable Clos networks Recursive construction: Benes Theorem [Slepian,Duguid]: Clos (N, n, k) is rearrangeably non-blocking iff k n. Proof: If rearrangable then k n since layer can connect only kn/n different calls. To show sufficiency, build a bipartite routing graph: Nodes: layer and layer switches Edges: calls (input node output node) max degree n, so by König s Theorem, can color edges in n colors. Assign nd stage switch to each color (coloring = routing!) Since k n, we re done. 9 Recursive construction: first and last stage N/ crossbars, middle stage N/ N/ recursive switch #crosspoints: 0

6 6 Benes network Greedy permutation routing Benes network (back-to-back Butterfly) #stages: log N #cross points :Nlog N N Sufficient to choose whether to go up or down at each intermediate node (solve for the upper and lower subswitches recursively). Idea: go back and forth, alternating up and down. Start from input i, choose up. Get to its output o. From o, go back, choosing down (up is taken!) Continue until completing a cycle, and then start with remainder. Example 5 6 7 8 ( ) 5 6 8 7 I Recursive strictly non-blocking Clos Networks #crosspoints: if k = n, choose n = (N/) /, get cost of O(N / ) points. Recursive construction: first and last stage N / subswitches, each N / N / ; middle stage N / subswitches of size N / N /. 5 6 7 8 I level 0 switches level r switches 7 8

Strict-sense Non-blocking: Better constructions Cantor Networks log N copies of Benes network Network size N log N Sorting networks self routing Batcher network: ½ log N comparators AKS network: ~6000 log N comparators Fabrics: Summary Rearrangably non-blocking networks can be constructed in O(N log N) crosspoints. Strictly non-blocking networks can be constructed in O(N log N) crosspoints (Cantor, Batcher-Banyan) Theoretical construction: O(N log N) crosspoints for strictly non-blocking. Non-practical due to large hidden constants (based on AKS sorting network) 9 0 Next: Switch Scheduling Saw: how to deal with a single permutation Problem: How to deal with many- demand? Buffering and scheduling strategies General traffic model Scheduling of a congested link Switch model NxN switch Cell a packet (For simplicity - all packets are in the same size). All line rate (of input and output) are the same. Time slot arrival time between Cells (indicator of the line rate).

Input Queuing Head of Line (HOL) Blocking fabric When first packet can t move, all packets behind it are stuck even if their destination is free! Approximate calculation: suppose destinations are random. Then Fabric implements partial permutation Packets wait in input buffers for fabric Can show: 58.6% Output queuing Fabric takes a packet from each input in each line clock Possibly more than one packet to an output port Packets wait for their link in output buffers fabric 5 6

Output queuing No HOL blocking best possible throughput Blocking only depends arrival pattern Can implement different queuing disciplines But: fabric must run N times faster than inputs Virtual output queuing In each input port: a queue for each output port (N queues overall) Can get 00% utilization on traffic with random destinations Requires scheduling! fabric Buffer control Buffer control fabric 7 Buffer control 8 Combined Input Output Queuing (CIOQ) Switch Input Output Virtual Output Queuing Input N fabric Output N 9 0

Combined Input Output Queuing (CIOQ) Switch Queues both in the input and in the output. For speedup of < S < N buffering is required in both inputs and outputs. Can get full emulation of output-queued switch. Mimic OQ with CIOQ switch We ll see that: A CIOQ switch with a speedup of can behave identically to any OQ switch. Problem definition Find the smallest speedup and the appropriate scheduling algorithm that: Allows CIOQ to exactly mimic OQ, for any input pattern, and any output queue policy Independent of switch size. Exactly mimic means that under identical input the departure time of every cell from both switches is identical. Scheduling Framework Each time slot is broken into phases:. Arrival Phase.. First scheduling phase.. Departure phase.. Second scheduling phase. Scheduling phase: algorithm chooses a set of cells that to send to output ports We use the stable marriage algorithm.

The stable marriage problem n boys {b B}, n girls {g G}. Each girl g ranks all n boys r g (b)=n for b most preferred by g, n- to second-to-best,..., r g (b)= for b least preferred by g. Same for boys: r b (g). A matching (n marriages) is stable if there is no unmarried pair, who both prefer each other to their spouses. Example B: prefers G,G,G B: prefers G,G,G B: prefers G,G,G G: prefers B,B,B G: prefers B,B,B G: prefers B,B,B Matching: B-G, B-G, B-G Stable? No! B prefers G to G, and G prefers B to B Matching B-G, B-G, B-G is stable. Why? 5 6 Stable marriage algorithm Definitions: M: partial matching B M,G M : boys, girls matched in M M(b)=g if (b,g) M, M(g)=b if (b,g) M. Stable marriage algorithm M:=Ø. Repeat Pick g G- G M (arbitrary). Find first b in g s list s.t. either: (i) b B M M := M {(b,g)}. (ii) b B M and r b (g) > r g (M(b)) M := M {(b,g)} {(b,m(b))} Until M =n. b is not matched yet b prefers g to his current match 7 8

Stable marriage algorithm Lemma : At all times, M is stable. If r g (b) > r g (M(g)) then b B M and r b (g) < r b (M(b)). Lemma : n iterations suffice. Given a partial matching M, define φ(m) increases by at least one unit in each iteration, and cannot exceed n. QED Input & output priority list Ports rank their counterparts by deadlines dictated by the emulated OQ switch: In an output port, most important is the input port holding the cell with the closest deadline. In an input port, most important is the output port to which most urgent cell is destined. Stable marriage property: If a cell is not matched in a scheduling phase, then one cell with higher priority (= closer deadline) was matched either to its input, or to its output. 9 50 Example A 5 Means: destination=a, deadline=5 Input Queues Output Queues A 5 A B X A A A A C A 6 Y B B A 7 B Z C C C Input X prefers B, then A, then C Output A prefers X, then Y, then Z Speedup is sufficient Theorem: The algorithm can emulate any OQ switch, under any arrival pattern. Proof Def: For a time slot t, L t (c) of cell c is # cells in its output with closer deadline, minus # cells preceding c in its input. Intuitively: how tight is the deadline of c. 5 5

Example Input Queues A 5 A B C A 6 A 7 B A 5 L(A ) = - = L(A 6) = - = X A A A A Y Z Means: destination=a, deadline=5 B C Output Queues B C C Proof (cont.) Lemma: For all t 0, L t+ (c) L t (c). Proof: In the worst case, L(c) can decrease by in a time slot: In the arrival phase, one cell may arrive at input; In the departure phase, one cell may leave output. But in each of the two scheduling phases: by stable marriage property, c not scheduled either #cells more urgent than c in its output increases by, or in its input is decreased by. QED 5 5 So? By Lemma, and since L 0 (c) 0, L t (c) 0 for all t. when a cell reaches its deadline (start of time slot), there s no one ahead of it in its input and its output, so Stable Marriage Algorithm will transfer it to output (if not already there) in first scheduling phase, And output queue will transmit it on departure phase. QED Practical Algorithms objective: find a maximal matching between input ports and output ports Note: maximal maximum! but maximal ½ maximum (why?) 55 56

Practical Algorithm: PIM Parallel Iterative Matching. Each input sends REQUEST to all outputs it needs. Each output selects randomly one of the requesting inputs, send GRANT. Each input selects randomly one of the granting outputs, sends ACCEPT O(log n) iterations guarantee maximal matching with high probability! 57 # # Parallel Iterative Matching Requests Random Selection Grant Random Selection Accept/Match 58 islip Round-Robin Selection Round-Robin Selection islip Properties # # Requests Grant Accept/Match 59 Random under low load TDM under high load Lowest priority to most recently matched iteration: fair to outputs At most N iterations until request granted Implementation: N priority encoders Up to 00% throughput for uniform traffic 60

Summary Implication: Speedup is enough! But computational cost of algorithm is too high. In practice, maximal matchings are good enough, with moderate speedup (-). Maximal matching: a matching that cannot be extended. Size: at least half of maximum. Can be computed in linear time Can be approximated distributively (random, islip) 6