Crossbar Crossbar - example Simple space-division switch Crosspoints can be turned on or off i n p u t s sessions: (,) (,) (,) (,) outputs Crossbar Advantages: simple to implement simple control flexible Drawback: again, limited scale number of crosspoints, N large VLSI layout area vulnerable to single faults Bottleneck: number of pins! Need log n pins for each input port to encode output port Bottom line: good for small N (hundreds) Combination: Time-space switching Precede each input link in a crossbar with a TSI Delay samples so that they arrive at the right time for the space division switch s schedule mux mux demux demux Crosspoints: (not 6) memory speed : (not )
Time-Space: Example Need a schedule! Desired permutation: (,,,) mux mux Internal speed = twice link speed demux demux time time Definition of the problem. Input: a permutation σ of [n] σ(i) = j means input port i goes to output j Size of crossbar, s s Implied: Each crossbar line handles m calls, where m=n/s Output: A schedule of length t m realizing σ For each crossbar input line: order to place samples For each time step : what permutation to realize at crossbar 5 6 Finding a schedule Some Graph Theory Goal: Minimize length of schedule Model by a routing graph a node for each crossbar input, a node for each crossbar output an edge connects nodes a,b if there s a call from a to b Need to find good edge coloring (,), (,), (,), (,) Edge coloring: Assignment of colors to edges, so that no incident edges have the same color. Note: always #colors max degree. Bipartite graph: two sets of nodes, edges only between sets (not inside a set). G=(A,B,E), where E A B. Theorem: a graph is bipartite iff all its cycles have even length. A B 7 8
König s Theorem [96] The edges of any bipartite graph with maximal degree Δ can be colored in Δ colors. Proof: By an algorithm.. Pick an edge from some node of degree Δ.. If leading to an unvisited node of degree Δ, continue.. When stopped: remove path edges number,,5... Repeat until done max degree Δ-, apply induction Analysis: Step is well defined because all cycles are of even length (bipartite graph) After Step, all nodes on the path get degrees < Δ. Corollary Can find optimal schedule! Because length of schedule at least number of inputs per crossbar input Computation is a bit time consuming But needs to be done only at setup time. 9 0 General Space division Goal: extend the benefits of crossbars by better topology design Graph representation: Nodes for inputs, outputs, and internal switches Edges for links between them Routing: edge disjoint paths Cost: number of crosspoints VLSI Layout area Multistage switch In a crossbar, in each step, only one crosspoint may be active in each row or column: call goes through a single crosspoint (internal switch) Multistage switch basic idea: Allow calls to go through more than one internal switch
Example Clos networks: (N,n,k) 5 sessions: (,) (,6) (,) (,) (5,) (6,5) 5 N N switch, three layers (stages): Layers,: N/n sub-switches, each subswitch is n k Layer : k subswitches, each is N/n N/n Each layer subswitch is connected to all layer and switches # crosspoints: (0,0,) Clos Network 6 6 Stage : switches Stage : switches Stage : switches This is less than N for Clos Networks Can suffer internal blocking, e.g., if k < n because total width of nd stage is k N/n. Types of non-blocking: Strict sense: can find a route from a free input to a free output without changing existing routes Rearrangable: can route any set of pairs (destroying existing state) Example New call from input 6 to output x x x x x x x x x x x N=8 n= k= 5 6
Clos Theorem [95] Blocking Example Clos network is strict-sense nonblocking iff k n. Proof: If k n-: Consider a call from subswitch a in stage to sub-switch b in stage. # current calls at a and b (n-), hence exists a free stage- sub-switch to connect a, b. If k < n-: From a sub-switch a in stage, connect to n different stage- sub-switches. Must use n different level sub-switches. Then can t connect from a stage sub-switch b to all layer sub-switches. 7 n k (N/n) (N/n) k n N=6 n= k= 8 Rearrangeable Clos networks Recursive construction: Benes Theorem [Slepian,Duguid]: Clos (N, n, k) is rearrangeably non-blocking iff k n. Proof: If rearrangable then k n since layer can connect only kn/n different calls. To show sufficiency, build a bipartite routing graph: Nodes: layer and layer switches Edges: calls (input node output node) max degree n, so by König s Theorem, can color edges in n colors. Assign nd stage switch to each color (coloring = routing!) Since k n, we re done. 9 Recursive construction: first and last stage N/ crossbars, middle stage N/ N/ recursive switch #crosspoints: 0
6 6 Benes network Greedy permutation routing Benes network (back-to-back Butterfly) #stages: log N #cross points :Nlog N N Sufficient to choose whether to go up or down at each intermediate node (solve for the upper and lower subswitches recursively). Idea: go back and forth, alternating up and down. Start from input i, choose up. Get to its output o. From o, go back, choosing down (up is taken!) Continue until completing a cycle, and then start with remainder. Example 5 6 7 8 ( ) 5 6 8 7 I Recursive strictly non-blocking Clos Networks #crosspoints: if k = n, choose n = (N/) /, get cost of O(N / ) points. Recursive construction: first and last stage N / subswitches, each N / N / ; middle stage N / subswitches of size N / N /. 5 6 7 8 I level 0 switches level r switches 7 8
Strict-sense Non-blocking: Better constructions Cantor Networks log N copies of Benes network Network size N log N Sorting networks self routing Batcher network: ½ log N comparators AKS network: ~6000 log N comparators Fabrics: Summary Rearrangably non-blocking networks can be constructed in O(N log N) crosspoints. Strictly non-blocking networks can be constructed in O(N log N) crosspoints (Cantor, Batcher-Banyan) Theoretical construction: O(N log N) crosspoints for strictly non-blocking. Non-practical due to large hidden constants (based on AKS sorting network) 9 0 Next: Switch Scheduling Saw: how to deal with a single permutation Problem: How to deal with many- demand? Buffering and scheduling strategies General traffic model Scheduling of a congested link Switch model NxN switch Cell a packet (For simplicity - all packets are in the same size). All line rate (of input and output) are the same. Time slot arrival time between Cells (indicator of the line rate).
Input Queuing Head of Line (HOL) Blocking fabric When first packet can t move, all packets behind it are stuck even if their destination is free! Approximate calculation: suppose destinations are random. Then Fabric implements partial permutation Packets wait in input buffers for fabric Can show: 58.6% Output queuing Fabric takes a packet from each input in each line clock Possibly more than one packet to an output port Packets wait for their link in output buffers fabric 5 6
Output queuing No HOL blocking best possible throughput Blocking only depends arrival pattern Can implement different queuing disciplines But: fabric must run N times faster than inputs Virtual output queuing In each input port: a queue for each output port (N queues overall) Can get 00% utilization on traffic with random destinations Requires scheduling! fabric Buffer control Buffer control fabric 7 Buffer control 8 Combined Input Output Queuing (CIOQ) Switch Input Output Virtual Output Queuing Input N fabric Output N 9 0
Combined Input Output Queuing (CIOQ) Switch Queues both in the input and in the output. For speedup of < S < N buffering is required in both inputs and outputs. Can get full emulation of output-queued switch. Mimic OQ with CIOQ switch We ll see that: A CIOQ switch with a speedup of can behave identically to any OQ switch. Problem definition Find the smallest speedup and the appropriate scheduling algorithm that: Allows CIOQ to exactly mimic OQ, for any input pattern, and any output queue policy Independent of switch size. Exactly mimic means that under identical input the departure time of every cell from both switches is identical. Scheduling Framework Each time slot is broken into phases:. Arrival Phase.. First scheduling phase.. Departure phase.. Second scheduling phase. Scheduling phase: algorithm chooses a set of cells that to send to output ports We use the stable marriage algorithm.
The stable marriage problem n boys {b B}, n girls {g G}. Each girl g ranks all n boys r g (b)=n for b most preferred by g, n- to second-to-best,..., r g (b)= for b least preferred by g. Same for boys: r b (g). A matching (n marriages) is stable if there is no unmarried pair, who both prefer each other to their spouses. Example B: prefers G,G,G B: prefers G,G,G B: prefers G,G,G G: prefers B,B,B G: prefers B,B,B G: prefers B,B,B Matching: B-G, B-G, B-G Stable? No! B prefers G to G, and G prefers B to B Matching B-G, B-G, B-G is stable. Why? 5 6 Stable marriage algorithm Definitions: M: partial matching B M,G M : boys, girls matched in M M(b)=g if (b,g) M, M(g)=b if (b,g) M. Stable marriage algorithm M:=Ø. Repeat Pick g G- G M (arbitrary). Find first b in g s list s.t. either: (i) b B M M := M {(b,g)}. (ii) b B M and r b (g) > r g (M(b)) M := M {(b,g)} {(b,m(b))} Until M =n. b is not matched yet b prefers g to his current match 7 8
Stable marriage algorithm Lemma : At all times, M is stable. If r g (b) > r g (M(g)) then b B M and r b (g) < r b (M(b)). Lemma : n iterations suffice. Given a partial matching M, define φ(m) increases by at least one unit in each iteration, and cannot exceed n. QED Input & output priority list Ports rank their counterparts by deadlines dictated by the emulated OQ switch: In an output port, most important is the input port holding the cell with the closest deadline. In an input port, most important is the output port to which most urgent cell is destined. Stable marriage property: If a cell is not matched in a scheduling phase, then one cell with higher priority (= closer deadline) was matched either to its input, or to its output. 9 50 Example A 5 Means: destination=a, deadline=5 Input Queues Output Queues A 5 A B X A A A A C A 6 Y B B A 7 B Z C C C Input X prefers B, then A, then C Output A prefers X, then Y, then Z Speedup is sufficient Theorem: The algorithm can emulate any OQ switch, under any arrival pattern. Proof Def: For a time slot t, L t (c) of cell c is # cells in its output with closer deadline, minus # cells preceding c in its input. Intuitively: how tight is the deadline of c. 5 5
Example Input Queues A 5 A B C A 6 A 7 B A 5 L(A ) = - = L(A 6) = - = X A A A A Y Z Means: destination=a, deadline=5 B C Output Queues B C C Proof (cont.) Lemma: For all t 0, L t+ (c) L t (c). Proof: In the worst case, L(c) can decrease by in a time slot: In the arrival phase, one cell may arrive at input; In the departure phase, one cell may leave output. But in each of the two scheduling phases: by stable marriage property, c not scheduled either #cells more urgent than c in its output increases by, or in its input is decreased by. QED 5 5 So? By Lemma, and since L 0 (c) 0, L t (c) 0 for all t. when a cell reaches its deadline (start of time slot), there s no one ahead of it in its input and its output, so Stable Marriage Algorithm will transfer it to output (if not already there) in first scheduling phase, And output queue will transmit it on departure phase. QED Practical Algorithms objective: find a maximal matching between input ports and output ports Note: maximal maximum! but maximal ½ maximum (why?) 55 56
Practical Algorithm: PIM Parallel Iterative Matching. Each input sends REQUEST to all outputs it needs. Each output selects randomly one of the requesting inputs, send GRANT. Each input selects randomly one of the granting outputs, sends ACCEPT O(log n) iterations guarantee maximal matching with high probability! 57 # # Parallel Iterative Matching Requests Random Selection Grant Random Selection Accept/Match 58 islip Round-Robin Selection Round-Robin Selection islip Properties # # Requests Grant Accept/Match 59 Random under low load TDM under high load Lowest priority to most recently matched iteration: fair to outputs At most N iterations until request granted Implementation: N priority encoders Up to 00% throughput for uniform traffic 60
Summary Implication: Speedup is enough! But computational cost of algorithm is too high. In practice, maximal matchings are good enough, with moderate speedup (-). Maximal matching: a matching that cannot be extended. Size: at least half of maximum. Can be computed in linear time Can be approximated distributively (random, islip) 6