High Performance Switching and Routing Telecom Center Workshop: Sept 4, 997. EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm
Outline Up until now, we have focused on high performance packet switches with:. A crossbar switching fabric, 2. Input queues (and possibly output queues as well), 3. Virtual output queues, and 4. Centralized arbitration/scheduling algorithm. Today we ll talk about the implementation of the crossbar switch fabric itself. How are they built, how do they scale, and what limits their capacity? 2
Crossbar switch Limiting factors. N 2 crosspoints per chip, or N x N-to- multiplexors 2. It s not obvious how to build a crossbar from multiple chips, 3. Capacity of I/O s per chip. State of the art: About 300 pins each operating at 3.25Gb/s ~= Tb/s per chip. About /3 to /2 of this capacity available in practice because of overhead and speedup. Crossbar chips today are limited by I/O capacity. 3
Scaling number of outputs: Trying to build a crossbar from multiple chips Building Block: 6x6 crossbar switch: 4 inputs 4 outputs Eight inputs and eight outputs required! 4
Scaling line-rate: Bit-sliced parallelism Linecard Cell Cell Cell k 8 7 6 5 4 3 2 Cell is striped across multiple identical planes. Crossbar switched bus. Scheduler makes same decision for all slices. Scheduler 5
Scaling line-rate: Time-sliced parallelism Linecard Cell Cell Cell Cell Cell Cell k 8 7 6 5 4 3 2 Cell carried by one plane; takes k cell times. Scheduler is unchanged. Scheduler makes decision for each slice in turn. Scheduler 6
Scaling a crossbar Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). What if we want to increase the number of ports? Can we build a crossbar-equivalent from multiple stages of smaller crossbars? If so, what properties should it have? 7
3-stage Clos Network m x m n x k k x n n 2 n 2 2 N m m N k N = n x m k >= n 8
With k = n, is a Clos network nonblocking like a crossbar? Consider the example: scheduler chooses to match (,), (2,4), (3,3), (4,2) 9
With k = n is a Clos network nonblocking like a crossbar? Consider the example: scheduler chooses to match (,), (2,2), (4,4), (5,3), By rearranging matches, the connections could be added. Q: Is this Clos network rearrangeably non-blocking? 0
With k = n a Clos network is rearrangeably non-blocking Routing matches is equivalent to edge-coloring in a bipartite multigraph. Colors correspond to middle-stage switches. (,), (2,4), (3,3), (4,2) Each vertex corresponds to an n x k or k x n switch. No two edges at a vertex may be colored the same. Vizing 64: a D-degree bipartite graph can be colored in D colors. Therefore, if k = n, a 3-stage Clos network is rearrangeably non-blocking (and can therefore perform any permutation).
How complex is the rearrangement? Method : Find a maximum size bipartite matching for each of D colors in turn, O(DN 2.5 ). Method 2: Partition graph into Euler sets, O(N.logD) [Cole et al. 00] 2
Edge-Coloring using Euler sets Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. For D=2 i, perform i Euler splits and -color each resulting graph. This is logd operations, each of O(E). 3
Euler partition of a graph Euler partiton of graph G:. Each odd degree vertex is at the end of one open path. 2. Each even degree vertex is at the end of no open path. 4
G Euler split of a graph G Euler split of G into G and G 2 :. Scan each path in an Euler partition. 2. Place each alternate edge into G and G 2 G 2 5
Edge-Coloring using Euler sets Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. For D=2 i, perform i Euler splits and -color each resulting graph. This is logd operations, each of O(E). 6
Implementation Request graph Scheduler Permutation Route connections Paths 7
Implementation Pros A rearrangeably non-blocking switch can perform any permutation A cell switch is time-slotted, so all connections are rearranged every time slot anyway Cons Rearrangement algorithms are complex (in addition to the scheduler) Can we eliminate the need to rearrange? 8
Strictly non-blocking Clos Network Clos Theorem: If k >= 2n, then a new connection can always be added without rearrangement. 9
m x m n x k M k x n n I M 2 O n I 2 O 2 N I m O m N M k N = n x m k >= n 20
Clos Theorem x x + n I a n n already in use at input and output. n O b k k. Consider adding the n-th connection between st stage I a and 3 rd stage O b. 2. We need to ensure that there is always some center-stage M available. 3. If k > (n ) + (n ), then there is always an M available. i.e. we need k >= 2n. 2
Scaling Crossbars: Summary Scaling capacity through parallelism (bitslicing and time-slicing) is straightforward. Scaling number of ports is harder Clos network: Rearrangeably non-blocking with k = n, but routing is complicated, Strictly non-blocking with k >= 2n, so routing is simple. But requires more bisection bandwidth. 22