BCube: A High Performance, Servercentric. Architecture for Modular Data Centers

BCube: A High Performance, Servercentric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang1;2, Yunfeng Shi1;3, Chen Tian1;4, Yongguang Zhang1, Songwu Lu1;5 1: Microsoft Research Asia, 2: Tsinghua, 3: PKU, 4: HUST, 5: UCLA {chguo,lguohan,danil,hwu}@microsoft.com, xuan-zhang05@mails.tsinghua.edu.cn, shiyunfeng@pku.edu.cn, tianchen@mail.hust.edu.cn, ygz@microsoft.com, slu@cs.ucla.edu Presented by: Rami Jiossy at Technion

Container-based Modular DataCenter Couple thousands of servers (1000-2000) 20- to 40-feet shipping container Difficult to service MDC once deployed Sun Microsystems states that the system can made operational for 1% of the cost of building a traditional data center Main Benefits: High mobility, Just Plug: Power water (cooling) Network Increased cooling efficiency Manufacturing & H/W Admin. Savings

Bcube Netowork Architecture Design and implementation derived from data-intensive applications and MDC requirements Graceful performance degradation Upon server/switch failures Support various bandwidth-intensive traffic patterns: One-to-one One-to-several One to-all All-to-all Uses only COTS mini-switches (low expense)

BCube1 BCube structure <1,0> <1,1> <1,2> <1,3> Level-1 BCube0 <0,0> <0,1> <0,2> <0,3> Level-0 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 switch server Connecting rule - The i-th server in the j-th BCube 0 connects to the j-th port of the i-th level-1 switch Server 13 is connected to switches <0,1> and <1,3>

Screen clipping taken: 1/5/2011, 11:56 Bigger BCube: 3-levels (k=2)

Notations and Observations A BCube k has: K+1 levels: 0 through k. n-port switches, same count at each level (n k ) n k+1 total servers, (k+1)n k total switches n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port switches at each layer. A server is assigned a BCube addr (a k, a k-1,, a 0 ) where a i [0,k] Neighboring server addresses differ by only one digit [ h(a,b) = 1 ] How many neighbors? Switches only connect to servers (K+1)(n-1) (act as neighbors dummy L2 crossbars)

How to route from Server 00 to Server 21? 1. Decide on permutation of indices 0-k, π 2. Correct digits in server address array according to π dictated order. What is the diameter of a BCube network?

Parallel paths at BCube Two paths between two servers A and B, are Parallel in case they are node/switch-disjoint. THEOREM 2. If for two servers A=a k a k-1.a 0 and B=b k b k-1.b 0 it holds that a i b i ; Then for the following permutations: π 0 = [i 0, (i 0-1) mod (k+1),, (i 0 -k) mod (k+1)] π 1 = [i 1, (i 1-1) mod (k+1),, (i 1 -k) mod (k+1)] i 1 i 0 BCubeRouting will produce two parallel paths. (0,k)

Multi-paths for one-to-one traffic THEOREM 3. There are k+1 parallel paths between any two servers in a Bcube k (BuildPathSet alogorithm) Useful when there is a server pairs <1,0> <1,1> <1,2> <1,3> exchanging large amount of data <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 9

Speedup for one-to-several traffic THEOREM 4. Server A and a set of servers {di di is A s level-i neighbor} form an edge disjoint complete graph. <1,0> <1,1> <1,2> <1,3> Writing to r servers, is r-times faster Than pipeline replication <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 P1 P1 P2 P2 10

Speedup for one-to-all traffic src 00 01 02 03 10 11 12 13 20 21 22 23 THEOREM 5. There are k+1 edge-disjoint spanning trees in a Bcube k One server transmits to all other servers. Cases like: upgrading system image A source can deliver a file of size L to all the other L servers in time in a k 1 Bcube k. distributing application binaries 30 31 32 33 11

ABT for all-to-all traffic All-to-all: shuffles data among all servers.. Flow = Connection between two servers (Path) In BCube there are no bottlenecks Aggregate bottleneck throughput (ABT) : since all links are used equally ABT = # Flows X throughput of the bottleneck flow Reflects the capacity of the network n ( N n 1 ABT for BCube increases lineary with the number of servers. where n is the switch port number and N is the total server count Theorem 6. The ABT for a BCube network is 1) 12

Screen clipping taken: 1/1/2011, 20:07 BCube Source Routing (BSR) Server-centric source routing Source server decides the best path for a flow and encodes the path in the packet header. (how best is chosen?) Intermediate servers only forward the packets based on the packet header. Packet header when sending from server 00 to 13: Path(00,13) = {02,22,23,13} 13

Path Selection Source server: 1. construct k+1 paths using BuildPathSet 2. Probes all these paths (no link status broadcasting) 3. If a path is not found, it uses BFS to find alternative (after removing all others) Intermediate servers: BSR design goals: Updates - Scalability Bandwidth: min(packetbw, InBW, OutBW) If next hops is not found, returns failure to source Destination - Routing server: performance Updates Bandwidth: min(packetbw, InBW) Send probe response to source on reverse path 4. Use a metric to select best path. (maximum available bandwidth / end-to-end delay) During Path Selection, the source servers sends on one of the selected parallel paths; and switches path if a better path has been found.

Path Adaptation Source performs path selection periodically (say, every 10 secs) to adapt to failures and network condition changes. If a failure is received, the source switches to an available path and waits for next timer to expire for the next selection round and not immediately. Usually uses randomness in timer to avoid path oscillation.

Packet Forwarding Each server has two components: Neighbor status table (k+1)x(n-1) entries Maintained by the neighbor maintenance protocol (updated upon probing / packet forwarding) Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV]) DP: diff digit (2-bit for 2-levels) DV: value of diff digit (rest of bits) Almost static (except Status) Packet forwarding procedure Intermediate servers update next hop MAC address on packet if next hop is alive Intermediate servers update status from packet One table lookup NHI Output port MAC Status 0:0 0 Mac20 1 0:1 0 Mac21 1 0:2 0 Mac22 0 1:0 1 Mac03 0 1:1 1 Mac13 1 1:3 1 Mac33 1

Path compression and fast packet <0,0> <0,1> forwarding Traditional address array needs 16 bytes: Path(00,13) = {02,22,23,13} The Next Hop Index (NHI) Array needs 4 bytes: Path(00,13)={0:2,1:2,0:3,1:1} <1,0> <1,1> <1,2> <1,3> Fwd node Next hop 2 3 1 3 <0,2> Forwarding table of server 23 NHI Output port MAC Status 0:0 0 Mac20 1 0:1 0 Mac21 1 0:2 0 Mac22 0 1:0 1 Mac03 0 1:1 1 Mac13 1 1:3 1 Mac33 1 <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 17

Screen clipping taken: 1/5/2011, 14:02 Dcell

Graceful degradation The metric: aggregation bottleneck throughput (ABT) under different server and switch failure rates (Simulation Based) Server failure Switch failure BCube Fat-tree BCube Fat-tree DCell DCell 19

Routing to external networks Ethernet has two levels link rate hierarchy 1G for end hosts and 10G for uplink aggregator 10G <1,0> <1,1> <1,2> <1,1> <1,3> 1G <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 gateway gateway gateway gateway 20

Implementation software BCube configuration TCP/IP protocol driver app kernel Intermediate driver BCube driver Neighbor maintenance Packet send/recv Ethernet miniport driver packet fwd Ava_band calculation IF 0 IF 1 IF k BSR path probing & selection Flow-path cache Intel PRO/1000 PT Quad Port Server Adapter hardware Neighbor maintenance packet fwd Ava_band calculation server ports NetFPGA 21

Testbed A BCube testbed 16 servers (Dell Precision 490 workstation with Intel 2.00GHz dualcore CPU, 4GB DRAM, 160GB disk) in bcube1 (4 bcube 0) 8 4-port mini-switches (DLink 8-port Gigabit switch DGS-1008D) Utilizes only 2 ports of the 4 ports in the switch NIC Intel Pro/1000 PT quad-port Ethernet NIC NetFPGA Because of PCI Interface limitations (160Mb/s) software implementation is used 22

Screen clipping taken: 1/2/2011, 11:42 CPU Overhead for Packet Forwarding Packet forwarding ideally is placed at the HW level. At the testbed we limit MTU to 9KB threshold.

Bandwidth-intensive application Per-server throughput support 24

Support for all-to-all traffic Total throughput for all-to-all 25

Conclusions By installing a small number of network ports at each server and using COTS mini-switches as crossbars, and putting routing intelligence at the server side, BCube forms a server-centric architecture We have shown that BCube significantly accelerates one-to-x traffic patterns and provides high network capacity for all-to-all traffic The BSR routing protocol further enables graceful performance degradation Future work will study how to scale the current servercentric design from the single container to multiple containers

Q & A 27