Spanning Tree and Datacenters EE 122, Fall 2013 Sylvia Ratnasamy http://inst.eecs.berkeley.edu/~ee122/ Material thanks to Mike Freedman, Scott Shenker, Ion Stoica, Jennifer Rexford, and many other colleagues
Last 1me: Self- Learning in Ethernet Approach: Flood first packet to node you are trying to reach Avoids loop by restric1ng flooding to spanning tree Flooding allows packet to reach des1na1on And in the process switches learn how to reach source of flood
Missing Piece: Building the Spanning Tree Distributed Self- configuring No global informa1on Must adapt when failures occur
A convenient spanning tree Shortest paths to (or from) a node form a tree Covers all nodes in the graph No shortest path can have a cycle
Algorithm Has Two Aspects Pick a root: Pick the one with the smallest iden1fier (MAC addr.) This will be the des1na1on to which shortest paths go Compute shortest paths to the root Only keep the links on shortest- paths Break 1es by picking the lowest neighbor switch addr Ethernet s spanning tree construc1on does both with a single algorithm
Constructing a Spanning Tree" Messages (Y, d, X) From node X Proposing Y as the root And advertising a distance d to Y Switches elect the node with smallest identifier (MAC address) as root Y in the messages Each switch determines if a link is on the shortest path from the root; excludes it from the tree if not d to Y in the message
Steps in Spanning Tree Algorithm" Initially, each switch proposes itself as the root Example: switch X announces (X, 0, X) Switches update their view of the root Upon receiving message (Y, d, Z) from Z, check Y s id If Y s id < current root: set root = Y Switches compute their distance from the root Add 1 to the shortest distance received from a neighbor If root or shortest distance to it changed, flood updated message (Y, d+1, X)
Example (root, dist, from)" 1" 2! 3! 4! 5! 6! 7! (1,0,1) (2,0,2) (3,0,3) (4,0,4) (5,0,5) (6,0,6) (7,0,7) 1" 2! 3! 4! 5! 6! 7! (1,0,1) (2,0,2) (1,1,3) (2,1,4) (1,1,5) (1,1,6) (2,1,7) 1" 2! 3! 4! 5! 6! 7! (1,0,1) (1,2,3) (1,1,3) (2,1,4) (1,1,5) (1,1,6) (2,1,7) 1" 2! 3! 4! 5! 6! 7! (1,0,1) (1,2,3) (1,1,3) (1,3,2) (1,1,5) (1,1,6) (1,3,2)
Robust Spanning Tree Algorithm" Algorithm must react to failures Failure of the root node Failure of other switches and links Root switch continues sending messages Periodically re-announcing itself as the root Detecting failures through timeout (soft state) If no word from root, time out and claim to be the root!
Problems with Spanning Tree? Delay in reestablishing spanning tree Network is down until spanning tree rebuilt Much of the network bandwidth goes unused Forwarding is only over the spanning tree A real problem for datacenter networks
Datacenters
What you need to know Characteris1cs of a datacenter environment goals, constraints, workloads, etc. How and why DC networks are different (vs. WAN) e.g., latency, geo, autonomy, How tradi1onal solu1ons fare in this environment e.g., IP, Ethernet, TCP, ARP, DHCP Not details of how datacenter networks operate
Disclaimer Material is emerging (not established) wisdom Material is incomplete many details on how and why datacenter networks operate aren t public
What goes into a datacenter (network)? Servers organized in racks
What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack (ToR) switch
What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack (ToR) switch An `aggrega1on fabric interconnects ToR switches
What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack (ToR) switch An `aggrega1on fabric interconnects ToR switches Connected to the outside via `core switches note: blurry line between aggrega1on and core With network redundancy of ~2x for robustness
Example 1 Brocade reference design
Example 2 Internet CR CR AR AR... AR AR S S S S S S... A A A A A A ~ 40-80 servers/rack Cisco reference design
Observa1ons on DC architecture Regular, well- defined arrangement Hierarchical structure with rack/aggr/core layers Mostly homogenous within a layer Supports communica1on between servers and between servers and the external world Contrast: ad- hoc structure, heterogeneity of WANs
Datacenters have been around for a while 1961, Informa-on Processing Center at the Na-onal Bank of Arizona
What s new?
SCALE!
How big exactly? 1M servers [Microsof] less than google, more than amazon > $1B to build one site [Facebook] >$20M/month/site opera1onal costs [Microsof 09] But only O(10-100) sites
What s new? Scale Service model user- facing, revenue genera1ng services mul1- tenancy jargon: SaaS, PaaS, DaaS, IaaS,
Implica1ons Scale need scalable solu1ons (duh) improving efficiency, lowering cost is cri1cal à `scale out solu9ons w/ commodity technologies Service model performance means $$ virtualiza9on for isola1on and portability
Mul1- Tier Applica1ons Applica1ons decomposed into tasks Many separate components Running in parallel on different machines 27
Componen1za1on leads to different types of network traffic North- South traffic Traffic between external clients and the datacenter Handled by front- end (web) servers, mid- 1er applica1on servers, and back- end databases Traffic pamerns fairly stable, though diurnal varia1ons 28
North- South Traffic user requests from the Internet Router Front- End Proxy Front- End Proxy Web Server Web Server Web Server Data Cache Data Cache Database Database 29
Componen1za1on leads to different types of network traffic North- South traffic Traffic between external clients and the datacenter Handled by front- end (web) servers, mid- 1er applica1on servers, and back- end databases Traffic pamerns fairly stable, though diurnal varia1ons East- West traffic Traffic between machines in the datacenter Commn. within big data computa1ons (e.g. Map Reduce) Traffic may shif on small 1mescales (e.g., minutes)
East- West Traffic Distributed Storage Map Tasks Reduce Tasks Distributed Storage 31
East- West Traffic CR CR AR AR AR AR S S S S S S A A A S S A A A... S S A A A S S A A A
Ofen doesn t cross the network East- West Traffic Some frac1on (typically 2/3) crosses the network Distributed Storage Map Tasks Always goes over the network Reduce Tasks Distributed Storage 33
What s different about DC networks? Characteris1cs Huge scale: ~20,000 switches/routers contrast: AT&T ~500 routers
What s different about DC networks? Characteris1cs Huge scale: Limited geographic scope: High bandwidth: 10/40/100G Contrast: DSL/WiFi Very low RTT: 10s of microseconds Contrast: 100s of milliseconds in the WAN
What s different about DC networks? Characteris1cs Huge scale Limited geographic scope Single administra1ve domain Can deviate from standards, invent your own, etc. Green field deployment is s1ll feasible
What s different about DC networks? Characteris1cs Huge scale Limited geographic scope Single administra1ve domain Control over one/both endpoints can change (say) addressing, conges1on control, etc. can add mechanisms for security/policy/etc. at the endpoints (typically in the hypervisor)
What s different about DC networks? Characteris1cs Huge scale Limited geographic scope Single administra1ve domain Control over one/both endpoints Control over the placement of traffic source/sink e.g., map- reduce scheduler chooses where tasks run alters traffic pamern (what traffic crosses which links)
What s different about DC networks? Characteris1cs Huge scale Limited geographic scope Single administra1ve domain Control over one/both endpoints Control over the placement of traffic source/sink Regular/planned topologies (e.g., trees/fat- trees) Contrast: ad- hoc WAN topologies (dictated by real- world geography and facili1es)
What s different about DC networks? Characteris1cs Huge scale Limited geographic scope Single administra1ve domain Control over one/both endpoints Control over the placement of traffic source/sink Regular/planned topologies (e.g., trees/fat- trees) Limited heterogeneity link speeds, technologies, latencies,
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements recall: all that east- west traffic target: any server can communicate at its full link speed problem: server s access link is 10Gbps!
Full Bisec1on Bandwidth Internet O(40x10x100) Gbps CR CR O(40x10)Gbps AR AR... AR AR 10Gbps S S S S S S... Tradi1onal tree topologies scale up A A A A A A ~ 40-80 servers/rack full bisec1on bandwidth is expensive typically, tree topologies oversubscribed
A Scale Out Design Build mul1- stage `Fat Trees out of k- port switches k/2 ports up, k/2 down Supports k 3 /4 hosts: 48 ports, 27,648 hosts All links are the same speed (e.g. 10Gps)
Full Bisec1on Bandwidth Not Sufficient To realize full bisec1onal throughput, rou1ng must spread traffic across paths Enter load- balanced rou1ng How? (1) Let the network split traffic/flows at random (e.g., ECMP protocol - - RFC 2991/2992) How? (2) Centralized flow scheduling (e.g., w/ SDN, Dec 2 nd lec.) Many more research proposals 44
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut- through switches making a comeback (lec. 2!) reduces switching 1me
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut- through switches making a comeback (lec. 2!) how? avoid conges1on reduces queuing delay
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut- through switches making a comeback (lec. 2!) how? avoid conges1on how? fix TCP 1mers (e.g., default 1meout is 500ms!) how? fix/replace TCP to more rapidly fill the pipe
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements Predictable, determinis9c performance your packet will reach in Xms, or not at all your VM will always see at least YGbps throughput Resurrec1ng `best effort vs. `Quality of Service debates How is s1ll an open ques1on
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements Predictable, determinis9c performance Differen1a1ng between tenants is key e.g., No traffic between VMs of tenant A and tenant B Tenant X cannot consume more than XGbps Tenant Y s traffic is low priority We ll see how in the lecture on SDN
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements Predictable, determinis9c performance Differen1a1ng between tenants is key Scalability (of course) Q: How s Ethernet spanning tree looking?
What s different about DC networks? Goals Extreme bisec1on bandwidth requirements Extreme latency requirements Predictable, determinis9c performance Differen1a1ng between tenants is key Scalability (of course) Cost/efficiency focus on commodity solu1ons, ease of management some debate over the importance in the network case
Announcements/Summary I will not have office hours tomorrow email me to schedule an alternate 1me Recap: datacenters new characteris1cs and goals some libera1ng, some constraining scalability is the baseline requirement more emphasis on performance less emphasis on heterogeneity less emphasis on interoperability