BCube: A High Performance, Servercentric. Architecture for Modular Data Centers

Similar documents
RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Network Design Considerations for Grid Computing

DATA centers run applications from online services such

Arrakis: The Operating System is the Control Plane

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

PacketShader: A GPU-Accelerated Software Router

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Scalable Enterprise Networks with Inexpensive Switches

NDN-NIC: Name-based Filtering on Network Interface Card

Motivation CPUs can not keep pace with network

Building Mega Data Center from Heterogeneous Containers

MODULAR datacenter (MDC) uses shipping-containers as

DCube: A Family of Network Structures for Containerized Data Centers Using Dual-Port Servers

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

EXAM TCP/IP NETWORKING Duration: 3 hours With Solutions

Performance Characteristics on Gigabit networks

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

BlueGene/L. Computer Science, University of Warwick. Source: IBM

EE/CSCI 451: Parallel and Distributed Computation

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Performance Characteristics on Fast Ethernet and Gigabit networks

DevoFlow: Scaling Flow Management for High Performance Networks

Interconnection Network

A 400Gbps Multi-Core Network Processor

6.9. Communicating to the Outside World: Cluster Networking

Abstract. AM; Reviewed: WCH/JK 9/11/02. Solution & Interoperability Test Lab Application Notes 2002 Avaya Inc. All Rights Reserved.

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Performance Characteristics on Fast Ethernet, Gigabit and 10 Gigabits networks

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

EE/CSCI 451: Parallel and Distributed Computation

EXAM TCP/IP NETWORKING Duration: 3 hours With Solutions

CSCI 466 Midterm Networks Fall 2011

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Computer Networks Principles

RDMA over Commodity Ethernet at Scale

Performance Characteristics on Gigabit networks

Advanced Computer Networks. End Host Optimization

Chapter 3 Part 2 Switching and Bridging. Networking CS 3470, Section 1

EXAM TCP/IP NETWORKING Duration: 3 hours

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Network Virtualization in Multi-tenant Datacenters

New Fault-Tolerant Datacenter Network Topologies

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

CSE398: Network Systems Design

GUIDE. Optimal Network Designs with Cohesity

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

CS155b: E-Commerce. Lecture 3: Jan 16, How Does the Internet Work? Acknowledgements: S. Bradner and R. Wang

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

CSE 473 Introduction to Computer Networks. Exam 2. Your name here: 11/7/2012

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Housekeeping. Fall /5 CptS/EE 555 1

Design and Implementation of Virtual TAP for Software-Defined Networks

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet

Motivation to Teach Network Hardware

FAR: A Fault-avoidance Routing Method for Data Center Networks with Regular Topology

Physical Organization of Parallel Platforms. Alexandre David

G-NET: Effective GPU Sharing In NFV Systems

Acceleration Systems Technical Overview. September 2014, v1.4

Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks

Chapter 7 Routing Protocols

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

ARISTA: Improving Application Performance While Reducing Complexity

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Performing MapReduce on Data Centers with Hierarchical Structures

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

LiRa: a WLAN architecture for Visible Light Communication with a Wi-Fi uplink

Primavera Compression Server 5.0 Service Pack 1 Concept and Performance Results

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Future Routing Schemes in Petascale clusters

Advanced Computer Networks. Flow Control

Next Generation Architecture for NVM Express SSD

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Deduplication Storage System

Ref: A. Leon Garcia and I. Widjaja, Communication Networks, 2 nd Ed. McGraw Hill, 2006 Latest update of this lecture was on

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

- Hubs vs. Switches vs. Routers -

Scalable Data Center Multicast. Reporter: 藍于翔 Advisor: 曾學文

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Toward a unified architecture for LAN/WAN/WLAN/SAN switches and routers

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

Lecture 16: Router Design

Chapter 1. Introduction

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Multi-resource Energy-efficient Routing in Cloud Data Centers with Network-as-a-Service

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 19: Networks and Distributed Systems

InfiniBand SDR, DDR, and QDR Technology Guide

Milestone Solution Partner IT Infrastructure Components Certification Report

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

Towards a Robust Protocol Stack for Diverse Wireless Networks Arun Venkataramani

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Use of the Internet SCSI (iscsi) protocol

EE/CSCI 451: Parallel and Distributed Computation

Transcription:

BCube: A High Performance, Servercentric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang1;2, Yunfeng Shi1;3, Chen Tian1;4, Yongguang Zhang1, Songwu Lu1;5 1: Microsoft Research Asia, 2: Tsinghua, 3: PKU, 4: HUST, 5: UCLA {chguo,lguohan,danil,hwu}@microsoft.com, xuan-zhang05@mails.tsinghua.edu.cn, shiyunfeng@pku.edu.cn, tianchen@mail.hust.edu.cn, ygz@microsoft.com, slu@cs.ucla.edu Presented by: Rami Jiossy at Technion

Container-based Modular DataCenter Couple thousands of servers (1000-2000) 20- to 40-feet shipping container Difficult to service MDC once deployed Sun Microsystems states that the system can made operational for 1% of the cost of building a traditional data center Main Benefits: High mobility, Just Plug: Power water (cooling) Network Increased cooling efficiency Manufacturing & H/W Admin. Savings

Bcube Netowork Architecture Design and implementation derived from data-intensive applications and MDC requirements Graceful performance degradation Upon server/switch failures Support various bandwidth-intensive traffic patterns: One-to-one One-to-several One to-all All-to-all Uses only COTS mini-switches (low expense)

BCube1 BCube structure <1,0> <1,1> <1,2> <1,3> Level-1 BCube0 <0,0> <0,1> <0,2> <0,3> Level-0 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 switch server Connecting rule - The i-th server in the j-th BCube 0 connects to the j-th port of the i-th level-1 switch Server 13 is connected to switches <0,1> and <1,3>

Screen clipping taken: 1/5/2011, 11:56 Bigger BCube: 3-levels (k=2)

Notations and Observations A BCube k has: K+1 levels: 0 through k. n-port switches, same count at each level (n k ) n k+1 total servers, (k+1)n k total switches n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port switches at each layer. A server is assigned a BCube addr (a k, a k-1,, a 0 ) where a i [0,k] Neighboring server addresses differ by only one digit [ h(a,b) = 1 ] How many neighbors? Switches only connect to servers (K+1)(n-1) (act as neighbors dummy L2 crossbars)

How to route from Server 00 to Server 21? 1. Decide on permutation of indices 0-k, π 2. Correct digits in server address array according to π dictated order. What is the diameter of a BCube network?

Parallel paths at BCube Two paths between two servers A and B, are Parallel in case they are node/switch-disjoint. THEOREM 2. If for two servers A=a k a k-1.a 0 and B=b k b k-1.b 0 it holds that a i b i ; Then for the following permutations: π 0 = [i 0, (i 0-1) mod (k+1),, (i 0 -k) mod (k+1)] π 1 = [i 1, (i 1-1) mod (k+1),, (i 1 -k) mod (k+1)] i 1 i 0 BCubeRouting will produce two parallel paths. (0,k)

Multi-paths for one-to-one traffic THEOREM 3. There are k+1 parallel paths between any two servers in a Bcube k (BuildPathSet alogorithm) Useful when there is a server pairs <1,0> <1,1> <1,2> <1,3> exchanging large amount of data <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 9

Speedup for one-to-several traffic THEOREM 4. Server A and a set of servers {di di is A s level-i neighbor} form an edge disjoint complete graph. <1,0> <1,1> <1,2> <1,3> Writing to r servers, is r-times faster Than pipeline replication <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 P1 P1 P2 P2 10

Speedup for one-to-all traffic src 00 01 02 03 10 11 12 13 20 21 22 23 THEOREM 5. There are k+1 edge-disjoint spanning trees in a Bcube k One server transmits to all other servers. Cases like: upgrading system image A source can deliver a file of size L to all the other L servers in time in a k 1 Bcube k. distributing application binaries 30 31 32 33 11

ABT for all-to-all traffic All-to-all: shuffles data among all servers.. Flow = Connection between two servers (Path) In BCube there are no bottlenecks Aggregate bottleneck throughput (ABT) : since all links are used equally ABT = # Flows X throughput of the bottleneck flow Reflects the capacity of the network n ( N n 1 ABT for BCube increases lineary with the number of servers. where n is the switch port number and N is the total server count Theorem 6. The ABT for a BCube network is 1) 12

Screen clipping taken: 1/1/2011, 20:07 BCube Source Routing (BSR) Server-centric source routing Source server decides the best path for a flow and encodes the path in the packet header. (how best is chosen?) Intermediate servers only forward the packets based on the packet header. Packet header when sending from server 00 to 13: Path(00,13) = {02,22,23,13} 13

Path Selection Source server: 1. construct k+1 paths using BuildPathSet 2. Probes all these paths (no link status broadcasting) 3. If a path is not found, it uses BFS to find alternative (after removing all others) Intermediate servers: BSR design goals: Updates - Scalability Bandwidth: min(packetbw, InBW, OutBW) If next hops is not found, returns failure to source Destination - Routing server: performance Updates Bandwidth: min(packetbw, InBW) Send probe response to source on reverse path 4. Use a metric to select best path. (maximum available bandwidth / end-to-end delay) During Path Selection, the source servers sends on one of the selected parallel paths; and switches path if a better path has been found.

Path Adaptation Source performs path selection periodically (say, every 10 secs) to adapt to failures and network condition changes. If a failure is received, the source switches to an available path and waits for next timer to expire for the next selection round and not immediately. Usually uses randomness in timer to avoid path oscillation.

Packet Forwarding Each server has two components: Neighbor status table (k+1)x(n-1) entries Maintained by the neighbor maintenance protocol (updated upon probing / packet forwarding) Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV]) DP: diff digit (2-bit for 2-levels) DV: value of diff digit (rest of bits) Almost static (except Status) Packet forwarding procedure Intermediate servers update next hop MAC address on packet if next hop is alive Intermediate servers update status from packet One table lookup NHI Output port MAC Status 0:0 0 Mac20 1 0:1 0 Mac21 1 0:2 0 Mac22 0 1:0 1 Mac03 0 1:1 1 Mac13 1 1:3 1 Mac33 1

Path compression and fast packet <0,0> <0,1> forwarding Traditional address array needs 16 bytes: Path(00,13) = {02,22,23,13} The Next Hop Index (NHI) Array needs 4 bytes: Path(00,13)={0:2,1:2,0:3,1:1} <1,0> <1,1> <1,2> <1,3> Fwd node Next hop 2 3 1 3 <0,2> Forwarding table of server 23 NHI Output port MAC Status 0:0 0 Mac20 1 0:1 0 Mac21 1 0:2 0 Mac22 0 1:0 1 Mac03 0 1:1 1 Mac13 1 1:3 1 Mac33 1 <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 17

Screen clipping taken: 1/5/2011, 14:02 Dcell

Graceful degradation The metric: aggregation bottleneck throughput (ABT) under different server and switch failure rates (Simulation Based) Server failure Switch failure BCube Fat-tree BCube Fat-tree DCell DCell 19

Routing to external networks Ethernet has two levels link rate hierarchy 1G for end hosts and 10G for uplink aggregator 10G <1,0> <1,1> <1,2> <1,1> <1,3> 1G <0,0> <0,1> <0,2> <0,3> 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 gateway gateway gateway gateway 20

Implementation software BCube configuration TCP/IP protocol driver app kernel Intermediate driver BCube driver Neighbor maintenance Packet send/recv Ethernet miniport driver packet fwd Ava_band calculation IF 0 IF 1 IF k BSR path probing & selection Flow-path cache Intel PRO/1000 PT Quad Port Server Adapter hardware Neighbor maintenance packet fwd Ava_band calculation server ports NetFPGA 21

Testbed A BCube testbed 16 servers (Dell Precision 490 workstation with Intel 2.00GHz dualcore CPU, 4GB DRAM, 160GB disk) in bcube1 (4 bcube 0) 8 4-port mini-switches (DLink 8-port Gigabit switch DGS-1008D) Utilizes only 2 ports of the 4 ports in the switch NIC Intel Pro/1000 PT quad-port Ethernet NIC NetFPGA Because of PCI Interface limitations (160Mb/s) software implementation is used 22

Screen clipping taken: 1/2/2011, 11:42 CPU Overhead for Packet Forwarding Packet forwarding ideally is placed at the HW level. At the testbed we limit MTU to 9KB threshold.

Bandwidth-intensive application Per-server throughput support 24

Support for all-to-all traffic Total throughput for all-to-all 25

Conclusions By installing a small number of network ports at each server and using COTS mini-switches as crossbars, and putting routing intelligence at the server side, BCube forms a server-centric architecture We have shown that BCube significantly accelerates one-to-x traffic patterns and provides high network capacity for all-to-all traffic The BSR routing protocol further enables graceful performance degradation Future work will study how to scale the current servercentric design from the single container to multiple containers

Q & A 27