Baidu s Best Practice with Low Latency Networks

Similar documents
P802.1Qcz Congestion Isolation

RDMA over Commodity Ethernet at Scale

IEEE P802.1Qcz Proposed Project for Congestion Isolation

Revisiting Network Support for RDMA

Requirement Discussion of Flow-Based Flow Control(FFC)

Best Practices for Deployments using DCB and RoCE

RoGUE: RDMA over Generic Unconverged Ethernet

Cisco Data Center Ethernet

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

RDMA in Data Centers: Looking Back and Looking Forward

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Advanced Computer Networks. Flow Control

Packet Scheduling in Data Centers. Lecture 17, Computer Networks (198:552)

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

An Approach for Implementing NVMeOF based Solutions

Advanced Computer Networks. Flow Control

The NE010 iwarp Adapter

WINDOWS RDMA (ALMOST) EVERYWHERE

Dell EMC Networking Deploying Data Center Bridging (DCB)

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

IEEE-SA Industry Connections White Paper IEEE 802 Nendica Report: The Lossless Network for Data Centers

Unit 2 Packet Switching Networks - II

Congestion Control for Large-Scale RDMA Deployments

Everything You Wanted To Know About Storage: Part Teal The Buffering Pod

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

THE LOSSLESS NETWORK. For Data Centers

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

TCP LoLa Toward Low Latency and High Throughput Congestion Control

Router s Queue Management

Cloud e Datacenter Networking

SNIA Tutorial 3 EVERYTHING YOU WANTED TO KNOW ABOUT STORAGE: Part Teal Queues, Caches and Buffers

Transport layer issues

Network Management & Monitoring

Mobile Transport Layer

The Network Layer and Routers

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Transport Protocols for Data Center Communication. Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti

DIBS: Just-in-time congestion mitigation for Data Centers

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

Configuring Queuing and Flow Control

N V M e o v e r F a b r i c s -

Ethernet Hub. Campus Network Design. Hubs. Sending and receiving Ethernet frames via a hub

ETHERNET ENHANCEMENTS FOR STORAGE. Sunil Ahluwalia, Intel Corporation

NVMe over Universal RDMA Fabrics

Basics (cont.) Characteristics of data communication technologies OSI-Model

ETHERNET ENHANCEMENTS FOR STORAGE. Sunil Ahluwalia, Intel Corporation Errol Roberts, Cisco Systems Inc.

Quality of Service. Traffic Descriptor Traffic Profiles. Figure 24.1 Traffic descriptors. Figure Three traffic profiles

NVMe Over Fabrics (NVMe-oF)

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

Configuring Priority Flow Control

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Introduction to Infiniband

WinOF-2 Release Notes

FlexNIC: Rethinking Network DMA

Network Configuration Example

Congestion Control and Resource Allocation

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms

Cisco Nexus 9300-EX Platform Switches Architecture

FM4000. A Scalable, Low-latency 10 GigE Switch for High-performance Data Centers

Cisco Nexus 9500 Series Switches Buffer and Queuing Architecture

RDMA and Hardware Support

Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter. Glenn Judd Morgan Stanley

Introduction. Network Architecture Requirements of Data Centers in the Cloud Computing Era

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Configuring Link Level Flow Control

How Are The Networks Coping Up With Flash Storage

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

On the Transition to a Low Latency TCP/IP Internet

Cisco Nexus 5000 Series Architecture: The Building Blocks of the Unified Fabric

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

Designing Next Generation FS for NVMe and NVMe-oF

Storage Protocol Offload for Virtualized Environments Session 301-F

Network Configuration Example

Differential Congestion Notification: Taming the Elephants

Birds of a Feather Presentation

Data Center Networks and Switching and Queueing and Covert Timing Channels

Fibre Channel vs. iscsi. January 31, 2018

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Flow-start: Faster and Less Overshoot with Paced Chirping

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

PARAVIRTUAL RDMA DEVICE

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

10 Reasons your WAN is Broken

RoCE vs. iwarp Competitive Analysis

Configuring Priority Flow Control

Implementing Cisco Quality of Service 2.5 (QOS)

Data center Networking: New advances and Challenges (Ethernet) Anupam Jagdish Chomal Principal Software Engineer DellEMC Isilon

Outline 9.2. TCP for 2.5G/3G wireless

Advanced Computer Networks. End Host Optimization

WarpTCP WHITE PAPER. Technology Overview. networks. -Improving the way the world connects -

Why AI Frameworks Need (not only) RDMA?

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Atacama: An Open Experimental Platform for Mixed-Criticality Networking on Top of Ethernet

Advanced Computer Networks. RDMA, Network Virtualization

FastReact. In-Network Control and Caching for Industrial Control Networks using Programmable Data Planes

Transcription:

Baidu s Best Practice with Low Latency Networks Feng Gao IEEE 802 IC NEND Orlando, FL November 2017 Presented by Huawei

Low Latency Network Solutions 01 1. Background Introduction 2. Network Latency Analysis 3. Low Latency Network Solutions 4. Best Practice

Background Introduction Artificial Intelligence High Performance Computing Cloud Real Time Big Data Analysis Latency-sensitive applications are deployed and developed in Data Centers, from the simple pursuit of high bandwidth, non-blocking to the pursuit of low latency, no packet loss Bandwidth-centric network design is switched to latency-centric design. Reduce the jitter of latency.

Network Latency Analysis L = P + S + N + D + H Photoelectric propagation delay Serialization delay Node forwarding delay Retransmission and queuing delay Host processing delay Fixed Fundamental Key Infiniband RDMA RoCE np->pipeline Low latency Chip Host Acceleration iwarp Reduce the device forwarding delay Reduce host processing delay PCIE DCB 10G->40G->100G Large Capacity Chip No packet loss IB Reduce Serialization delay Reduce packet re-transmission delay ECN

Low Latency Solution : Host Acceleration Credit+CNP PFC+QCN PFC+ECN RDMA vs TCP/IP Kernel Bypass brought by RDMA reduces the latency on the Host RoCEv2 Compatible with current Ethernet-based DCN Low CAPEX/OPEX Easy to deploy, easy to reuse the operation capability.

Low Latency Solution : PFC + ECN 1 PFC(Priority Flow Control) is a kind of back-pressure protocol based on priority queues. Congestion node sends Pause frame to notify upstream node to stop sending to prevent buffer overflow and packet loss. PFC problems HOL Blocking PFC unfairness PFC deadlock Pause Storm 2 ECN(Explicit Congestion Notification) is a kind of end to end congestion control mechanism based on the flow. NIC SW CE CNP NIC Long control loop Randomness NIC SW SW SW CNP Complexity of setting threshold NIC SW NIC Different congestion control mechanisms on NICs

Low Latency Solution : Network Architecture Upgrade 12*100GE 16*100GE *** *** TOR *** 12*100GE 1,Non-blocking when scale-out; Speedup = 1:1 TOR *** 12*100GE 2, Speedup > 1 when Fan-out; Speedup = 4:3

Best Practice -1 Evaluation objects: 1. PFC only and ECN+PFC 2. Under different network utilization and speedup ratio Speedup 1:1 Speedup 4:3 Conclusions: 1 ECN+PFC outperforms PFC under different kinds of network utilization. 2 Speedup ratio profits the efficiency of the network: the higher, the better. 3 Threshold should be configured properly:provided the headroom, PFC threshold should be set as high as possible. ECN threshold should be set based on traffic pattern.

Best Practice - 2 Evaluation objects: 1. DCQCN:PFC+ECN 2. New Solution:TOR downlinks enable ECN, Per-Packet Load Balancing Conclusions: 1 Need to involve an ideal load balancing algorism: increase the speedup ratio could mitigate the congestion of Fabric's internal ports, but the packet loss caused by uneven distribution of traffic still exists.

Innovations on Low Latency Network Technology 02 1. Control Plane Feedback Mechanism 2. Data Plane Multipath Load Balancing 3. Management Plane Self Adaptive Network 4. Function enhancement : Queuing Optimization

Control Plane Feedback Mechanism Optimization Traditional Congestion Notification Congestion Notification / Packet Loss Notification Feedback info is simple Only mark congested/uncongested, no quantized congestion information. Notification loop is long NIC generates the congestion notification, the control loop is long. Congestion notification packet is mixed with normal traffic, without prioritization design. Notification Message improvement Involve congestion notification mechanism with more quantized levels, not two status. Multiple ways to accelerate Switch feedback the congestion/packet loss directly, shorten the control loop Set a higher priority to notification message TCP fast retransmission

Data Plane Multipath Load Balancing The traditional hash algorithm distributes traffic unevenly In multi-path scenario, as using flow 5- tuple based hash algorism, elephant flows may map to the same link, introducing persistent congestion on the link. Dynamic load balancing New multi-path load balancing Select a idle path based on measured load of multi paths Use the length of the egress queue as a hash key of load balancing algorithm Cut elephant flows into flowlets, schedule to different paths and make sure no out-of-order.

Management Plane Adaptive Network Analyzer Low latency network puts forward higher requirements for operation and maintenance management automation According to the severe requirements of packet loss and latency, the network configuration needs to be dynamically adapted to ensure the online configuration is always best. Effect of the Adaptive Network 1. Detection and discovery Traffic measurement, mark the information along the network nodes (timestamp, ingress port, egress port, queue) 2. Computing and characteristic analysis Analyze real-time service characteristics, calculate the optimal scheduling strategy 3. Instruction distribution and continuous optimization According to the traffic pattern, self configure and dynamically tune the parameters.

Function enhancement : Queuing Optimization Back-pressure Back-pressure Advantages of technical solutions low latency: isolate the congested flows, make non-congested flows low latency. High throughput: buffer congested flows a long the path, fully utilize the link capacity, not slow down the mice flow Quick response: zero hop response Traffic characteristics Elephant flows: contribute 80 percent of total traffic. Packet loss has little influence to the whole performance. Latency nonsensitive. Mice flows: contribute 20 percent of the data traffic load. Packet loss has serious influence to the whole performance. Latency sensitive.

03 Summary

Summary Business Orientation Network Orientation Product Orientation Architecture Evolution Internal requirements from Baidu Cloud & Artificial intelligence applications Under the overall layout of the network, achieve network acceleration within the partial data center network Promote industrial development, products need to be optimized Invest in small-scale. Optimize and iterate gradually

THANKS