Advanced Computer Networks. Flow Control

Similar documents
Advanced Computer Networks. Flow Control

Advanced Computer Networks. Datacenter TCP

Advanced Computer Networks. Datacenter TCP

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture 3: Flow-Control

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Flow Control can be viewed as a problem of

TCP Incast problem Existing proposals

Lecture: Interconnection Networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Multicomputer distributed system LECTURE 8

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Advanced Computer Networks. Wireless TCP

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Basic Low Level Concepts

Advanced Computer Networks. End Host Optimization

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 7: Flow Control - I

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT

Transport layer issues

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

UNIT IV -- TRANSPORT LAYER

Lecture 14: Congestion Control"

Lecture 14: Congestion Control"

Advanced Computer Networks. RDMA, Network Virtualization

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

RoGUE: RDMA over Generic Unconverged Ethernet

ECE 650 Systems Programming & Engineering. Spring 2018

On Network Dimensioning Approach for the Internet

Introduction to Infiniband

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

II. Principles of Computer Communications Network and Transport Layer

Wireless Challenges : Computer Networking. Overview. Routing to Mobile Nodes. Lecture 25: Wireless Networking

CS268: Beyond TCP Congestion Control

User Datagram Protocol

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Communication Networks

Revisiting Network Support for RDMA

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

Transport Protocols for Data Center Communication. Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti

First Exam for ECE671 Spring /22/18

P802.1Qcz Congestion Isolation

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data?

Data and Computer Communications. Chapter 2 Protocol Architecture, TCP/IP, and Internet-Based Applications

Network Control and Signalling

Chapter III: Transport Layer

Congestion Control in Datacenters. Ahmed Saeed

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

Data Center TCP(DCTCP)

15-744: Computer Networking. Data Center Networking II

Unit 2 Packet Switching Networks - II

ECE 650 Systems Programming & Engineering. Spring 2018

EEC-484/584 Computer Networks

Chapter III. congestion situation in Highspeed Networks

ECE 333: Introduction to Communication Networks Fall 2001

Cloud e Datacenter Networking

Lecture 18: Communication Models and Architectures: Interconnection Networks

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Data center Networking: New advances and Challenges (Ethernet) Anupam Jagdish Chomal Principal Software Engineer DellEMC Isilon

Lecture 15 Networking Fundamentals. Today s Plan

Wireless TCP Performance Issues

TCP over Wireless. Protocols and Networks Hadassah College Spring 2018 Wireless Dr. Martin Land 1

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

PLEASE READ CAREFULLY BEFORE YOU START

CSC 4900 Computer Networks: Network Layer

Congestion Control for High Bandwidth-delay Product Networks

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

CSE 123A Computer Networks

Packet Scheduling in Data Centers. Lecture 17, Computer Networks (198:552)

ECE 610: Homework 4 Problems are taken from Kurose and Ross.

Congestion Control for High Bandwidth-delay Product Networks. Dina Katabi, Mark Handley, Charlie Rohrs

CERN openlab Summer 2006: Networking Overview

CMSC 332 Computer Networks Network Layer

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

1999, Scott F. Midkiff

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

CS132/EECS148 - Instructor: Karim El Defrawy Midterm Spring 2013 Time: 1hour May 2nd, 2013

Congestion Control In the Network

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Congestion Control in Communication Networks

Network Management & Monitoring

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview

Basics (cont.) Characteristics of data communication technologies OSI-Model

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

Requirement Discussion of Flow-Based Flow Control(FFC)

Handles all kinds of traffic on a single network with one class

Transport Protocols & TCP TCP

DIBS: Just-in-time congestion mitigation for Data Centers

TCP and BBR. Geoff Huston APNIC. #apricot

MUD: Send me your top 1 3 questions on this lecture

1/29/2008. From Signals to Packets. Lecture 6 Datalink Framing, Switching. Datalink Functions. Datalink Lectures. Character and Bit Stuffing.

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Transcription:

Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich

Last week TCP in Datacenters Avoid incast problem - Reduce RTO_MIN Avoid buffer overflow - DCTCP: Proportional window decrease Make sure flows meet their deadlines - D3: Allocate deadline-proportional bandwidth in routers - D2TCP: Deadline proportional variation of sender rate Use all the available bandwidth in Fat Trees - Multipath TCP 2

Today Flow Control Store-and-forward, cut-through, wormhole Head-of-line blocking Infiniband Lossless Ethernet Flow coordination 3

Flow Control Basics 4

Where to best put flow control So far we have discussed the TCP/IP/Ethernet stack TCP flow-control: avoid receiver buffer overflow TCP congestion-control: avoid switch buffer overflow TCP's congestion-control is reactive First loose packets, then adjust data rate Is reactive congestion-control a good choice at 1 Gbit/s data rate? 10 Gbit/s data rate? 100 Gbit/s data rate? 5

Supercomputer interconnect technologies (2003-2014) 10G Ethernet Infiniband FDR Infiniband DDR Infiniband QDR Infiniband Gbit/s Ethernet Infiniband 6 Low latency, high bandwidth interconnect technology Layers similar to TCP/IP: physical, link, routing, transport Key differentiating factors: link-level flow control, kernelbypass, new network semantics

Supercomputer interconnect technologies (2003-2014) 10G Ethernet Infiniband FDR Infiniband DDR Infiniband QDR Infiniband Gbit/s Ethernet today Infiniband 7 next week Low latency, high bandwidth interconnect technology Layers similar to TCP/IP: physical, link, routing, transport Key differentiating factors: link-level flow control, kernelbypass, new network semantics

Flow control Control state Channel bandwidth Buffer capacity Determines how a network's resources are allocated to packets traversing the network Flow-control network resources: Channel bandwidth Buffer capacity Goal: Allocate resources in an efficient manner to achieve a high fraction of the network's ideal bandwidth Deliver packets with low, predictable latency 8

Flow control mechanisms Bufferless: Drop or misroute packets if outgoing channel at switch is blocked red packet cannot be forward on requested channel because channel is occupied by the blue packet route red packet on uppper channel instead and re route it later Circuit switching: Only headers are buffered and reserve path Pause header forwarding if path is not available Buffered: Decouples channel allocation in time 9

Flits Packet Head flit Tail flit Body flit Packets are MTU-sized Typically several 1000 bytes Switch buffering and forwarding often works at a smaller granularity (several bytes) Flit FLow control unit Packets are divided into flits by the hardware Typically no extra headers 10

Time-space diagram: bufferless flow control head flit F R F 1 R F 2 R F 3 R channel 0 body flit tail flit ACK H B B B T H B B B T N A H B B B N H B N H B B B T A H B B B T A H B B B T A 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 cycle reverse channel forward channel 11 NACK cannot acquire channel 3, dropping packet

channel Bufferless flow control using timeouts 0 1 2 3 timeout H B B B T H B B B T H B B B T 0 1 2 3 4 5 6 7 8 H B B B H B B H B H 9 T A B T A B B T A B B B T A 10 11 12 13 14 15 16 17 18 19 20 21 cycle Packet is unable to acquire channel 3 in cycle 3 and is dropped Preceeding channels continue to transmit the packet until the tail flit is received Eventually a timeout triggers retransmission 12

Pros/cons of bufferless flow control Simple Inefficient: Valuable bandwidth used for packet that are dropped later Misrouting does not drop packets, but may lead to instability (packet may never reach destination) 13

Buffered flow control: overview Store-and-forward Channel and buffer allocation on a per packet-basis Receives full packets into buffer and forwards them after they have been received High latency (each switch waits for full packet) Cut-through Channel and buffer allocation on a per packet-basis Forwards packet as soon as first (header) flit arrives and outgoing resources are available Low latency but still blocks the channel for the duration of a whole packet transmission Wormhole Buffers are allocated on a per-flit basis Low latency and efficient buffer usage 14

channel Store-and-forward flow control 0 1 2 3 H B B B T H B B B T H B B B T H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 cycle Packet must acquire full packet buffer at next hop, plus channel bandwidth before forwarding The entire packet is transmitted over the channel before proceeding to the next channel High latency! 15

Cut-through flow control 0 1 2 3 H B B B H B B H B H T B T B B T B B B T 0 4 1 2 3 5 6 7 with contention channel channel without contention 8 9 10 0 1 2 3 H B B B T H B B B T H B B B T H B B B T 0 cycle 1 2 3 4 5 6 7 8 9 10 cycle Packet must acquire full packet buffer at next hop, and channel bandwidth before forwarding But: each flit of the packet is forwarded as soon as it is received 16

Wormhole routing Just like cut-through, but with buffers allocated to flits (not to packets) A head must acquire two resources before forwarding One flit buffer at next switch One flit of channel bandwidth Consumes much less buffer space than cut-through flow control 17

Head-of-line (HoL) blocking in Wormhole switching Intended routing of packets: A is going from Node-1 to Node-4 B is going from Node-0 to Node-5 Situation: B is blocked at Node-3 Cannot acquire flit buffer at Node-5 HoL blocking: A cannot progress to Node-4 even though all channels are idle No intermixing of channels possible in Wormhole B is locking channel from Node-0 to Node-3 18

Virtual channels Each switch has multiple virtual channels per physical channel Each virtual channel contains separate buffer space A head must acquire two resources before forwarding A virtual channel on the next switch (including buffer space) Channel bandwidth at the switch 19

Virtual channel flow control Example: A can proceed even though B is blocked at Node-3 20

Buffer management How to communicate the availability of buffers between switches? Common types used today: Credit-based - Switch keeps count of number of free flit buffers per downstream switch (credits) - Counter decreased when sending at downstream switch - Stop sending when counter reaches zero - Downstream switch sends back signal to increment credit counter when buffer is freed (forwarding) One/off - Downstream switches send on or off flag to start and stop incoming flit stream 21

Credit-based flow control Node 1 Node 2 time Credit t2 Flit Process t3 Flit Credit t4 Process credit roundtrip delay tcrt t1 t5 Credit Flit Need enough buffers to hide round trip time: With only one flit buffer: throughput = Lf / tcrt With F buffers: throughput = F*Lf / tcrt 22 Lf: flit length F: #flits

On/off flow control roundtrip delay trt Node 2 t1 t3 t4 Of Flits time t2 process t6 t7 Flits t5 On process t9 Flits t8 roundtrip delay trt Node 1 T2: Switch sends off if its free buffer count falls below a limit F off T6: Switch send on if its free buffer count rises above F on b: bandwidth Need to prevent additional flits from overflowing: F off >= trt * b / Lf Lf: flit length 23

Infiniband ConnectX 4 Interconnect technology designed for high-bandwidth, low-latency Bandwidth: Up to 100Gbit/s (EDR, Mellanox ConnectX-4) RTT latencies: 1-2 us Layered architecture Physical layer Link layer: credit-based flow control, virtual lanes Network layer Transport: reliable/unreliable, connection/datagram 24

Ethernet: PFC Priority Flow Control (PFC, IEEE 802.1bb): PFC Pause/Start flow control for individual priority classes PFC frame sent to immediate upstream entity (NIC or switch) PFC issues: PFC is coarse grained: stops entire priority class Unfair: blocks all flows, even those that didn't cause the congestion 25

Ethernet: Other Approaches Quantized Congestion Notification (QCN): Add flow information to MAC packets Detect congestion at switches Send congestion notification message to end host End host reacts and throttles the rate of the flow QCN limits: does not work across subnet boundaries Data center QCN: Like QCN but congestion notification messages are sent using UDP Works across IP subnets Moral: Ethernet was not designed with built-in flow-control Retro-fitting flow control is difficult Maybe this should not be done at the Ethernet layer then? DCTCP is similar than QCN but is implemented at the transport layer 26

Limits of Flow-based Congestion Control Discussed several techniques to improve networking in partition/aggregate type of applications: Fine-grain TCP timers: reduces long tail effects DCTCP: reduces queue buildup D3 and D2DCTCP: meet deadlines and SLAs Link-level flow contro These approaches are all working on a per-flow basis None of them looks at the collective behavior of flows by taking job semantics into account No coordination between individual network transfers within a single job 27

Lack of coordination can hurt the performance 28

Scalability of Netflix recommendation system Bottlenecked by communication as cluster size increases 29

Two key traffic patterns in MapReduce Map Stage Job Broadcast One-to-many Partition work Shuffle Reduce Stage Many-to-many Aggregate resultswith end-system provision 30

Orchestra: Managing Data Transfers in Computer Clusters

Orchestra: key idea Optimize at the level of transfers instead of individual flows Transfer: set of all flows transporting data between two stages of a job Coordination done through three control components Cornet: cooperative broadcast Weighted shuffle scheduling: shuffle coordination Inter-transfer controller (ITC): global coordination 32

Cornet: cooperative broadcast Key idea: Broadcasting in MapReduce is very similar to data distribution mechanisms in the Internet like BitTorrent BitTorrent-like protocol optimized for data centers Split data up into blocks and distribute them across nodes in the data center On receiving: request/gather blocks from various nodes Receivers of blocks become part of sender set (BitTorrent) Cornet differs from classical BitTorrent: Blocks are much larger (4MB) No need for incentives Data center is assumed to have high-bandwidth No selfish peers in the data center Topology aware Topology of data center is known Receiver chooses sender in the same rack 33

Cornet performance Experiment: 100GB data to 100 receivers on Amazon EC2 cluster Traditional broadcast implementations use distributed file system to store and retrieve broadcast data Cornet is about 4-5 times more efficient 34

Shuffle: Status Quo (1) Reducers Mappers Map output To receivers (top) need to fetch separate pieces of data from each sender If every sender has equal amount of data, all links are equally loaded and utilized What if data sizes are unbalanced? 35

Shuffle: Status Quo (1) Reducers Mappers Map output To receivers (top) need to fetch separate pieces of data from each sender If every sender has equal amount of data, all links are equally loaded and utilized What if data sizes are unbalanced? 36

Shuffle: Sender Bottleneck Reducers Mappers Map output Senders s1, s2, s4 and s5 have one data unit for each receiver Sender s3 has two data units for both receivers The link of the sender s3 becomes the bottleneck if flows share bandwidth in fair way 37

Orchestra: Weighted Shuffle Scheduling (WSS) Key idea: Assign weights to each flow in a shuffle Make the weight proportional to the data that needs to be transported 38

Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 39

Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 40

Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 41

Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 42

Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 43

Example: shuffle with weighted scheduling Reducers Mappers Map output Receivers fetch data at 1/4 units/seconds from s1, s2, s4 and s5...and: fetch data at 1/2 units/seconds s3 Fetching data from s1, s2, s4 and s5: 4 seconds Fetching data from s3: 4 seconds Total time = 4 seconds (25% faster than fair sharing) 44

Example: shuffle with weighted scheduling Reducers Mappers Map output Receivers fetch data at 1/4 units/seconds from s1, s2, s4 and s5...and: fetch data at 1/2 units/seconds s3 Fetching data from s1, s2, s4 and s5: 4 seconds Fetching data from s3: 4 seconds Total time = 4 seconds (25% faster than fair sharing) 45

Example: shuffle with weighted scheduling Reducers Mappers Map output Receivers fetch data at 1/4 units/seconds from s1, s2, s4 and s5...and: fetch data at 1/2 units/seconds s3 Fetching data from s1, s2, s4 and s5: 4 seconds Fetching data from s3: 4 seconds Total time = 4 seconds (25% faster than fair sharing) 46

Orchestra: End-to-end evaluation 1.9x faster on 90 nodes 47

Next time End host optimizations User-level networking RDMA 48