The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Similar documents
The Software Metaphor for LAN PHY WAN PHY: Why High-Speed Networking in Clusters High-Speed Networking in Grids. Wu-chun (Wu) Feng

What is the Future for High-Performance Networking?

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Ultra high-speed transmission technology for wide area data movement

Computer Networking

The NE010 iwarp Adapter

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Outline Computer Networking. TCP slow start. TCP modeling. TCP details AIMD. Congestion Avoidance. Lecture 18 TCP Performance Peter Steenkiste

ADVANCED TOPICS FOR CONGESTION CONTROL

QuickSpecs. HP Z 10GbE Dual Port Module. Models

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

QuickSpecs. Integrated NC7782 Gigabit Dual Port PCI-X LOM. Overview

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

BlueGene/L. Computer Science, University of Warwick. Source: IBM

TCP and BBR. Geoff Huston APNIC

CS644 Advanced Networks

Introduction to TCP/IP Offload Engine (TOE)

Proxy-based TCP-friendly streaming over mobile networks

10 Gbit/s Challenge inside the Openlab framework

TCP and BBR. Geoff Huston APNIC

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

Optimizing Performance: Intel Network Adapters User Guide

Networks & protocols research in Grid5000 DAS3

TCP Congestion Control

Congestion Control for High Bandwidth-delay Product Networks. Dina Katabi, Mark Handley, Charlie Rohrs

Congestion Control In the Network

TCP on High-Speed Networks

Cluster Network Products

Multifunction Networking Adapters

Increase-Decrease Congestion Control for Real-time Streaming: Scalability

Internetwork. recursive definition point-to-point and multi-access: internetwork. composition of one or more internetworks

Multiple unconnected networks

Analytics of Wide-Area Lustre Throughput Using LNet Routers

QuickSpecs. NC7771 PCI-X 1000T Gigabit Server Adapter. HP NC7771 PCI-X 1000T Gigabit Server Adapter. Overview

Congestion Control. Daniel Zappala. CS 460 Computer Networking Brigham Young University

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

Congestion Control for High Bandwidth-delay Product Networks

The Controlled Delay (CoDel) AQM Approach to fighting bufferbloat

CERN openlab Summer 2006: Networking Overview

TCP SIAD: Congestion Control supporting Low Latency and High Speed

The Present and Future of Congestion Control. Mark Handley

iscsi Technology: A Convergence of Networking and Storage

UNIT IV -- TRANSPORT LAYER

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Linux Plumbers Conference TCP-NV Congestion Avoidance for Data Centers

All Roads Lead to Convergence

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

CS268: Beyond TCP Congestion Control

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

CS 43: Computer Networks. 19: TCP Flow and Congestion Control October 31, Nov 2, 2018

COMP/ELEC 429/556 Introduction to Computer Networks

An Extensible Message-Oriented Offload Model for High-Performance Applications

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

RDMA and Hardware Support

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

HP NC7771 PCI-X 1000T

TCP on High-Speed Networks

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

SNIA Discussion on iscsi, FCIP, and IFCP Page 1 of 7. IP storage: A review of iscsi, FCIP, ifcp

Computer Networks. ENGG st Semester, 2010 Hayden Kwok-Hay So

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Advanced Computer Networks. End Host Optimization

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 11, 2018

Promoting the Use of End-to-End Congestion Control in the Internet

Maelstrom: An Enterprise Continuity Protocol for Financial Datacenters

Computer Networked games

Implementation Experiments on HighSpeed and Parallel TCP

CS Networks and Distributed Systems. Lecture 10: Congestion Control

QuickSpecs. HP NC6170 PCI-X Dual Port 1000SX Gigabit Server Adapter. Overview. Retired

Network Management & Monitoring

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

CS 268: Lecture 7 (Beyond TCP Congestion Control)

Congestion Control. Principles of Congestion Control. Network assisted congestion. Asynchronous Transfer Mode. Computer Networks 10/23/2013

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Network Design Considerations for Grid Computing

Congestion Control. Principles of Congestion Control. Network-assisted Congestion Control: ATM. Congestion Control. Computer Networks 10/21/2009

TCP LoLa Toward Low Latency and High Throughput Congestion Control

Lecture 14: Congestion Control"

15-744: Computer Networking TCP

Communication Networks

The Google File System

Application Acceleration Beyond Flash Storage

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Infiniband Fast Interconnect

Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Is Going Faster Getting Easier? Some thoughts on the state of high speed IP networking

TCP and BBR. Geoff Huston APNIC. #apricot

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

TCP Congestion Control : Computer Networking. Introduction to TCP. Key Things You Should Know Already. Congestion Control RED

Future Routing Schemes in Petascale clusters

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

TDT Appendix E Interconnection Networks

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Transcription:

Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov http://home.lanl.gov/feng Research & Development in Advanced Network Technology (RADIANT) Computer & Computational Science (CCS) Division Los Alamos National Laboratory and Department of Computer & Information Science The Ohio State University

Outline History of Networking in High-Performance Computing (HPC) TCP in HPC? Current Solutions for HPC Future of High-Performance Networking: TCP vs. ULNI The Road to a HP-TCP What s Wrong with TCP? Solutions? Crossroads for High-Performance Networking

TCP for High-Performance Computing? Problems with TCP for HPC in the Mid-1990s Today & Tomorrow Computing Paradigm: Cluster or supercomputer. + computational grid Network Environment: System-area network (SAN). + wide-area network (WAN) TCP Performance (mid-90s): Latency: O(1000 µs). BW: O(10 Mb/s). Too heavywgt TCP Performance (today): Latency: O(100 µs). BW: O(500 Mb/s). TCP Performance (optimized):latency: 95 µs. BW: 1.77 Gb/s. [Trapeze TCP] Solution Problem: ULNIs do not scale to WAN. Must use TCP (despite conflicts). User-level network interfaces (ULNIs) or OS-bypass protocols Active Messages, FM, PM, U-Net. Recently, VIA (Compaq, Intel, µsoft) ULNI Performance (mid-90s): Latency: O(10 µs). BW: O(600-800 Mb/s). ULNI Performance (Reference: Petrini, Hoisie, Feng, Graham, 2000.) Latency: 1.9 µs. BW: 392 MB/s = 3.14 Gb/s. User-Level Latency: 4.5 µs. User-Level BW: 307 MB/s = 2.46 Gb/s. Latency & BW problems not specific to HPC Medicine M. Ackerman (NLM). ECG: 20 GB now @ 100% reliability! Neurological: Smoothness @ 80% ok. Remote collaboration & bulk-data transfer R. Mount (SLAC). 100 GB 22 hrs. Integrated multi-level collaboration T. Znati (NSF/ANIR).

Current Solutions for HPC Computational Grid (WAN) TCP Cost-effective for commodity, high-end supercomputing. Better fault tolerance than large-scale supercomputers. + Infrastructure to deal with routing (ARP, IP). o Congestion control (conflict between Internet & HPC communities). Non-adaptive flow control. High latency (even over short distances) and low bandwidth. Cluster Computing or Supercomputing (SAN) User-Level Network Interface (ULNI) or OS-Bypass Protocol + Negotiated flow control. + Low latency and high bandwidth. No automatic infrastructure to deal with routing. No congestion control. Each community is working to address the negatives. Will each community continue as separate thrusts? Sociologically? Technically? Re-inventing ULNI Re-inventing TCP

Future of High-Performance Networking Internet Community HPC Community Early 1990s SANs Mid-1990s specialized ULNI grids Late 1990s TCP Vegas standardized ULNI 2010 & beyond TCP Vegas + active queue management in network (to ensure fairness) standardized ULNI (extended to WAN)

Future of High-Performance Networking Internet Community HPC Community Early 1990s SANs Mid-1990s specialized ULNI grids Late 1990s TCP Vegas standardized ULNI 2010 & beyond HP-TCP + active queue management in network (to ensure fairness)

The Road to a HP-TCP What s Wrong with TCP? Host-Interface Bottleneck Software 10GigE packet interarrival: 1.2 µs Null system call in Linux: 10 µs Host can only send & receive packets as fast as the OS can process them. Excessive copying. Excessive CPU utilization. [ Hardware (PC) Not anything wrong with TCP per se. PCI I/O bus. 64 bit, 66 MHz = 4.2 Gb/s. Solution: InfiniBand? ] Adaptation Bottlenecks Flow Control No adaptation currently being done in any standard TCP. Static-sized window/buffer is supposed to work for both the LAN & WAN. Congestion Control Adaptation mechanisms will not scale, particularly TCP Reno. Adaptation mechanisms induce burstiness to the aggregate traffic stream.

Host-Interface Bottleneck (Software) First-Order Approximation deliverable bandwidth = maximum-sized packet / interrupt latency e.g., 1500-byte MTU / 50 µs = 30 MB/s = 240 Mb/s Problems Maximum-sized packet (or MTU) is only 1500 bytes for Ethernet. Interrupt latency to process a packet is quite high. CPU utilization for network tasks is too high. (See next slide.) 625 Mb/s 900+ Mb/s CC no CC Solutions Intended to Boost TCP Performance Reduce frequency of interrupts, e.g., interrupt coalescing or OS-bypass. Increase effective MTU size, e.g., interrupt coalescing or jumbograms. Reduce interrupt latency, e.g., push checksums into hardware, zero-copy Reduce CPU utilization, e.g., offload protocol processing to NIC.

666-MHz Alpha with Linux (Courtesy: USC/ISI) Courtesy: USC/ISI CPU Utilization (%) 120 100 80 60 40 20 0 100 200 300 400 500 600 700 800 900 Throughput (Mb/s) 1500-byte MTU 4000-byte MTU 9000-byte MTU Note: The congestion-control mechanism does not get activated in these tests.

Solutions to Boost TCP Performance (many non-tcp & non-standard) Interrupt Coalescing Increases bandwidth (BW) at the expense of even higher latency. Jumbograms Increases BW with minimal increase in latency, but at the expense of more blocking in switches/routers and lack of interoperability. ULNI or OS-Bypass Protocol Increases BW & decreases latency by an order of magnitude or more. Integrate OS-bypass into TCP? VIA over TCP (IETF Internet Draft, GigaNet, July 2000). Interrupt Latency Reduction (possible remedy for TCP) Provide zero-copy TCP (a la OS-bypass) but OS still middleman. Push protocol processing into hardware, e.g., checksums. Which ones will guide the design of a HP-TCP?

Benchmarks: TCP TCP over Gigabit Ethernet (via loopback interface) Problem? Congestion control Data copies Theoretical Upper-Bound: 750 Mb/s due to the nature of TCP Reno. Environment: Red Hat Linux 6.2 OS on 400-MHz & 733-MHz Intel PCs; Alteon AceNIC GigE cards; 32-bit, 33-MHz PCI bus. Test: Latency & bandwidth over loopback interface. Latency: O(50 µs). Peak BW w/ default set-up: 335 Mb/s (400) & 420 Mb/s (733). Solution: ULNI? Peak BW w/ manual tweaks by network gurus at both ends: 625 Mb/s. Change default send/receive buffer size from 64 KB to 512 KB. Enable interrupt coalescing. (2 packets per interrupt.) Jumbograms. Theor. BW: 18000 / 50 = 360 MB/s = 2880 Mb/s. Problem: OS is the middleman. Faster CPUs provide slightly less latency and slightly more BW. 10GigE BW for a high-speed connection wasted.

What s Wrong with TCP? Host-Interface Bottleneck Software A host can only send and receive packets as fast as the OS can process the packets. Excessive copying. Excessive CPU utilization. [ Hardware (PC) Not anything wrong with TCP per se. PCI I/O bus. 64 bit, 66 MHz = 4.2 Gb/s. Solution: InfiniBand? ] Adaptation Bottlenecks Flow Control No adaptation currently being done in any standard TCP. Static-sized window/buffer is supposed to work for both the LAN and WAN. Congestion Control Adaptation mechanisms will not scale, particularly TCP Reno.

Adaptation Bottleneck Flow Control Issues No adaptation currently being done in any standard TCP. 32-KB static-sized window/buffer that is supposed to work for both the LAN and WAN. Problem: Large bandwidth-delay products require flow-control windows as large as 1024-KB to fill the network pipe. Consequence: As little as 3% of network pipe is filled. Solutions Manual tuning of buffers at send and receive end-hosts. Too small low bandwidth. Too large waste memory (LAN). Automatic tuning of buffers. PSC: Auto-tuning but does not abide by TCP semantics, 1998. LANL: Dynamic flow control, 2000. Increase BW by 8 times. Fair? Network striping & pipelining with default buffers. Unfair. UIC, 2000.

Adaptation Bottleneck Congestion Control Adaptation mechanisms will not scale due to Additive increase / multiplicative decrease algorithm (see next slide). Induces bursty (i.e., self-similar or fractal) traffic. TCP Reno congestion control Bad: Allow/induce congestion. Detect & recover from congestion. Analogy: Deadlock detection & recovery in OS. Result: At best 75% utilization in steady state. TCP Vegas congestion control Better: Approach congestion but try to avoid it. Usually results in better network utilization. Analogy: Deadlock avoidance in OS. 100% 50% 100% 50% Utilization vs. Time

Optimal Bandwidth The future performance of computational grids (as well as clusters & supercomputers trying to get away from ULNI scalability problems) looks bad if we continue to rely on the widely-deployed TCP Reno. Example: High BW-delay product: 1 Gb/s WAN * 100 ms RTT = 100 Mb Additive increase when window size is 1 100% increase in window size. when window size is 1000 0.1% increase in window size. window size time 100 Mb available BW 50 Mb Re-convergence to optimal bandwidth takes nearly 7 minutes! (Performance is awful if network uncongested.)

AIMD Congestion Control Stable & fair (under certain assumptions of synchronized feedback) but Not well-suited for emerging applications (e.g., streaming & real-time audio and video) Its reliability and ordering semantics increases end-to-end delays and delay variations. Multimedia applications generally do not react well to the large and abrupt reductions in transmission rate caused by AIMD. Solutions Deploy TCP-friendly (non-aimd) congestion-control algorithms, e.g., binomial congestion-control algorithms such as inverse increase / additive decrease (MIT). Provide a protocol that functionally sits between UDP & TCP, e.g., RAPID: Rate-Adjusting Protocol for Internet Delivery n% reliability (vs. 100% reliability of TCP) where 0% < n 100% Provide high performance and utilization regardless of network conditions. Hmm what about wireless?

How to Build a HP-TCP? Host-Interface Bottleneck Software BW problems potentially solvable. Latency? What happens when we go optical to the chip? A host can only send and receive packets as fast as the OS can process the packets. Based on past trends, the I/O bus will Hardware (PC) continue to be a bottleneck. PCI I/O bus. 64 bit, 66 MHz = 4.2 Gb/s. Solution: InfiniBand? Adaptation Bottlenecks Flow Control Solutions exist but are not widely deployed. No adaptation currently being done in any standard TCP. Static-sized window/buffer is supposed to work for both the LAN and WAN. Congestion Control Adaptation mechanisms will not scale, particularly TCP Reno. TCP Vegas? Binomial congestion control?

Crossroads for High-Performance Networking Hardware/Architecture http://www.canet3.net/library/papers/wavedisk.doc InfiniBand helps but network-peer CPU (or co-processor) or network-integrated microprocessor architecture may be needed. Today Near Future Far Future: Network is Storage Memory- & disk-access too slow. NIC NIC I/O Bus Memory Bus I/O Bridge I/O Bus Memory Bus I/O Bridge Software Bottleneck? Offload as much protocol processing to the NIC CPU $ CPU Main Memory $ CPU NIC Main Memory

Crossroads for High-Performance Networking Software Internet Community HPC Community Early 1990s SANs Mid-1990s specialized ULNI grids Late 1990s TCP Vegas standardized ULNI Tomorrow TCP Vegas standardized ULNI (extend to WAN, e.g., add IP routing, congestion control, MTU discovery)

Crossroads for High-Performance Networking Software Internet Community HPC Community Early 1990s SANs Mid-1990s specialized ULNI grids Late 1990s TCP Vegas standardized ULNI Tomorrow HP-TCP

That s all folks!