Outrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University

Similar documents
The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

Advanced Computer Networks. End Host Optimization

by Brian Hausauer, Chief Architect, NetEffect, Inc

MPA (Marker PDU Aligned Framing for TCP)

An Extensible Message-Oriented Offload Model for High-Performance Applications

VM and I/O. IO-Lite: A Unified I/O Buffering and Caching System. Vivek S. Pai, Peter Druschel, Willy Zwaenepoel

NFS/RDMA. Tom Talpey Network Appliance

The NE010 iwarp Adapter

IsoStack Highly Efficient Network Processing on Dedicated Cores

Memory Management Strategies for Data Serving with RDMA

Lecture 2: Layering & End-to-End

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Multifunction Networking Adapters

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

iscsi Technology: A Convergence of Networking and Storage

An RDMA Protocol Specification (Version 1.0)

RoCE vs. iwarp Competitive Analysis

Sun N1: Storage Virtualization and Oracle

An Extensible Message-Oriented Offload Model for High-Performance Applications

USING ISCSI AND VERITAS BACKUP EXEC 9.0 FOR WINDOWS SERVERS BENEFITS AND TEST CONFIGURATION

Optimizing Performance: Intel Network Adapters User Guide

Motivation CPUs can not keep pace with network

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

IO-Lite: A Unified I/O Buffering and Caching System

Matt Wakeley 27 June, iscsi Framing Presentation

NETWORK OVERLAYS: AN INTRODUCTION

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Application Acceleration Beyond Flash Storage

Storage Protocol Offload for Virtualized Environments Session 301-F

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

FPGA Augmented ASICs: The Time Has Come

Introduction to TCP/IP Offload Engine (TOE)

FreeBSD Network Performance Tuning

Page 1. Multilevel Memories (Improving performance using a little cash )

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

Accelerating Web Protocols Using RDMA

SNIA Discussion on iscsi, FCIP, and IFCP Page 1 of 7. IP storage: A review of iscsi, FCIP, ifcp

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

A Study of iscsi Extensions for RDMA (iser) Patricia Thaler (Agilent).

TCP offload engines for high-speed data processing

Support for Smart NICs. Ian Pratt

Performance Report: Multiprotocol Performance Test of VMware ESX 3.5 on NetApp Storage Systems

RDMA and Hardware Support

What is RDMA? An Introduction to Networking Acceleration Technologies

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

TCP Framing Discussion Slides

Introduction to Ethernet Latency

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Ethernet Jumbo Frames

CSE 461 Module 10. Introduction to the Transport Layer

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Network Design Considerations for Grid Computing

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Outline Computer Networking. TCP slow start. TCP modeling. TCP details AIMD. Congestion Avoidance. Lecture 18 TCP Performance Peter Steenkiste

Generic RDMA Enablement in Linux

6.9. Communicating to the Outside World: Cluster Networking

CS 136: Advanced Architecture. Review of Caches

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Agenda. 1. The need for Multi-Protocol Networks. 2. Users Attitudes and Perceptions of iscsi. 3. iscsi: What is it? 4.

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

QuickSpecs. HP Z 10GbE Dual Port Module. Models

All Roads Lead to Convergence

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

W H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Evaluation of a Zero-Copy Protocol Implementation

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

High Performance File Serving with SMB3 and RDMA via SMB Direct

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

STORAGE PROTOCOLS. Storage is a major consideration for cloud initiatives; what type of disk, which

Hands-On Wide Area Storage & Network Design WAN: Design - Deployment - Performance - Troubleshooting

Traditional SAN environments allow block

Using Switches with a PS Series Group

INT G bit TCP Offload Engine SOC

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

ECE 650 Systems Programming & Engineering. Spring 2018

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

TCP Tuning for the Web

Optimizing your virtual switch for VXLAN. Ron Fuller, VCP-NV, CCIE#5851 (R&S/Storage) Staff Systems Engineer NSBU

CS 326: Operating Systems. Networking. Lecture 17

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

40 Gbps IPsec on Commodity Hardware. Jim Thompson Founder & CTO Netgate

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Benefits of full TCP/IP offload in the NFS

Optimizing TCP Receive Performance

Measurement-based Analysis of TCP/IP Processing Requirements

Performance Implications Libiscsi RDMA support

Optimising for the p690 memory system

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Creating an agile infrastructure with Virtualized I/O

Host Solutions Group Technical Bulletin August 30, 2007

Transcription:

Outrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University

But first. This work addresses questions that are important in the industry right now. It is an outgrowth of Trapeze project: 1996-2000. It is tangential to my primary research agenda. Resource management for large-scale shared service infrastructure. Self-managing computing/storage utilities Internet service economy Federated distributed systems Amin Vahdat will speak about our work on Secure Highly Available Resource Peering (SHARP) in a few weeks.

A brief history Much research on fast communication and end-system TCP/IP performance through 1980s and early 1990s. Common theme: advanced NIC features and host/nic boundary. TCP/IP offload controversial: early efforts failed User-level messaging and Remote Direct Memory Access or RDMA (e.g., unet) SAN market grows enormously in mid-1990s VI Architecture standardizes SAN messaging host interface in 1997-1998. FibreChannel (FC) creates market for network block storage. Then came Gigabit Ethernet

A brief history, part 2 Zero-copy TCP/IP First gigabit TCP [1999] Consensus that zero-copy sockets are not general [2001] IETF RDMA working group [2002] Direct Access File System [2002] iscsi block storage for TCP/IP Revival of TCP/IP offload 10+GE NFS/RDMA, offload chips, etc. Uncalibrated marketing claims TCP/IP Ethernet iscsi??? SAN DAFS

Ethernet/IP in the data center 10+Gb/s Ethernet continues the trend of Ethernet speeds outrunning Moore s Law. Ethernet runs IP. This trend increasingly enables IP to compete in high performance domains. Data centers and other SAN markets {System, Storage, Server, Small} Area Network Specialized/proprietary/nonstandard Network storage: iscsi vs. FC Infiniband vs. IP over 10+GE

Ethernet/IP vs. Real SANs IP offers many advantages One network Global standard Unified management, etc. But can IP really compete? What do real SANs really offer? Fatter wires? Lower latency? Lower host overhead

SAN vs. Ethernet Wire Speeds Scenario #1 Scenario #2 SAN Log Bandwidth: smoothed step function SAN Ethernet Ethernet time time

Outrunning Moore s Law? Network Bandwidth per CPU cycle Whichever scenario comes to pass, both SANs and Ethernet are advancing ahead of Moore s Law. Amdahl s other law SAN Ethernet How much bandwidth do data center applications need? etc. high performance (data center?) I/O-intensive apps compute-intensive apps time

The problem: overhead Ethernet is cheap, and cheap NICs are dumb. Although TCP/IP family protocol processing itself is reasonably efficient, managing a dumb NIC steals CPU/memory cycles away from the application. o o a SAN a TCP/IP a = application processing per unit of bandwidth o = host communication overhead per unit of bandwidth

The host/network gap Host saturation throughput curve 1/(a+o) Low-overhead SANs can deliver higher throughput, even when the wires are the same speed. Application (server) throughput Bandwidth (wire speed) Gap SAN TCP/IP Host overhead (o)

Hitting the wall Bandwidth per CPU cycle Host saturation point SAN Ethernet Throughput improves as hosts advance, but bandwidth per cycle is constant once the host saturation point is reached. time

IP SANs If you believe in the problem, then the solution is to attach hosts to the faster wires with smarter NICs. Hardware checksums, interrupt suppression Transport offload (TOE) Connection-aware w/ early demultiplexing ULP offload (e.g., iscsi) Direct data placement/rdma Since these NICs take on the key characteristics of SANs, let s use the generic term IP-SAN. or just offload

How much can IP-SANs help? IP-SAN is a difficult engineering challenge. It takes time and money to get it right. LAWS [Shivam&Chase03] is a back of napkin analysis to explore potential benefits and limitations. Figure of merit: marginal improvement in peak application throughput ( speedup ) Premise: Internet servers are fully pipelined Ignore latency (your mileage may vary) IP-SANs can improve throughput if host saturates.

What you need to know (about) Importance of overhead and effect on performance Distinct from latency, bandwidth Sources of overhead in TCP/IP communication Per segment vs. per byte (copy and checksum) MSS/MTU size, jumbo frames, path MTU discovery Data movement from NIC through kernel to app RFC 793 (copy semantics) and its impact on the socket model and data copying overhead. Approaches exist to reduce it, and they raise critical architectural issues (app vs. OS vs. NIC) RDMA+offload and the layer controversy Skepticism of marketing claims for proposed fixes. Amdahl s Law LFNs

Focusing on the Issue The key issue IS NOT: The pipes: Ethernet has come a long way since 1981. Add another zero every three years? Transport architecture: generality of IP is worth the cost. Protocol overhead: run better code on a faster CPU. Interrupts, checksums, etc: the NIC vendors can innovate here without us. All of these are part of the bigger picture, but we don t need an IETF working group to fix them.

The Copy Problem The key issue IS data movement within the host. Combined with other overheads, copying sucks up resources needed for application processing. The problem won t go away with better technology. Faster CPUs don t help: it s the memory. General solutions are elusive on the receive side. The problem exposes basic structural issues: interactions among NIC, OS, APIs, protocols.

Zero-Copy Alternatives Option 1: page flipping NIC places payloads in aligned memory; OS uses virtual memory to map it where the app wants it. Option 2: scatter/gather API NIC puts the data wherever it want; app accepts the data wherever it lands. Option 3: direct data placement NIC puts data where the headers say it should go. Each solution involves the OS, application, and NIC to some degree.

Page Flipping: the Basics Goal: deposit payloads in aligned buffer blocks suitable for the OS VM and I/O system. Receiving app specifies buffers (per RFC 793 copy semantics). K U NIC Header splitting Aligned payload buffers VM remaps pages at socket layer

Page Flipping with Small MTUs Give up on Jumbo Frames. K U NIC Split transport headers, sequence and coalesce payloads for each connection/stream/flow. Host

Page Flipping with a ULP ULP PDUs encapsulated in stream transport (TCP, SCTP) K U NIC Split transport and ULP headers, coalesce payloads for each stream (or ULP PDU). Host Example: an NFS client reading a file

Page Flipping: Pros and Cons Pro: sometimes works. Application buffers must match transport alignment. NIC must split headers and coalesce payloads to fill aligned buffer pages. NIC must recognize and separate ULP headers as well as transport headers. Page remap requires TLB shootdown for SMPs. Cost/overhead scales with number of processors.

Option 2: Scatter/Gather NIC demultiplexes packets by ID of receiving process. System and apps see data as arbitrary scatter/gather buffer chains (readonly). K U NIC Deposit data anywhere in buffer pool for recipient. Host Fbufs and IO-Lite [Rice]

Scatter/Gather: Pros and Cons Pro: just might work. New APIs New applications New NICs New OS May not meet app alignment constraints.

Option 3: Direct Data Placement NIC NIC steers payloads directly to app buffers, as directed by transport and/or ULP headers.

DDP: Pros and Cons Effective: deposits payloads directly in designated receive buffers, without copying or flipping. General: works independent of MTU, page size, buffer alignment, presence of ULP headers, etc. Low-impact: if the NIC is magic, DDP is compatible with existing apps, APIs, ULPs, and OS. Of course, there are no magic NICs

DDP: Examples TCP Offload Engines (TOE) can steer payloads directly to preposted buffers. Similar to page flipping ( pack each flow into buffers) Relies on preposting, doesn t work for ULPs ULP-specific NICs (e.g., iscsi) Proliferation of special-purpose NICs Expensive for future ULPs RDMA on non-ip networks VIA, Infiniband, ServerNet, etc.

Remote Direct Memory Access Register buffer steering tags with NIC, pass them to remote peer. Remote Peer NIC RDMA-like transport shim carries directives and steering tags in data stream. Directives and steering tags guide NIC data placement.

LAWS ratios α γ σ β Ratio of Host CPU speed to NIC processing speed (Lag ratio) CPU intensity (compute/communication) of the application (Application ratio) Percentage of wire speed the host can deliver for raw communication without offload (Wire ratio) Portion of network work not eliminated by offload (Structural ratio) On the Elusive Benefits of Protocol Offload, Shivam and Chase, NICELI 2003.

Application ratio (γ) Application ratio (γ) captures compute-intensity. o o γ = a/o o o a a a a For a given application, lower overhead increases γ. For a given communication system, γ is a property of the application: it captures processing per unit of bandwidth.

γ and Amdahl s Law network-intensive apps high benefit 1/γ Amdahl s Law bounds the potential improvement to 1/γ when the system is still host-limited after offload. throughput increase (%) What is γ for typical services in the data center? Apache Apache w/ Perl? CPU-intensive apps low benefit compute-intensity (γ)

Wire ratio (σ) Wire ratio (σ) captures host speed relative to network. B = network bandwidth Host saturation throughput for raw communication = 1/o Network processing cannot saturate CPU when σ >1. σ = (1/o) / B σ 0 Fast network Slow host Best realistic scenario: wire speed just saturates host. σ = 1 σ>>1 Slow network Fast host

Effect of wire ratio (σ) Lower σ faster network benefit grows rapidly Improvement when the system is network-limited after transport offload is: throughput increase (%) ((γ+1)/σ) 1 Higher σ slower network little benefit compute-intensity (γ)

Putting it all together throughput increase (%) Peak benefit from offload occurs at intersection point. network-limited ((γ+1)/σ) 1 Peak benefit occurs when the application drives the full network bandwidth with no host cycles left idle. σ = γ host-limited 1/γ compute-intensity (γ)

Offload for fast hosts Faster hosts, better protocol implementations, and slower networks all push σ higher. E.g., a 100 Mb/s net on a gigabit-ready host gives σ=10. The throughput improvement is bounded by 1/σ (e.g., 10%). throughput increase (%) σ = γ Key question: Will network advances continue to outrun Moore s Law and push σ lower over the long term? compute-intensity (γ)

Offload for fast networks throughput increase (%) σ = γ σ 0 Peak benefit is unbounded as the network speed advances relative to the host! But: those benefits apply only to a narrow range of low-γ applications. With real application processing (higher γ) the potential benefit is always bounded by 1/γ. compute-intensity (γ)

Offload for a realistic network The network is realistic if the host can handle raw communication at wire speed (σ 1). The best realistic scenario is σ=1: raw communication just saturates the host. In this case, offload improves throughput by up to a factor of two (100%), but no more. throughput increase (%) σ = γ compute-intensity (γ) The peak benefit occurs when γ=1: the host is evenly split between overhead and app processing before offload. o a a

Pitfall: offload to a slow NIC If the NIC is too slow, it may limit throughput when γ is low. The slow NIC has no impact on throughput unless it saturates, but offload may do more harm than good for low-γ applications. throughput impact (%) compute-intensity (γ)

Quantifying impact of a slow NIC The lag ratio (α) captures the relative speed of the host and NIC for communication processing. When the NIC lags behind the host (α>1) then the peak benefit occurs when α=γ, and is bounded by 1/α. throughput impact (%) compute-intensity (γ) We can think of the lag ratio in terms of Moore s Law. E.g., α=2 when NIC technology lags the host by 18 months. Then the peak benefit from offload is 50%, and it occurs for an application that wasted 33% of system CPU cycles on overhead.

IP transport offload: a dumb idea whose time has come? Offload enables structural improvements such as direct data placement (RDMA) that eliminate some overhead from the system rather than merely shifting it to the NIC. If a share β of the overhead remains, then the peak benefit occurs when αβ=γ, and is bounded by 1/αβ. Jeff Mogul, TCP offload is a dumb idea whose time has come, HotOS 2003. throughput impact (%) compute-intensity (γ) If β = 50%, then we can get the full benefit from offload with 18 month-old NIC technology. DDP/RDMA eases time-to-market pressure for offload NICs.

Outrunning Moore s Law, revisited IP-SANs will free IP/Ethernet technology to advance along the curve to higher bandwidth per CPU cycle. But how far up the curve do we need to go? If we get ahead of our applications, then the benefits fall off quickly. What if Amdahl was right? Network Bandwidth per CPU cycle Amdahl s other law Ethernet niche market high-margin market high-volume market time

Conclusion To understand the role of 10+GE and IP-SAN in the data center, we must understand the applications (γ). Lies, damn lies, and point studies. Careful selection of γ and σ can yield arbitrarily large benefits from SAN technology, but those benefits may be elusive in practice. LAWS analysis exposes fundamental opportunities and limitations of IP-SANs and other approaches to low-overhead I/O (including non-ip SANs). Helps guide development, evaluation, and deployment.