THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks

Similar documents
U-Net: A User-Level Network Interface for Parallel and Distributed Computing

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Ethan Kao CS 6410 Oct. 18 th 2011

Low-Latency Communication over Fast Ethernet

Parallel Computing Trends: from MPPs to NoWs

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

ATM and Fast Ethernet Network Interfaces for User-level Communication

Advanced Computer Networks. End Host Optimization

An O/S perspective on networks: Active Messages and U-Net

Low-Latency Communication over ATM Networks using Active Messages

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

NOW and the Killer Network David E. Culler

Cisco Series Internet Router Architecture: Packet Switching

6.9. Communicating to the Outside World: Cluster Networking

Chapter 3 Packet Switching

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Review: Hardware user/kernel boundary

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Common Protocols. The grand finale. Telephone network protocols. Traditional digital transmission

Network Design Considerations for Grid Computing

08:End-host Optimizations. Advanced Computer Networks

19: Networking. Networking Hardware. Mark Handley

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

CMSC 611: Advanced. Interconnection Networks

Performance Evaluation of Myrinet-based Network Router

Virtual Interface Architecture over Myrinet. EEL Computer Architecture Dr. Alan D. George Project Final Report

HWP2 Application level query routing HWP1 Each peer knows about every other beacon B1 B3

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

New Approaches to Optical Packet Switching in Carrier Networks. Thomas C. McDermott Chiaro Networks Richardson, Texas

Message Passing Architecture in Intra-Cluster Communication

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

Packet Switching - Asynchronous Transfer Mode. Introduction. Areas for Discussion. 3.3 Cell Switching (ATM) ATM - Introduction

A LynxOS device driver for the ACENic Gigabit Ethernet Adapter

MPA (Marker PDU Aligned Framing for TCP)

Asynchronous Transfer Mode

Layer Optimization: Congestion Control CS 118. Computer Network Fundamentals Peter Reiher. Lecture 17 Page 1 CS 118. Winter 2016

CS 856 Latency in Communication Systems

Module 6: INPUT - OUTPUT (I/O)

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

CS 43: Computer Networks The Link Layer. Kevin Webb Swarthmore College November 28, 2017

ECE 650 Systems Programming & Engineering. Spring 2018

Taking Advantage of Using the dmax DMA Engine in Conjunction with the McASP Peripheral on the TMS320C67x DSP

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Mike Anderson. TCP/IP in Embedded Systems. CTO/Chief Scientist The PTR Group, Inc.

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 25, 2018

Multi-Gigabit Transceivers Getting Started with Xilinx s Rocket I/Os

G Robert Grimm New York University

OPERATING SYSTEMS CS136

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Integrated Device Technology, Inc Stender Way, Santa Clara, CA Phone #: (408) Fax #: (408) Errata Notification

An FPGA-Based Optical IOH Architecture for Embedded System

To provide a faster path between applications

The Washington University Smart Port Card

Chapter 7. The Transport Layer

CS510 Operating System Foundations. Jonathan Walpole

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Computer Science 146. Computer Architecture

Demultiplexing on the ATM Adapter: Experiments withinternetprotocolsinuserspace

Low-Latency Communication on the IBM RISC System/6000 SP

CS370 Operating Systems

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Distributed Queue Dual Bus

CN-100 Network Analyzer Product Overview

TDT Appendix E Interconnection Networks

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

Smart cards and smart objects communication protocols: Looking to the future. ABSTRACT KEYWORDS

Network Management & Monitoring

Part 5: Link Layer Technologies. CSE 3461: Introduction to Computer Networking Reading: Chapter 5, Kurose and Ross

Networks. Randal E. Bryant CS347 Lecture 24 April 16, 1997

Xen Network I/O Performance Analysis and Opportunities for Improvement

AVR XMEGA Product Line Introduction AVR XMEGA TM. Product Introduction.

Scalable Distributed Memory Machines

Cell Switching (ATM) Commonly transmitted over SONET other physical layers possible. Variable vs Fixed-Length Packets

Initial Performance Evaluation of the Cray SeaStar Interconnect

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Lecture 15: Networks & Interconnect Interface, Switches, Routing, Examples Professor David A. Patterson Computer Science 252 Fall 1996

Introduction to TCP/IP Offload Engine (TOE)

Achieving UFS Host Throughput For System Performance

Chapter Seven Morgan Kaufmann Publishers

CS162 Operating Systems and Systems Programming Lecture 17. Disk Management and File Systems

Receive Livelock. Robert Grimm New York University

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

The Nios II Family of Configurable Soft-core Processors

Getting Connected (Chapter 2 Part 4) Networking CS 3470, Section 1 Sarah Diesburg

USB3DevIP Data Recorder by FAT32 Design Rev Mar-15

CS162 Operating Systems and Systems Programming Lecture 21. Networking. Page 1

CS61C Machine Structures Lecture 37 Networks. No Machine is an Island!

Chapter 5.6 Network and Multiplayer

Computer Architecture CS 355 Busses & I/O System

Advanced Computer Networks. RDMA, Network Virtualization

CS330: Operating System and Lab. (Spring 2006) I/O Systems

Low Latency MPI for Meiko CS/2 and ATM Clusters

Interrupt transfers & USB 2.0 & USB 3.0. Group Members Mehwish Awan Mehwish Kiran

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2

Transcription:

Thorsten von Eicken Dept of Computer Science tve@cs.cornell.edu Cornell niversity THE -NET SER-LEVEL NETWORK ARCHITECTRE or: it s easy to buy high-speed networks but making them work is another story NoW retreat June 7th-9th,1995 Joint work with Werner Vogels, Anindya Basu, and Vineet Buch This document was created with FrameMaker 4..4 1

Why ATM & -Net goals Why ATM? could be decent LAN standard ok if one ignores 99% of the standards ok if one ignores 99% of the vendor software shoot yourself in the foot and then try to run? yup, but building your own hardware is even worse... Why -Net? need user-level access to NI (for all the good ol reasons) not everyone has bought into Active Messages (yet :-) provide simple abstraction over network send + receive queues flexible buffers managed by user enable (but not require) true zero copy build Active Messages over -Net efficiently 2

Experimental Set-up Standard workstations 4 SS-2 @6Mhz, 64MB mem each $19 25 Total cost $77 ATM network Switch chassis (1/4 Fore Systems ASX-2) each $55 Network module for switch $6 Network interfaces (4x Fore Sys SBA-2) each $19 Fiber (4x lab fiber) each $1 Total cost $19 5 Fraction of total cost 2% list prices july 94 3

Test 1: ATM bandwidth straight out of the box measure bandwidth using infinite stream out of cached buffers Results: TCP at most 6% of bw DP >8% only with large buffers, 2% drops if wrong buffer size AAL5 >8% only with buffers 3K<size<4k 14 12 1 8 6 4 2 Mbits/s AAL5 send Mbytes/s DP send DP recv TCP bytes 16 14 12 1 8 6 4 2 4 1 2 3 4 5 6 7 8

ATM latency straight out of the box Test 2: measure round-trip using 1 ping-pongs Results: worse than ethernet, unless bandwidth matters 45 us 4 35 3 25 Ethernet TCP 2 15 1 5 bytes 5 2 4 6 8 1 12 14 Ethernet DP Fore ATM TCP Fore ATM DP

It must be the ATM network s fault! Fore Systems ASX-2 switch: up to 16 14/155Mbit ports full bandwidth broadcast architecture (equiv. to xbar) about 7µs latency per switch 14 Mbit fiber TAXI chip-set 175Mhz clock 4b/5b bit encoding -> 14Mbit/s 55bytes/cell -> 122.2Mbit/s payload bandwidth about 3µs to serialize cell about 1µs optical conversion delay (unlike when using SONET!) 6

It must be the ATM interface s fault! SBA-2 Network Interface IN FIFO and DMA Bus Ctrl 25 Mhz i96 dual-issue processor burst DMA onto host bus AAL5 CRC calculation in hardware 7 Bus Slave Interface Bus Master Interface OT FIFO and DMA Board Control Intel i96 Control Processor Net ctrl Physical Layer 256K Boot SRAM PROM The SBA-2 hardware Rcv Buff Tx Buff CRC to/from ATM Network Host bus

... But no, it s the Software Good ol NIX NIX Networking layers regular TCP/IP stack accounts for ~7% of the round-trip latency > Werner Vogels will explain... Device layer SBA-2 device driver maps or copies mbufs into DMA space sends by pointing SBA-2 at PD descriptors receives by handling interrupts and getting PD descriptors SBA-2 firmware deals with AAL5 segmentation/reassembly sends queue of PD descrs pointing to buffer descrs pointing to buffers receives into queue of free buffer descrs pointing to buffers provides queue of PD descrs pointing to buffer descrs pointing to buffers + interrupt 8

Traditional: kernel controls the network all communication goes via kernel -Net: applications access network directly via simple mux kernel only involved in connection set-up -Net: Basic Idea node 1 K node 2 K node 1 M K node 2 M K Legend: ser application Operating K system kernel M Message mux/demux 9

-Net Building Blocks -Net: ser-level Network Interface communication segment send queue free bufs recv queue -Net endpoint Main memory SBA-2 SRAM 1

-Net Characteristics Each user process communicates directly with NI per-process queues and comm segment, protected from other processes per process -Net channels, converted to/from VCI in NI Connection set-up still handled via kernel kernel informs NI about per-process channel<->vci mappings kernel can enforce protection/authorization/authentication Optimized short messages single packet sned is optimized (ATM: single cell = 4 bytes payload) single packet receives fit in receive queue - no buffer alloc necessary Supports scatter/gather one PD can consist of multiple buffers Various reception models polling the receive queue going to sleep and waking up (blocking read or select) getting an interrupt (NIX signal) 11

Raw -Net Performance 35 us 16 Mbytes/s 3 14 25 2 15 -Net AAL5 round-trip 12 1 8 6 1 4 5 2 bytes Expected low latency Expected high bandwidth, even with small messages bytes 12 128 256 384 512 64 768 896 124 1 2 3 4 5 6 7 8 -Net AAL5 bandwidth

-Net issues How much memory per process? send queue is small, receive queue should be larger communication segment could be huge all this memory is pinned what are the limiting factors? main memory size? DMA space? Sbus address space? SBA-2 SRAM? How about a cheap NI that doesn t do -Net? Solution: emulated -Net endpoints for applications which don t need the high performance same interface, but serviced by the kernel, not by the NI kernel muxes all emulated endpoints over its own endpoint involves system call + copy 13

1 Thorsten von Eicken, 1995 Parameters full IP & TCP 6 headers 4 regular TCP checksum 2 (in addition to AAL5 CRC!) one VCI per TCP connection Improvements 14 12 1 8 TCP/IP over -Net Mbits/s Mbytes/s -Net TCP Fore TCP simple connection mux/demux based on VCIs custom buffering: no buffer copy, no strange fragmentation, simple allocation, pre-aligned straight-forward acks: no strange delays flow-control: can provide feed-back to application few buffers: 8Kb window & 2Kb PDs bytes 16 14 12 1 8 6 4 2 16 14 12 1 8 6 4 2 us Fore TCP -Net TCP Fore DP -Net DP bytes 14 2 4 6 8 1 2 3 4 5 6 7 8

NAM: Micro-benchmarks Performance small message round trip time: 66µs (AM=1%). bulk xfer bandwidth: 15MB/sec @3Kbytes. comparisons: CM5 12µs round-trip, 1MB/sec bandwidth SP-2 52µs round-trip, 35MB/sec bandwidth CS-2 ~25µs round-trip, ~2MB/sec bandwidth Issues remaining to be resolved improving the flow-control reducing the memory requirements 15

Split-C: Application benchmarks Machines CM-5: 33Mhz SS-2 CS-2: 4Mhz Supersparc ATM: 5/6Mhz Supersparc Results on 8 processors normalized to the CM-5. compute phases: ATM > CS-2 > CM-5 small msg comm phases: CM-5 > CS-2 ATM large msg comm phases: CS-2 > ATM > CM-5 Caveat: ATM cluster has no coordinated scheduling 1.2 1..8.6.4.2. matrix multiply 128x128 16x16 blocks blocks sample sort, 512K sml msg bulk msg netw. cpu 16 CM-5 ATM Meiko CM-5 ATM Meiko CM-5 ATM Meiko CM-5 ATM Meiko 2 1 CM-5 ATM Meiko CM-5 ATM Meiko CM-5 ATM Meiko CM-5 ATM Meiko radix sort small msg radix sort bulk msg components connected conjugate gradient

Other uses of -Net (Student Projects) Real-time video transport snarf X-window data direct from 8-bit frame buffer xmit to remote workstation paste into X-window direct on frame buffer over 9Mbit/sec bandwidth using custom (broken)protocol needs more research into real-time communication protocols Distributed Shared Virtual Memory port of Quarks DSM from DP to -Net replaces comm module -Net works fine, but optimizations in Quarks are for slow nets most of the time is spent in sending page deltas instead of raw data Remote Procedure Call ground-up implementation using DCE stub compiler avoid complex marshalling is same arch, reliable packet stream approx 2µs round-trip RPC 17

Summary Order of magnitude increase in network bandwidth requires system-wide rethinking! Networking layers In conventional systems the kernel is in the way 1. kernel layers cannot be optimized for all networks (from SLIP to ATM) 2. kernel layers cannot be optimized for all applications (telnet to video) 3. protection boundary crossings cost DP/TCP are not a problem, but they re not cheap either Application layers The network ceases to be the bottleneck First got to undo a decade of optimizations against slow DP/TCP ethernet Then got to think hard about compute phases & overall scheduling 18

Summary (cont.) -Net offers the full performance of ATM networks required redesign of all network-related software the hardware is not a problem (could be faster though...) ATM is not a problem (could be better though...) Result: simple user-level network interface access model is independent of ATM and independent of communication model full ATM performance without dedicating all the memory and all the processor to it supports hot parallel languages, as well as legacy protocols tremendous protocol flexibility at the application level, enabling new modes of use Next: true zero copy: communication segment = user-space 622Mbit/sec (?) 19