AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

Similar documents
Ethan Kao CS 6410 Oct. 18 th 2011

An O/S perspective on networks: Active Messages and U-Net

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Low-Latency Communication over Fast Ethernet

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Active Messages: a Mechanism for Integrated Communication and Computation 1

ATM and Fast Ethernet Network Interfaces for User-level Communication

THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

cmam ( 3 ) CM-5 Active Message Layer

Advanced Computer Networks. End Host Optimization

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Parallel Computing Trends: from MPPs to NoWs

08:End-host Optimizations. Advanced Computer Networks

To provide a faster path between applications

Low Latency MPI for Meiko CS/2 and ATM Clusters

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

Directed Point: An Efficient Communication Subsystem for Cluster Computing. Abstract

Advanced Computer Networks. RDMA, Network Virtualization

LRP: A New Network Subsystem Architecture for Server Systems. Rice University. Abstract

CS330: Operating System and Lab. (Spring 2006) I/O Systems

Low-Latency Communication over ATM Networks using Active Messages

Outline. Limited Scaling of a Bus

Chapter 13: I/O Systems

Security versus Performance Tradeoffs in RPC Implementations for Safe Language Systems

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

19: Networking. Networking Hardware. Mark Handley

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

6.9. Communicating to the Outside World: Cluster Networking

A Modular High Performance Implementation of the Virtual Interface Architecture

U-Net/SLE: A Java-based user-customizable virtual network interface

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Chapter 8: I/O functions & socket options

LogP Performance Assessment of Fast Network Interfaces

CS 457 Networking and the Internet. Network Overview (cont d) 8/29/16. Circuit Switching (e.g., Phone Network) Fall 2016 Indrajit Ray

Data Communication & Networks G Session 7 - Main Theme Networks: Part I Circuit Switching, Packet Switching, The Network Layer

Much Faster Networking

Low-Latency Communication on the IBM RISC System/6000 SP

ECE 650 Systems Programming & Engineering. Spring 2018

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Communication. Distributed Systems Santa Clara University 2016

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

[08] IO SUBSYSTEM 1. 1

Multiprocessor Interconnection Networks

CSE398: Network Systems Design

Midterm II December 3 rd, 2007 CS162: Operating Systems and Systems Programming

CS 43: Computer Networks. 15: Transport Layer & UDP October 5, 2018

Message Passing Models and Multicomputer distributed system LECTURE 7

The Lighweight Protocol CLIC on Gigabit Ethernet

Chapter 5 Input/Output. I/O Devices

Transport Layer. The transport layer is responsible for the delivery of a message from one process to another. RSManiaol

High performance communication subsystem for clustering standard high-volume servers using Gigabit Ethernet

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Optimal Communication Performance on. Fast Ethernet with GAMMA. Giuseppe Ciaccio. DISI, Universita di Genova. via Dodecaneso 35, Genova, Italy

Last Class: RPCs and RMI. Today: Communication Issues

EE/CSCI 451: Parallel and Distributed Computation

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Computer Architecture: Dataflow/Systolic Arrays

Optimizing Performance: Intel Network Adapters User Guide

set of \built-in" features. However, experience has shown [10] that in the interest of reducing host overhead, interrupts, and I/O bus transfers, it m

Switching / Forwarding

Networking and Internetworking 1

Section 3; in Section 4 we will attempt a discussion of the four approaches. The paper ends with a ConclusionèSection 5è. 2 Issues in Communication In

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

MICE: A Prototype MPI Implementation in Converse Environment

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation

Performance of a High-Level Parallel Language on a High-Speed Network

The Nexus Task-parallel Runtime System*

NOW and the Killer Network David E. Culler

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

FlexNIC: Rethinking Network DMA

Input/Output Systems

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Design and Implementation of Virtual Memory-Mapped Communication on Myrinet

Distributed Systems 27. Process Migration & Allocation

Chapter 13: I/O Systems

ECE 435 Network Engineering Lecture 11

Processor Architecture and Interconnect

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Lecture 15: I/O Devices & Drivers

Last Class: RPCs. Today:

Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).

Concurrency: Deadlock and Starvation

Basic Low Level Concepts

Internet Services & Protocols. Quality of Service Architecture

Scalable Distributed Memory Machines

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

I/O Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Veena Avula, Induprakas Kodukula

Advanced Computer Networks. Flow Control

Message Passing Architecture in Intra-Cluster Communication

A Userspace Packet Switch for Virtual Machines

ECE 435 Network Engineering Lecture 11

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

Transcription:

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University

Papers 2 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992. U-Net: A User-Level Network Interface for Parallel and Distributed Computing, Thorsten von Eicken, Anindya Basu, Vineet Buch and Werner Vogels. 15th SOSP, December 1995.

Authors - I 3 Thorsten von Eicken Founder of RightScale Ph.D, CS at University of California, Berkeley Assistant Prof. at Cornell (1993-1999) David E. Culler Chair, Electrical Engineering and Computer Sciences at Berkeley Seth Copen Goldstein Associate Professor at Carnegie Mellon University Klaus Erik Schauser Associate professor at UCSB

Authors - II 4 Thorsten von Eicken Anindya Basu Advised by von Eicken Vineet Buch Google MS from Cornell Werner Vogels Research Scientist at Cornell (1994-2004) CTO and vice President of Amazon

Active Messages 5 Introduction Preview: Active Message Traditional programming models Active Messages Environment Split-C Message Driven Model vs. Active Messages Final Notes

Introduction 6 Inefficient use of underlying hardware Poor overlap between communication and computation High communication overhead Communication Computation

Introduction: Objective 7 Eliminate inefficiency by increasing the overlap between communication & computation!

Preview: Active Message 8 Solution: A novel asynchronous communication mechanism! Active messages MESSAGE? ACTIVE MESSAGE HEADER Address of a userspace handler BODY Argument of userspace handler

Traditional Programming Models 9 Synchronous comm. 3-phase protocol Send / receive are blocking Simple No buffering High comm. latency Underutilized network bandwidth

Traditional Programming Models 10 Asynchronous comm. Send / receive are non-blocking Message layer buffers the message Communication & Computation can overlap

Traditional Programming Models 11 Synchronous Asynchronous PROS No Buffering Overlap CONS No Overlap High Comm. Latency Buffering Active Messages

Active Messages 12 Requires SPMD programming model Key optimization: not buffered Except for network transport buffering For large messages buffering is needed on sender side Receiving side pre-allocates structures Handlers do not block - deadlock avoidance

Active Messages 13 Sender Packs the message with a header containing handler s address Pass the message to the network Continue computing Receiver Message arrival invokes handler Handler Extracts the data & interrupts computation Passes it to the ongoing computation Implementation of active messages on different platforms is not the same (ncube/2 vs. CM-5).

Environment 14 Implementation on parallel supercomputers ncube/2 Binary hypercube network NI with 28 DMA channels CM-5 (connection machine-5) Hypertree network Split-C A parallel extension of C CM-5 machine of NSA

Split-C: PUT and GET 15 Nonblocking Message handler Message formatter Used in matrix multiplication to show the performance of active message

Split-C: GET 16 Matrix multiply achieves 95% performance in ncube/2 configuration. Xmit: time to inject the message info the network Hops: time for the network hops

Split-C: Matrix Multiply 17 m: # of columns per processor # of nodes is constant (N=128)

Message Driven Model vs. Active Messages 18 Message driven model Computation in message handlers Handler may suspend Memory allocation & scheduling on message arrival Active Messages Computation in background (handler extracts message) Immediate execution

Final Notes 19 Not a new parallel programming paradigm! A primitive communication mechanism Used to implement them efficiently The need for userspace handler address leads to programming difficulty Requires SPMD model (but not required on some modified versions)

U-Net 20 Introduction U-Net Remove the Kernel from Critical Path Communication in U-Net Protection Zero Copy True zero copy vs. zero copy Environment Performance Final Notes

Introduction 21 Low bandwidth and high communication latency No longer hardware issues! Problem is the message path through the kernel Small messages Common in various applications Requires low round-trip latency Latency = processing overhead + network latency

Introduction: Objective 22 Make communication FASTER!

U-Net 23 Low communication latency High bandwidth even with small messages! Support for workstations with off-the shelf network Keep providing full protection Kernel controlled channel setup / tear-down TCP, UDP, Active Messages, can be implemented

Remove the kernel from critical path 24 Solution: Enable user-level access to NI to implement user-level communication protocols Does not require OS modification or custom HW Provides higher flexibility (app. specific protocols!)

Communication in U-Net 25 Each process creates endpoints to access network U-net endpoint: application handle into the network Communication segment: regions of memory that hold message data Send/recv/free queues: hold message descriptors

26 Communication in U-Net: Send

27 Communication in U-Net: Receive

Protection 28 Owning process protection: Endpoints Communication Segments Send/Receive/Free Queues Tag protection: Accessible by owning process only! Outgoing message tagged with originating address Incoming message delivered to correct destination point only!

Zero Copy 29 Base level U-Net (zero copy) Send/receive needs a buffer (not really zero copy!) Requires a copy between app. data structures and a buffer in the comm. segment Direct Access U-Net (true zero copy) Spans on entire address space Has special hardware requirements

Environment 30 Implementation on off-the-shelf hardware platform SPARCStations Fore SBA-100 NIC Fore SBA-200 NIC SunOS 4.1.3 BSD-based Unix OS Fore SBA-200 NIC Ancestor of Solaris (after version 5.0) Split-C A parallel extension of C

Performance: UDP vs. TCP 31 UDP bandwidth as a function of message size TCP bandwidth as a function of data generation by application

Performance: UDP vs. TCP 32 Fast U-Net roundtrips: Apprx. 7x speedup for TCP Apprx. 5x speedup for UDP Fast U-Net TCP roundtrips let use of small window size. End-to-end round-trip latencies as a function of message size

Final Notes 33 Low comm. latency High bandwidth Good flexibility Cluster of workstations vs. supercomputers Comparable performance using U-Net Additional system resources (parallel process scheduler, file systems, ) are needed