Ethan Kao CS 6410 Oct. 18 th 2011

Size: px

Start display at page:

Download "Ethan Kao CS 6410 Oct. 18 th 2011"

Douglas Campbell
5 years ago
Views:

1 Ethan Kao CS 6410 Oct. 18 th 2011

Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser.

2 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings of the 19th Annual International Symposium on Computer Architecture, U-Net: A User-Level Network Interface for Parallel and Distributed Computing, Von Eicken, Basu, Buch and Werner Vogels. 15th SOSP, December 1995.

3 Parallel System: Multiple processors one machine Shared Memory Supercomputing

4 Distributed System: Multiple machines linked together Distributed memory Cloud computing

5 How to efficiently communicate? Between processors Between machines Active Messages U-Net

6 Thorsten von Eicken Berkeley Ph.D. -> Assistant professor at Cornell -> UCSB Founded RightScale, Chief Architect at Expertcity.com David E. Culler Professor at Berkeley Seth Copen Goldstein Berkeley Ph.D. -> Associate professor at CMU Klaus Erik Schauser Berkeley Ph.D. -> Associate professor at UCSB

7 Existing message passing multiprocessors had high communication costs Message passing machines made inefficient use of underlying hardware capabilities ncube/2 CM-5 Thousands of nodes interconnected Poor overlap between computation and communication

8 Improve overlap between computation & communication Aim for 100% utilization of resources Low start-up costs for network usage

9 Asynchronous communication Minimal buffering Handler interface Weaknesses: Address of the message handler must be known Design needs to be hardware specific?

10 Asynchronous communication mechanism Messages contain user-level handler address Handler executed on message arrival Takes message off network Message body is argument Does not block

11 Sender blocks until messages can be injected into network Receiver interrupted on message arrival - runs handler User level program pre-allocates receiving structures Eliminates buffering

12 Traditional send/receive models

13 Key optimization in AM vs. send/receive is reduction of buffering. AM can achieve near order of magnitude reduction: ncube/2 AM send/handle: 11us/15us overhead ncube/2 async send/receive: 160us overhead CM-5 AM : <2us overhead CM-5 blocking: 86us overhead Prototype of blocking send/receive on top of AM: 23us overhead

14 Non-blocking implementations of PUT and GET Implementations consist of a message formatter and a message handler

15 Multiplication of C = A x B. Processor GETS one column of A after another to perform rank-1 update with its own columns of B. Achieves 95% of peak performance

16 Computation occurs in the message handler. Specialized hardware -> Monsoon, J-Machine Memory allocation and scheduling required upon message arrival Tricky to implement in hardware Expensive In Active Messages, handler only removes messages from the network. Threaded Abstract Machine (TAM) Parallel execution model based on Active Message Typically no memory allocation upon message arrival No test results

17 Good performance Not a new parallel programming paradigm Evolutionary not Revolutionary AM systems? Multiprocessor vs. Cluster

18 Thorsten von Eicken Anindya Basu Advised by von Eicken Vineet Buch M.S. from Cornell Co-founded Like.com -> Google Werner Vogels Research Scientist at Cornell -> CTO of Amazon

19 Bottleneck of local area communication at kernel Several copies of messages made Processing overhead dominates for small messages Low round-trip latencies growing in importance Especially for small messages Traditional networking architecture inflexible Cannot easily support new protocols or send/receive interfaces

20 Remove kernel from critical path of communication Provide low-latency communication in local area settings Exploit full network bandwidth even with small messages Facilitate the use of novel communication protocols

21 Flexible Low latency for smaller messages Off the shelf hardware good performance Weaknesses : Multiplexing resources between processes not in kernel Specialized NI needed?

22 User level communication architecture independent Virtualizes network devices Kernel control of channel set-up and tear-down

23 Remove kernel from critical path: send/recv

24 U-Net: Multiplexes NI among all processes accessing network Enforces protection boundaries and resource limits Process: Contents of each message and management of send/recv resources (i.e. buffers)

25 Main building blocks of U-Net: Endpoints Communication Segments Message Queues Each process that wishes to access the network Creates one or more endpoints Associates a communication segment with each endpoint Associates set of send, receive and free message queues with each endpoint

27 Prepare packet -> place it in the comm seg Place descriptor on the Send queue U-Net takes descriptor from queue Transfer packet from memory to network packet U-Net NI Network From Itamar Sagi

28 U-Net receives message and identifies Endpoint Takes free space from free queue Places message in communication cegment Places descriptor in receive queue Process takes descriptor from receive queue and reads message U-Net NI packet Network From Itamar Sagi

29 Only owning process can access: Endpoints Communication Segments Message queues Outgoing messages tagged with the originating endpoint Incoming messages demultiplexed by U-Net

30 Base-level: zero-copy Comm segment not regarded as memory regions 1 copy betw application data structure and buffer in comm segment Small messages held entirely in queue Direct-access: true zero copy Comm segments can span entire process address space Sender can specify offset within destination comm seg for data Difficult to implement on existing workstation hardware

31 U-Net implementations support Base-level Hardware for direct-access not available Copy overhead not a dominant cost Kernel emulated endpoints

32 Implemented on SPARCstations running SunOS 4.13 Fore SBA-100 interface Lack of hardware for CRC computation = overhead Fore SBA-200 interface Uses custom firmware to implement base-level architecture i960 processor reprogrammed to implement U-Net directly Small messages: 65us RTT vs. 12us for CM-5 Fiber saturated with packet sizes of 800 bytes

35 Traditional UDP and TCP over ATM performance disappointing < 55% max bandwidth for TCP Better performance with UDP and TCP over U-Net Not bounded by kernel resources More state awareness = better application-network relationships

37 Main goals were to achieve low latency communication and flexibility NetBump

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University Papers 2 Active Messages: A Mechanism for Integrated Communication and Control,