AC: COMPOSABLE ASYNCHRONOUS IO FOR NATIVE LANGUAGES. Tim Harris, Martín Abadi, Rebecca Isaacs & Ross McIlroy

Similar documents
Extensions to Barrelfish Asynchronous C

AC: Composable Asynchronous IO for Native Languages

CS 153 Design of Operating Systems Winter 2016

INTERFACING REAL-TIME AUDIO AND FILE I/O

Remote Procedure Call (RPC) and Transparency

Ben Walker Data Center Group Intel Corporation

CSE 153 Design of Operating Systems

GPUfs: Integrating a file system with GPUs

CSE 153 Design of Operating Systems Fall 2018

Lecture 5: Synchronization w/locks

RDMA-like VirtIO Network Device for Palacios Virtual Machines

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

Operating System Design Issues. I/O Management

1. With respect to space, linked lists are and arrays are.

Beyond Block I/O: Rethinking

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 22: Remote Procedure Call (RPC)

Chapter 3 Process Description and Control

Scalable Software Transactional Memory for Chapel High-Productivity Language

Systèmes d Exploitation Avancés

Predictable Programming on a Precision Timed Architecture

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Control Abstraction. Hwansoo Han

CODE TIME TECHNOLOGIES. Abassi RTOS. CMSIS Version 3.0 RTOS API

Exception-Less System Calls for Event-Driven Servers

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Multi-core Architecture and Programming

Operating System Services

Flash: an efficient and portable web server

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Lightweight Concurrency Primitives for GHC. Peng Li Simon Peyton Jones Andrew Tolmach Simon Marlow

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Disruptor Using High Performance, Low Latency Technology in the CERN Control System

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Caching and Demand-Paged Virtual Memory

Operating System: Chap2 OS Structure. National Tsing-Hua University 2016, Fall Semester

Composable Shared Memory Transactions Lecture 20-2

JavaScript. The Bad Parts. Patrick Behr

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Linux Storage System Analysis for e.mmc With Command Queuing

CSL373/CSL633 Major Exam Solutions Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

COS 318: Operating Systems. Virtual Machine Monitors

CS3350B Computer Architecture

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

CALIFORNIA SOFTWARE LABS

Final Exam Solutions May 11, 2012 CS162 Operating Systems

AD910A M.2 (NGFF) to SATA III Converter Card

Order Is A Lie. Are you sure you know how your code runs?

Vulkan (including Vulkan Fast Paths)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Operating Systems. Synchronization

Lecture 4: Threads; weaving control flow

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

Java Internals. Frank Yellin Tim Lindholm JavaSoft

PCI-EK01 Driver Level Programming Guide

Asynchronous OSGi: Promises for the masses. Tim Ward.

Design Principles for End-to-End Multicore Schedulers

Threads Implementation. Jo, Heeseung

Chapter 6: Process Synchronization. Module 6: Process Synchronization

Minerva. Performance & Burn In Test Rev AD903A/AD903D Converter Card. Table of Contents. 1. Overview

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

CS 167 Final Exam Solutions

Martin Kruliš, v

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

3/7/2018. Sometimes, Knowing Which Thing is Enough. ECE 220: Computer Systems & Programming. Often Want to Group Data Together Conceptually

Dell EMC CIFS-ECS Tool

Concurrent & Distributed Systems Supervision Exercises

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 21 November 2014

Simplify Software Integration for FPGA Accelerators with OPAE

Final Exam Preparation Questions

iseries Job Attributes

ECE 574 Cluster Computing Lecture 8

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Computer Science 161

CPS 310 first midterm exam, 10/6/2014

CSL373/CSL633 Major Exam Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

CS 134: Operating Systems

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Myths and reality of communication/computation overlap in MPI applications

EbbRT: A Framework for Building Per-Application Library Operating Systems

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

MINERVA. Performance & Burn In Test Rev AD912A Interposer Card. Table of Contents. 1. Overview

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Programming with MPI

Threads in Java. Threads (part 2) 1/18

Threads, Synchronization, and Scheduling. Eric Wu

Processors, Performance, and Profiling

Non-Blocking Writes to Files

Design Patterns for Real-Time Computer Music Systems

CS510 Advanced Topics in Concurrency. Jonathan Walpole

MPI: A Message-Passing Interface Standard

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Transcription:

AC: COMPOSABLE ASYNCHRONOUS IO FOR NATIVE LANGUAGES Tim Harris, Martín Abadi, Rebecca Isaacs & Ross McIlroy

Synchronous IO in the Windows API Read the contents of h, and compute a result BOOL ProcessFile(HANDLE h, int *output) { BOOL ok = TRUE; DWORD size = FileSize(h); LPVOID buf = malloc(size); for (int i = 0; ok && i<(size/1024); i++) { ok &= ReadFile(h, buf+(i*1024), 1024, NULL, NULL); if (ok) { *output = free(buf); return ok; It exposes reads one by one to the IO system: no pipelining, poor performance If we want to process multiple files Implemented using synchronous calls to we must either (i) handle them ReadFile to read each block in turn serially, (ii) introduce threading 15/11/2011 2

Asynchronous IO static OVERLAPPED o[pd]; BOOL ProcessFile(HANDLE h, int *output) { DWORD size = FileSize(h); LPVOID buf = malloc(size); HANDLE cp = CreateIoCompletionPort(h, NULL, 0, 0); BOOL ok = TRUE; DWORD idx; LPOVERLAPPED *op; for (int i = 0; i<pd; i++) { PostQueuedCompletionStatus(cp, 0, i, &o[i]); for (int i = 0; i<size/1024; i++) { DWORD idx; LPOVERLAPPED *op; GetQueuedCompletionStatus(cp, NULL, &idx, &op, INFINITE); op >offset = i*1024; ReadFile(h, buf+(i*1024), 1024, NULL, op); for (int i = 0; i<pd; i++) { GetQueuedCompletionStatus(cp, NULL, &idx, &op, INFINITE); if (ok) { *output =... CloseHandle(cp); free(buf); return ok; Create completion port to synchronize with IOs Main loop, issuing IO operations Drain the completion port, waiting for all IOs to be done and asynchronous IO is not composable: ProcessFile is just a normal synchronous function (And this is with a lot of error handling code omitted) 15/11/2011 3

AC: asynchronous C Synchronous IO Easy to use Poor perf AC: similar programming model to synchronous, similar performance to async Async IO 1. Focus on performance and ability for use in low level code (e.g., monitor process in Barrelfish research OS) Difficult to use 2. Small number of constructs: starting async work, synchronize with async work, stop Good waiting perf for async work 3. Enable higher level features to be built over this join patterns, futures, 15/11/2011 4

AC: almost the same as synchronous BOOL ProcessFileAC(HANDLE h, int *output) { BOOL ok = TRUE; DWORD size = FileSize(h); LPVOID buf = malloc(size); do { for (int i = 0; ok && i<n; i++) { async { ok &= ReadFileAC(h, i*1024, buf+(i*1024), 1024); finish; if (ok) { *output = free(buf); return ok; 1. Use the AC primitives for data access 2. Use async to identify work that can continue while waiting: { async X ; Y start X and, if it blocks, continue with Y 3. Add finish to identify synchronization: do { X finish don t go past finish until all the async work in X is done 15/11/2011 5

AC is composable We can build layered abstractions using AC: user code can be called asynchronously, not just primitives: BOOL ProcessTwoFilesAC(HANDLE h1, HANDLE h2, int *output) { int o1, o2; BOOL ok = TRUE; do { async { ok &= ProcessFileAC(h1, &o1); async { ok &= ProcessFileAC(h2, &o2); finish; if (ok) *output = return ok; 15/11/2011 6

Making multiple requests // Counting semaphore to limit requests SemAC_t sem; SemACInit(&sem, 100); // Loop potentially starting lots of work do { for (int i = 0; i < 1000000; i++) { P_SemAC(&sem); async { V_SemAC(&sem); finish; // Do work 15/11/2011 7

Cancellation Suppose we want to add a timeout 2. It marks the named do..finish as cancelled, and pokes blocked BOOL TryProcessFileAC(HANDLE h, int timeout, 1. Suppose int *result) we execute operations dynamically within it { BOOL ok = TRUE; this cancel statement with_timoeut: do { async { ok = ProcessFileAC(h, result); cancel with_timeout; async { SleepAC(result); cancel with_timeout; finish; return ok; 5. The finish works as before: wait until both asyncs are done 3. SleepAC is woken and terminates early with FALSE 4. This second cancellation has no effect 15/11/2011 8

Cancellation BOOL ProcessFileAC(HANDLE h, int *output) { BOOL ok = TRUE; DWORD size = FileSize(h); 2. It pokes blocked LPVOID buf = malloc(size); operations in the do { 3. Cancellation BOOL TryProcessFileAC(HANDLE named do finish for (int i = h, 0; int ok propagates timeout, && i<n; i++) into *result) { { BOOL ok = TRUE; async { ProcessFileAC with_timoeut: do { ok &= ReadFileAC(h, i*1024, buf+(i*1024), 1024); async { ok = ProcessFileAC(h, result); cancel with_timeout; async { SleepAC(result); finish; cancel with_timeout; finish; if (ok) { return ok; *output = Suppose we want to add a timeout 4. The result of ProcessFileAC free(buf); forms the overall return result ok; 1. Suppose we execute this cancel statement 15/11/2011 9

Exact cancellation A convention: provided by built primitives, recommended for user code An *AC function either Returns TRUE if it completes successfully without cancellation Returns FALSE and undoes its partial side effects if it is cancelled Easier said than done 1. Remember cancellation is co operative 2. *NAC non cancellable primitives 3. For responsiveness, push clean up work to a thread pool 10

Implementation Thread Run queue Current FB Finish block (FB) ACTV Enclosing FB Count = 0 Completion AWI Stack main First Last Techniques: Per thread run queue of atomic work items ( AWIs ) Wait for next IO completion when AWI queue empty Book keeping data all stack allocated Cactus stack implementation, creating a new stack lazily when blocking (rather than eagerly on entering an async) 15/11/2011 11

Performance Function call latency (cycles) Direct (normal function call) async X() (X does not block) async X() (X blocks) 8 12 1692 Do not fear async Think about correctness: if the callee doesn t block then perf is basically unchanged 15/11/2011 12

Performance: Barrelfish ping pong test Using ring-buffer channel directly Using portable event-based interface Using AC (one side only) Using AC (both sides) MPI (Visual Studio 2008 + HPC-Pack 2008 SDK) Latency (cycles) 931 1134 1266 1405 2780 Ping pong test Minimum sized messages AMD 4 * 4-core machine Using cores sharing L3 cache 15/11/2011 13

2 phase commit between cores AC Async Sync 19 event handlers, 5 temporary structures 15/11/2011 14

Software RAID0 model, 64KB transfers 2 * Intel X25 M SSDs Each 250MB/s AC Async Sync 15/11/2011 15

Performance (embedded web server) Apache benchmark workload driver Running across WAN between CA and UK 15/11/2011 16

Summary AC: composable asynchronous IO for native languages Similar programming style to synchronous IO APIs Similar performance for asynchronous IO APIs Also in the paper: Operational semantics for core language Integration of IO with runtime system Available as part of the Barrelfish research OS http://www.barrelfish.org We re hiring: tharris@microsoft.com 15/11/2011 17

www.research.microsoft.com 2011 Microsoft Corporation. All rights reserved. This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.