Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

Similar documents
PTASK + DANDELION: DATA-FLOW PROGRAMMING SUPPORT FOR HETEROGENEOUS PLATFORMS

Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September

Real-Time Support for GPU. GPU Management Heechul Yun

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

PTask: Operating System Abstractions To Manage GPUs as Compute Devices

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPGPU introduction and network applications. PacketShaders, SSLShader

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS427 Multicore Architecture and Parallel Computing

The control of I/O devices is a major concern for OS designers

Arrakis: The Operating System is the Control Plane

GViM: GPU-accelerated Virtual Machines

OPERATING SYSTEM TRANSACTIONS

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

Re-architecting Virtualization in Heterogeneous Multicore Systems

The rcuda middleware and applications

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Chapter 13: I/O Systems

Accelerating image registration on GPUs

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Tesla GPU Computing A Revolution in High Performance Computing

GPU for HPC. October 2010

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

high performance medical reconstruction using stream programming paradigms

Module 12: I/O Systems

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

PacketShader: A GPU-Accelerated Software Router

Tesla GPU Computing A Revolution in High Performance Computing

Input/Output Systems

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

High Performance Computing on GPUs using NVIDIA CUDA

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

NVSG NVIDIA Scene Graph

Threading Hardware in G80

CSE 120 Principles of Operating Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Dataflow Programming on GPUs

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Numerical Algorithms on Multi-GPU Architectures

Chapter 13: I/O Systems

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

GPU ARCHITECTURE Chris Schultz, June 2017

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

Rinnegan: Efficient Resource Use in Heterogeneous Architectures

Making Storage Smarter Jim Williams Martin K. Petersen

ATS-GPU Real Time Signal Processing Software

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Chapter 13: I/O Systems

Parallel Execution of Kahn Process Networks in the GPU

JCudaMP: OpenMP/Java on CUDA

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL GPU. Andy Currid NVIDIA

Lecture 13 Input/Output (I/O) Systems (chapter 13)

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Porting Nouveau to Tegra K1

GPU Fundamentals Jeff Larkin November 14, 2016

Accelerating Cloud Graphics

Martin Dubois, ing. Contents

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

A Graph-Partition Based Scheduling Policy for Heterogeneous Architectures

Lamassu: Storage-Efficient Host-Side Encryption

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Module 12: I/O Systems

ECE 571 Advanced Microprocessor-Based Design Lecture 20

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

General Purpose GPU Computing in Partial Wave Analysis

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Research Faculty Summit Systems Fueling future disruptions

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Accelerating Realism with the (NVIDIA Scene Graph)

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1

Processes, Context Switching, and Scheduling. Kevin Webb Swarthmore College January 30, 2018

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Musemage. The Revolution of Image Processing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Facial Recognition Using Neural Networks over GPGPU

Chapter 13: I/O Systems

Capriccio : Scalable Threads for Internet Services

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Chapter 13: I/O Systems

Transcription:

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges GPU+main memory disjoint Treated as I/O device by OS PTask SOSP 2011 2

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones These two tablets things are related: Great for gaming and HPC/batch We need OS abstractions Unusable in other application domains GPU programing challenges GPU+main memory disjoint Treated as I/O device by OS PTask SOSP 2011 3

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4

programmervisible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5

programmervisible interface GPGPU APIs Shaders/ Kernels Language Integration DirectX/CUDA/OpenCL Runtime 1 OS-level abstraction! 1. No kernel-facing API 2. No OS resource-management 3. Poor composability PTask SOSP 2011 6

Higher is better 1200 1000 800 600 400 200 0 GPU benchmark throughput no CPU load CPU scheduler and GPU scheduler not integrated! high CPU load Image-convolution in CUDA Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nvidia GeForce GT230 PTask SOSP 2011 7

OS cannot prioritize cursor updates WDDM + DWM + CUDA == dysfunction Flatter lines Are better Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nvidia GeForce GT230 PTask SOSP 2011 8

Raw images Hand events capture detect capture camera images xform noisy point cloud detect gestures filter geometric transformation High data rates Data-parallel algorithms good fit for GPU noise filtering NOT Kinect: this is a harder problem! PTask SOSP 2011 9

#> capture xform filter detect & CPU GPU Modular design flexibility, reuse GPU Utilize heterogeneous hardware Data-parallel components GPU Sequential components CPU Using OS provided tools processes, pipes CPU PTask SOSP 2011 10

GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution Program inputs explicitly transferred/bound at runtime Device buffers pre-allocated User-mode apps must implement Main memory CPU Copy inputs Copy outputs Send commands GPU memory GPU PTask SOSP 2011 11

#> capture xform filter detect & capture xform filter detect read() write() read() write() read() write() read() copy to GPU OS executive copy from GPU copy to GPU copy from GPU IRP camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer GPU Run! PCI-xfer PTask SOSP 2011 12

GPU Analogues for: Process API IPC API Scheduler hints Abstractions that enable: Fairness/isolation OS use of GPU Composition/data movement optimization PTask SOSP 2011 13

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14

ptask (parallel task) Has priority for fairness Analogous to a process for GPU execution List of input/output resources (e.g. stdin, stdout ) ports Can be mapped to ptask input/outputs A data source or sink channels Similar to pipes, connect arbitrary ports Specialize to eliminate double-buffering OS objects OS RM possible data: specify where, not how graph DAG: connected ptasks, ports, channels datablocks Memory-space transparent buffers PTask SOSP 2011 15

rawimg cloud f-in f-out #> capture xform filter detect & ptask graph capture xform filter detect mapped mem GPU mem GPU mem process (CPU) ptask (GPU) port channel ptask graph datablock Optimized data movement Data arrival triggers computation PTask SOSP 2011 16

Graphs scheduled dynamically ptasks queue for dispatch when inputs ready Queue: dynamic priority order ptask priority user-settable ptask prio normalized to OS prio Transparently support multiple GPUs Schedule ptasks for input locality PTask SOSP 2011 17

Datablock space V M RW data main 1 1 1 1 gpu0 0 1 1 0 gpu1 1 1 1 0 Main Memory GPU 0 Memory GPU 1 Memory Logical buffer backed by multiple physical buffers buffers created/updated lazily mem-mapping used to share across process boundaries Track buffer validity per memory space writes invalidate other views Flags for access control/data placement PTask SOSP 2011 18

rawimg cloud f-in #> capture xform filter capture xform filter Datablock space V M RW data main 1 0 10 10 10 gpu 10 10 10 10 Main Memory GPU Memory process ptask port channel datablock PTask SOSP 2011 19

port datablock port 1-1 correspondence between programmer and OS abstractions GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21

Windows 7 Full PTask API implementation Stacked UMDF/KMDF driver Kernel component: mem-mapping, signaling User component: wraps DirectX, CUDA, OpenCL syscalls DeviceIoControl() calls Linux 2.6.33.2 Changed OS scheduling to manage GPU GPU accounting added to task_struct PTask SOSP 2011 22

Windows 7, Core2-Quad, GTX580 (EVGA) Implementations pipes: capture xform filter detect modular: capture+xform+filter+detect, 1process handcode: data movement optimized, 1process ptask: ptask graph Configurations real-time: driven by cameras unconstrained: driven by in-memory playback PTask SOSP 2011 23

relative to handcode 3.5 3 lower is better 2.5 2 1.5 1 0.5 0 runtime user sys handcode modular pipes ptask compared to hand-code compared to pipes 11.6% higher throughput ~2.7x less CPU lower usage CPU util: no driver 16x higher throughput program ~45% less memory usage Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 24

PTask invocations/second 1600 1400 1200 1000 800 600 400 fifo priority ptask Higher is better 200 0 FIFO queue invocations in arrival order ptask aged priority queue w OS priority graphs: 6x6 matrix multiply priority same for every PTask node PTask provides throughput 2 4 6 proportional 8 to priority PTask priority Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 25

Speedup over 1 GPU 2 1.5 Synthetic graphs: Varying depths Higher is better 1 0.5 0 priority data-aware Data-aware provides best throughput, preserves priority Data-aware == priority + locality Graph depth > 1 req. for any benefit Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz 2 x GTX580 (EVGA) PTask SOSP 2011 26

user-prgs R/W bnc cuda-1 cuda-2 user-libs EncFS FUSE libc OS PTask Linux 2.6.33 HW SSD1 SSD2 GPU Simple GPU usage accounting Restores performance GPU/ CPU cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read 1.17x -10.3x -30.8x 1.16x 1.16x Write 1.28x -4.6x -10.3x 1.21x 1.20x PTask SOSP 2011 27 EncFS: nice -20 cuda-*: nice +19 AES: XTS chaining SATA SSD, RAID seq. R/W 200 MB

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28

OS support for heterogeneous platforms: Helios [Nightingale 09], BarrelFish [Baumann 09],Offcodes [Weinsberg 08] GPU Scheduling TimeGraph [Kato 11], Pegasus [Gupta 11] Graph-based programming models Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] StreamIt [Thies 02] DirectShow TCP Offload [Currid 04] Tasking Tessellation, Apple GCD, PTask SOSP 2011 29

OS abstractions for GPUs are critical Enable fairness & priority OS can use the GPU Dataflow: a good fit abstraction system manages data movement performance benefits significant Thank you. Questions? PTask SOSP 2011 30