GPUfs: Integrating a file system with GPUs

Similar documents
GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

Accelerator-centric operating systems

GPUfs: Integrating a File System with GPUs

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

! Readings! ! Room-level, on-chip! vs.!

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

I/O Buffering and Streaming

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing on GPUs using NVIDIA CUDA

Real-Time Support for GPU. GPU Management Heechul Yun

Advanced CUDA Optimization 1. Introduction

CUDA Programming Model

Flexible Architecture Research Machine (FARM)

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

Antonio R. Miele Marco D. Santambrogio

Fundamental CUDA Optimization. NVIDIA Corporation

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 8823: GPU Architectures. Objectives

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

General Purpose GPU Computing in Partial Wave Analysis

Dongjun Shin Samsung Electronics

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Modern Processor Architectures. L25: Modern Compiler Design

Arrakis: The Operating System is the Control Plane

Fundamental CUDA Optimization. NVIDIA Corporation

OS Extensibility: SPIN and Exokernels. Robert Grimm New York University

When MPPDB Meets GPU:

Re-architecting Virtualization in Heterogeneous Multicore Systems

CPU-GPU Heterogeneous Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Addressing Heterogeneity in Manycore Applications

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Massively Parallel Architectures

NPTEL Course Jan K. Gopinath Indian Institute of Science

Performance potential for simulating spin models on GPU

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Computer Systems Laboratory Sungkyunkwan University

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Beyond Block I/O: Rethinking

CA485 Ray Walshe Google File System

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Compute-mode GPU Programming Interfaces

Master Informatics Eng.

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Generic System Calls for GPUs

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Tesla GPU Computing A Revolution in High Performance Computing

CME 213 S PRING Eric Darve

Eleos: Exit-Less OS Services for SGX Enclaves

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Portland State University ECE 588/688. Graphics Processors

high performance medical reconstruction using stream programming paradigms

Heterogeneous Computing and OpenCL

Threading Hardware in G80

OPERATING SYSTEM TRANSACTIONS

Hybrid Implementation of 3D Kirchhoff Migration

GFS: The Google File System

Introduction to CELL B.E. and GPU Programming. Agenda

Capriccio : Scalable Threads for Internet Services

Lecture 2: CUDA Programming

Background: I/O Concurrency

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Performance Optimization. Patrick Legresley

Introduction to GPU hardware and to CUDA

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Recent Advances in Heterogeneous Computing using Charm++

CS533 Concepts of Operating Systems. Jonathan Walpole

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Distributed Systems Operation System Support

Euro-Par Pisa - Italy

CEC 450 Real-Time Systems

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Distributed File Systems II

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Automatic Intra-Application Load Balancing for Heterogeneous Systems

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Experts in Application Acceleration Synective Labs AB

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Modeling and SW Synthesis for

CS377P Programming for Performance GPU Programming - II

Tesla Architecture, CUDA and Optimization Strategies

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Transcription:

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1

Traditional System Architecture Applications OS CPU 2

Modern System Architecture Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 3

Software-hardware gap is widening Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 4

Software-hardware gap is widening Accelerated applications OS CPU Ad-hoc abstractions and management mechanisms Manycore processors FPGA Hybrid CPU-GPU GPUs 5

On-accelerator OS support closes the programmability gap Accelerated applications Native accelerator applications OS On-accelerator OS support CPU Coordination Manycore processors FPGA Hybrid CPU-GPU GPUs 6

GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation Evaluation 7

Building systems with GPUs is hard. Why? 8

Goal of GPU programming frameworks GPU CPU Data transfers GPU invocation Memory management Parallel Algorithm 9

Headache for GPU programmers GPU CPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10

GPU kernels are isolated GPU CPU Data transfers Invocation Memory management Parallel Algorithm 11

Example: accelerating photo collage http://www.codeproject.com/articles/36347/face-collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12

CPU Implementation CPU Application While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13

Offloading computations to GPU CPU Application Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14

Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel start Kernel termination 15

Kernel start/stop overheads ke invo y to cop U GP Cache flush cop y CPU to CPU Invocation latency GPU Synchronization 16

Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 17

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 18

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU Why do we need to deal with low-level system details? 19

The reason is... GPUs are peer-processors They need I/O OS services 20

GPUfs: application view GPU2 GPU1 d hare n( s ope GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op writ e() CPUs GPUfs Host File System 21

GPUfs: application view d hare n( s ope GPU2 GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op System-wide shared namespace GPU1 writ e() CPUs POSIX GPUfs (CPU)-like API Host File System Persistent storage 22

Accelerating collage app with GPUfs No CPU management code CPU GPUfs GPU open/read from GPU 23

Accelerating collage app with GPUfs CPU Read-ahead GPUfsGPUfs GPUfs buffer cache GPU Overlapping Overlapping computations and transfers 24

Accelerating collage app with GPUfs CPU GPUfs GPU Data reuse Random data access 25

Challenge GPU CPU 26

Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 active threads 31,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* 27

Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s Memory Memory ~x20 6-16 GB/s 28

How to build an FS layer on this hardware? 29

GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation,. 30

GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS Massive parallelism GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory Heterogeneous GPU Memory memory Host File System Disk 31

GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32

Buffer cache semantics Local or Distributed file system data consistency? 33

GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34

In the paper On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Changes in the semantics are crucial 35

Implementation bits In the paper Paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36

Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37

Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 3500 CUDA piplined CUDA optimized GPU file I/O Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38

Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5kb each) Unpredictable output size 39

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 40

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 8% overhead 40s (7.3X) 40s (7.3X) Unbounded input/output size support 41

GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/home/gpufs http://goo.gl/ofj6j 42

Our life would have been easier with PCI atomics Preemptive background daemons GPU-CPU signaling support In-GPU exceptions GPU virtual memory API (host-based or device) Compiler optimizations for register-heavy libraries Seems like accomplished in 5.0 43

Sequential access to file: 3 versions CUDA whole file transfer GPU file I/O GPU CPU gmmap() Read file Transfer to GPU CUDA pipelined transfer CPU Read chunk Transfer to GPU Read chunk Read chunk Transfer to GPU Transfer to GPU Read chunk Transfer to GPU 44

Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 16K 64K 256K 512K 1M 2M Page size 45

Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 Benefit: Decouple performance constraints from application logic 16K 64K 256K 512K 1M 2M Page size 46

Yesterday On-accelerator OS support Accelerators as co-processors Tomorrow Accelerators as peers 47

What about software? Yesterday CPU? Tomorrow CPU GPU GPU Accelerators as peers Accelerators as coprocessors 48

Set GPUs free! 49

Parallel square root on GPU gpu_thread(thread_id i){ float buffer; int fd=gopen(filename,o_grdwr); Same code will run in all offset=sizeof(float)*i; thousands of gread(fd,sizeof(float),&buffer,offset); the GPU buffer=sqrt(buffer); threads gwrite(fd,sizeof(float),&buffer,offset); gclose(fd); } 50

GPUfs impact on GPU programs Memory overhead Register pressure Very little CPU coding Makes exitless GPU kernels possible Pay-as-you-go design 51

Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 52

Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU kernel is a single data-parallel application GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 53

GPUfs semantics (see more discussion in the paper) int fd=gopen( filename,o_grdwr); Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector One call per SIMD vector: bulk-synchronous cooperative execution One file descriptor per file: open()/close() cached on a GPU 54

GPU hardware characteristics Parallelism Heterogeneous memory 55

API semantics int fd=gopen( filename,o_grdwr); 56

API semantics int fd=gopen( filename,o_grdwr); C P U 57

This code runs in 100,000 GPU threads int fd=gopen( filename,o_grdwr); C P U G P U 58