GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

Similar documents
Accelerator-centric operating systems

GPUnet: networking abstractions for GPU programs

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

Advanced Computer Networks. End Host Optimization

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

GPUfs: Integrating a file system with GPUs

Asynchronous Peer-to-Peer Device Communication

GPUfs: Integrating a file system with GPUs

OpenACC Course. Office Hour #2 Q&A

Paving the Road to Exascale

FaRM: Fast Remote Memory

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Mark Falco Oracle Coherence Development

M 3 Microkernel-based System for Heterogeneous Manycores

GPUfs: Integrating a File System with GPUs

IO virtualization. Michael Kagan Mellanox Technologies

Much Faster Networking

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

IsoStack Highly Efficient Network Processing on Dedicated Cores

Fundamental CUDA Optimization. NVIDIA Corporation

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Unified Runtime for PGAS and MPI over OFED

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Assignment 2 Group 5 Simon Gerber Systems Group Dept. Computer Science ETH Zurich - Switzerland

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Map-Reduce. Marco Mura 2010 March, 31th

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Application Acceleration Beyond Flash Storage

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Flexible Architecture Research Machine (FARM)

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

! Readings! ! Room-level, on-chip! vs.!

Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16

Fundamental CUDA Optimization. NVIDIA Corporation

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

A Distributed Hash Table for Shared Memory

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction to CUDA Programming

Distributing Computation to Large GPU Clusters

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

M 3 Microkernel-based System for Heterogeneous Manycores

Advanced CUDA Optimization 1. Introduction

TUNING CUDA APPLICATIONS FOR MAXWELL

"#' %#& Lecture 7: Organizing Game Clients and Servers. Socket: Network communication endpoints. IP address: IP-level name of a machine

GPU MEMORY BOOTCAMP III

Eleos: Exit-Less OS Services for SGX Enclaves

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

GPUs and Emerging Architectures

Midterm II December 3 rd, 2007 CS162: Operating Systems and Systems Programming

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Performance Evaluation of Myrinet-based Network Router

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

ECE 574 Cluster Computing Lecture 15

CA485 Ray Walshe Google File System

CUDA Performance Optimization. Patrick Legresley

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPUs have enormous power that is enormously difficult to use

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

RDMA-like VirtIO Network Device for Palacios Virtual Machines

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

Recent Advances in Heterogeneous Computing using Charm++

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Networked Applications: Sockets. End System: Computer on the Net

Last class: Today: Thread Background. Thread Systems

Memory Management Strategies for Data Serving with RDMA

High Performance File Serving with SMB3 and RDMA via SMB Direct

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

FlexNIC: Rethinking Network DMA

GFS: The Google File System

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

19: I/O. Mark Handley. Direct Memory Access (DMA)

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

IsoStack Highly Efficient Network Processing on Dedicated Cores

TUNING CUDA APPLICATIONS FOR MAXWELL

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Support for Smart NICs. Ian Pratt

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Transcription:

Author: Andrzej Jackowski 1

Author: Andrzej Jackowski 2

GPU programming problem 3

GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory 4

The tip of the iceberg: double buffering (not rocket science, still annoying) 5

GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory If we want to use GPU efficiently this design is more complicated we have to batch requests, have multithreading or asynchrony etc. 6

7

GPUfs: Integrating a File System with GPUs [2013] CPU & Memory Filesystem GPU & Memory 8

GPUfs: Integrating a File System with GPUs [2013] Filesystem GPU 9

GPUfs - main goals POSIX-like filesystem for GPU programs Making common API calls efficient when executed in data parallel fashion Consistency semantics that minimizes expensive CPU <-> GPU transfers Generic, software-only buffer cache with key features like read-ahead, data transfer scheduling, asynchronous writes Proof-of-concept running on NVIDIA Fermi cards (GF400, GF500 series etc.) Quantitative evaluation 10

GPUfs implementation GPUfs API - library provided for your GPU application, i.a. gread/gwrite, gopen/gclose, gfstat, gfsync GPU File State - open descriptors structure GPU Buffer Cache - one per application (instead of one per filesystem), but many kernels of one process can use the same cache! CPU-GPU RPC/daemon - layer responsible for CPU <-> GPU communication (CPU is reading files). There is a lot of interesting stuff about it in paper! GPUfs consistency module - ensures that for each file there is only one writer (it s assumption for GPUfs) 11

Example: gread() gread/gwrite are like pread/pwrite (offsets are given, no seek pointer) Calls has per ThreadBlock granularity, that s why you can t open file per thread, but you can open file per ThreadBlock 12

Performance results GPUfs code can be faster than standard CUDA code In CUDA optimized version we read 1GB chunks into multiple buffers, which sometimes cause slowdowns due to buffer cache paging GPUfs reads are more granular (2MB for each page) Moreover GPUfs code is simpler 13

GPUfs code example (from GitHub) zfd = gopen(src, O_GRDONLY); zfd1 = gopen(dst, O_GWRONCE); filesize = fstat(zfd); for (size_t me = blockidx.x * FS_BLOCKSIZE; me < filesize; me += FS_BLOCKSIZE * griddim.x) { int toread = min((unsigned int) FS_BLOCKSIZE, (unsigned int) (filesize - me)); if (toread!= gread(zfd, me, toread, scratch)) { assert(null); } if (toread!= gwrite(zfd1, me, toread, scratch)){ assert(null);} } gclose(zfd); gclose(zfd1); 14

Summary of GPUfs For now there are some hardware limitations (because GPU is designed as co-cpu) Code written with GPUfs can be simple and efficient You can t get rid of CPU, but you can get rid of CPU code (in the GPUfs test CPU code had only 1 line - exec GPU program) 15

16

GPUnet design: server In GPU-accelerated large batches are needed to amortize CPU-GPU communication In GPU-net we don t have to batch requests, so it s lightweight and simple to use Moreover we don t have to mix CPU and GPU code for handling a single request 17

GPUnet design: performance benefits This also penalizes small requests, so batching is needed 18

GPUnet design: library API shared int sock; Sockets are commonly used and versatile Functions similar to standard socket libraries: gconnect, gbind, gsend, grcv [..] All threads in a threadblock must invoke the same GPUnet call together in a coalesced manner: with the same arguments, at the same point in application code [...] shared uchar buf[buf_size]; int ret, i; while ((sock = gconnect_in(addr)) < 0) {}; assert(sock >= 0); for (i = 0; i < NR_MSG; i++) { int sent = 0; do { ret = gsend(sock, buf + sent, BUF_SIZE - sent); if (ret < 0) { goto out; } else { sent += ret; } } while (sent < BUF_SIZE); 19

GPUnet design: library assumptions Simplicity - provide sophisticated, effective solution as easy to use socket abstraction Compatibility with GPU programming - support common GPU programming idioms Compatibility with CPU endpoints - CPU can talk with aaaaaaaaaaaaa remote GPU using sockets Network Interface Controller sharing - both CPU and GPU can use NIC at the same time Namespace sharing - single network namespace (IP, ports, socket names) among CPUs and GPUs 20

Peer-to-peer Direct Memory Access P2P DMA allows graphic cards to exchange data without using CPU memory In the same way GPU can work with other PCI-e devices e.g. network interface controller Improves both throughput and latency Moreover, relieve CPU&memory 21

Remote Direct Memory Access DMA over the network Low latency High bandwidth We don t need to switch context CPU won t be involved (so CPU cache is not touched) 22

GPUnet design: Rsockets compatibility Rsockets is library that provides socket-level API over RDMA GPUnet is extending rsockets and giving similar API for GPU code Rsockets and GPUnet are compatible 23

GPUnet design: for discrete GPU Discrete means not integrated (pic related) Outperforms hybrid (cpu-gpu) configuration for order of magnitude Authors believe that discrete GPUs will be popular for years [it s 2016 they re still right!] 24

Design vs Implementation 25

We can t get rid of CPU again! Flow control mechanism - allows the sender to block if receiver buffer is full Cannot be implemented using only GPU GPU can t access HCA s (host channel adapter, NIC in InfiniBand) door-bell registers in order to trigger a send operation GPU can t map completion queue - the structure that delivers completion notification e.g. when new data arrives CPU is necessary to assist every GPU send and receive operation Implemented efficiently using GPU memory mapping into CPU s address space 26

Non-RDMA transport GPUnet is also working on hardware without RDMA, since RDMA is not yet popular Has higher latency and needs bigger buffer than RDMA solution 27

Example: GPUnet Map Reduce (Almost) 0 lines of CPU code GPUfs for file access GPUnet for network usage 28

Example: GPUnet Map Reduce What a pity that scalability tests were so small (max 4 nodes) Only two machines were using GPU-NIC RDMA (two machines were using bounce buffers) phoenix++ 8-cores (1 node) Hadoop 8-cores (1 node) GPUnet 1 GPU (1 node) GPUnet 4 GPU (4 nodes) K-means 12.2s 71.0s 5.6s 1.6s Word-count 6.23s 211.0s 29.6s 10.0s 29

Example: Face Verification Server 1. 2. 3. 4. 5. CPU client send image with label GPU server gets image label histogram from memcached GPU server creates image histogram GPU server compares histograms GPU server send reply with result to CPU client 30

Three different GPU server implementations CPU version - running the whole GPU server as multi threaded CPU process {506 lines of code} GPU (no GPUnet) version - CPU version with algorithm executed as CUDA code {596 lines of code} GPUnet version - written as native GPU application with GPUnet {245 lines of code} 31

Performance results 32

Scaling results CPU & GPU servers working together! 33

Limitations GPUnet doesn t provide mechanism for socket migration between CPU and GPU (it may be convenient to use it e.g. in load balancing) GPUnet relies on consistent reads to GPU memory when it s used by both kernel and NIC RDMA. Anyway tests with CRC codes proves no violations, despite the lack of guarantee from NVIDIA GPUnet, as well as GPUfs is available on GitHub, but it has prototype quality 34

Summary GPUnet shows new approach for GPU programming It s not the fastest possible solution, because you can always write dedicated code using P2P DMA & RDMA but it s definitely fast and simply to use! There are a lot of interesting low-level investigations in the paper You can learn even more by studying the GPUfs/GPUnet source code 35

Questions, thoughts? https://github.com/gpufs/gpufs https://github.com/ut-osa/gpunet 36