An Evaluation of Unified Memory Technology on NVIDIA GPUs

Size: px

Start display at page:

Download "An Evaluation of Unified Memory Technology on NVIDIA GPUs"

Maximillian Pearson
5 years ago
Views:

1 An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo Institute of Technology, Japan 2 NVIDIA, Singapore 3 May 4, 2015

2 Outline 1 Introduction Unified Memory Programming Model Problem Statements Contributions 2 Evaluation Approach Hardware Platforms Benchmarks 3 Results and Discussion Performance Results Discussion 4 Conclusion and Future Work Conclusion Related Work Future Work

3 Introduction Outline 1 Introduction Unified Memory Programming Model Problem Statements Contributions 2 Evaluation Approach Hardware Platforms Benchmarks 3 Results and Discussion Performance Results Discussion 4 Conclusion and Future Work Conclusion Related Work Future Work

4 Introduction Unified Memory Programming Model Unified Memory Programming Model CPU GPU CPU GPU System Memory GPU Memory Unified Memory Figure: Traditional and Unified Memory access model

5 Introduction Unified Memory Programming Model The code of Unified Memory model is simple and easy to understand. begin elem, d elem; malloc(&elem); cudamalloc(&d elem); Initialize elem; MemcpyFromHostToDevice; launch kernel; MemcpyFromDeviceToHost; end 1: Normal begin elem; cudamallocmanaged(&elem); Initialize elem; launch kernel; Sync; end 2: Unified Memory Figure: Code examples

6 Introduction Unified Memory Programming Model Unified Memory Programming Model Unified Memory is a component of the new CUDA programming model. Define CPU and GPU memory space as a single coherent memory Simplify GPU programming and migrate data transparently Data movement still takes place Requirements: a GPU with SM architecture 3.0 or higher (Kepler or newer) a 64-bit host application and operating system

7 Introduction Problem Statements Problem Statements Will Unified Memory programming model affect the performance of applications and why? Whether memory transfers exist in the device which has physical unified memory shared with CPU and GPU? TK1 is the first mobile platform with Kepler GPU. TK1 supports Unified Memory and it offers a 2GB physically shared memory.

8 Introduction Contributions Contributions To the best of our knowledge, the following are our contributions: We explained and validated the reason of performance loss caused by redundant memory transfers and page faults. We further proposed a memory states transition diagram and explained when redundant memory transfers will happen. We studied the memory behavior on TK1. It is very likely that there still exists memory transfers between CPU and GPU in Unified Memory programming model.

9 Evaluation Approach Outline 1 Introduction Unified Memory Programming Model Problem Statements Contributions 2 Evaluation Approach Hardware Platforms Benchmarks 3 Results and Discussion Performance Results Discussion 4 Conclusion and Future Work Conclusion Related Work Future Work

GPIO USB UART GbE PCIe root complex ARM ARM ARM ARM 192-Core GPU

10 Evaluation Approach Hardware Platforms Hardware Platforms We conducted the experiments on the GPU node with K40 and Nvidia TK1. GPIO USB UART GbE PCIe root complex ARM ARM ARM ARM 192-Core GPU Memory Controller H.264 HDMI DP CSI-2 (a) GPU node (b) TK1 Figure: Anatomy Pictures

11 Evaluation Approach Hardware Platforms Hardware Platforms We conducted the experiments on our supercomputer π and Nvidia TK1. Device GPU node TK1 CPU GPU CPU GPU Features 2 Intel Xeon CPU 8 cores, 16 threads k40, 15 SMXs, 2880 CUDA cores, 12GB of GDDR5 memory 4 ARM Cortex 1 SMX, 192 CUDA cores, 2GB DDR3L memory shared with CPU Table: Architecture and characteristic

12 Evaluation Approach Benchmarks Benchmarks We selected a series of benchmarks (from simple to complex) to evaluate the performance of Unified Memory. Matrix Multiplication From CUDA 6.0 SDK Samples Diffusion 3D Benchmark 3D 7-point stencil code developed by Tokyo Tech. diffusion3d standard, register reuse and shared memory Parboil Benchmark Suite The Parboil benchmarks suite collects applications from many different scientific and commercial fields.

13 Evaluation Approach Benchmarks Benchmarks To test the performance of Unified Memory programming model Modify these benchmarks to use Unified Memory remove all references to device pointers remove explicit memory transfers convert host pointers to managed memory pointers Measure the running time and average the results of ten runs

14 Results and Discussion Outline 1 Introduction Unified Memory Programming Model Problem Statements Contributions 2 Evaluation Approach Hardware Platforms Benchmarks 3 Results and Discussion Performance Results Discussion 4 Conclusion and Future Work Conclusion Related Work Future Work

15 Results and Discussion Performance Results Performance Results Performance results for these benchmarks, normalized to the case without using Unified Memory 1.2 CUDA 6.0 CUDA 6.5 Normalized Runtime(s) cutcp spmv bfs mri-q stencil shared memory register reuse standard MM-CUBLAS Figure: Performance of different Benchmark Suites on K40 (a). Green and purple histograms correspond to Unified Memory version with CUDA 6.0 and CUDA 6.5

16 Results and Discussion Performance Results Performance Results The applications with Unified Memory have worse performance than the normal versions. 1.2 Normalized Runtime(s) MatrixMul Diffusion-reg cutcp stencil spmv Figure: Part of the experiment results based on TK1

17 Results and Discussion Performance Results Timeline We generated the PTX code (the pseudo-assembly codes for CUDA). The kernel part is the same. We profiled the Diffusion3D Standard benchmark and generated the timeline using NVIDIA Visual Profiler. (a) Normal timeline (b) Unified Memory version timeline Figure: Diffusion3D standard Timeline

Results and Discussion Discussion Page Faults A page fault is a type of interrupt raised

transfer cost more time and overlap with initialization and accuracy checking Total size

18 Results and Discussion Discussion Page Faults A page fault is a type of interrupt raised when a program accesses a memory page that is mapped into the virtual address space, but not loaded in physical memory. Exist an extra DtoH (Device to Host) memory copy process coincided with page faults DtoH transfer cost more time and overlap with initialization and accuracy checking Total size of page faults equals to the array size used in the program (a) Normal timeline (b) Unified Memory version timeline Figure: Diffusion3D standard Timeline

19 Results and Discussion Discussion Page Faults The data are invalid to CPU after accessing by the GPU The memory space returned by cudamallocmanaged() is also invalid to CPU The performance loss is caused by page faults and redundant memory transfers These two problems are related to the operating system and hardware so can not be solved now.

20 Results and Discussion Discussion Redundant Memory Transfers No kernel calls The size of x is 16 KB, or four pages Redundant Memory transfer DtoH Allocate x Initialize x Sync Device to Host (bytes) CPU Page faults 4

21 Results and Discussion Discussion Redundant Memory Transfers The data is read-only for GPU The size of x is 16 KB, or four pages. Unified Memory always assumes the GPU has the newest data. Allocate x DtoH HtoD Initialize x Launch kernel GPU read x DtoH Sync CPU Read x Host to Device (bytes) Device to Host (bytes) CPU Page faults 8

22 Results and Discussion Discussion Redundant Memory Transfers Kernel modified the data The size of x is 16 KB, or four pages. Allocate x DtoH HtoD Initialize x Launch kernel GPU write x DtoH Sync CPU Read x Host to Device (bytes) Device to Host (bytes) CPU Page faults 8

23 Results and Discussion Discussion Redundant Memory Transfers No kernel references the data The size of x is 16 KB, y is 4KB. The kernel never references the variable y Unified Memory assumes that any active kernel may use any managed memory. Allocate x,y DtoH x,y Initialize x,y HtoD x,y Launch kernel kernel only reference x Host to Device (bytes) Device to Host (bytes) CPU Page faults 5

24 Results and Discussion Discussion Redundant Memory Transfers The host never read and write the managed memory No redundant memory transfer. Allocate x Launch kernel kernel read & write x Sync DtoH Write Invalid Valid Dirty Kernel Call HtoD Figure: States transition diagram

25 Conclusion and Future Work Outline 1 Introduction Unified Memory Programming Model Problem Statements Contributions 2 Evaluation Approach Hardware Platforms Benchmarks 3 Results and Discussion Performance Results Discussion 4 Conclusion and Future Work Conclusion Related Work Future Work

26 Conclusion and Future Work Conclusion Conclusion We introduced an evaluation methodology of Unified Memory and presented a performance evaluation for the Unified Memory programming model on K40 and TK1. We also validated that the performance loss is caused by redundant memory transfers and page faults We further proposed a memory states transition diagram and explained when redundant memory transfers will happen We found that Unified Memory cannot use the 2GB physical unified memory efficiently on the TK1

27 Conclusion and Future Work Related Work Related Work Gelado et al. presented a new programming model called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory Nickolls et al. investigated the Unified Memory programming model and evaluate the performance.

28 Conclusion and Future Work Future Work Future Work Test the Unified Memory performance with NVLink NVLink provides at least 80 GB/s bandwidth, 5 times of current PCIe Gen3 16 Survey Unified Memory on multi-gpus Utilize the features of Unified Memory to optimize CUDA code based Unified Memory programming model

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid