X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Similar documents
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Recent Advances in Heterogeneous Computing using Charm++

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

When MPPDB Meets GPU:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

Speeding up the execution of numerical computations and simulations with rcuda José Duato

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Performance Diagnosis for Hybrid CPU/GPU Environments

Optimizing Multiple GPU FDTD Simulations in CUDA

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Lecture 10. Efficient Host-Device Data Transfer

GPU Performance Nuggets

Fundamental CUDA Optimization. NVIDIA Corporation

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

High Performance Computing

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Memory Management. Today. Next Time. Basic memory management Swapping Kernel memory allocation. Virtual memory

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Martin Dubois, ing. Contents

Predicting GPU Performance from CPU Runs Using Machine Learning

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

AN 831: Intel FPGA SDK for OpenCL

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

CME 213 S PRING Eric Darve

Maximizing Face Detection Performance

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Adaptive SMT Control for More Responsive Web Applications

OpenACC Course. Office Hour #2 Q&A

Performance impact of dynamic parallelism on different clustering algorithms

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GViM: GPU-accelerated Virtual Machines

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Fundamental CUDA Optimization. NVIDIA Corporation

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Transparent Checkpoint and Restart Technology for CUDA applications. Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

arxiv: v1 [cs.ms] 8 Aug 2018

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Warped parallel nearest neighbor searches using kd-trees

Compressed Swap for Embedded Linux. Alexander Belyakov, Intel Corp.

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU-accelerated Verification of the Collatz Conjecture

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Tesla GPU Computing A Revolution in High Performance Computing

Building NVLink for Developers

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

GPUfs: Integrating a file system with GPUs

Memory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

RxNetty vs Tomcat Performance Results

Accelerating Implicit LS-DYNA with GPU

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University


Re-architecting Virtualization in Heterogeneous Multicore Systems

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Object Support in an Array-based GPGPU Extension for Ruby

GPU Fundamentals Jeff Larkin November 14, 2016

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Faster Simulations of the National Airspace System

The benefits and costs of writing a POSIX kernel in a high-level language

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Efficient Data Transfers

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A case study of performance portability with OpenMP 4.5

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Introduction to CUDA

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

Addressing the Memory Wall

Accelerating Data Warehousing Applications Using General Purpose GPUs

Introduction to Multicore Programming

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Intel Xeon Phi Coprocessors

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA. Matthew Joyner, Jeremy Williams

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Transcription:

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo

Programming for large scale heterogeneous architecture Increase of the number of GPU based heterogeneous architectures e.g) Titan (18688 CPUs, 18688 GPUs), TSUBAME2.5 (2816 CPUs, 4224GPUs) High computing and memory performance of GPU Hard to program on distributed system Hard to program with GPU devices X10 + CUDA Large performance gap between X10+CUDA and CUDA

0.06 0.05 Performance gap between CUDA and X10 Operation CPU GPU memcpy + kernel (add two rails into a rail) Environment TSUBAME2.5 The performance of CPU GPU memcpy of X10 is worse than that of CUDA Elapsed Time [s] 0.04 0.03 0.02 Memcpy Kernel 0.01 0 CUDA X10

Performance of CPU GPU memcpy Cost of CPU GPU data transfer is larger than cost of running kernel on GPU in many applications Sorting, Merging, etc. CPU GPU memcpy mechanism of X10 degradates the memcpy performance compared to CUDA A performance gap decrease the overall performance of X10CUDA applications

Approach & Contribution Approach Reducing overheads of existing CPU GPU memcpy of X10 with dynamic pinning memory Not allocate pinned buffer preliminary, but pin memory each time when memcpy is called Contribution Achieves runtime reduction of 27% for CPU GPU memcpy compared to the current X10 implementation on TSUBAME2.5

Writing programs using GPU in X10 How to write? GPU is a remote place in X10 Naively do async at (gpu) @CUDA Parallelizing operations with async 240 64 threads work in parallel Source: http://x10 lang.org/documentation/practical x10 programming/x10 on gpus.html

Pros and Cons of X10 using CUDA Pros: Easier to program compared to CUDA No need to handle pointer No need to specify the #threads and #blocks autothreads(), autoblocks() etc. Cons: Lower performance compared to CUDA Lack of some features which we can use in CUDA

Source code level optimization Coalescing memory access Neighbor threads access neighbor memory spaces Overlapping kernel and bidirectional memory copy Using three CUDA streams Current X10 implementation doesn t support using over two streams Selecting optimal #threads and #blocks autoblocks()/autothreads() Automatically choose enough blocks/threads to saturate the GPU Source: http://x10.sourceforge.net/x10doc/latest/ These optimizations are used in both X10 and CUDA In order to reduce the performance gap, we do runtime level optimization

CPU GPU Memcpy of X10 Allocating two pinned buffers at the beginning Double buffering Copy a part of data which will be transferred to GPU in the next step to one pinned buffer Copy a part of data on the other pinned buffer to GPU In these operations, the size of chunk is 1MB Copy operations are managed by a queuing system

Analysis of the memcpy Copy 1MB chunk from Host to Device

Accelerating CPU GPU memcpy with dynamic pinning memory Instead of using buffers which are pinned at the beginning, we pin buffers each time when CPU GPU memcpy is called No need to allocate extra buffers No need to copy data to pinned buffers Other modifications Not splitting data into 1MB chunks and pinning data at once Not iterating 1MB copy operations but copying data at once Reduce the cost of managing queue

Experiment Evaluate the proposed memcpy mechanism Iterating Rail.asyncCopy() 2 times In order to easily check the overhead of memcpy in the profiling phase, we iterated asynccopy TSUBAME2.5 CPU: Intel Xeon X5670 2.93 GHz (6 cores) x 2 GPU: NVIDIA Tesla K20X x 3 DRAM: 54GB

Default memcpy vs Optimized memcpy Runtime reduction of 27%

Profiling overhead reduction Current X10 implementation Optimized implementation The ratio between memcpy and a overhead is reduced

Performance Evaluation 25% longer runtime compared to CUDA

Conclusion Achieved runtime reduction of 27% for CPU GPU memcpy compared to the original X10 implementation on TSUBAME2.5 Runtime of CPU GPU data transfer of X10 is still approximately 25% larger than that of CUDA The overheads of memcpy haven t been deleted completely Applying double buffering technique to our method

Future Work Double buffering Pinning memory CPU GPU memory copy Managing pinned memory spaces Releasing pinned memory when the total pinned memory space grows Not releasing pinned memory when the memory is used multiple times