High Performance Computing

Similar documents
vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Gist: Efficient Data Encoding for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Profiling DNN Workloads on a Volta-based DGX-1 System

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

A performance comparison of Deep Learning frameworks on KNL

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Deep Learning Accelerators

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Fuzzy Set Theory in Computer Vision: Example 3

Research Faculty Summit Systems Fueling future disruptions

World s most advanced data center accelerator for PCIe-based servers

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Deep learning in MATLAB From Concept to CUDA Code

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

GPU-Accelerated Deep Learning

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Efficient Communication Library for Large-Scale Deep Learning

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Towards Scalable Machine Learning

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Convolutional Neural Networks

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Deep Learning and Its Applications

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

Democratizing Machine Learning on Kubernetes

Leveraging MLC STT-RAM for Energy-efficient CNN Training

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Implementing Deep Learning for Video Analytics on Tegra X1.

Training Deep Neural Networks (in parallel)

Xilinx ML Suite Overview

Parallel Deep Network Training

Practical Near-Data Processing for In-Memory Analytics Frameworks

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Deep Learning with Tensorflow AlexNet

Convolutional Neural Network Layer Reordering for Acceleration

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

C-Brain: A Deep Learning Accelerator

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Research Faculty Summit Systems Fueling future disruptions

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

All You Want To Know About CNNs. Yukun Zhu

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Revolutionizing the Datacenter

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

Spatial Localization and Detection. Lecture 8-1

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

The Future of High Performance Computing

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Binary Convolutional Neural Network on RRAM

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

arxiv: v1 [cs.cv] 11 Feb 2018

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Deep Learning for Computer Vision II

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Transcription:

High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1

Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler, NVIDIA Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 2016 2

Outline: 1. Introduction 2. Background and Motivation 3. Virtualized DNN 4. Methodology 5. Results 6. Related work 7. Conclusion 3

1. Introduction: Deep neural networks (DNNs) have recently been successfully deployed in various application domains. Due to the tremendous compute horsepower offered by GPUs Popular machine learning frameworks use GPUs for accelerated deep learning. The DRAM capacity of the GPUs limit the size of the DNN that can be trained. 4

1. Introduction: e.g) Titan X: 12 GB memory capacity The trend in deep learning is to move towards larger and deeper network designs. The physical memory limitations of GPUs is becoming important. 5

1. Introduction: In this paper, authors propose vdnn. vdnn (virtualized Deep Neural Network) A runtime memory management solution that virtualizes the memory usage of DNN across both GPU and CPU memories. vdnn allows to train larger and deeper networks beyond the capacity of GPU. 6

2. Background and Motivation DNNs are designed using a combination of multiple types of layers. Convolutional layer Activation layer Pooling layer Fully-connected layer. feature extraction layers classification layers Convolutional neural networks are one of the most popular ML algorithms for image recognition. These DNNs are trained using a backward propagation algorithm. 7

2. Background and Motivation Forward propagation is performed from the first layer to the last layer. Backward propagation is performed in the opposite direction. Both propagations traverse the network layer-wise. 8

2. Background and Motivation Per layer memory allocations required are determined by the layer s input-output relationships and its mathematical function. e.g) Convolutional layer Forward: input/output feature maps (X and Y), weights of the layer (W) Backward: input/output gradient maps (dy and dx), weight s gradient (dw), X and W If FFT based convolution algorithm is used, it needs an additional workspace (WS) buffer. 9

2. Background and Motivation Because of the layer-wise gradient update rule of the backward propagation algorithm, each layer s feature maps (X) are later reused during its own backward propagation pass. All Xs must still be available in GPU memory until backward computation is completed. As the number of layers increases, the fraction of memory allocated for feature maps grows. 10

2. Background and Motivation 11

2. Background and Motivation VGG network Per layers memory usage of VGG-16 (batch size: 256) during forward 12

2. Background and Motivation There are the following key observations about memory usage. The intermediate feature maps (X) and workspace (WS) incur higher memory usage compared to the weights (W) of each layer. Most of these X are concentrated on the feature extraction layers. Most of these W are concentrated on the later classifier layers. The per layer memory usage is much smaller than memory usage of the entire network. 13

3. virtualized DNN: Design Principle The design objective of vdnn memory manager is to virtualize the memory usage of DNNs, using both GPU and CPU memory, while minimizing its impact on performance. Device memory X, HtoD Host memory X Y X Y GPU Y, DtoH CPU There is overhead due to communication between GPU and CPU. 14

3. virtualized DNN: Design Principle vdnn is based on the three key observations. 1. DNNs are via SGD are designed and structured with multiple layers. 2. The training of these neural networks involves a series of layer-wise computations. 3. The GPU only processes a single layer s computation at any given time. vdnn adopts a sliding-window based, layer-wise memory management strategy. The runtime memory manager allocates memory for the immediate usage of the layer that is currently being processed by the GPU. 15

3. virtualized DNN: Design Principle Forward propagation: vdnn allocates current layer s X on GPU. Other layer s Xs are offloaded to CPU memory. 16

3. virtualized DNN: Design Principle Backward propagation: Similar to forward propagation, vdnn aggressively releases data that are not needed for current layer s backward computation. 17

3. virtualized DNN: Design Principle Non-linear feedforward network still involves a series of layer-wise computations. vdnn can also handle nonlinear. 18

3. virtualized DNN: Core Operations And Its Design vdnn uses cudnn (https://developer.nvidia.com/cudnn) for computation on GPU. cudnn is a GPU-accelerated library for DNN. Various frameworks (including Caffe, Tensorflow, chainer) use cudnn. cudnn provide some algorithms for each layer s operation, and can find the best suited algorithm. e.g) convolutionforward: IMPLICIT_GEMM, GEMM, FFT, etc 19

3. virtualized DNN: Core Operations And Its Design vdnn uses CUDA streams (https://docs.nvidia.com/cuda/cuda-cprogramming-guide/). A stream is a sequence of operations that execute in order on GPU. Different streams may execute their operations out of order with respect to one another or concurrently. cudamemcpyasync(htod, stream1); Convolution(stream1); cudamemcpyasync(dtoh, stream1); stream1: HtoD CONV DtoH time 20

3. virtualized DNN: Core Operations And Its Design cudamemcpyasync(htod, stream1); Convolution(stream2); stream1: HtoD DtoH cudamemcpyasync(dtoh, stream1); stream2: CONV time 21

3. virtualized DNN: Core Operations And Its Design cudamemcpyasync(htod, stream1); Convolution(stream2); stream1: HtoD DtoH cudastreamsynchronize(stream2); cudamemcpyasync(dtoh, stream1); stream2: CONV time 22

3. virtualized DNN: Core Operations And Its Design vdnn employs two streams, stream compute and stream memory. stream compute : all the layer s forward and backward computation stream memory : the memory allocation/release, offload, and prefetch Memory Allocation/Release When the program launches, the vdnn allocates memory pool. Whenever vdnn allocates (and releases) data structure, the memory is allocated (released) from memory pool without cudamalloc() and cudafree(). 23

3. virtualized DNN: Core Operations And Its Design Memory Offload Input feature maps (Xs) are offloaded from GPU to CPU. vdnn overlaps offloading with the same layer s forward computation. 24

3. virtualized DNN: Core Operations And Its Design Memory Prefetch Offloaded Xs are prefetched back from CPU to GPU. vdnn overlaps other layer s prefetching with the current layer s backward computation. 25

3. virtualized DNN: vdnn Memory Transfer Policy Determining the best layers to offload their X is a multi-dimensional optimization problem that must consider. 1. GPU memory capacity 2. The convolutional algorithms used and the overall layer-wise memory usage memory-optimal implicit GEMM VS performance-optimal convolutional algorithm 3. The network-wide performance The additional latency possibly incurred due to offload/prefetch 26

3. virtualized DNN: vdnn Memory Transfer Policy Static vdnn vdnn all offload all layers X from GPU most memory-efficient solution vdnn conv only offload CONV layers X from GPU This policy is based on the observation that CONV layers have long computation latency to hide the latency of offload/prefetch. 27

3. virtualized DNN: vdnn Memory Transfer Policy Static vdnn Convolutional algorithm is determined with memory-optimal or performance-optimal. While static vdnn is simple and easy to implement, it does not account for the system architectural components that determine the trainability and performance of a DNN. 28

3. virtualized DNN: vdnn Memory Transfer Policy Dynamic vdnn vdnn dyn automatically determine the offloading layers and the convolutional algorithms at runtime balance the trainability and performance of a DNN The dynamic vdnn launches profiling to optimization before training iterations. This profiling is based on a greedy algorithm that tries to locally optimize layer s memory usage and performance, seeking a global optimum state in terms of trainability and performance. 29

4. Methodology: GPU Node Topology NVIDIA s Titan X single precision throughput: 7 TFLOPS memory bandwidth: 336 GB/sec memory capacity: 12 GB The GPU communicates with an Intel i7-5930k via a PCIe switch. bandwidth of communication with CPU: 16 GB/sec 30

4. Methodology: DNN Benchmarks Conventional DNNs AlexNet(128), OverFeat(128), Googlenet(128), VGG-16(64), VGG-16(128), VGG-16(256) 31

4. Methodology: DNN Benchmarks Very Deep Networks extend the number of CONV layers of VGG VGG-116, VGG-216, VGG-316, VGG-416 batch size: 32 32

5. Result all: static vdnn all conv: static vdnn conv dyn: dynamic vdnn dyn base: not use vdnn (All memory are allocated on GPU.) vdnn all, vdnn conv and base are evaluated with both memoryoptimal (m) and performance-optimal (p). 33

5. Result: GPU Memory Usage Performance-optimal vdnn tend to allocate more memory on GPU to improve performance. Performance-efficient algorithms requires larger workspace. The total number of offload layers is reduced. 34

5. Result: Impact on Memory System vdnn does come at the cost of adding read/write traffic to the GPU memory subsystem. Potentially interfering with the normal cudnn operations. The feature extraction layers rarely saturate the 336 GB/sec of peak memory bandwidth. 35

5. Result: Performance The vdnn conv s throughput reach an average 97 % of baseline s throughput. The dynamic vdnn does much better in terms of balancing memory efficiency and overall throughput. 36

5. Result: Power The system profiling utility of nvprof is used to measure the average GPU power consumption (energy / time). The additional energy overheads of vdnn dyn memory traffic is negligible on average. vdnn dyn dose not incur any noticeable performance overhead. The studied DNNs rarely saturate the peak DRAM bandwidth. 37

5. Result: Training Very Deep Networks vdnn allocates most of memory in CPU memory. Very deep networks can be trained. vdnn dyn did not incur any noticeable performance degradations. Because the offload/prefetch latency is completely hidden. 38

6. Related work There have been a variety of proposals aiming to reduce the memory usage of neural networks. Network pruning techniques remove small valued weight connections. Reducing the number of bits required to model the network. Several prior works discussed mechanisms to support virtualized memory on GPUs. TLB implementations that consider the unique memory access patterns of GPUs are proposed. 39

7. Conclusion Existing ML frameworks require users to carefully manage their GPU memory usage. vdnn solution improves the memory-efficiency of DNN. We also study the scalability of vdnn to extremely deep network. vdnn can train networks with hundreds of layers without any performance loss. 40