High-Performance Data Loading and Augmentation for Deep Neural Network Training

Similar documents
CafeGPI. Single-Sided Communication for Scalable Deep Learning

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

High Performance Computing

A performance comparison of Deep Learning frameworks on KNL

Hybrid Implementation of 3D Kirchhoff Migration

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Research Faculty Summit Systems Fueling future disruptions

Towards Scalable Machine Learning

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Profiling DNN Workloads on a Volta-based DGX-1 System

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Techniques to improve the scalability of Checkpoint-Restart

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Zhang HPC Application R&D Manager,Inspur

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Deep learning in MATLAB From Concept to CUDA Code

TUNING CUDA APPLICATIONS FOR MAXWELL

IBM Deep Learning Solutions

NVIDIA DATA LOADING LIBRARY (DALI)

GPU-Accelerated Deep Learning

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Wu Zhiwen.

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Implementing Deep Learning for Video Analytics on Tegra X1.

TUNING CUDA APPLICATIONS FOR MAXWELL

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

arxiv: v1 [cs.dc] 8 Jun 2018

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Training Deep Neural Networks (in parallel)

World s most advanced data center accelerator for PCIe-based servers

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Parallel Computer Architecture and Programming Written Assignment 3

Lecture 10. Efficient Host-Device Data Transfer

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Deep Learning Frameworks with Spark and GPUs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Deploying Deep Learning Networks to Embedded GPUs and CPUs

GPU Accelerated Machine Learning for Bond Price Prediction

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Parallel Computing with MATLAB

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

CSE 120 Principles of Operating Systems

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Hierarchical DAG Scheduling for Hybrid Distributed Systems

CPU-GPU Heterogeneous Computing

Multiprocessor scheduling

Fundamental CUDA Optimization. NVIDIA Corporation

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Performance Optimization Part II: Locality, Communication, and Contention

Heterogeneous platforms

CUDA OPTIMIZATIONS ISC 2011 Tutorial

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

NVIDIA GPU TECHNOLOGY UPDATE

White Paper. Major Performance Tuning Considerations for Weblogic Server

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Maximizing GPU Power for Vision and Depth Sensor Processing. From NVIDIA's Tegra K1 to GPUs on the Cloud. Chen Sagiv Eri Rubin SagivTech Ltd.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Concurrency for data-intensive applications

CUDA Programming Model

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Parallel Processing SIMD, Vector and GPU s cont.

The former pager tasks have been replaced in 7.9 by the special savepoint tasks.

A Simulated Annealing algorithm for GPU clusters

OPTIMIZING HPC SIMULATION AND VISUALIZATION CODE USING NVIDIA NSIGHT SYSTEMS

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Recent Advances in Heterogeneous Computing using Charm++

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Introduction to CUDA

Optimizing and Accelerating Your MATLAB Code

CUDA Performance Optimization. Patrick Legresley

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Transcription:

High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com

Roadmap 1. The General-Purpose Acceleration Framework (GPAF) project 1. Systems & software 2. Key features 2. Data loading & augmentation systems 1. The data loading & augmentation task 2. Motivation for our system 3. Data augmentation pipeline implementation 1. Data augmentation on CPU & GPU 2. Multi-threading augmentation on CPU 3. Automatic performance tuning 4. Memory management 5. Levels of parallelism 4. Results & analysis

General-Purpose Acceleration Framework Project

Systems & Software Goal: Design & build software & hardware infrastructure to accelerate machine learning and mathematical workloads. Specifically through the use of many-gpu, distributed systems Hardware Custom GPU clusters used across Samsung Software Distributed math library (dmath). Used to accelerate popular machine learning frameworks Kaldi speech recognition toolkit Caffe deep learning framework Samsung Advanced Learning v1.0 + = Expresso (topic for today)

Key Features Pooled memory management, avoid costly allocation, de-allocation, and registration for RDMA transfers Asynchronous replication of shared data, overlapping parameter distribution w/ forward pass computation Caching of distributed job metadata to minimize overhead when starting common tasks Multi-threaded, asynchronous, CPU/GPU data loading pipeline, automatically tuned at runtime (topic for today) Highly optimized DNN operations (cudnn, cublas, custom combined backward convolution) Distributed batch norm (strict or relaxed) Half-precision support as storage and computation on supporting hardware (See EDMNN@NIPS, GTC 2016 talk)

Data Loading & Augmentation System Design & Motivation

Data Loading & Augmentation Task 1. Load images from database 2. Decode image 3. Perform any data augmentation 4. Copy image to the GPU for training 1. 2. 3. 4.

Typical Augmentation System Multiple threads are used to accelerate data augmentation Data loading for next batch is done in parallel with the forwardbackward pass for the previous batch Previous batch computation

Frames / Second Motivation for Our System Advances in GPUs and systems for deep learning have accelerated DNN training to the point where data loading and augmentation can be the main bottleneck The typical approach of multi-threading and overlapping preprocessing with training on the previous batch is no longer sufficient for some networks Batch Size Peak training speed (bars) for AlexNet on 8 GPUs. Dotted lines mark peak data loading speeds with 5 threads / GPU

Question How can we accelerate data loading & augmentation so that we can continue to leverage training speedups from the latest GPUs and software libraries? Our solution Utilize the GPU for data augmentation

Problem Data loading is only the main bottleneck for some networks. How can we accelerate data loading with the GPU when necessary, but avoid wasting GPU resources on data augmentation when data loading is not the main bottleneck?

Goal for Our System We need a data loading & augmentation system that can adapt to the computational needs of the network so that we can continue to leverage training speedups from the latest GPUs & software systems Source: developer.nvidia.com/cudnn

Data Augmentation Pipeline Implementation

Key Features 1. Data augmentation on CPU & GPU 2. Multi-threading augmentation on CPU 3. Automatic performance tuning 4. Memory management 5. Levels of parallelism

Data Augmentation on CPU & GPU Central augmentation pipeline composed of data augmentation operations on each worker process All operations implemented on CPU and GPU Used BLAS, cublas, and OpenCV CPU/GPU to build fast data augmentation operations Operations are moved between CPU and GPU between batches to avoid thread safety issues (note: we refer to the stage of the pipeline at which the data is moved to the GPU as the transfer index)

Multi-Threading Augmentation on CPU Multiple threads are used by each worker to augment data on the CPU Number of worker threads is the same across the workers and is managed centrally The transfer index and the number of threads per worker are managed centrally by the master process for all workers to avoid resource imbalances Each batch is prepared in parallel with training on the previous batch

Automatic Performance Tuning User would need to try max_num_threads * (num_ops + 1) settings to find the optimal number of threads & transfer index Harms experimentation speed Could be waste of time for some networks where data loading is insignificant compared to network training (e.g. very deep networks) However, state space is small enough for us to search at runtime programmatically

Automatic Performance Tuning At runtime, master process samples performance for all different combinations of thread counts and transfer indices (referred to as states) Samples for each state are taken over batches. We take N samples for each state, where N = ceil(128 / batch_size_per_worker) The ideal setting is found by selecting the lowest runtime from the medians of the N samples for each state

Memory Management Page-locked host memory allocations and device memory allocations cause implicit synchronization. We need to avoid synchronization with the GPU so that we do not interfere with network training Host & device buffers are allocated by each thread on startup and only resized when samples exceed the current buffer size Page-locked host memory is used to allow overlap of transfers to device with computation on device Data types are promoted lazily to avoid unnecessary data transfers

Memory Management Lazy data type promotion RGB images are loaded in uint8 Operations like mean subtraction and scaling of data are very sensitive to precisions and cannot be done in uint8 Rather than performing all ops in float when mean sub or scale are present, we wait until the higher precision is needed to promote the data type Crop (uint8) Mirror (uint8) Mean Sub (float)

Memory Management Benefits We can avoid unnecessary memory transfers to GPU when mean subtraction is moved to GPU CPU GPU Crop (uint8) Mirror (uint8) (uint8) Mean Sub (float) Avoids 4x increase in traffic to the GPU Auto-tuning frequently achieves significant performance improvements by a. Moving high-precision ops to GPU (saves data transfer) b. Moving computationally intensive ops to GPU (slow on CPU)

Levels of Parallelism 1. The processing pipeline is replicated across the workers and controlled centrally by the master process 2. Within each worker, data augmentation is threaded Threads load from central DB to ensure replicable training and testing results

Levels of Parallelism 3. Within each thread, processing on host is overlapped with transfers to device and computation on GPU Pinned memory is used for host buffers Host buffers are associated with CUDA events Device buffers are associated with CUDA streams Thread keeps multiple of each buffer and ping-pongs between them

Levels of Parallelism For each sample: 1. Select next host & device buffers 2. Block on host buffer event (confirm any previous copy is complete) 3. Fill host buffer with training sample 4. Run host-side augmentation 5. Block on dev buffer stream (confirm all work in this buffer is complete) 6. Start async copy from host buffer to dev buffer 7. Enqueue host buffer event in dev buffer stream 8. Launch all dev-side augmentation & final async copy into batch

Performance Results & Analysis

Peak Data Loading Performance Peak training speed (bars) for AlexNet on 8 GPUs. Dotted lines mark peak data loading speeds with 5 threads / GPU Peak Processing Rates With Our System Crop/mirror/meansub: 18910 FPS (2.13x speedup) Crop/mirror/meansub/interp/colordist: 9475 FPS (2.45x speedup)

Results With AlexNet AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs DataLoader V2 (orange) is system described in this talk DataLoader V1 (dark blue) is previous system Extremely pipelined: crop is done as copy into GPU memory, single kernel does mean sub, scale, mirror Only supports crop, mean subtraction, scale, mirror 1 thread per worker No DataLoader (light blue) is training speed with dummy data Augmentation pipeline is basic crop, mirror, and mean subtraction

Results With AlexNet AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs Observations Data loader V2 provides 19.4% speedup on average Performance gain for more complex pipelines is likely to be larger Move from M40 to P100 significantly increased the problem. Likely to continue to increase as systems advance There is still room for improvement: data loader v2 provides 87.4% of peak performance on average

Results With AlexNet 1/3 AlexNet on 8 M40 GPUs 4/1 5/0 3/0 2/0 1/1 1/3 1/2 AlexNet on 8 P100 GPUs 4/0 4/0 3/0 3/0 2/2 1/2 Looking at auto-tuner decisions (num_threads/transfer_index) Auto-tuner generally moves augmentation to GPU before threading GPU side augmentation + high thread counts are chosen as the amount of work increases

Results With AlexNet AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs Room for improvement New data loader peak w/ crop, mirror, mean sub is 18910 FPS Questions 1. What is stopping us from matching peak performance with no data loader? 2. Max speed is not too far off our peak, what else can we do to improve?

Results With AlexNet AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs Question 1: What is stopping us from matching peak performance with no data loader? Contention for resources with training - Large amount of data moving to each worker from host CPU - Many data augmentation kernels competing with network computation

Results With AlexNet AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs Question 2: Max speed on P100 is relatively close to our peak, what else can we do to improve? Lock-free system for loading data from the DB Optimization of some operations (avoiding synchronous stuff, better kernels) Work to increase overlap between augmentation and training Reduce system overhead (stalling threads between batches, global sync to get performance samples, etc.) Decoding images on GPU Note: because data augmentation can be done on GPU, it will be able to leverage speedups from new GPUs as well

Results With GoogleNet GoogleNet on 8 M40 GPUs Network is ~3x more difficult to train than AlexNet. Data loading is not a bottleneck yet 6.5% average speedup from reduced device side synchronization

Takeaways Benefits 15.5% speedup on average across our experiments Reduced interference with network training Reduced memory transfers GPU-side augmentation to accelerate complex pipelines Automatic performance tuning finds optimal performance settings, reducing user burden Room for improvement Dedicated network between GPUs (NVLink!) Optimizing augmentation operations Reducing system overhead

Thank You.

Appendix Expresso AlexNet training speed compared to BVLC Caffe (12/09/16). 2.12x speedup on average