Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks, Inc. 1
Talk Outline Design Deep Learning & Vision Algorithms Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks Accelerate and Scale Training Acceleration with GPU s Scale to clusters High Performance Deployment Automate compilation with GPU Coder On TitanXP: 7x faster than TensorFlow 5x faster than pycaffe2 On Jetson: On par with TensorRT 2x faster than C++-Caffe 2
Example: Transfer Learning Workflow Transfer Learning Images Labels Load Reference Network Modify Network Structure Learn New Weights New Classifier Training Data Labels: Cars, Trucks, BigTrucks, SUVs, Vans 3
Example: Transfer Learning in MATLAB Set up training dataset Split, shuffle, re-arrange images Read image, Data augmentation (clip, rotate, resize, etc) Easily manage large sets of images Single line of code to access images Operates on disk, database, big-data file system 4
Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Create DNNs in MATLAB 1. Easy access to research models 2. Caffe Model importer 3. Build from scratch 5
Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure 6
Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure 7
Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure Learn New Weights Many more training options 8
Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 9
More CPUs Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 10
More CPUs Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 11
Visualizing and Debugging Intermediate Results Training Accuracy Visualization Deep Dream Filters Many options for visualizations and debugging Examples to get started Layer Activations Feature Visualization Deep Dream Activations 13
GPU Coder for Deployment: New Product in R2017b GPU Coder Accelerated implementation of parallel algorithms on GPUs Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature detection/extraction Signal Processing and Communications FFT, filtering, cross correlation, 7x faster than state-of-art 700x faster than CPUs for feature extraction 20x faster than CPUs for FFTs 14
GPU Coder Compilation Flow GPU Coder CUDA Kernel creation Memory allocation Data transfer minimization Library function mapping Loop optimizations Dependence analysis Data locality analysis GPU memory allocation Data-dependence analysis Dynamic memcpy reduction 15
GPU Coder Generates CUDA from MATLAB: saxpy Scalarized MATLAB CUDA Vectorized MATLAB CUDA kernel for GPU parallelization Loops and matrix operations are directly compiled in to kernels 16
Generated CUDA Optimized for Memory Performance Kernel data allocation is automatically optimized CUDA kernel for GPU parallelization CUDA Mandelbrot space 17
Algorithm Design to Embedded Deployment Workflow MATLAB algorithm (functional reference) Build type Call CUDA from MATLAB directly Desktop GPU Call CUDA from (C++) handcoded main() Desktop GPU Embedded GPU Call CUDA from (C++) hand-coded main()..mex.lib Cross-compiled.lib C++ C++ 1 Functional test 2 Deployment unit-test 3 Deployment integration-test 4 Real-time test 20
Demo: Alexnet Deployment with mex Code Generation 21
Algorithm Design to Embedded Deployment on Tegra GPU MATLAB algorithm (functional reference) Build type Call CUDA from MATLAB directly Tesla GPU Call CUDA from (C++) handcoded main() Tesla GPU Tegra GPU Call CUDA from (C++) hand-coded main(). Cross-compiled on host with Linaro toolchain.mex.lib Cross-compiled.lib C++ C++ 1 Functional test 2 Deployment unit-test 3 Deployment integration-test 4 Real-time test (Test in MATLAB on host) (Test generated code in MATLAB on host + GPU) (Test generated code within C/C++ app on host + GPU) (Test generated code within C/C++ app on Tegra target) 22
Alexnet Deployment to Tegra: Cross-Compiled with lib Two small changes 1. Change build-type to lib 2. Select cross-compile toolchain 23
End-to-End Application: Lane Detection https://tinyurl.com/ybaxnxjg Alexnet Transfer Learning Output of CNN is lane parabola co-efficients according to: y = ax^2 + bx + c Image Lane detection CNN Left lane co-efficients Right lane co-efficients Post-processing (find left/right lane points) Image with marked lanes GPU coder generates code for whole application 24
How Good is Generated Code Performance Performance of image processing and computer vision Performance of CNN inference (Alexnet) on Titan XP GPU Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2 25
GPU Coder for Image Processing and Computer Vision Fog removal Stereo disparity Distance transform SURF feature extraction Ray tracing Orders magnitude speedup over CPU 26
Frames per second Alexnet Inference on NVIDIA Titan XP MATLAB GPU Coder (R2017b) 2x 5x 7x mxnet (0.10) MATLAB (R2017b) Caffe2 (0.8.1) TensorFlow (1.2.0) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz Testing platform GPU cudnn Pascal Titan Xp v5 27
Frames per second Alexnet Inference on Jetson TX2: Frame-Rate Performance 400 350 300 250 0.85x TensorRT (2.1) MATLAB GPU Coder (R2017b) 200 2x 150 100 C++ Caffe (1.0.0-rc5) 50 0 1 16 32 64 128 256 Batch Size 28
Peak Memory (MB) Alexnet Inference on Jetson TX2: Memory Performance C++ Caffe (1.0.0-rc5) MATLAB GPU Coder (R2017b) TensorRT 2.1 (using giexec wrapper) Batch Size 30
Design Your DNNs in MATLAB, Deploy with GPU Coder Design Deep Learning & Vision Algorithms Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks Accelerate and Scale Training Acceleration with GPU s Scale to clusters High Performance Deployment Automate compilation with GPU Coder On TitanXP: 7x faster than TensorFlow 5x faster than pycaffe2 On Jetson TX2: On par with TensorRT 2x faster than C++-Caffe 31
Check Out Deep Learning in MATLAB and GPU Coder Deep learning in MATLAB GPU Coder systematics.co.il\mwevents 32