Deep learning in MATLAB From Concept to CUDA Code

Similar documents
Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Demystifying Deep Learning

Demystifying Deep Learning

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

GPU-Accelerated Deep Learning

2015 The MathWorks, Inc. 1

NVIDIA DEEP LEARNING INSTITUTE

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA DATA LOADING LIBRARY (DALI)

Autonomous Driving Solutions

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision

CafeGPI. Single-Sided Communication for Scalable Deep Learning

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Realtime Object Detection and Segmentation for HD Mapping

Fast Hardware For AI

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

TENSORRT. RN _v01 January Release Notes

DIGITS DEEP LEARNING GPU TRAINING SYSTEM

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Research Faculty Summit Systems Fueling future disruptions

IBM Deep Learning Solutions

Implementing Deep Learning for Video Analytics on Tegra X1.

Optimizing and Accelerating Your MATLAB Code

High Performance Computing

S CUDA on Xavier

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TENSORRT 4.0 RELEASE CANDIDATE (RC)

Wu Zhiwen.

TENSORRT. RN _v01 June Release Notes

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

What s New in MATLAB and Simulink The MathWorks, Inc. 1

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Designing GPU-accelerated applications with RTMaps (Real-Time Multisensor Applications) Framework and NVIDIA DriveWorks

All You Want To Know About CNNs. Yukun Zhu

What s New for MATLAB David Willingham

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

TENSORRT 3.0. DU _v3.0 February Installation Guide

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Integrate MATLAB Analytics into Enterprise Applications

2015 The MathWorks, Inc. 1

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

개발과정에서의 MATLAB 과 C 의연동 ( 영상처리분야 )

CNN optimization. Rassadin A

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

A performance comparison of Deep Learning frameworks on KNL

CUDNN. DU _v07 December Installation Guide

What s New in MATLAB and Simulink

Getting started with Caffe. Jon Barker, Solutions Architect

Embedded GPGPU and Deep Learning for Industrial Market

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

Integrate MATLAB Analytics into Enterprise Applications

NVIDIA PLATFORM FOR AI

The OpenVX Computer Vision and Neural Network Inference

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

Deep Learning Accelerators

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Small is the New Big: Data Analytics on the Edge

What s inside: What is deep learning Why is deep learning taking off now? Multiple applications How to implement a system.

TENSORFLOW. DU _v1.8.0 June User Guide

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Getting Started with MATLAB Francesca Perino

Semantic Segmentation

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Tutorial on Machine Learning Tools

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

The Tesla Accelerated Computing Platform

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

HPC with Multicore and GPUs

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Accelerated Machine Learning Algorithms in Python

Integrate MATLAB Analytics into Enterprise Applications

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

Transcription:

Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks, Inc. 1

Talk Outline Design Deep Learning & Vision Algorithms Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks Accelerate and Scale Training Acceleration with GPU s Scale to clusters High Performance Deployment Automate compilation with GPU Coder On TitanXP: 7x faster than TensorFlow 5x faster than pycaffe2 On Jetson: On par with TensorRT 2x faster than C++-Caffe 2

Example: Transfer Learning Workflow Transfer Learning Images Labels Load Reference Network Modify Network Structure Learn New Weights New Classifier Training Data Labels: Cars, Trucks, BigTrucks, SUVs, Vans 3

Example: Transfer Learning in MATLAB Set up training dataset Split, shuffle, re-arrange images Read image, Data augmentation (clip, rotate, resize, etc) Easily manage large sets of images Single line of code to access images Operates on disk, database, big-data file system 4

Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Create DNNs in MATLAB 1. Easy access to research models 2. Caffe Model importer 3. Build from scratch 5

Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure 6

Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure 7

Example: Transfer Learning in MATLAB Set up training dataset Load Reference Network Modify Network Structure Learn New Weights Many more training options 8

Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 9

More CPUs Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 10

More CPUs Deep learning on CPU, GPU, multi-gpu and clusters More GPUs 11

Visualizing and Debugging Intermediate Results Training Accuracy Visualization Deep Dream Filters Many options for visualizations and debugging Examples to get started Layer Activations Feature Visualization Deep Dream Activations 13

GPU Coder for Deployment: New Product in R2017b GPU Coder Accelerated implementation of parallel algorithms on GPUs Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature detection/extraction Signal Processing and Communications FFT, filtering, cross correlation, 7x faster than state-of-art 700x faster than CPUs for feature extraction 20x faster than CPUs for FFTs 14

GPU Coder Compilation Flow GPU Coder CUDA Kernel creation Memory allocation Data transfer minimization Library function mapping Loop optimizations Dependence analysis Data locality analysis GPU memory allocation Data-dependence analysis Dynamic memcpy reduction 15

GPU Coder Generates CUDA from MATLAB: saxpy Scalarized MATLAB CUDA Vectorized MATLAB CUDA kernel for GPU parallelization Loops and matrix operations are directly compiled in to kernels 16

Generated CUDA Optimized for Memory Performance Kernel data allocation is automatically optimized CUDA kernel for GPU parallelization CUDA Mandelbrot space 17

Algorithm Design to Embedded Deployment Workflow MATLAB algorithm (functional reference) Build type Call CUDA from MATLAB directly Desktop GPU Call CUDA from (C++) handcoded main() Desktop GPU Embedded GPU Call CUDA from (C++) hand-coded main()..mex.lib Cross-compiled.lib C++ C++ 1 Functional test 2 Deployment unit-test 3 Deployment integration-test 4 Real-time test 20

Demo: Alexnet Deployment with mex Code Generation 21

Algorithm Design to Embedded Deployment on Tegra GPU MATLAB algorithm (functional reference) Build type Call CUDA from MATLAB directly Tesla GPU Call CUDA from (C++) handcoded main() Tesla GPU Tegra GPU Call CUDA from (C++) hand-coded main(). Cross-compiled on host with Linaro toolchain.mex.lib Cross-compiled.lib C++ C++ 1 Functional test 2 Deployment unit-test 3 Deployment integration-test 4 Real-time test (Test in MATLAB on host) (Test generated code in MATLAB on host + GPU) (Test generated code within C/C++ app on host + GPU) (Test generated code within C/C++ app on Tegra target) 22

Alexnet Deployment to Tegra: Cross-Compiled with lib Two small changes 1. Change build-type to lib 2. Select cross-compile toolchain 23

End-to-End Application: Lane Detection https://tinyurl.com/ybaxnxjg Alexnet Transfer Learning Output of CNN is lane parabola co-efficients according to: y = ax^2 + bx + c Image Lane detection CNN Left lane co-efficients Right lane co-efficients Post-processing (find left/right lane points) Image with marked lanes GPU coder generates code for whole application 24

How Good is Generated Code Performance Performance of image processing and computer vision Performance of CNN inference (Alexnet) on Titan XP GPU Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2 25

GPU Coder for Image Processing and Computer Vision Fog removal Stereo disparity Distance transform SURF feature extraction Ray tracing Orders magnitude speedup over CPU 26

Frames per second Alexnet Inference on NVIDIA Titan XP MATLAB GPU Coder (R2017b) 2x 5x 7x mxnet (0.10) MATLAB (R2017b) Caffe2 (0.8.1) TensorFlow (1.2.0) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz Testing platform GPU cudnn Pascal Titan Xp v5 27

Frames per second Alexnet Inference on Jetson TX2: Frame-Rate Performance 400 350 300 250 0.85x TensorRT (2.1) MATLAB GPU Coder (R2017b) 200 2x 150 100 C++ Caffe (1.0.0-rc5) 50 0 1 16 32 64 128 256 Batch Size 28

Peak Memory (MB) Alexnet Inference on Jetson TX2: Memory Performance C++ Caffe (1.0.0-rc5) MATLAB GPU Coder (R2017b) TensorRT 2.1 (using giexec wrapper) Batch Size 30

Design Your DNNs in MATLAB, Deploy with GPU Coder Design Deep Learning & Vision Algorithms Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks Accelerate and Scale Training Acceleration with GPU s Scale to clusters High Performance Deployment Automate compilation with GPU Coder On TitanXP: 7x faster than TensorFlow 5x faster than pycaffe2 On Jetson TX2: On par with TensorRT 2x faster than C++-Caffe 31

Check Out Deep Learning in MATLAB and GPU Coder Deep learning in MATLAB GPU Coder systematics.co.il\mwevents 32