GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Similar documents
Deep learning in MATLAB From Concept to CUDA Code

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Demystifying Deep Learning

Demystifying Deep Learning

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Wu Zhiwen.

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

Research Faculty Summit Systems Fueling future disruptions

CafeGPI. Single-Sided Communication for Scalable Deep Learning

High Performance Computing

End to End Optimization Stack for Deep Learning

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision

Neural Network Exchange Format

A performance comparison of Deep Learning frameworks on KNL

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

THE NVIDIA DEEP LEARNING ACCELERATOR

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

2015 The MathWorks, Inc. 1

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

GPU-Accelerated Deep Learning

Autonomous Driving Solutions

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

NVIDIA DATA LOADING LIBRARY (DALI)

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

NVIDIA DEEP LEARNING INSTITUTE

Xilinx ML Suite Overview

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

World s most advanced data center accelerator for PCIe-based servers

Snapdragon NPE Overview

Realtime Object Detection and Segmentation for HD Mapping

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

S CUDA on Xavier

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

Accelerate Deep Learning Inference with openvino toolkit

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

WITH INTEL TECHNOLOGIES

Deep Learning Compiler

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Small is the New Big: Data Analytics on the Edge

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Deep Learning Accelerators

Fast Hardware For AI

Implementing Deep Learning for Video Analytics on Tegra X1.

IBM Deep Learning Solutions

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Getting started with Caffe. Jon Barker, Solutions Architect

Embedded GPGPU and Deep Learning for Industrial Market

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

SVM multiclass classification in 10 steps 17/32

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Revolutionizing the Datacenter

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

Deep Learning Processing Technologies for Embedded Systems. October 2018

NVIDIA PLATFORM FOR AI

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Practical Introduction to CUDA and GPU

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

POINT CLOUD DEEP LEARNING

NumbaPro CUDA Python. Square matrix multiplication

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Deep Learning Requirements for Autonomous Vehicles

Visual Perception for Autonomous Driving on the NVIDIA DrivePX2 and using SYNTHIA

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Designing GPU-accelerated applications with RTMaps (Real-Time Multisensor Applications) Framework and NVIDIA DriveWorks

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

The OpenVX Computer Vision and Neural Network Inference

Transcription:

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1

GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming (expressivity, algorithmic, ) easier GPUs are hardware on steroids, but, programming them is hard 2

Consider an example: saxpy Scalarized MATLAB Vectorized MATLAB Automatic compilation from an expressive language to a high-performance language 3

Deep Learning applications TensorRT Deep learning Traffic Sign Recognition TensorRTis great for inference, but, applications require more than inference 4

GPU Coder is new technology released in September 2017 Accelerated implementation of parallel algorithms on GPUs Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature detection/extraction Signal Processing and Communications FFT, filtering, cross correlation, 5x faster than TensorFlow 2x faster than mxnet 60x faster than CPUs for stereo disparity 20x faster than CPUs for FFTs 5

Example: Lidar semantic segmentation 6

Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 7

GPU Coder automatically extracts parallelism from MATLAB 1. Scalarized MATLAB ( for-all loops) Infer CUDA kernels from MATLAB loops 2. Vectorized MATLAB (math operators and library functions) 3. Composite functions in MATLAB (maps to cublas, cufft, cusolver, cudnn, TensorRT) Library replacement 8

From a loop to a CUDA kernel for k = 1:n t = A(k).* X(k); C(k) = t + Y(k); { } Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions mykernel<<< f(n) >>>(A, X, Y, C, n); static global mykernel(a, X, Y, C, n) { int k = getthreadindex(n); } int t = A[k] * X[k]; C[k] = t + Y[k]; Is this loop parallel? Y Compute kernel size Classify kernel variables (input, output, local) Create kernel from loop body Depence analysis to understand the iteration space f(n) Ins: A, X, Y, n Outs: C Local: t, k 9

From a loop-nesting to CUDA kernels Imperfect Loops for i = 1:p (outer prologue code) for j = 1:m for k = 1:n (inner loop) (outer epilogue code) Perfect Loops for i = 1:p for j = 1:m for k = 1:n (inner loop) Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Fundamental requirement: Loops need to be contiguous for parallelization 10

From a loop-nesting to CUDA kernels Imperfect Loops Perfect Loops Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions for i = 1:p (outer prologue code) for j = 1:m for k = 1:n (inner loop) (outer epilogue code) Loop Perfectization (if possible) for i = 1:p for j = 1:m for k = 1:n (outer prologue code) (inner loop) if k == n (outer epilogue code) Fundamental requirement: Loops need to be contiguous for parallelization 11

From a loop-nesting to CUDA kernels Example 1 Example 2 Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions (M x N) (K x P) for a = 1:M for b = 1:N (outer prologue code) for c = 1:K for d = 1:P (inner loop) (P) (MxN + KxL) for i = 1:P for a = 1:M for b = 1:N (inner loop) for x = 1:K for y = 1:L (inner loop) Find parallel loops Partition loop nesting Depence analysis Heuristic may favor larger iteration space Fundamental requirement: Loops need to be contiguous for parallelization 12

From a loop-nesting to CUDA kernels Example 1 Example 2 Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions (M x N) (K x P) for a = 1:M for b = 1:N (outer prologue code) for c = 1:K for d = 1:P (inner loop) (P) (MxN + KxL) for i = 1:P for a = 1:M for b = 1:N (inner loop) for x = 1:K for y = 1:L (inner loop) Find parallel loops Partition loop nesting Create kernel for each partition Depence analysis Heuristic may favor larger iteration space Use process from single loop conversion Fundamental requirement: Loops need to be contiguous for parallelization 13

From vectorized MATLAB to CUDA kernels output(:, 1) = (input(:, 1) x_im).* factor; Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Assume the following sizes: output : M x 3 input : M x 3 x_im : M x 1 factor : M x 1 for i = 1:M diff(i) = input(i, 1) x_im(i); for a = 1:M output(i, 1) = diff(i) * factor(i); for i = 1:M diff(i) = input(i, 1) x_im(i); output(i, 1) = diff(i) * factor(i); Scalarization Loop Fusion Reduce to for-loops Create larger parallel loops (and hence CUDA kernels) 14

From vectorized MATLAB to CUDA kernels output(:, 1) = (input(:, 1) x_im).* factor; Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Assume the following sizes: output : M x 3 input : M x 3 x_im : M x 1 factor : M x 1 for i = 1:M diff(i) = input(i, 1) x_im(i); for a = 1:M output(i, 1) = diff(i) * factor(i); for i = 1:M diff(i) = input(i, 1) x_im(i); output(i, 1) = diff(i) * factor(i); for i = 1:M tmp = input(i, 1) x_im(i); output(i, 1) = tmp * factor(i); Scalarization Loop Fusion Scalar Replacement Reduce to for-loops Create larger parallel loops (and hence CUDA kernels) Reduce temp matrices to temp scalars 15

From composite functions to optimized CUDA Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Core math Matrix multiply (cublas) Linear algebra (cusolver) FFT functions (cufft) Convolution Image processing Computer vision imfilter imresize imerode imdilate bwlabel imwarp Neural Networks Deep learning inference (cudnn, TensorRT) Over 300+ MATLAB functions are optimized for CUDA code generation 16

Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 17

Optimizing CPU-GPU data movement is a challenge A = for i = 1:N A(i) imfilter Where is the ideal placement of memcpy? A = cudamemcpyhtod(ga, a); kernel1<<< >>>(ga) cudamemcpydtoh( ) cudamemcpyhtod( ) imfilter_kernel( ) cudamemcpydtoh( ) Naïve placement Optimized placement K1 K2 K3 K1 K2 K3 18

GPU Coder optimizes memcpy placement cudamemcpy *not* needed A(:) =. C(:) =. for i = 1:N. gb = kernel1(ga); ga = kernel2(gb); if (some_condition) gc = kernel3(ga, gb);.. = C; cudamemcpy *definitely* needed cudamemcpy *may be* needed Assume ga, gb and gc are mapped to GPU memory Observations Equivalent to Partial redundancy elimination (PRE) Dynamic strategy track memory location with a status flag per variable Use-Def to determine where to insert memcpy Generated (pseudo) code A(:) = A_isDirtyOnCpu = true; for i = 1:N if (A_isDirtyOnCpu) cudamemcpy(ga, A); A_isDirtyOnCpu = false; gb = kernel1(ga); ga = kernel2(gb); if (somecondition) gc = kernel3(ga, gb); C_isDirtyOnGpu = true; if (C_isDirtyOnGpu) cudamemcpy(c, gc); C_isDirtyOnGpu = false; = C; 19

GPU memory hierarchy is deep Memory Visibility Heuristics/notes GPU Coder support Global memory CPU + GPU Share data b/w CPU and GPU Yes Local memory/registers per GPU thread Thread local data Yes Shared memory per GPU block Shared data between threads Yes Texture memory CPU + GPU Shared read-only data with 2D alignment Future Constant memory GPU Read-only constants Yes 20

GPU Coder automatically maps data to shared memory Input image Conv. kernel Output image cols kh kw rows Dotprod GPU Coder automatically creates shared memory for many MATLAB image processing functions: imfilter, imerode, imdilate, conv2, 21

NVIDIA Hardware Support Package (HSP) Simple out-of-box targeting to NVIDIA boards: Jetson Drive platform 24

Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 25

Deep learning workflow in MATLAB DNN design + training Model importer Train in MATLAB Model importer Trained DNN Design in MATLAB Manage large image sets Automate data labeling Easy access to models Training in MATLAB Acceleration with GPU s Scale to clusters 26

Deep learning workflow in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 27

Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 28

Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 29

Traffic sign detection and recognition Object detection DNN Strongest Bounding Box Classifier DNN 30

GPU Coder allows for choice in deployment (cudnn, TensorRT) Application logic C++/CUDA + cudnn C++/CUDA + TensorRT C++/CUDA + TensorRT (int8) 31

Performance summary (VGG-16) on TitanXP GPU Coder (TensorRT int8) GPU Coder (TensorRT fp32) GPU Coder (cudnn fp32) MATLAB (cudnn fp32) 0 50 100 150 200 250 300 350 400 32

MATLAB and GPU Coder support state-of-art classification networks GoogLeNet ResNet50 Alexnet vs Squeezenet Network # Layers Size Frame-rate (GPU Coder) Alexnet 25 227 MB 500 Fps ResNet50 177 96 MB 160 Fps GoogLeNet 144 27 MB 190 Fps Squeezenet 68 5 MB 615 Fps 33

ResNet-50 Inference on NVIDIA Titan V Frames per second MATLAB GPU Coder + TensorRT 4 (int8) MATLAB GPU Coder + TensorRT 4 MATLAB GPU Coder + cudnn PyTorch Tensorflow Batch Size Testing platform CPU: Intel Xeon CPU E5-1650 v3 @ 3.5 GHz GPU: NVIDIA Titan-V 34

ResNet-50 Inference on NVIDIA Titan V MATLAB GPU Coder + TensorRT 4 (int8) Frames per second Tensorflow + TensorRT 4 (int8) MATLAB GPU Coder + TensorRT 4 (fp32) Tensorflow + TensorRT 4 (fp32) Batch Size Testing platform CPU: Intel Xeon CPU E5-1650 v3 @ 3.5 GHz GPU: NVIDIA Titan-V 35

Lidar processing application design is easy in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 36

Lidar processing application design is easy in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 37

Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 38

Why Use Lidar and Deep Learning for Autonomous Vehicles? Why use Lidar? Provides accurate 3-D structure of scene Required sensor for autonomous driving Why use deep learning? Best accuracy High-performance inference with GPU Coder Classifying Point Cloud Clusters Ground Cars Lidar Sematic Segmentation 39

What Does Lidar Data Look Like? 40

Lidar processing application design is easy in MATLAB DNN design + training Application design Standalone Deployment Application logic C++/CUDA + TensorRT Data prep, labeling Training Trained DNN C++/CUDA + cudnn 41

Data preparation and labeling of Lidar is a challenge DNN design + training Accessing lidar data Lidar preprocessing Labeling lidar data Training Trained DNN 42

Access and Visualize Lidar Data Access Stored Lidar Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply) Visualize Lidar Data Streaming Lidar player Static point cloud display Point cloud differences 43

Lidar Preprocessing Remove Ground Fit plane using RANSAC Cluster Segment clusters using Euclidean distance 44

Ground Truth Labeling of Lidar Data 45

Ground Truth Labeling of Lidar Data 46

Ground Truth Labeling of Lidar Data 47

Organize Data for Training Raw Point Cloud Data Ground Truth Labels Transformed to Label Mask Project to 2D 48

Create Network Architecture Easy MATLAB API to create network 49

Deployment using GPU Coder C++/CUDA + TensorRT C++/CUDA + cudnn 50

Lidar semantic segmentation 51

Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 52

Thank you 53

Backup 54

Outline #1 (without lidar demo) Parallelism and GPUs (why GPU Coder, use-cases etc) [4 mins] Compiler overview [15 mins] Targeting loops, Memory optimizations, Library replacements Demo: Workflow demo using stereo disparity (ext to run on HW) Benchmarks: IPT functions, stereo disparity Workflow and hardware targeting [10 mins] From algorithm to embedded Demo: Ext stereo-disparity demo to run on Jetson or DrivePX2 Deep learning compiler overview [15 mins] Design philosophy, Dataflow graph optimizations, Library plugins Demo: Applications using DNN like traffic-signs, age-detection (both cudnn + TensorRT) Demo: Quantization with TensorRT/int8 (logo detection) Benchmarks: alexnet, resnet, segnet 55

Outline #2 (with lidar demo) Parallelism and GPUs (why GPU Coder, use-cases etc) [4 mins] Compiler and workflow [15 mins] Loops, memory, library mapping and optimizations Rapid prototyping and deployment with HSP Demo: Workflow demo using stereo disparity (ext to run on HW) Benchmarks: IPT functions, stereo disparity Deep learning compiler overview [15 mins] Design philosophy, dataflow graph optimizations, library plugins Demo: Applications using DNN like traffic-signs, age-detection (both cudnn + TensorRT) Demo: Quantization with TensorRT/int8 (logo detection) Benchmarks: alexnet, resnet, segnet End-to-End Workflow demo (lidar) [15 mins] Labeling, training Deployment on DrivePX2 56

Demos for talk Workflow demo (alexnet, resnet, traffic signs, stereo disparity) Board demo (Jetson, DrivePX2) TensorRT demos (traffic signs, logonet + int8) Arvind s lidar demo (-to-) Performance benchmarks IPT functions Deep learning inference (GPC (cudnn, TensorRT) vs Caffe2, TF, MxNet) Alexnet Resnet GoogleNet Segnet 57

Localization Planning Autonomous systems Perception Controls 58

Consider an example: saxpy Scalarized MATLAB Vectorized MATLAB 59

Deep learning workflow in MATLAB Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 60

Re-targetability: A target-agnostic interface layer Pre-trained DL network Front-End Cross-layer optimizations C back- GPU Back DL structure Auto-generated Target-agnostic code malloc, scheduling ARM Neon interface ARM Mali interface MKL-DNN interface cudnn interface Hand-written interface layer code TensorRT interface Algorithm only ARM CL ARM CL MKL-DNN library Vor library cudnn library TensorRT 61

Deep learning workflow in MATLAB DNN design + training Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 62

Design pattern: Stencil kernel Problem: how to auto-generate hand-coded version from: 1. for x=1:orow Key insight: this part always the same 2. for y=1:ocol 3. C(x,y) = f ( In(x:x+window_x, y:y+window_y)); 4. 5. Only need user to specify this part f() takes in an input window and returns a scalar output. e.g. convolution : x = sum( input(:).* kernel(:)) 63

Design pattern: Stencil kernel n m kernel myfunc im out 64

GPU Coder automatically maps data to shared memory Input image Conv. kernel Output image cols kh kw rows GPU Coder automatically infers data locality Creates GPU shared memory Collaborative loading Re-write algorithm to use shared memory Dotprod Optimized image processing toolbox functions: conv2 imfilter imerode Imdilate 65

Memory Management Modes are Configurable Standard (explicit copy) cudamalloc Separate host and GPU memory allocations (and variables) CUDA 7 or earlier for desktop GPUs Optimal dynamic memcpy minimization Unified memory (implicit copy) cudamallocmanaged No need for separate variables CUDA 8 or embedded GPUs Optimal dynamic device-sync 66

Design pattern: Stencil kernel n m kernel myfunc im Optimized image processing toolbox functions: conv2 imfilter imerode Imdilate out 67

Example generated code: 1-D FIR filter Shared memory decl Collaborative load Sync point Algorithm with translated accesses to shared memory 68

Inference benchmarking on TitanXP Alexnet VGG-16 9000 1200 8000 7000 1000 6000 800 5000 4000 600 3000 400 2000 1000 0 TensorFlow (cudnn-fp32) 1 2 4 8 16 32 64 128 200 0 1 2 4 8 16 32 64 128 69

Inference benchmarking on TitanXP MATLAB GPU Coder (TensorRT-int8) MATLAB GPU Coder (TensorRT-fp32) MATLAB GPU Coder (cudnn-fp32) mxnet (cudnn-fp32) TensorFlow (cudnn-fp32) 9000 1200 8000 7000 6000 5000 4000 3000 2000 1000 Images / sec 1000 800 600 400 200 0 1 2 4 8 16 32 64 128 0 1 2 4 8 16 32 64 128 Batch size Batch size Alexnet VGG-16 70

Deep Learning Workflow Access Data Preprocess and Label Train Network Inference Raw Point Clouds Label Data Create Deep Network Train on Multi- GPU 71

Access and Visualize Lidar Data Access Data Access Stored Lidar Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply) Visualize Lidar Data Streaming Lidar player Static point cloud display Point cloud differences 72

Lidar Preprocessing Preprocess and Label ROI + Remove Ground Fit plane using RANSAC Cluster Segment clusters using Euclidean distance 73

Ground Truth Labeling of Lidar Data Preprocess and Label 74

Ground Truth Labeling of Lidar Data Preprocess and Label 75

Ground Truth Labeling of Lidar Data Preprocess and Label 76

Organize Data for Training Train Network Raw Point Cloud Data Ground Truth Labels Transformed to Label Mask Project to 2D 77

Create Network Architecture Train Network Easy API to create network structure 78

Compile Model to CUDA Code Inference GPU Coder 79