GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Size: px
Start display at page:

Download "GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB"

Transcription

1 GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1

2 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming (expressivity, algorithmic, ) easier GPUs are hardware on steroids, but, programming them is hard 2

3 Consider an example: saxpy Scalarized MATLAB Vectorized MATLAB Automatic compilation from an expressive language to a high-performance language 3

4 Deep Learning applications TensorRT Deep learning Traffic Sign Recognition TensorRTis great for inference, but, applications require more than inference 4

5 GPU Coder is new technology released in September 2017 Accelerated implementation of parallel algorithms on GPUs Neural Networks Deep Learning, machine learning Image Processing and Computer Vision Image filtering, feature detection/extraction Signal Processing and Communications FFT, filtering, cross correlation, 5x faster than TensorFlow 2x faster than mxnet 60x faster than CPUs for stereo disparity 20x faster than CPUs for FFTs 5

6 Example: Lidar semantic segmentation 6

7 Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 7

8 GPU Coder automatically extracts parallelism from MATLAB 1. Scalarized MATLAB ( for-all loops) Infer CUDA kernels from MATLAB loops 2. Vectorized MATLAB (math operators and library functions) 3. Composite functions in MATLAB (maps to cublas, cufft, cusolver, cudnn, TensorRT) Library replacement 8

9 From a loop to a CUDA kernel for k = 1:n t = A(k).* X(k); C(k) = t + Y(k); { } Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions mykernel<<< f(n) >>>(A, X, Y, C, n); static global mykernel(a, X, Y, C, n) { int k = getthreadindex(n); } int t = A[k] * X[k]; C[k] = t + Y[k]; Is this loop parallel? Y Compute kernel size Classify kernel variables (input, output, local) Create kernel from loop body Depence analysis to understand the iteration space f(n) Ins: A, X, Y, n Outs: C Local: t, k 9

10 From a loop-nesting to CUDA kernels Imperfect Loops for i = 1:p (outer prologue code) for j = 1:m for k = 1:n (inner loop) (outer epilogue code) Perfect Loops for i = 1:p for j = 1:m for k = 1:n (inner loop) Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Fundamental requirement: Loops need to be contiguous for parallelization 10

11 From a loop-nesting to CUDA kernels Imperfect Loops Perfect Loops Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions for i = 1:p (outer prologue code) for j = 1:m for k = 1:n (inner loop) (outer epilogue code) Loop Perfectization (if possible) for i = 1:p for j = 1:m for k = 1:n (outer prologue code) (inner loop) if k == n (outer epilogue code) Fundamental requirement: Loops need to be contiguous for parallelization 11

12 From a loop-nesting to CUDA kernels Example 1 Example 2 Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions (M x N) (K x P) for a = 1:M for b = 1:N (outer prologue code) for c = 1:K for d = 1:P (inner loop) (P) (MxN + KxL) for i = 1:P for a = 1:M for b = 1:N (inner loop) for x = 1:K for y = 1:L (inner loop) Find parallel loops Partition loop nesting Depence analysis Heuristic may favor larger iteration space Fundamental requirement: Loops need to be contiguous for parallelization 12

13 From a loop-nesting to CUDA kernels Example 1 Example 2 Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions (M x N) (K x P) for a = 1:M for b = 1:N (outer prologue code) for c = 1:K for d = 1:P (inner loop) (P) (MxN + KxL) for i = 1:P for a = 1:M for b = 1:N (inner loop) for x = 1:K for y = 1:L (inner loop) Find parallel loops Partition loop nesting Create kernel for each partition Depence analysis Heuristic may favor larger iteration space Use process from single loop conversion Fundamental requirement: Loops need to be contiguous for parallelization 13

14 From vectorized MATLAB to CUDA kernels output(:, 1) = (input(:, 1) x_im).* factor; Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Assume the following sizes: output : M x 3 input : M x 3 x_im : M x 1 factor : M x 1 for i = 1:M diff(i) = input(i, 1) x_im(i); for a = 1:M output(i, 1) = diff(i) * factor(i); for i = 1:M diff(i) = input(i, 1) x_im(i); output(i, 1) = diff(i) * factor(i); Scalarization Loop Fusion Reduce to for-loops Create larger parallel loops (and hence CUDA kernels) 14

15 From vectorized MATLAB to CUDA kernels output(:, 1) = (input(:, 1) x_im).* factor; Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Assume the following sizes: output : M x 3 input : M x 3 x_im : M x 1 factor : M x 1 for i = 1:M diff(i) = input(i, 1) x_im(i); for a = 1:M output(i, 1) = diff(i) * factor(i); for i = 1:M diff(i) = input(i, 1) x_im(i); output(i, 1) = diff(i) * factor(i); for i = 1:M tmp = input(i, 1) x_im(i); output(i, 1) = tmp * factor(i); Scalarization Loop Fusion Scalar Replacement Reduce to for-loops Create larger parallel loops (and hence CUDA kernels) Reduce temp matrices to temp scalars 15

16 From composite functions to optimized CUDA Extracting parallelism in MATLAB 1. Scalarized MATLAB (for loops) 2. Vectorized MATLAB 3. Composite functions Core math Matrix multiply (cublas) Linear algebra (cusolver) FFT functions (cufft) Convolution Image processing Computer vision imfilter imresize imerode imdilate bwlabel imwarp Neural Networks Deep learning inference (cudnn, TensorRT) Over 300+ MATLAB functions are optimized for CUDA code generation 16

17 Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 17

18 Optimizing CPU-GPU data movement is a challenge A = for i = 1:N A(i) imfilter Where is the ideal placement of memcpy? A = cudamemcpyhtod(ga, a); kernel1<<< >>>(ga) cudamemcpydtoh( ) cudamemcpyhtod( ) imfilter_kernel( ) cudamemcpydtoh( ) Naïve placement Optimized placement K1 K2 K3 K1 K2 K3 18

19 GPU Coder optimizes memcpy placement cudamemcpy *not* needed A(:) =. C(:) =. for i = 1:N. gb = kernel1(ga); ga = kernel2(gb); if (some_condition) gc = kernel3(ga, gb);.. = C; cudamemcpy *definitely* needed cudamemcpy *may be* needed Assume ga, gb and gc are mapped to GPU memory Observations Equivalent to Partial redundancy elimination (PRE) Dynamic strategy track memory location with a status flag per variable Use-Def to determine where to insert memcpy Generated (pseudo) code A(:) = A_isDirtyOnCpu = true; for i = 1:N if (A_isDirtyOnCpu) cudamemcpy(ga, A); A_isDirtyOnCpu = false; gb = kernel1(ga); ga = kernel2(gb); if (somecondition) gc = kernel3(ga, gb); C_isDirtyOnGpu = true; if (C_isDirtyOnGpu) cudamemcpy(c, gc); C_isDirtyOnGpu = false; = C; 19

20 GPU memory hierarchy is deep Memory Visibility Heuristics/notes GPU Coder support Global memory CPU + GPU Share data b/w CPU and GPU Yes Local memory/registers per GPU thread Thread local data Yes Shared memory per GPU block Shared data between threads Yes Texture memory CPU + GPU Shared read-only data with 2D alignment Future Constant memory GPU Read-only constants Yes 20

21 GPU Coder automatically maps data to shared memory Input image Conv. kernel Output image cols kh kw rows Dotprod GPU Coder automatically creates shared memory for many MATLAB image processing functions: imfilter, imerode, imdilate, conv2, 21

22

23

24 NVIDIA Hardware Support Package (HSP) Simple out-of-box targeting to NVIDIA boards: Jetson Drive platform 24

25 Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 25

26 Deep learning workflow in MATLAB DNN design + training Model importer Train in MATLAB Model importer Trained DNN Design in MATLAB Manage large image sets Automate data labeling Easy access to models Training in MATLAB Acceleration with GPU s Scale to clusters 26

27 Deep learning workflow in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 27

28 Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 28

29 Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 29

30 Traffic sign detection and recognition Object detection DNN Strongest Bounding Box Classifier DNN 30

31 GPU Coder allows for choice in deployment (cudnn, TensorRT) Application logic C++/CUDA + cudnn C++/CUDA + TensorRT C++/CUDA + TensorRT (int8) 31

32 Performance summary (VGG-16) on TitanXP GPU Coder (TensorRT int8) GPU Coder (TensorRT fp32) GPU Coder (cudnn fp32) MATLAB (cudnn fp32)

33 MATLAB and GPU Coder support state-of-art classification networks GoogLeNet ResNet50 Alexnet vs Squeezenet Network # Layers Size Frame-rate (GPU Coder) Alexnet MB 500 Fps ResNet MB 160 Fps GoogLeNet MB 190 Fps Squeezenet 68 5 MB 615 Fps 33

34 ResNet-50 Inference on NVIDIA Titan V Frames per second MATLAB GPU Coder + TensorRT 4 (int8) MATLAB GPU Coder + TensorRT 4 MATLAB GPU Coder + cudnn PyTorch Tensorflow Batch Size Testing platform CPU: Intel Xeon CPU E GHz GPU: NVIDIA Titan-V 34

35 ResNet-50 Inference on NVIDIA Titan V MATLAB GPU Coder + TensorRT 4 (int8) Frames per second Tensorflow + TensorRT 4 (int8) MATLAB GPU Coder + TensorRT 4 (fp32) Tensorflow + TensorRT 4 (fp32) Batch Size Testing platform CPU: Intel Xeon CPU E GHz GPU: NVIDIA Titan-V 35

36 Lidar processing application design is easy in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 36

37 Lidar processing application design is easy in MATLAB DNN design + training Model importer Application design Application logic Train in MATLAB Trained DNN Model importer 37

38 Talk outline Introduction GPU Coder internals Automatic parallelization Memory optimization Deep learning compilation Application demo: Lidar processing in MATLAB using deep learning 38

39 Why Use Lidar and Deep Learning for Autonomous Vehicles? Why use Lidar? Provides accurate 3-D structure of scene Required sensor for autonomous driving Why use deep learning? Best accuracy High-performance inference with GPU Coder Classifying Point Cloud Clusters Ground Cars Lidar Sematic Segmentation 39

40 What Does Lidar Data Look Like? 40

41 Lidar processing application design is easy in MATLAB DNN design + training Application design Standalone Deployment Application logic C++/CUDA + TensorRT Data prep, labeling Training Trained DNN C++/CUDA + cudnn 41

42 Data preparation and labeling of Lidar is a challenge DNN design + training Accessing lidar data Lidar preprocessing Labeling lidar data Training Trained DNN 42

43 Access and Visualize Lidar Data Access Stored Lidar Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply) Visualize Lidar Data Streaming Lidar player Static point cloud display Point cloud differences 43

44 Lidar Preprocessing Remove Ground Fit plane using RANSAC Cluster Segment clusters using Euclidean distance 44

45 Ground Truth Labeling of Lidar Data 45

46 Ground Truth Labeling of Lidar Data 46

47 Ground Truth Labeling of Lidar Data 47

48 Organize Data for Training Raw Point Cloud Data Ground Truth Labels Transformed to Label Mask Project to 2D 48

49 Create Network Architecture Easy MATLAB API to create network 49

50 Deployment using GPU Coder C++/CUDA + TensorRT C++/CUDA + cudnn 50

51 Lidar semantic segmentation 51

52 Deep learning workflow in MATLAB DNN design + training Application design Standalone Deployment Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 52

53 Thank you 53

54 Backup 54

55 Outline #1 (without lidar demo) Parallelism and GPUs (why GPU Coder, use-cases etc) [4 mins] Compiler overview [15 mins] Targeting loops, Memory optimizations, Library replacements Demo: Workflow demo using stereo disparity (ext to run on HW) Benchmarks: IPT functions, stereo disparity Workflow and hardware targeting [10 mins] From algorithm to embedded Demo: Ext stereo-disparity demo to run on Jetson or DrivePX2 Deep learning compiler overview [15 mins] Design philosophy, Dataflow graph optimizations, Library plugins Demo: Applications using DNN like traffic-signs, age-detection (both cudnn + TensorRT) Demo: Quantization with TensorRT/int8 (logo detection) Benchmarks: alexnet, resnet, segnet 55

56 Outline #2 (with lidar demo) Parallelism and GPUs (why GPU Coder, use-cases etc) [4 mins] Compiler and workflow [15 mins] Loops, memory, library mapping and optimizations Rapid prototyping and deployment with HSP Demo: Workflow demo using stereo disparity (ext to run on HW) Benchmarks: IPT functions, stereo disparity Deep learning compiler overview [15 mins] Design philosophy, dataflow graph optimizations, library plugins Demo: Applications using DNN like traffic-signs, age-detection (both cudnn + TensorRT) Demo: Quantization with TensorRT/int8 (logo detection) Benchmarks: alexnet, resnet, segnet End-to-End Workflow demo (lidar) [15 mins] Labeling, training Deployment on DrivePX2 56

57 Demos for talk Workflow demo (alexnet, resnet, traffic signs, stereo disparity) Board demo (Jetson, DrivePX2) TensorRT demos (traffic signs, logonet + int8) Arvind s lidar demo (-to-) Performance benchmarks IPT functions Deep learning inference (GPC (cudnn, TensorRT) vs Caffe2, TF, MxNet) Alexnet Resnet GoogleNet Segnet 57

58 Localization Planning Autonomous systems Perception Controls 58

59 Consider an example: saxpy Scalarized MATLAB Vectorized MATLAB 59

60 Deep learning workflow in MATLAB Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 60

61 Re-targetability: A target-agnostic interface layer Pre-trained DL network Front-End Cross-layer optimizations C back- GPU Back DL structure Auto-generated Target-agnostic code malloc, scheduling ARM Neon interface ARM Mali interface MKL-DNN interface cudnn interface Hand-written interface layer code TensorRT interface Algorithm only ARM CL ARM CL MKL-DNN library Vor library cudnn library TensorRT 61

62 Deep learning workflow in MATLAB DNN design + training Model importer Application logic C++/CUDA + TensorRT Train in MATLAB Trained DNN Model importer C++/CUDA + cudnn 62

63 Design pattern: Stencil kernel Problem: how to auto-generate hand-coded version from: 1. for x=1:orow Key insight: this part always the same 2. for y=1:ocol 3. C(x,y) = f ( In(x:x+window_x, y:y+window_y)); Only need user to specify this part f() takes in an input window and returns a scalar output. e.g. convolution : x = sum( input(:).* kernel(:)) 63

64 Design pattern: Stencil kernel n m kernel myfunc im out 64

65 GPU Coder automatically maps data to shared memory Input image Conv. kernel Output image cols kh kw rows GPU Coder automatically infers data locality Creates GPU shared memory Collaborative loading Re-write algorithm to use shared memory Dotprod Optimized image processing toolbox functions: conv2 imfilter imerode Imdilate 65

66 Memory Management Modes are Configurable Standard (explicit copy) cudamalloc Separate host and GPU memory allocations (and variables) CUDA 7 or earlier for desktop GPUs Optimal dynamic memcpy minimization Unified memory (implicit copy) cudamallocmanaged No need for separate variables CUDA 8 or embedded GPUs Optimal dynamic device-sync 66

67 Design pattern: Stencil kernel n m kernel myfunc im Optimized image processing toolbox functions: conv2 imfilter imerode Imdilate out 67

68 Example generated code: 1-D FIR filter Shared memory decl Collaborative load Sync point Algorithm with translated accesses to shared memory 68

69 Inference benchmarking on TitanXP Alexnet VGG TensorFlow (cudnn-fp32)

70 Inference benchmarking on TitanXP MATLAB GPU Coder (TensorRT-int8) MATLAB GPU Coder (TensorRT-fp32) MATLAB GPU Coder (cudnn-fp32) mxnet (cudnn-fp32) TensorFlow (cudnn-fp32) Images / sec Batch size Batch size Alexnet VGG-16 70

71 Deep Learning Workflow Access Data Preprocess and Label Train Network Inference Raw Point Clouds Label Data Create Deep Network Train on Multi- GPU 71

72 Access and Visualize Lidar Data Access Data Access Stored Lidar Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply) Visualize Lidar Data Streaming Lidar player Static point cloud display Point cloud differences 72

73 Lidar Preprocessing Preprocess and Label ROI + Remove Ground Fit plane using RANSAC Cluster Segment clusters using Euclidean distance 73

74 Ground Truth Labeling of Lidar Data Preprocess and Label 74

75 Ground Truth Labeling of Lidar Data Preprocess and Label 75

76 Ground Truth Labeling of Lidar Data Preprocess and Label 76

77 Organize Data for Training Train Network Raw Point Cloud Data Ground Truth Labels Transformed to Label Mask Project to 2D 77

78 Create Network Architecture Train Network Easy API to create network structure 78

79 Compile Model to CUDA Code Inference GPU Coder 79

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deploying Deep Learning Networks to Embedded GPUs and CPUs Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get

More information

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Introduction to Deep Learning in Signal Processing & Communications with MATLAB Introduction to Deep Learning in Signal Processing & Communications with MATLAB Dr. Amod Anandkumar Pallavi Kar Application Engineering Group, Mathworks India 2019 The MathWorks, Inc. 1 Different Types

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Demystifying Deep Learning

Demystifying Deep Learning Demystifying Deep Learning Let the computers do the hard work Jérémy Huard 2015 The MathWorks, Inc. 1 2 Why MATLAB for Deep Learning? MATLAB is Productive MATLAB is Fast MATLAB Integrates with Open Source

More information

Demystifying Deep Learning

Demystifying Deep Learning Demystifying Deep Learning Mandar Gujrathi Mandar.Gujrathi@mathworks.com.au 2015 The MathWorks, Inc. 1 2 Deep Learning Applications Voice assistants (speech to text) Teaching character to beat video game

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int

More information

Wu Zhiwen.

Wu Zhiwen. Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT Kurtis McBride CEO, Miovision ABOUT MIOVISION COMPANY Founded in 2005 40% growth, year over year Offices in Kitchener, Canada

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

End to End Optimization Stack for Deep Learning

End to End Optimization Stack for Deep Learning End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team

More information

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision 1 컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision 김종남 Application Engineer 2017 The MathWorks, Inc. 2 Three Main Topics New capabilities for computer vision system design: Deep Learning 3-D Vision

More information

Neural Network Exchange Format

Neural Network Exchange Format Copyright Khronos Group 2017 - Page 1 Neural Network Exchange Format Deploying Trained Networks to Inference Engines Viktor Gyenes, specification editor Copyright Khronos Group 2017 - Page 2 Outlook The

More information

A performance comparison of Deep Learning frameworks on KNL

A performance comparison of Deep Learning frameworks on KNL A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description

More information

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016 NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING September 13, 2016 AI FOR AUTONOMOUS DRIVING MAPPING KALDI LOCALIZATION DRIVENET Training on DGX-1 NVIDIA DGX-1 NVIDIA DRIVE PX 2 Driving with DriveWorks

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

2015 The MathWorks, Inc. 1

2015 The MathWorks, Inc. 1 2015 The MathWorks, Inc. 1 개발에서구현까지 MATLAB 환경에서의딥러닝 김종남 Application Engineer 2015 The MathWorks, Inc. 2 3 Why MATLAB for Deep Learning? MATLAB is Productive MATLAB is Fast MATLAB Integrates with Open Source

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1 Parallel and Distributed Computing with MATLAB 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

Autonomous Driving Solutions

Autonomous Driving Solutions Autonomous Driving Solutions Oct, 2017 DrivePX2 & DriveWorks Marcus Oh (moh@nvidia.com) Sr. Solution Architect, NVIDIA This work is licensed under a Creative Commons Attribution-Share Alike 4.0 (CC BY-SA

More information

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG Valid Through July 31, 2018 INTRODUCTION The NVIDIA Deep Learning Institute (DLI) trains developers, data scientists, and researchers on how to use artificial

More information

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 TWO FORCES DRIVING THE FUTURE OF COMPUTING 10 7 Transistors (thousands) 10 6 10 5 1.1X per year 10 4 10 3 10 2 1.5X per year Single-threaded

More information

NVIDIA DATA LOADING LIBRARY (DALI)

NVIDIA DATA LOADING LIBRARY (DALI) NVIDIA DATA LOADING LIBRARY (DALI) RN-09096-001 _v01 September 2018 Release Notes TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. DALI DALI DALI DALI DALI Overview...1 Release

More information

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE

More information

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM AGENDA 1 Introduction to Deep Learning 2 What is DIGITS 3 How to use DIGITS Practical DEEP LEARNING Examples Image Classification, Object Detection,

More information

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,

More information

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK 17 May 2016, Melbourne 24 May 2016, Sydney Werner Scholz, CTO and Head of R&D, XENON Systems Mike Wang, Solutions Architect,

More information

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

NVIDIA DEEP LEARNING INSTITUTE

NVIDIA DEEP LEARNING INSTITUTE NVIDIA DEEP LEARNING INSTITUTE TRAINING CATALOG Valid Through July 31, 2018 INTRODUCTION The NVIDIA Deep Learning Institute (DLI) trains developers, data scientists, and researchers on how to use artificial

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications Ryosuke Tanno and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1. INTRODUCTION Deep

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Snapdragon NPE Overview

Snapdragon NPE Overview March 2018 Linaro Connect Hong Kong Snapdragon NPE Overview Mark Charlebois Director, Engineering Qualcomm Technologies, Inc. Caffe2 Snapdragon Neural Processing Engine Efficient execution on Snapdragon

More information

Realtime Object Detection and Segmentation for HD Mapping

Realtime Object Detection and Segmentation for HD Mapping Realtime Object Detection and Segmentation for HD Mapping William Raveane Lead AI Engineer Bahram Yoosefizonooz Technical Director NavInfo Europe Advanced Research Lab Presented at GTC Europe 2018 AI in

More information

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff Member (STSM), IBM Larry Brown Senior Software Engineer, IBM Agenda Introduction Inferencing with PyCaffe TensorRT

More information

S CUDA on Xavier

S CUDA on Xavier S8868 - CUDA on Xavier Anshuman Bhat CUDA Product Manager Saikat Dasadhikari CUDA Engineering 29 th March 2018 1 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2 About nervana

More information

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA A NEW COMPUTING ERA Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA THE ERA OF AI AI CLOUD MOBILE PC 2 TWO FORCES DRIVING THE FUTURE OF COMPUTING 10 7 Transistors (thousands) 10 5 1.1X

More information

Accelerate Deep Learning Inference with openvino toolkit

Accelerate Deep Learning Inference with openvino toolkit Accelerate Deep Learning Inference with openvino toolkit Priyanka Bagade, IoT Developer Evangelist, Intel Core and Visual Computing Group Optimization Notice Intel s compilers may or may not optimize to

More information

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx Accelerating your Embedded Vision / Machine Learning design with the revision Stack Giles Peckham, Xilinx Xilinx Foundation at the Edge Vision Customers Using Xilinx >80 ADAS Models From 23 Makers >80

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

WITH INTEL TECHNOLOGIES

WITH INTEL TECHNOLOGIES WITH INTEL TECHNOLOGIES Commitment Is to Enable The Best Democratize technologies Advance solutions Unleash innovations Intel Xeon Scalable Processor Family Delivers Ideal Enterprise Solutions NEW Intel

More information

Deep Learning Compiler

Deep Learning Compiler Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

Small is the New Big: Data Analytics on the Edge

Small is the New Big: Data Analytics on the Edge Small is the New Big: Data Analytics on the Edge An overview of processors and algorithms for deep learning techniques on the edge Dr. Abhay Samant VP Engineering, Hiller Measurements Adjunct Faculty,

More information

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Fast Hardware For AI

Fast Hardware For AI Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights

More information

Implementing Deep Learning for Video Analytics on Tegra X1.

Implementing Deep Learning for Video Analytics on Tegra X1. Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning

More information

IBM Deep Learning Solutions

IBM Deep Learning Solutions IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016 How do you teach a computer to Perceive? 2 Deep Learning: teaching Siri to recognize a bicycle

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Getting started with Caffe. Jon Barker, Solutions Architect

Getting started with Caffe. Jon Barker, Solutions Architect Getting started with Caffe Jon Barker, Solutions Architect Caffe tour Overview Agenda Example applications Setup Performance Hands-on lab preview 2 A tour of Caffe 3 What is Caffe? An open framework for

More information

Embedded GPGPU and Deep Learning for Industrial Market

Embedded GPGPU and Deep Learning for Industrial Market Embedded GPGPU and Deep Learning for Industrial Market Author: Dan Mor GPGPU and HPEC Product Line Manager September 2018 Table of Contents 1. INTRODUCTION... 3 2. DIFFICULTIES IN CURRENT EMBEDDED INDUSTRIAL

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / 2017. 10. 31 syoh@add.re.kr Page 1/36 Overview 1. Introduction 2. Data Generation Synthesis 3. Distributed Deep Learning 4. Conclusions

More information

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

SVM multiclass classification in 10 steps 17/32

SVM multiclass classification in 10 steps 17/32 SVM multiclass classification in 10 steps import numpy as np # load digits dataset from sklearn import datasets digits = datasets. load_digits () # define training set size n_samples = len ( digits. images

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC TERATECH Juin 2017 Gunter Roth, François Courteille DRAMATIC

More information

Deep Learning Processing Technologies for Embedded Systems. October 2018

Deep Learning Processing Technologies for Embedded Systems. October 2018 Deep Learning Processing Technologies for Embedded Systems October 2018 1 Neural Networks Architecture Single Neuron DNN Multi Task NN Multi-Task Vehicle Detection With Region-of-Interest Voting Popular

More information

NVIDIA PLATFORM FOR AI

NVIDIA PLATFORM FOR AI NVIDIA PLATFORM FOR AI João Paulo Navarro, Solutions Architect - Linkedin i am ai HTTPS://WWW.YOUTUBE.COM/WATCH?V=GIZ7KYRWZGQ 2 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 3 GPU COMPUTING

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 [Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware

More information

POINT CLOUD DEEP LEARNING

POINT CLOUD DEEP LEARNING POINT CLOUD DEEP LEARNING Innfarn Yoo, 3/29/28 / 57 Introduction AGENDA Previous Work Method Result Conclusion 2 / 57 INTRODUCTION 3 / 57 2D OBJECT CLASSIFICATION Deep Learning for 2D Object Classification

More information

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev Beyond Training The next steps of Machine Learning Chris Parsons chrisparsons@uk.ibm.com @chrisparsonsdev /in/chrisparsonsdev What is this talk? Part 1 What is Machine Learning? AI Infrastructure PowerAI

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Intel, Santa Clara August 30 th 2018 Trusted execution of ML: 3 motivating

More information

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Scalable Distributed Training with Parameter Hub: a whirlwind tour Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

Visual Perception for Autonomous Driving on the NVIDIA DrivePX2 and using SYNTHIA

Visual Perception for Autonomous Driving on the NVIDIA DrivePX2 and using SYNTHIA Visual Perception for Autonomous Driving on the NVIDIA DrivePX2 and using SYNTHIA Dr. Juan C. Moure Dr. Antonio Espinosa http://grupsderecerca.uab.cat/hpca4se/en/content/gpu http://adas.cvc.uab.es/elektra/

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

Designing GPU-accelerated applications with RTMaps (Real-Time Multisensor Applications) Framework and NVIDIA DriveWorks

Designing GPU-accelerated applications with RTMaps (Real-Time Multisensor Applications) Framework and NVIDIA DriveWorks MUNICH OCT 10-12, 2017 Designing GPU-accelerated applications with RTMaps (Real-Time Multisensor Applications) Framework and NVIDIA DriveWorks Xavier Rouah Lead Software Engineer Brief introduction about

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor WELCOME TO Operant conditioning 1938 Spiking Neuron 1952 first neuroscience department 1964 Deep learning prevalence mid 2000s The Turing Machine 1936 Transistor 1947 First computer science department

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information