EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Similar documents
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

NVIDIA PLATFORM FOR AI

Deploying Deep Learning Networks to Embedded GPUs and CPUs

World s most advanced data center accelerator for PCIe-based servers

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

NVIDIA DEEP LEARNING PLATFORM

Xilinx ML Suite Overview

A NEW COMPUTING ERA. DAVID B. KIRK, FELLOW NVIDIA AI Conference Singapore 2017

A performance comparison of Deep Learning frameworks on KNL

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

NEW NVIDIA PLATFORM FOR AI

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

TENSORRT. RN _v01 January Release Notes

S8901 Quadro for AI, VR and Simulation

TENSORRT. SWE-SWDOCTRT-001-RELN_vTensorRT October Release Notes

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Fast Hardware For AI

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Deep learning in MATLAB From Concept to CUDA Code

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

TESLA V100 PERFORMANCE GUIDE May 2018

GPU-Accelerated Deep Learning

High Performance Computing

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Deep Learning Accelerators

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

DEEP LEARNING ALISON B LOWNDES. Deep Learning Solutions Architect & Community Manager EMEA

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Wu Zhiwen.

NVIDIA AI INFERENCE PLATFORM

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

TENSORRT. RN _v01 June Release Notes

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Research Faculty Summit Systems Fueling future disruptions

Building the Most Efficient Machine Learning System

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

Building the Most Efficient Machine Learning System

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

POINT CLOUD DEEP LEARNING

Revolutionizing the Datacenter

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Xilinx ML Suite Overview

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Unified Deep Learning with CPU, GPU, and FPGA Technologies

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

Deep Learning Inference on Openshift with GPUs

Autonomous Driving Solutions

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Implementing Deep Learning for Video Analytics on Tegra X1.

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

DGX UPDATE. Customer Presentation Deck May 8, 2017

The OpenVX Computer Vision and Neural Network Inference

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

IBM Deep Learning Solutions

Machine Learning on VMware vsphere with NVIDIA GPUs

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

NVIDIA DEEP LEARNING INSTITUTE

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Democratizing Machine Learning on Kubernetes

Accelerate Deep Learning Inference with openvino toolkit

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

HOW TO BUILD A MODERN AI

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

Embedded GPGPU and Deep Learning for Industrial Market

CNN optimization. Rassadin A

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

April 2 nd, Bob Burroughs Director, HPC Solution Sales

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Profiling DNN Workloads on a Volta-based DGX-1 System

An introduction to Machine Learning silicon

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Transcription:

EFFICIENT INFERENCE WITH TENSORRT Han Vanholder

AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60 Billion Video frames/day uploaded on Youtube PERSONALIZATION SPEECH TRANSLATION VIDEO 2

Top - 1 Accuracy NETWORKS ARE GROWING Bigger, Better, Exponential Compute 85 80 75 70 65 Speed/accuracy trade-offs for modern convolutional object detectors April 2017, Jonathan Huang et all 3

Top - 1 Accuracy NETWORKS ARE GROWING Bigger, Better, Exponential Compute 85 80 75 70 65 OUTRAGEOUSLY LARGE NEURAL NETWORKS: THEMIXTURE-OF-EXPERTS SPARSELY-GATED LAYER ICLR 2017, Noam Shazeer et all 4

DL FLOW Pivot : Research to Production Source Dataset Curated Dataset MODEL ZOO REST API IMPORT Format PREPROCESS clean, clip, label, Normalize,.. TRAIN DEPLOY tune, compile + runtime INFERENCE & MICROSERVICES VISUALIZATION SCORE + OPTIMIZE, VISUALIZATION RESULT * inference, prediction 5

DL FLOW Pivot : Research to Production Source Dataset Curated Dataset MODEL ZOO REST API IMPORT Format PREPROCESS clean, clip, label, Normalize,.. TRAIN DEPLOY tune, compile + runtime INFERENCE & MICROSERVICES VISUALIZATION SCORE + OPTIMIZE, VISUALIZATION RESULT * inference, prediction 6

DL FLOW Pivot : Research to Production Source Dataset Curated Dataset MODEL ZOO Inference needs Latency Accuracy Efficiency Throughput REST API IMPORT Format PREPROCESS clean, clip, label, Normalize,.. TRAIN DEPLOY tune, compile + runtime INFERENCE & MICROSERVICES VISUALIZATION SCORE + OPTIMIZE, VISUALIZATION RESULT * inference, prediction 7

NVIDIA TENSORRT PROGRAMMABLE INFERENCING PLATFORM TESLA P4 TensorRT JETSON TX2 DRIVE PX 2 NVIDIA DLA TESLA V100 8

NVIDIA TensorRT Programmable Inference Accelerator Embedded Automotive Data center Jetson Drive PX Tesla Maximize throughput and minimize latency Deploy reduced precision without retraining and without accuracy loss developer.nvidia.com/tensorrt Train in any framework, deploy in TensorRT without overhead 9

Throughput (Images/Sec) Images/sec Throughput (Sentences/Sec) Images/sec TENSORRT 3 INFERENCE PERFORMANCE Up to 40x Faster CNNs vs. CPU Up to 140x Faster RNNs vs CPU 6,000 ResNet50, under 7ms latency 5700 40 600 OpenNMT Neural Translation 550 500 4,000 2,000 0 140 305 CPU-Only V100 + TensorFlow 1400 P4 + TensorRT V100 + TensorRT 30 20 10 0 Latency (ms) 400 200 0 4 CPU-Only + Torch 25 V100 + Torch 65 P4 + TensorRT V100 + TensorRT 400 300 200 100 0 Latency (ms) Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. P100 + TensorRT: NVIDIA TensorRT (FP16), batch size 10, Tesla P100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512. developer.nvidia.com/tensorrt Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100- PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. P100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla P100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On 10

Latency (ms) Latency (ms) Images/sec Images/sec TENSORRT 3 INFERENCE PERFORMANCE Up to 40x Faster CNNs vs. CPU Up to 140x Faster RNNs vs CPU 6,000 ResNet50, under 7ms latency 40 600 OpenNMT Neural Translation 500 4,000 2,000 14 ms 6.7 ms 6.4 ms 6.8 ms 30 20 10 Latency (ms) 400 200 280 ms 153 ms 126 ms 118 ms 400 300 200 100 Latency (ms) 0 CPU-Only V100 + TensorFlow P4 + TensorRT V100 + TensorRT 0 0 CPU-Only + Torch V100 + Torch P4 + TensorRT V100 + TensorRT 0 Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. P100 + TensorRT: NVIDIA TensorRT (FP16), batch size 10, Tesla P100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512. developer.nvidia.com/tensorrt Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100- PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. P100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla P100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60ghz 3.5GHz Turbo (Broadwell) HT On 11

NVIDIA TENSORRT 3 Volta TensorCore TensorFlow Importer Python API Volta TensorCore Support Import TensorFlow Models Python API Data Scientists Compiled & Optimized Model 3.7x faster inference on Tesla V100 vs. Tesla P100 under 7ms real-time latency Optimize and deploy TensorFlow models up to 18x faster vs. TensorFlow framework Improved productivity with easy to use Python API for data science workflows TensorRT 3 RC is now available as a free download to members of NVIDIA Developer Program developer.nvidia.com/tensorrt 12

TENSORRT LAYER AND TENSOR FUSION Un-optimized network next input TensorRT Optimized Network next input concat relu bias 1x1 conv. relu bias 3x3 conv. relu bias 5x5 conv. relu bias 1x1 conv. 3x3 CBR 5x5 CBR 1x1 CBR relu bias 1x1 conv. relu bias 1x1 conv. max pool 1x1 CBR max pool input input concat developer.nvidia.com/tensorrt Vertical fusion of relu, bias and 1x1 conv Horizonal fusion of 1x1 CBR layers Eliminate concatenation layers 13

TENSORRT LAYERS Built-in Custom Layer API Activation: ReLU, tanh, sigmoid Concatenation Convolution Deconvolution Element wise operations Fully-connected LRN Pooling: max and average RNN Scaling SoftMax TensorRT Application CUDA Runtime Custom Layer developer.nvidia.com/tensorrt 14

TENSORRT 3: TENSORFLOW IMPORTER AND PYTHON API Python API Model Importer AI Researchers Data Scientists Other Frameworks Python API Network Definition API Optimize and deploy TensorFlow models that are up to 18x faster vs. TensorFlow framework Improved productivity with easy to use Python API for Data Science workflows TensorRT Runtime Optimized Model Data center Embedded Automotive developer.nvidia.com/tensorrt 15

ACCELERATING TENSORFLOW NETWORKS Running TensorFlow CNNs in TensorRT in 3 lines of python import tensorrt # import TensorRT python module #create the inference engine object mnist_engine = tensorrt.lite.engine(framework="tf", path=data_dir + "/mnist/lenet5_mnist_frozen.pb", max_batch_size=10, input_nodes={"in":(1,28,28)}, output_nodes=["out"], preprocessors={"in":normalize}, postprocessors={"out":argmax}) #Source framework #Model File #Max number of images to be processed at a time #Input layers #Ouput layers #Preprocessing functions #Postprocesssing functions results = mnist_engine.infer(data)[0] # run the inferene 16

WEIGHT AND ACTIVATION PRECISION CALIBRATION Maintain accuracy without retraining FP32 TOP 1 INT8 TOP 1 DIFFERENCE Alexnet 57.22% 56.96% 0.26% Googlenet 68.87% 68.49% 0.38% VGG 68.56% 68.45% 0.11% Resnet-50 73.11% 72.54% 0.57% Resnet-101 74.58% 74.14% 0.44% Resnet-152 75.18% 74.56% 0.61% developer.nvidia.com/tensorrt 17

NVIDIA PLATFORM SAVES DATA CENTERS Game Changing Inference Performance 1 NVIDIA HGX with 8 Tesla V100 GPUs 3 KWatts 160 Xeon Scalable Processor CPU servers 65 KWatts Image recognition using Resnet-50, Neural Machine Translation using OpenNMT 18

Images/Second Latency INFERENCE: THROUGHPUT & LATENCY Real Time, High Throughput: P4, TensorRT3 2000 ResNet50 Inf/s vs Batch (bigger is better) E5-2690v4 Inf/s P4 Inf/s P4 INT8 Inf/s ResNet50 latency (log ms) vs Batch (smaller is better) E5-2690v4 (ms) P4 (ms) P4 INT8 (ms) 1500 1000 1000 13x 100 100ms 500 10 Real Time 7ms 0 1 2 4 8 16 32 64 128 1 1 2 4 8 16 32 64 128 Inference throughput (images/sec) on ResNet50, NVIDIA benchmarking. P4 measured with TensorRT 3.0 RC. E5-2690v4 measured with Intel DL SDK 19

Sentences/Second Latency INFERENCE: THROUGHPUT & LATENCY Real Time, High Throughput: V100, TensorRT3 OpenNMT Inf/s vs Batch (bigger is better) OpenNMT latency (log ms) vs Batch (smaller is better) 800 E5-2690v4 Inf/s V100 Torch Inf/s V100 TRT Inf/s 100000 E5-2690v4 (ms) V100 Torch (ms) V100 TRT (ms) 600 10000 400 100x 1000 100 200 10 200 ms 0 1 2 4 8 32 64 128 1 1 2 4 8 32 64 128 Inference throughput (sentences/sec) on OpenNMT 692M, NVIDIA benchmarking. V100 measured with TensorRT (FP32). E5-2690v4 measured with Torch (FP32). 20

DL FLOW Pivot : Research to Production A/B Testing, Usage data Source Dataset Curated Dataset MODEL ZOO REST API IMPORT Format PREPROCESS clean, clip, label, Normalize,.. TRAIN DEPLOY tune, compile + runtime INFERENCE & MICROSERVICES VISUALIZATION SCORE + OPTIMIZE, VISUALIZATION Automated with TensorRT Rapid Deployment, High Productivity RESULT * inference, prediction 21

We measured that the new Azure instances running NVIDIA cards accelerated the inference on detection network by 3x! [ ] Ultimately with Azure, ObjectStore, and NVIDIA technology we built an object detection system which can handle full Bing index worth of data under peak traffic [and] provide great experience to users. Source: Microsoft Bing Blog TensorRT is a real game changer. Not only does TensorRT make model deployment a snap but the resulting speed up is incredible: out of the box, BodySLAM, our human pose estimation engine, now runs over two times faster than using CAFFE GPU inferencing. Source: Paul Kruszewski, CEO - WRNCH 22

DATACENTER GPUS FOR DL Form Factor V100 DGX-1, HGX-1 SXM2 V100 V100 FHHL P4 PCIe Double Wide PCIe FHHL Single Wide PCIe Low Profile (HHHL) Max Power 300W 250W 150W 75W / 50W FP32 15.7 TFLOP 13.3 TFLOP 10 TFLOP 5.5 TFLOPs TensorCore 125 TFLOP 106 TFLOP 80 TFLOP Integer 8 22 TOPs 23

AI revolutionizes every industry Inferencing demands acceleration. TensorRT Programmable Inference platform: throughput, latency, accuracy, efficiency Download now developer.nvidia.com/tensorrt 24