Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Size: px

Start display at page:

Download "Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA"

Osborne McDaniel
5 years ago
Views:

1 Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA

3 Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video

4 AI Inference is exploding 1 Billion Videos Watched Per Day Facebook LIVE VIDEO 1 Billion Voice Searches Per Day Google, Bing, etc. SPEECH 1 Trillion Ads/Rankings Per Day Impressions RECOMMENDATIONS

5 Bigger and More Compute Intensive in quest for accuracy 350X Inception-v4 30X DeepSpeech 3 10X MoE GNMT AlexNet GoogleNet ResNet-50 Inception-v2 DeepSpeech DeepSpeech 2 OpenNMT Image (GOP * Bandwidth) Speech (GOP * Bandwidth) Translation (GOP * Bandwidth)

Cambrian Explosion Convolutional Networks

Networks Reinforcement Learning New Species

GRU Beam Search 3D-GAN MedGAN Conditional GAN

Collaborative Filtering Concat Dropout Pooling

6 Cambrian Explosion Convolutional Networks Recurrent Networks Generative Adversarial Networks Reinforcement Learning New Species Capsule Nets Encoder/Decoder ReLu BatchNorm LSTM GRU Beam Search 3D-GAN MedGAN Conditional GAN DQN Simulation Mixture of Experts Neural Collaborative Filtering Concat Dropout Pooling WaveNet CTC Attention Coupled GAN Speech Enhancement GAN DDPG Block Sparse LSTM

7 Inefficiency Limits Innovation Single Model Only! Single Framework Only Custom Development ASR NLP Some systems are overused while others are underutilized Recommender Solutions can only support models from one framework Developers need to reinvent the plumbing for every application

8 Speed/accuracy trade-offs for modern convolutional object detectors, Google Research More Accuracy = Time

9 Inference Performance Latency Throughput Efficiency Easy Deploy

10 Inference Optimization Lightweight Model Deep Compression Small Computation Suitable Lightweight Hardware No Network Modification Continuous the Enovation E.g; MobileNet SqueezeNet SEP-Nets E.g; Quantization and Binarization Network Pruning and Sharing Distillation / Factorization

11 TensorRT From framework to target FRAMEWORKS GPU PLATFORMS TESLA T4 TensorRT Optimizer Runtime JETSON TX2 DRIVE PX 2 NVIDIA DLA TESLA V100

12 Throughput (images/sec) Speed-up ResNet-50 V Batch Response time (ms) TensorRT INT8 TensorRT FP16 TensorRT FP32 GPU Native FP32 CPU Native FP32

13 TensorRT 5 More Layers / Plugin / APIs Inference Server Integration

14 TensorRT Workflow Step 1. Optimize trained model Model Optimize Plan

15 TensorRT Workflow Step 2. Deploy optimized plan Embedded Data center Plan Runtime Engine Automotive

16 Step-by-Step TensorRT Plan Build Model Parser C++/Python API Network TensorRT Builder Engine Network Definitions tensorrt-developer-guide/index.html#support_op

17 7 Steps to Deployment with TensorRT Step 1: Convert trained model into TensorRT format Step 2: Create a model parser Step 3: Register inputs and outputs Step 4: Optimize model and create a runtime engine Step 5: Serialize optimized engine Step 6: De-serialize engine Step 7: Perform inference

18 Custom Layer Support using Plugin Layer Model Parser C++/Python API Network TensorRT Builder Engine Network Definitions Plugin Factory Plugin A Plugin B

19 Plugin Layer Procedure icudaengine Plugin Factory Plugin Layer Phase 1. Initialize Layer? (missing) Initialize Phase 2. Inference enqueue Phase 3. Store/Load Serialize Deserialize

20 Supporting Features Feature C++ Python (x86 only) NvCaffeParser NvUffParser CNNs Yes Yes Yes Yes RNNs Yes Yes No No INT8 Calibration Yes Yes N/A N/A Asymmetric Padding Yes Yes No No

21 Define your network using C/C++ or Python TensorRT API INetworkDefinition* network = builder->createnetwork(); convertrnnweights, convertrnnbias setweightsforgate, setbiasforgate

22 TensorRT RNN Programming /usr/src/tensorrt/samples/ common/dumptfwts.py Dump Weights Load Weights Convert Weights Set Weights ckpt wts TF format TRT TF format format TensorRT RNN Layer Get started with samplecharrnn in documentation

NMT with Transformer Inference Deploy highly-optimized language translation apps in production environments Running 3 binded engines (Encoder, Generator, Suffle Engine) Modular Network Merge Support

23 NMT with Transformer Inference Deploy highly-optimized language translation apps in production environments Running 3 binded engines (Encoder, Generator, Suffle Engine) Modular Network Merge Support NMT layers such as Gather, Softmax, Batch GEMM and Top K Get started with NMT sample in documentation Encoder Input Setup Encoder RNN Runs on CPU GPU-Accelerated Decoder RNN Attention Model Projection TopK Decoder Input Output Beam Search Batch Reduction Beam Scoring Beam Shuffle * CPU-Only: TensorFlow on SKL core, FP32 GPU: V100, TensorRT 5, FP16; Sorted data, Batch=128, English to German

24 Concat Recommendation Inference Deploy multi-layer perceptron (MLP) based recommendation apps in production Predict events (click, purchase, interest) accurately based on input items (user, query, observed activity) Get started with examples on MNIST character recognition and movie recommender system in the documentation 1 2 N User Context Item Item Item... Embedding Embedding Embedding Embedding Embedding xn Fused MLP Kernel Fully Connected Bias Activation Fully Connected Bias Activation Fully Connected Bias Runs on CPU GPU-Accelerated * CPU-Only: TensorFlow on Intel Xeon E v4 CPU at 2.20 GHz; GPU: V100 with custom networks Activation TopK Output Top Scores

25 TFLOPS / TOPS Inferencing with Low precision Peak Performance Float INT8 P4 65 Float INT8 INT4 T4

26 Post Training Quantization Minimize information loss between FP32 and INT8 inference on calibration dataset 100 mini-batches are sufficient to determine quantization parameter (~500 samples)

27 Maximize throughput at low latency with mixed precision compute in production New INT8 APIs and Optimizations FP32 Training Auto Calibration FP32 weights Optimize to INT8 Apply INT8 quantization aware training or custom calibration algorithms with new APIs Control precision per-layer with new APIs Optimizations for depth wise convolution operation FP32 Training FP32 weights Custom Calibration Custom Calibration Calibration Data O(1000) Images FP32 or INT8 weights Custom Activation ranges Optimize to INT8 Getting Start with INT8 sample in the documentation Quantization Aware Training in FP32 Quantization Aware Training Custom Activation Ranges FP32 or accurate INT8 weights Optimize to INT8

28 Inference Performance Throughput Performance/WATT Higher is better!! P4 - INT8 V100 - Mixed P4 - INT8 V100 - Mixed

29 WIDELY ADOPTED 30

30 NVIDIA INFERENCE MOMENTUM Image Tagging Video Analysis Advertising Impact Video Captioning Cybersecurity Visual Search Finding Music Sports Performance Customer Service Visual Search Industrial Inspection Voice Recognition 31

31 Using GPUs made it possible to enable media understanding of the Twitter platform, by drastically reducing media deep learning model training time and by allowing us to derive real-time understanding of live videos at inference time. Nicolas Koumchatzky, Head of Twitter Cortex 32

32 At Microsoft, we are working hard to deliver the most advanced AI-powered services to our customers. Using NVIDIA GPUs in real-time inference workloads has improved Bing s advanced search offerings, enabling us to reduce object detection latency for images. We look forward to working with NVIDIA s next generation inference hardware and software to expand the way people benefit from AI products and services. -Jordi Ribas CVP, Bing and AI Products, Microsoft AI is becoming increasingly pervasive, and inference is a critical capability customers need to successfully deploy their AI models, so we re excited to support NVIDIA s Turing Tesla T4 GPUs on Google Cloud Platform soon. -Chris Kleban, product manager at Google Cloud 33

33 SEOUL NOVEMBER 7-8,2018

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60