Lecture 12: Model Serving. CSE599W: Spring 2018

Similar documents
CNN optimization. Rassadin A

MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

arxiv: v1 [cs.cv] 11 Feb 2018

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Scalable Distributed Training with Parameter Hub: a whirlwind tour

How to Build Optimized ML Applications with Arm Software

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Model Compression. Girish Varma IIIT Hyderabad

Binary Convolutional Neural Network on RRAM

How to Build Optimized ML Applications with Arm Software

In Live Computer Vision

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Research Faculty Summit Systems Fueling future disruptions

Clipper A Low-Latency Online Prediction Serving System

Xilinx ML Suite Overview

Research Faculty Summit Systems Fueling future disruptions

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

High Performance Computing

A Method to Estimate the Energy Consumption of Deep Neural Networks

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Deep Learning Accelerators

YOLO9000: Better, Faster, Stronger

Smart Ultra-Low Power Visual Sensing

Neural Network Exchange Format

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Revolutionizing the Datacenter

Delivering Deep Learning to Mobile Devices via Offloading

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Onto Petaflops with Kubernetes

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Deep Learning Compiler

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Efficient Methods for Deep Learning

Edge of Tomorrow Deploying Collaborative Machine Intelligence to the Edge

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Brainchip OCTOBER

Deploying Deep Learning Networks to Embedded GPUs and CPUs

IQ for DNA. Interactive Query for Dynamic Network Analytics. Haoyu Song. HUAWEI TECHNOLOGIES Co., Ltd.

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

An introduction to Machine Learning silicon

Deep learning in MATLAB From Concept to CUDA Code

Spatial Localization and Detection. Lecture 8-1

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

DeepIndex for Accurate and Efficient Image Retrieval

Real-time convolutional networks for sonar image classification in low-power embedded systems

Searching for Meaning in the Era of Big Data and IoT

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

arxiv: v1 [cs.cv] 20 May 2016

The OpenVX Computer Vision and Neural Network Inference

Yiqi Yan. May 10, 2017

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

Implementing Deep Learning for Video Analytics on Tegra X1.

Graph Streaming Processor

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Deep Model Compression

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Joint Object Detection and Viewpoint Estimation using CNN features

Xilinx Machine Learning Strategies For Edge

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Bandwidth-Efficient Deep Learning

Emergence of the Memory Centric Architectures

DeepLearning on FPGAs

OBJECT DETECTION HYUNG IL KOO

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Demystifying Deep Learning

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

YOLO 9000 TAEWAN KIM

Personalized Diapause: Reducing Radio Energy Consumption of Smartphones by Network-Context Aware Dormancy Predictions

MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Lecture 5: Object Detection

Study of Residual Networks for Image Recognition

Convolutional Neural Network Layer Reordering for Acceleration

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Unified Deep Learning with CPU, GPU, and FPGA Technologies

May I Have Your Attention Please? (said one neuron to another)

Demystifying Deep Learning

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Transcription:

Lecture 12: Model Serving CSE599W: Spring 2018

Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page 263 of this book Intelligent assistant Surveillance / Remote assistance Input keyboard

Model Serving Constraints Latency constraint Batch size cannot be as large as possible when executing in the cloud Can only run lightweight model in the device Resource constraint Battery limit for the device Memory limit for the device Cost limit for using cloud Accuracy constraint Some loss is acceptable by using approximate models Multi-level QoS

Runtime Environment Sensor Processor Radio Cloud Camera Touch screen Microphone CPU GPU FPGA ASIC Wifi 4G / LTE Bluetooth streaming CPU server GPU server TPU / FPGA server 4

Resource usage for a continuous vision app Omnivision OV2740 90mW Tegra K1 GPU 290GOPS@10W = 34pJ/OP Qualcomm SD810 LTE >800mW Atheros 802.11 a/g 15Mbps@700mW= 47nJ/b Amazon EC2 CPU c4.large 2x400GFLOPS $0.1/h GPU g2.2xlarge 2.3TFLOPS $0.65/h Imager Processor Radio Cloud Workload Budget Compute power Deep learning 300GFLOPS @ 30GFLOPs/frame, 10fps Device power Cloud cost 30% of 10Wh for 10h = 300mW $10 person/year 9GFLOPS 3.5GFLOPS (GPU) / 8GFLOPS (CPU) Huge gap between workload and budget 5

Outline Model compression Serving System

Model Compression Tensor decomposition Network pruning Quantization Smaller model

Matrix decomposition Fully-connected layer Memory reduction: Computation reduction: R R M M * * R N N Merge into one matrix

Tensor decomposition Convolutional layer Memory reduction: Computation reduction: bounded by * Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low power mobile applications." ICLR (2016).

Decompose the entire model

Fine-tuning after decomposition

Accuracy & Latency after Decomposition

Network pruning: Deep compression * Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR (2016).

Network pruning: prune the connections

Network pruning: weight sharing 1. Use k-means clustering to identify the shared weights for each layer of a trained network. Minimize 2. Finetune the neural network using shared weights.

Network pruning: accuracy

Network pruning: accuracy vs compression

Quantization: XNOR-Net * Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV (2016).

Quantization: binary weights

Quantization: binary input and weights

Quantization: accuracy

Quantization

Quantization

Smaller model: Knowledge distillation Knowledge distillation: use a teacher model (large model) to train a student model (small model) * Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015).

Smaller model: accuracy

Discussion What are the implications of these model compression techniques for serving? Specialized hardware for sparse models Song Han, et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network. ISCA 2016 Accuracy and resource trade-off Han, Seungyeop, et al. "MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints." MobiSys (2016).

Serving system

Serving system Goals: High flexibility for writing applications High efficiency on GPUs Satisfy latency SLA Challenges Provide common abstraction for different frameworks Achieve high efficiency Sub-second latency SLA that limits the batch size Model optimization and multi-tenancy causes long tail

Nexus: efficient neural network serving system requests Global Scheduler Backend M 1 M 3 scheduler Backend M 2 scheduler Frontend Frontend Application Logic Nexus Runtime Load Balancer Dispatch Approx. lib Backend M 2 scheduler unit query Backend M 2 scheduler Backend M 3 M 4 scheduler Remarks Frontend runtime library allows arbitrary app logic Packing models to achieve higher utilization A GPU scheduler allows new batching primitives A batch-aware global scheduler allocates GPU cycles for each model GPU GPU GPU GPU GPU App 1 App 2 App 3 Control flow 30

Flexibility: Application runtime class ModelHandler: # return output future def Execute(input) Async RPC, execute the model remotely class AppBase: # return ModelHandler Send load model request to global scheduler def GetModelHandler(framework, model, version, latency_sla) # Load models during setup time, implemented by developer def Setup() # Process requests, implemented by developer def Process(request) 31

Application example: Face recognition class FaceRecApp(AppBase): def Setup(self): self.m1 = self.getmodelhandler( caffe, vgg_face, 1, 100) self.m2 = self.getmodelhandler( mxnet, age_net, 1, 100) def Process(self, request): ret1 = self.m1.execute(request.image) ret2 = self.m2.execute(request.image) Load model from different framework Execute concurrently on remote GPUs return Reply(request.user_id, ret1[ name ], ret2[ age ]) Force to synchronize when accessing future data 32

Application example: Traffic Analysis class TrafficApp(AppBase): def Setup(self): self.det = self.getmodelhandler( darknet, yolo9000, 1, 300) self.r1 = self.getmodelhandler( caffe, vgg_face, 1, 150) self.r2 = self.getmodelhandler( caffe, googlenet_car, 1, 150) def Process(self, request): persons, cars = [], [] for obj in self.det.execute(request.image): if obj[ class ] == person : persons.append(self.r1.execute(request.image[obj[ rect ]]) elif obj[ class ] == car : cars.append(self.r2.execute(request.image[obj[ rect ]]) return Reply(request.user_id, persons, cars) 33

High Efficiency Request rate high For high request rate, high latency SLA workload, saturate GPU efficiency by using large batch size low high Latency SLA GPU saturate region batch size >= 32 low Workload characteristic VGG16 throughput vs batch size 34

High Efficiency batch size 16 Request rate high VGG16 conv2_1 (winograd) VGG16 fc6 low high Latency SLA low Workload characteristic Suppose we can choose a different batch size for each op (layer), and allocate dedicated GPUs for each op. GPU GPU... GPU op 1 op 2 op n 35

Split Batching 36

Split Batching Batch size 3 for entire model 40% 13% VGG16 optimal batch size for each segment for latency SLA 15ms 37

High Efficiency Request rate high low high Latency SLA batch k batch k+1 exec batching exec time low GPU is idle during this period Workload characteristic Execute multiple models on one GPU 38

Execute multiple models on a single GPU worst latency Model A requests GPU execution A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 batch k batch k+1 39

Execute multiple models on a single GPU worst latency Model A requests GPU execution A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 batch k batch k+1 worst latency Model A requests Model B requests GPU execution A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A1 B1... A5 Operations of model A and B interleave Next batch 40

Execute multiple models on a single GPU worst latency Model A requests Model B requests GPU execution A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 A1 A2 A3 A4 A5... batch k A batch k B batch k A +1 Use larger batch size as latency is reduced and predictive 41

High Efficiency low Request rate high high Latency SLA Solution depends If saturate GPU in temporal domain due to low latency: allocate dedicated GPU(s) If not: can use multi-batching to share GPU cycles with other models low Workload characteristic 42

Prefix batching for model specialization Specialization (long-term / short-term) re-trains last a few layers A new model 43

Prefix batching for model specialization Specialization (long-term / short-term) re-trains last a few layers Prefix batching allows batch execution for common prefix Different suffixes execute individually Common prefix can be batched together A new model 44

Meet Latency SLA: Global scheduler First apply split batching and prefix batching if possible Multi-batching: bin-packing problem to pack models into GPUs Bin-packing optimization goal Minimize the resource usage (number of GPUs) Constraint Requests have to be served within latency SLA Degrees of freedom Split workloads into smaller tasks Change the batch size 45

Best-fit decreasing algorithms 46

Best-fit decreasing algorithms Bin to fit in execution cycles 47

Best-fit decreasing algorithms Bin to fit in execution cycles All residue workloads sort... 48

Reference Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low power mobile applications." ICLR (2016). Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR (2016). Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015). Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV (2016). Crankshaw, Daniel, et al. "Clipper: A Low-Latency Online Prediction Serving System." NSDI (2017).