Clipper A Low-Latency Online Prediction Serving System

Size: px

Start display at page:

Download "Clipper A Low-Latency Online Prediction Serving System"

Marcia Underwood
5 years ago
Views:

1 Clipper A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang, Giulio Zhou, Michael Franklin, Joseph Gonzalez, Ion Stoica NSDI 2017 March 29, 2017

2 Learning Big Data Training Complex Model

3 Learning Big Data Training Complex Model

4 Learning Produces a Trained Model Query Decision CAT Model

5 Learning Serving Query Big Data Training Decision? Model Application

6 Serving Learning Big Data Training Query Decision Model Application Prediction-Serving for interactive applications Timescale: ~10s of milliseconds

7 Prediction-Serving Raises New Challenges

8 Prediction-Serving Challenges??? Create Caffe 8 VW Support low-latency, highthroughput serving workloads Large and growing ecosystem of ML models and frameworks

9 Support low-latency, high-throughput serving workloads Models getting more complex Ø 10s of GFLOPs [1] Deployed on critical path Ø Maintain SLOs under heavy load [1] Deep Residual Learning for Image Recognition. He et al. CVPR Using specialized hardware for predictions

10 Google Translate Serving 140 billion words a day 1 Invented New Hardware! Tensor Processing Unit (TPU) 82,000 GPUs running 24/7 [1]

11 Big Companies Build One-Off Systems Problems: Ø Expensive to build and maintain Ø Highly specialized and require ML and systems expertise Ø Tightly-coupled model and application Ø Difficult to change or update model Ø Only supports single ML framework

12 Prediction-Serving Challenges??? Create Caffe 12 VW Support low-latency, highthroughput serving workloads Large and growing ecosystem of ML models and frameworks

13 Large and growing ecosystem of ML models and frameworks Fraud Detection Content Rec. Personal Asst. Robotic Control Machine Translation??? Create VW Caffe 13

Difficult to deploy and Robotic Control Machine

14 Large and growing ecosystem of ML models and frameworks Fraud Detection Content Rec. Personal Asst. Difficult to deploy and Robotic Control Machine Translation brittle to manage Create Caffe??? Varying physical resource requirements VW

15 But most companies can t build new serving systems

16 Use existing systems: Offline Scoring Big Data Training Model Batch Analytics

17 Use existing systems: Offline Scoring Datastore Big Data Training Scoring X Y Model Batch Analytics

18 Use existing systems: Offline Scoring Look up decision in datastore X Y Query Decision Application Low-Latency Serving

Use existing systems: Offline Scoring Look up decision in datastore Problems: X Y Query Ø Requires full set of queries ahead of time Ø Small and bounded input

19 Use existing systems: Offline Scoring Look up decision in datastore Problems: X Y Query Ø Requires full set of queries ahead of time Ø Small and bounded input domain Ø Wasted computation and space Decision Ø Can render and store unneeded predictions Ø Costly to update Ø Re-run batch job Application Low-Latency Serving

20 Prediction-Serving Challenges??? Create Caffe 20 VW Support low-latency, highthroughput serving workloads Large and growing ecosystem of ML models and frameworks

21 How does Clipper address these challenges?

22 Clipper Solutions q Simplifies deployment through layered architecture q Serves many models across ML frameworks concurrently q Employs caching, batching, scale-out for high-performance serving

23 Clipper Decouples Applications and Models Applications Predict RPC/REST Interface Feedback Clipper RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe

24 Clipper Architecture Applications Predict RPC/REST Interface Clipper Observe Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput. Model Selection Layer Model Abstraction Layer RPC RPC RPC RPC Model Container (MC) MC MC MC

25 Clipper Architecture Applications Predict RPC/REST Interface Clipper Observe Selection Policy Caching Adaptive Batching Model Selection Layer Model Abstraction Layer RPC RPC RPC RPC Model Container (MC) MC MC MC

Abstraction Layer Adaptive Batching 100 lines of Python RPC 250

26 Clipper Implementation Applications Predict RPC/REST Interface Clipper Observe Core system: 5000 lines of Rust Caching RPC: Model Abstraction Layer Adaptive Batching 100 lines of Python RPC 250 lines of Rust RPC RPC RPC 200 lines of C++ Model Container (MC) ust MC MC MC

27 Caching Adaptive Batching Model Abstraction Layer RPC Model Container (MC) RPC RPC RPC MC MC MC Caffe

28 Caching Adaptive Batching Model Abstraction Layer RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems

29 Container-based Model Deployment Implement Model API: class ModelContainer: def init (model_data) def predict_batch(inputs)

30 Container-based Model Deployment Implement Model API: class ModelContainer: def init (model_data) def predict_batch(inputs) Ø Implemented in many languages Ø Ø Ø Python Java C/C++

31 Container-based Model Deployment Model implementation packaged in container Model Container (MC) class ModelContainer: def init (model_data) def predict_batch(inputs)

32 Container-based Model Deployment Clipper RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe

33 Caching Adaptive Batching Model Abstraction Layer RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation

34 Caching Adaptive Batching Model Abstraction Layer RPC RPC RPC RPC RPC RPC l Container (MC) MC MC MC MC MC Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation Ø Scale-out Problem: frameworks optimized for batch processing not latency

35 Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load Hardware Acceleration Helps amortize system overhead

36 Adaptive Batching to Improve Throughput Ø Why batching helps: A single page load may generate many queries Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load Clipper Solution: Hardware Acceleration Helps amortize system overhead Adaptively tradeoff latency and throughput Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)

37 Tensor Flow Conv. Net (GPU) Better Throughput (Queries Per Second) Batch Size

38 Tensor Flow Conv. Net (GPU) Better Throughput (Queries Per Second) Latency (ms) Better Batch Size

39 Tensor Flow Conv. Net (GPU) Better Throughput (Queries Per Second) Optimal Batch Size Latency (ms) Better Batch Size

40 Better 1R BatFKLng Throughput (QPS) R-2S RandRP )RreVt (6KOearn) LLnear 690 (3y6SarN) LLnear 690 (6KLearn) KerneO 690 (6KLearn) 1921 LRg RegreVVLRn (6KLearn)

41 Better AdaStLve 1R BatFKLng Throughput (QPS) R-2S andRP )RreVt (6KOearn) LLnear 690 (3y6SarN) LLnear 690 (6KLearn) KerneO 690 (6KLearn) LRg 5egreVVLRn (6KLearn)

42 Better AdaStLve 1R BatFKLng Throughput (QPS) P99 Latency (ms) 20 ms is Fast Enough Better R-2S andRP )RreVt (6KOearn) 20 0 LLnear 690 (3y6SarN) 20 0 LLnear 690 (6KLearn) 28 5 KerneO 690 (6KLearn) 20 0 LRg 5egreVVLRn (6KLearn)

43 AdaStLve 1R BatFKLng Throughput (QPS) P99 Latency (ms) Batch Size R-2S andRP )RreVt (6KOearn) LLnear 690 (3y6SarN) 1318 LLnear 690 (6KLearn) 3 KerneO 690 (6KLearn) 1224 LRg 5egreVVLRn (6KLearn)

44 Overhead of decoupled architecture Applications Predict RPC/REST Interface Feedback Clipper RPC RPC RPC RPC MC MC MC MC Caffe

45 Overhead of decoupled architecture Applications Predict RPC/REST Interface Clipper RPC RPC RPC RPC MC MC MC MC Caffe Feedback Applications Predict RPC Interface TensorFlow- Serving

46 Overhead of decoupled architecture Applications Applications Predict RPC/REST Interface Feedback Predict RPC Interface Clipper RPC RPC RPC RPC Model Container MC MC MC Caffe TensorFlow- Serving

47 Overhead of decoupled architecture Model: AlexNet trained on CIFAR-10 Better P99 Latency (ms) Throughput (QPS) Better

48 Clipper Architecture Applications Predict RPC/REST Interface Clipper Observe Improve accuracy through bandit methods and ensembles, online learning, and personalization Provide a common interface to models while bounding latency and maximizing throughput. Model Selection Layer Model Abstraction Layer RPC RPC RPC RPC Model Container (MC) MC MC MC

49 Clipper Improve accuracy through bandit methods and ensembles, online learning, and personalization Model Selection Layer Version 1 Version 2 Version 3 Periodic retraining Caffe Experiment with new models and frameworks

50 Selection Policy: Estimate confidence Version 2 Version 3 CAT CAT CAT CAT Policy CAT CONFIDENT Caffe

51 Selection Policy: Estimate confidence Caffe CAT MAN CAT SPACE Policy CAT UNSURE

52 Selection Policy: Estimate confidence Better 7Rp-5 ErrRr 5ate Image1et crniident unsure ensemele 4-agree 5-agree

53 Selection Policy: Estimate confidence Better 7Rp-5 ErrRr 5ate Image1et crniident unsure ensemele 4-agree 5-agree width is percentage of query workloads

54 Clipper Selection Policy Model Selection Layer Selection policies supported by Clipper ØExploit multiple models to estimate confidence Ø Use multi-armed bandit algorithms to learn optimal model-selection online Ø Online personalization across ML frameworks *See paper for details

55 Conclusion Ø Prediction-serving is an important and challenging area for systems research Ø Support low-latency, high-throughput serving workloads Ø Serve large and growing ecosystem of ML frameworks Ø Clipper is a first step towards addressing these challenges Ø Simplifies deployment through layered architecture Ø Serves many models across ML frameworks concurrently Ø Employs caching, adaptive batching, container scale-out to meet interactive serving workload demands Ø Beyond academic prototype to build a real, open-source system crankshaw@cs.berkeley.edu

56 GPU Cluster Scaling TKrRugKput (10K Tps) Latency (Ps) Agg 10Gbps Agg 1Gbps 0ean 10Gbps 0ean 1Gbps 0ean 10Gbps 0ean 1Gbps Gbps 399 1Gbps uPber Rf 5eplLcas

Prediction Systems. Dan Crankshaw UCB RISE Lab Seminar 10/3/2015

Prediction Systems. Dan Crankshaw UCB RISE Lab Seminar 10/3/2015 Prediction Systems Dan Crankshaw UCB RISE Lab Seminar 10/3/2015 Learning Big Data Training Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied... major focus of the