End to End Optimization Stack for Deep Learning

Size: px

Start display at page:

Download "End to End Optimization Stack for Deep Learning"

Jacob Hunt
6 years ago
Views:

1 End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

2 Collaborators University of Washington AWS AI Team Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline and many more contributors in the DMLC community Carlos Guestrin Luis Ceze Arvind Krishnamurthy

3 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

4 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

5 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

6 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware Built a new accelerator

7 Deep Learning System Research is Exciting but Hard Frameworks CNTK Need entire software stack on top of it! Computational graph Layout transformation Quantization Operator kernel optimization Benchmarking. Operator Libraries cudnn, NNPack, MKL-DNN Hardware Built a new accelerator

8 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

9 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

10 Deep Learning System Research is Exciting but Hard Frameworks CNTK Data Layout Optimization Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

11 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Operator Libraries cudnn, NNPack, MKL-DNN Hardware

12 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Operator Libraries cudnn, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!

13 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Serving Operator Libraries cudnn, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!

14 The End to End System Challenge Frameworks Hardware Back-Ends

15 The End to End System Challenge Frameworks Hardware Back-Ends

16 The End to End System Challenge Frameworks Hardware Back-Ends

17 The End to End System Challenge Frameworks Hardware Back-Ends

18 The End to End System Challenge Frameworks Hardware Back-Ends

19 The End to End System Challenge Frameworks Hardware Back-Ends

20 The End to End System Challenge Frameworks Hardware Back-Ends

21 The End to End System Challenge Frameworks Intermediate representation Hardware Back-Ends

22 Computational Graph IR and Remaining Gap Examples: NGraph, XLA, NNVM, DLVM Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

23 Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

24 Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion too many possible choices: precision, layout, fused pattern, device, threading Need a low level IR to express them explicitly Backends

25 TVM: Low Level IR Concise and compact description Explicit control on codegen Framework NNVM Graph Auto Differentiation Memory Plan Ease of deployment Support new hardware backends TVM Hardware backends

26 Tensor Index Expression Declaration Compute C = dot(a, B.T) import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='a') Inputs B = tvm.placeholder((n, h), name= B') k = tvm.reduce_axis((0, h), name= k') C = tvm.compute((m, n), lambda i, j: tvm.sum(a[i, k] * B[j, k], axis=k)) Shape of C Computation Rule

27 Challenge: Hardware Diversities IR

28 Challenge: Hardware Diversities CPU GPU Accelerators IR

29 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed

30 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor

31 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor Data type fp32 fp16 int8

32 Unified Schedule Optimizations for Hardwares Algorithm described in IR Lowering Generated code (LLVM, CUDA, OpenCL )

33 Unified Schedule Optimizations for Hardwares Algorithm described in IR Scheduling Optimization Lowering Generated code (LLVM, CUDA, OpenCL )

34 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout Lowering Generated code (LLVM, CUDA, OpenCL )

35 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering Generated code (LLVM, CUDA, OpenCL )

36 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation Generated code (LLVM, CUDA, OpenCL )

37 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation ( ) Latency hiding Generated code (LLVM, CUDA, OpenCL )

38 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation ( ) Latency hiding ( ) Tensorization Generated code (LLVM, CUDA, OpenCL )

39 Separation of Compilation and Deployment Compilation Stack TVM Runtimes Framework Frontends Deploy NNVM TVM TVM Graph Module Heavy optimizations Lightweight, 300 to 600 KB

40 Remote Execution and Profiling Devices with TVM Runtime Server with TVM Compiler TVM RPC

41 Performance Portable against state of art 8 6 K80, Baseline MXNet with cudnn auto tune enabled One grad student month 1.2x MXNet NNVM Compiler Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack Two undergrad weeks MXNet NNVM Compiler Time cost(ms) Time cost(ms) 4 1.2x x 11.5x 0 ResNet18 MobileNet 0 ResNet18 MobileNet Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

42 Coming Soon: Target New Accelerators Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon

43 NNVM Compiler: Open Compiler for AI Systems Caffe Keras MXNet PyTorch Caffe2 CNTK CoreML NNVM ONNX Graph Optimizations External Support Supported Work in progress TVM TVM Primitives Joint Work with AWS AI Team and DMLC community More hardware backends OpenCL LLVM Metal CUDA X86 AMDGPUs ARM Javascript/WASM

44 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

45 Deep Learning System Research is Just Exciting Frameworks CNTK NNVM My new optimizations works on all platforms! Graph Optimizations TVM TVM Primitives I can program my new accelerators from python :) Hardware

46 Deep Learning System Research is Just Exciting Frameworks CNTK NNVM My new optimizations works on all platforms! Graph Optimizations TVM TVM Primitives I can program my new accelerators from python :) Hardware You can be part of it!

Automatic Code Generation TVM Stack

Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks