Optimizing CNN Inference on CPUs

Size: px

Start display at page:

Download "Optimizing CNN Inference on CPUs"

Stephanie Sanders
5 years ago
Views:

1 Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI

2 Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation

3 Make DL inference easier and faster

4 Deep Learning Inference Model Marketplace Model Marketplace Deep Learning Compiler Inference Target Inference Target

5 Models and hardware targets are far away! Computation graph optimization [Tensor] operation optimization Machine code generation

6 TVM: end-to-end optimization stack

7 Computation Graph Optimization Represent high-level deep learning computations Pruning Pre-compute Memory plan Operation fusion Data layout transform

8 Operation Optimization Challenges

9 Solution: separating compute definition and scheduling Compute definition C = tvm.compute((m, n), lambda i, j: tvm.sum(a[i, k] * B[k, j], axis=k)) Compute scheduling s = tvm.create_schedule(c.op) xo, yo, xi, yi = s[c].tile(c.op.axis[0], C.op.axis[1], bn, bn) ko, ki = s[c].split(k, factor=4) s[c].reorder(ko, xi, ki, yi) s[c].unroll(ki) s[c].cache_read(...) S[C].cache_write(...) s[c].vectorize(yi) s[c].parallel(xo)...

10 Machine code generation LLVM OpenCL CUDA x86 AMD GPU ARM CPU ARM GPU Intel Graphics Nvidia GPU

get _out put ( 0, out put ) import tvm import nnvm.frontend import nnvm.compiler graph, params = nnvm.frontend.from_mxnet(mx_resnet50) graph, lib, params = nnvm.

11 Model in, deployable module out modul e = runt i me. creat e(graph, l i b, t vm. gpu( 0)) modul e. set _i nput ( **params) modul e. run(dat a=dat a_array) out put = t vm. nd. empt y(out _shape, ct x=t vm. gpu(0)) modul e. get _out put ( 0, out put ) import tvm import nnvm.frontend import nnvm.compiler graph, params = nnvm.frontend.from_mxnet(mx_resnet50) graph, lib, params = nnvm.compiler.build(graph, target) input Deployable Module prediction tabby, tabby cat On languages and platforms you choose

12 Optimization on CPUs

13 Intel Xeon Platinum 8000-series CPUs (Skylake) Multi-cores E.g., EC2 c5.9xlarge: 1 processor with 18 cores. AVX-512 supported 512-bit width registers (ZMM) vfmadd231ps -1664(%rax,%r13){1to16}, %zmm0, %zmm1

14 Optimizing CNN inference 0. Leverage the hardware-independent graph-level optimizations Fusion Precomputing Pruning 1. CONV operation optimization 2. Graph-level layout optimization 3. Optimization scheme search

15 in_height kernel_height out_height CONV optimization Data layout is important! conv = tvm.compute(oshape, lambda n, oc, oh, ow: tvm.sum( data[n, ic, oh*stride+kh, ow*stride+kw] * kernel[oc, ic, kh, kw], axis=[ic, kh, kw]), ) NCHW -> NHWC NCHW -> NCHW[x]c OIHW-> OIHW[x]i[y]o in_channel in_width for (n, 0, N): for (oc, 0, OC): for (oh, 0, OH): for (ow, 0, OW): Out[n, oc, oh, ow] = 0 // init Out for (ic, 0, IC): for (kh, 0, KH): for (kw, 0, KW): // Out += In * Kernel kernel_width out_channel (# of kernel) out_width

16 in_height kernel_height out_height CONV optimization Utilize the AVX-512 ISA well (broadcast) Load input to DRAM; Load kernels to ZMM; // up to 16 float32 vfmadd input, kernel, output Store output back to DRAM in_channel in_width outputs inputs + kernels kernel_width ZMM_0 Load 31 inputs to DRAM; Load kernels to ZMM; vfmadd input_1, kernel, output_1 vfmadd input_2, kernel, output_2 vfmadd input_31, kernel, output_31 Store output_{1 31} back to DRAM ZMM_1 - ZMM_{ow_inner} DRAM vectorized FMA out_channel (# of kernel) ow_inner out_width

17 CONV optimization Use a customized thread pool Lock-free queue Thread-binding to physical cores Cache line padding

18 Graph-level layout optimization Undef FLATTEN NCHW AlterOpLayout Undef FLATTEN NCHW NCHW16c LayoutTransform LayoutTransform for parameters can be pre-computed during compile time. CONV OIHW Kernel CONV_NCHW16c OIHW16i16o LayoutTransform OIHW Kernel NCHW NCHW16c RELU NCHW POOLING RELU NCHW16c POOLING NCHW BATCH_NORM C Mean / Variance optimized layout NCHW16c BATCH_NORM C16c LayoutTransform C Mean / Variance NCHW NCHW16c CONV OIHW Kernel CONV_NCHW16c OIHW16i16o LayoutTransform OIHW Kernel NCHW NCHW16c Data Data NCHW LayoutTransform

19 Operator Local Search C H W For operators in a graph, different input sizes lead to various workloads. In local search, we parameterize the schedule for each workload and search for the optimal parameter combinations. Reference: Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin and Arvind Krishnamurthy Learning to optimize tensor programs. arxiv: [cs.lg]

20 Graph Global Search Layout Transformation A layout transformation may be required between two workloads with different layouts.

21 Evaluations

22 End-to-end results Batch size = 1 Baseline: MKLDNN, 18-core Skylake MXNet-MKLDNN TVM-compiled Time (ms) x 1.8x 1.2x 2.1x 1.5x 1.5x 1.7x ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152 VGG-11 VGG-19 MobileNet 1.2x

23 Scalability ResNet-152 VGG-19 MobileNet

24 Conclusions Industry needs an open standard compiler for DL AWS working on the TVM stack Coverages main frameworks and hardware targets Gains performance through Graph and Tensor co-optimization Separation of hardware-specific schedules

25 Contact us if you re interested in trying out TVM with your models! Thank you! Q & A offline

Deep Learning Compiler

Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,