Deep Learning Compiler

Size: px

Start display at page:

Download "Deep Learning Compiler"

Kelly Roberts
5 years ago
Views:

1 Deep Learning Compiler AWS AI

2 Acknowledgement

3 Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU, Intel graphics ARM CPU, ARM GPU Nvidia GPU FPGA ASIC Product targets Amazon Rekognition AWS DeepLens Amazon Lex And a lot of internal/external products

5 CONV Kernel tuning

6 Intel Xeon Platinum 8000-series CPUs (Skylake) Multi-cores E.g., EC2 c5.9xlarge: 1 processor with 18 cores. AVX-512 supported 512-bit width registers (ZMM) E.g. vfmadd231ps -1664(%rax,%r13){1to16}, %zmm0, %zmm1

7 CONV optimization Data layout is important! conv = tvm.compute(oshape, lambda n, oc, oh, ow: tvm.sum( data[n, ic, oh*stride+kh, ow*stride+kw] * kernel[oc, ic, kh, kw], axis=[ic, kh, kw]), ) for (n, 0, N): for (oc, 0, OC): for (oh, 0, OH): for (ow, 0, OW): NCHW -> NHWC NCHW -> NCHW[x]c OIHW-> OIHW[x]i[y]o in_channel in_width Out[n, oc, oh, ow] = 0 // init Out for (ic, 0, IC): for (kh, 0, KH): for (kw, 0, KW): // Out += In * Kernel in_height kernel_width kernel_height out_channel (# of kernel) out_width out_height

8 CONV optimization Utilize the AVX-512 ISA well in_height (broadcast) Load input to DRAM; Load kernels to ZMM; // up to 16 float32 vfmadd input, kernel, output Store output back to DRAM in_channel in_width outputs inputs + kernels kernel_width kernel_height ZMM_0 Load 31 inputs to DRAM; Load kernels to ZMM; vfmadd input_1, kernel, output_1 vfmadd input_2, kernel, output_2 vfmadd input_31, kernel, output_31 Store output_{1 31} back to DRAM ZMM_1 - ZMM_{ow_inner} DRAM vectorized FMA out_channel (# of kernel) ow_inner out_width out_height

9 Intel Graphics on Amazon DeepLens Hardware Configs: Intel HD Graphics 500 (Intel s Gen 9) On-die integrated GPU 12 EUs, 0.55 GHz 7 physical threads per EU, bit FPUs per EU GFLOPS peak performance Work items in the same SIMD group form a subgroup sharing 4KB GRFs Intel Opencl ext: cl_intel_subgroups Shares the main memory with CPU

10 Instruction examples and corresponding TVM instructions intel_sub_group_block_read/write cache_read/write(buffer, warp, [result]) Intel_sub_group_shuffle storage_align(axis, 16) and bind it to threads Convolution: Work items work on a certain block of workloads to utilize local memory Layout transform for coalescing memory accesses Utilize cl_intel_subgroups operations

11 Graph-level optimization

12 Graph-level layout optimization Undef FLATTEN NCHW AlterOpLayout Undef FLATTEN NCHW NCHW16c LayoutTransform LayoutTransform for parameters can be pre-computed during compile time. CONV OIHW Kernel CONV_NCHW16c OIHW16i16o LayoutTransform OIHW Kernel NCHW NCHW16c RELU NCHW POOLING RELU NCHW16c POOLING NCHW BATCH_NORM C Mean / Variance optimized layout NCHW16c BATCH_NORM C16c LayoutTransform C Mean / Variance NCHW NCHW16c CONV OIHW Kernel CONV_NCHW16c OIHW16i16o LayoutTransform OIHW Kernel NCHW NCHW16c Data Data NCHW LayoutTransform

13 Graph/tensor co-optimization CONV i LayoutTransform CONV j ELEWISE_ADD LayoutTransform CONV k LayoutTransform CONV No LayoutTransform? Yes CONV schemes CONV computing time: varies along with different CONV schemes CONV l LayoutTransform N-2 N-1 N Layout Transform time: varies along with different CONV schemes Dynamic programming + necessary heuristics

14 End-to-end results Batch size = Intel CPU MXNet OpenVINO TVM ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152 VGG-11 VGG-13 VGG-16 VGG-19 DenseNet-121 DenseNet-161 DenseNet-169 DenseNet-201 Inception-v3 MobileNet SSD Intel Graphics OpenVINO TVM ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152 SSD

15 Other functionalities

16 Runtime multi-threading Use a customized thread pool for CPU targets Lock-free queue using C++ atomics Thread-binding to physical cores Cache line padding

17 ResNet-152 VGG-19 DenseNet-121 Inception-v3

18 Graph Annotation Annotation Copy node insertion Optimization/Compilation Runtime graph lib file GPU GPU GPU CPU lib CPU CPU TVM op CPU lib CPU GPU GPU TVM op TVM op TVM op GPU GPU lib GPU lib GPU GPU GPU GPU node CPU node Data copy node

19 Quantization on Intel CPUs Hardware support : Fast INT8 operations with INT32 accumulation INT8 conv2d kernel requires new schedule Performs reduction in groups of 4 INT8 elements to INT32 elements FP32 schedule does not require in-vector reduction 3.5 INT8 schedules speedup for varying workloads of conv2d Speedup normalized to FP32 schedules W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25 Workloads

20 ASICs AWS inferentia

21 Takeaways

22 Takeaways Industry needs an open standard compiler for DL AWS working on the TVM stack

23 Takeaways Industry needs an open standard compiler for DL AWS working on the TVM stack We are eager to collaborate with the community Talk to us, we have 10+ people here today!

24 Takeaways Industry needs an open standard compiler for DL AWS working on the TVM stack We are eager to collaborate with the community Talk to us, we have 10+ people here today!

25 Takeaways Industry needs an open standard compiler for DL AWS working on the TVM stack We are eager to collaborate with the community Talk to us, we have 10+ people here today! We are hiring! Write to Vin Sharma or Yida Wang

Optimizing CNN Inference on CPUs

Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation Make DL inference easier and faster