Research Faculty Summit Systems Fueling future disruptions

Size: px

Start display at page:

Download "Research Faculty Summit Systems Fueling future disruptions"

Lizbeth McBride
5 years ago
Views:

1 Research Faculty Summit 2018 Systems Fueling future disruptions

2 Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia

System Challenge in Deep Learning Innovations are

models and workload patterns RNN, CNN, GAN,

Diverse and emerging hardware accelerators, GPU,

Compiler stack is key to bridge framework and

and hardware Optimize for both local execution and

and inference Intermediate Representation (IR) x w

3 System Challenge in Deep Learning Innovations are emerging very fast in deep learning area New DNN models and workload patterns RNN, CNN, GAN, reinforcement learning, graph neural network, etc. Diverse and emerging hardware accelerators, GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc. Compiler stack is key to bridge framework and hardware Combine information of computation graph and hardware Optimize for both local execution and distributed scalability Critical for both training and inference Intermediate Representation (IR) x w * b + y Compiler & Optimizer Infrastructure Execution backends

4 Wolong: Optimizer Stack for Deep Learning System innovation to bridge application and hardware General computation graph optimization Software and hardware co-design Just-in-time compiler Transparent optimization Communication efficiency Accelerator execution efficiency Memory efficiency Tensor Placement Optimizer: Memory layout and placement optimization Intermediate Representation (graph of operators) Global Optimizer: Optimize distributed training over RDMA Graph analyzer/ RDMA memcpy library Local Optimizer: Optimize execution with JIT compiler Operator batching/ kernel fusion Execution Runtime CPU, GPU, RDMA devices

5 Global Optimizer Fast Distributed Deep Learning Computation over RDMA

data parallelism Tensor transmission across server becomes bottlenecks Dispatch partitions Partition YY graph

6 Distributed Dataflow Graph Execution Deep learning computation is modeled as dataflow graph Achieve parallel manner through graph partitioning Model parallelism vs. data parallelism Tensor transmission across server becomes bottlenecks Dispatch partitions Partition YY graph HH * σσ * WW 11 XX WW 22 Latency (ms) Send-Recv Benchmark grpc RDMA 10x Send HH σσ WW 11 * XX Recv YY * WW 2 message size (MB) Server 0 Server 1

7 General Message Passing Library (e.g., RPC) Unavoidable memory copy overhead in RPC Generally designed for dynamic data structure Lacks knowledge of actual data placement and size Extra memory copy from data serialization Software/hardware co-design to completely remove memory copy overhead Leverage runtime application information RDMA network HH Send σσ * WW 11 XX Application memory t RPC managed buffer t t YY * Recv WW 22 Application memory t RPC managed buffer t Server 0 Server 1

8 Combine Dataflow Graph Computation with RDMA Tensor abstraction in deep learning computation Consists of a plain byte array with sufficiently large size (tens of KB to MB) Do NOT require variant data serialization/deserialization Do NOT require extra batching since access pattern is already sequential RDMA enables to manage local and distributed memory in a unified view One-side RDMA R/W : efficient memory copy between host memory GPU-Direct RDMA : efficient memory copy between host and device memory Global graph optimizer for distributed computation Has the entire view and control of memory placement among devices and servers Capable of making globally optimized strategy for tensor data placement in runtime

Optimized Communication Mechanism Transfer statically placed tensor through one-side RDMA write Phase I: graph analyzing RDMA-based zero-copy communication Phase II: graph execution Send HH σσ * WW

9 Optimized Communication Mechanism Transfer statically placed tensor through one-side RDMA write Phase I: graph analyzing RDMA-based zero-copy communication Phase II: graph execution Send HH σσ * WW 11 XX... Source Tensor 1 RDMA lib: Conduct remote memory copy One-sided RDMA write (Polling flag byte) Dest Tensor Recv YY * WW 22 Server 0 Tensor Manager: Detect the source tensor place Re-allocate as RDMA memory Tensor Manager: Pre-allocate RDMA compatible receive tensor Server 1

10 Global Optimizer: Performance Evaluation Improve training throughput, convergence speed and scalability Deep Learning Benchmarks Convergence of Seq2Seq Translation Throughput (mini-batchs/sec) x 2.3x 2.6x 4.2x 4.0x 8.1x AlexNet Inception FC LSTM GRU VGG16 Perplexity x speed up Run time (s) TensorFlow(gRPC) Wolong TensorFlow(gRPC) Wolong(RDMA) More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA * Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM 15 datasets;

11 Local Optimizer Kernel Fusion for Deep Learning on GPU

12 Motivation Deep learning frameworks model computation as graph of primitive operators Expressivity to represent arbitrary neural network structure Flexibility to run on multi-device and multi-server through graph partitioning Significant framework overhead to schedule thousands of operators Kernel-launch overhead Cross operator communication overhead Too fine-grained to leverage vendor s library Example: 80-step LSTM model Contains 1686 operators in TensorFlow runtime (ms) LSTM 512x512 (80steps) TensorFlow CUDNN FuseKernel

13 DL Frameworks vs. Vendor Provided Library Flexibility Deep learning frameworks E.g., TensorFlow, PyTorch, CNTK Embrace flexibility and expressivity Performance inefficiency DL framework + Compiler Generate library-like code in runtime Win both of the worlds Hardware specific library E.g., cudnn, cublas, MKL Designed for extreme efficiency Impossible to handle customized or new network structure Efficiency

14 Wolong Compiler Design Computation graph level optimization Graph rewriting based on computational equivalence Common subexpression elimination, constant folding etc. Operator batching: automatically batch same type operators to better leverage batch efficiency Target and application specific runtime compilation Static shape and type inference Static memory planning Aggressive kernel fusion Computational Graph (IR) x * + y w b Graph level optimizer Target specific JIT compiler Execution runtime

15 Wolong Compiler Execution Workflow Detect optimize subgraph Code generation LSTM Relu Sigmoid Mul BiasAdd Shape inference Memory planning JIT compile FusedKernel Input graph Operator batching Cache hit in later iterations Rewrite graph Output FusedKernel Before Graph Execution Runtime Optimization Input Input

16 Graph Level Optimization: Operator Batching Automatically conduct GEMM fusion and static memory placement optimization M 0 [2,4] M 2 [4,1] M 1 [2,4] Concat(M 0, M 1 ) M 2 [4,1] Tensor M 0 (2,4) Tensor M 1 (2,4) O 0 [2,1] O 1 [2,1] Concat(O 0, O 1 ) Concat Tensor M 2 (4,4)

17 JIT Compilation: Kernel Fusion Leverage aggressive kernel fusion to completely remove scheduling overhead Element-wise (i.e., point-wise) operators No cross-element dependency between operators Better leverage cache, register locality h = ssssssssssssss(xx 1 + xx 2 ) h Sig Add xx 1 xx h xx 1 xx 2 global void kernel_0(float *x1, float *x2, float *h) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = x1[idx] + x2[idx]; float temp1 = sigmoidf(temp0); h[idx] = temp1; } }

18 JIT Compilation: Kernel Fusion Fuse arbitrary (non element-wise) operators in to single kernel Operator data dependency may introduce cross threads data dependency in kernel Need global synchronization to guarantee correctness Cross operator communication uses device memory E.g., fuse two matrix multiplications: Z = AA BB CC AA MM MM BB ZZ CC void kernel_0(float *A, float *B, float *C, float *Z) { if (idx < 1024) { buffer[idx] = _f(a, B); Global_Sync(); Z[idx] = _f(buffer, C); h[idx] = temp1; } }

19 Graph Computation in DL Frameworks Operators (kernels) are scheduled (launched) one by one CPU SM SM SM SM SM SM PCIe GPU block0 block1 block2 block3 block4 block5 Overhead of kernel launching, DAG scheduling, memory copy, etc. Conv block6 block7 block8 block0 block1 block2 block3 block4 block5 block6 block7 block8 block9 block1 0

20 Arbitrary Kernel Fusion Is Limited by GPU Architechture Hard to conduct global synchronization across all threads PCIe GPU CPU SM SM SM SM SM SM Conv barrier Fused as single kernel block0 Conv block6 Conv block1 Conv block7 Conv block2 Conv block8 Conv block3 Conv block9 Conv block4 Conv block10 Conv block5 Conv Never complete due to waiting for barrier Never be scheduled

21 Our Solution: Persistent Threads and Virtual Blocks Assign virtual block task to persistent threads PCIe GPU CPU SM SM SM SM SM SM block0 block1 block2 block3 block4 block5 Conv barrier vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 Fused as single kernel vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 vblock9 vblock1 0

22 Kernel Packing Explore graph level parallelism in static code generation GPU PCIe CPU SM SM SM SM SM SM [64, 512] [512, 128] block0 block1 block2 block3 block4 block5 vblock0 vblock1 Under utilization of GPU resource σσ AAAAAA vblock0 vblock1 vblock2 vblock3 vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 [64, 128] vblock6

$Code Generation Code Generator Operator kernels: device functions A σσ Relu C B //kernel code generated by Wolong compiler global void kernel_0(float *A, float *B, float *C) { int idx = blockidx.$

23 Code Generation Code Generator Operator kernels: device functions A σσ Relu C B //kernel code generated by Wolong compiler global void kernel_0(float *A, float *B, float *C) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = sigmoidf(a[idx]); buffer0[idx] = temp0; } GlobalSync(); for (int tix = bx; tix < ey; tix += offx) { for (int tiy = by; tiy < ey; tiy += offy) { (buffer0, B, buffer1, 1024, 1024, 128); } } if (idx < 1024) { float temp2 = reluf(temp1); C[idx] = temp2; } } //device function for sigmoid operator device float sigmoidf(float in) { return 1.f / (1.f + expf(-in)); } //device function for relu operator device float reluf(float in) { return fmaxf(0.f, in); } Operator device functions //device function for operator device float (float *a, float *b, float *c, int m, int n, int k) { if (thread_ix < m && thread_iy < k) { float temp = 0.f; for (int i = 0; i < n; ++i) { temp += a[tix*n+i] * b[k*i+tiy]; } c[k * tix + tiy] = temp; } }, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.

24 Performance of End-to-end Kernel Fusion RNN inference benchmark (LSTM-128uints-80steps) Fused 1686 operators into 1 kernel Wolong fused_kernel Avg Runtime(ms) LSTM Benchmark 10.9x TensorFlow XLA OpBatch Fusion Experiments are conducted on Nvidia GTX 1080 Ti GPUs

25 Conclusion A compiler infrastructure is critical for both cloud and edge AI Optimize for fast distributed training in cloud Optimize for efficient inference on accelerator devices System innovations to bridge applications and diverse hardware Common intermediate representation (IR) Co-design software and hardware for extreme efficiency Wolong prototype has demonstrated the initial improvements Up to 8x speedup on training workloads Up to 10x speedup on inference benchmark

26 Thank You! Systems Fueling future disruptions

28 Distributed Graph Optimizer of Wolong Transfer dynamically allocated tensor through RDMA write/read Phase I: graph analyzing Supports GPUDirect RDMA as well Phase II: graph execution Send HH σσ * WW 11 XX Tensor meta data 1... One-sided RDMA write One-sided RDMA read Tensor meta data 01 Allocate... Recv YY * WW 22 Source Tensor Dest Tensor Server 0 Server 1

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning