Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Size: px

Start display at page:

Download "Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture"

Arron Lewis
5 years ago
Views:

1 The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung Research KAIST

bigger models - Greg Diamos, Senior Researcher, Baidu (EE times, Sep 2016)

2 Neural Network Training : Scalability matters Today the job of training machine learning models is limited by compute, if we had faster processors we d run bigger models - Greg Diamos, Senior Researcher, Baidu (EE times, Sep 2016) Nvidia DGX-2 : Designed to train the previously impossible Nvidia DGX-1 Nvidia DGX-2 2

3 Neural Network Training : Scalability matters Data Parallel Training is widely used for CNN Distributes inputs, and duplicates weights Weight update through collective communication input batch W 0 W 1 W 2 Reduce Broadcast W 3 W W update : worker 3

Scalable Parallel Training of CNN Data Parallel : constant amount of communication per worker Common approach to make DPT scalable is increasing batch size However, large batch can lower convergence

4 Scalable Parallel Training of CNN Data Parallel : constant amount of communication per worker Common approach to make DPT scalable is increasing batch size However, large batch can lower convergence stability or require longer training [Masters, arxiv 18], [Hoffer, NIPS 17] input batch W 0 W 0 1 Assume a moderate fixed batch size (256) W W 0 1 W 1 W 2 W 0 W 3 W 1 Reduce Broadcast We explore a new parallelism dimension and interconnect W 0 architecture W 1 W W for scalable parallel training update W W update Communication time can limit scalability with a fixed batch size : worker 4

5 Contents Background / Motivation Winograd transform and Intra-tile parallelism Multi-dimensional Parallel Training Hybrid Interconnect topology and Dynamic clustering Evaluation Conclusion

6 Tile based Winograd Transform for CNN Convolution Dot products 2x~4x reduction in multiplication, but increase in memory access Supported by Nvidia cudnn, Intel Nervana, NNPACK (Caffe2, PyTorch, MXNet) Winograd Layer [Li, arxiv 17] Spatial Domain Winograd Domain w Transform W x X Y Transform Spatial Domain Activation y T : tile T element 6

7 Intra-tile parallelism in transformed convolution Element-wise independent computation Intra-tile parallelism x y X Y w W tile Direct convolution Transformed convolution 7

8 Multi-dimensional Parallel Training (MPT) Data + Intra-tile parallelism a row = cluster (Intra-tile parallel), a column = group (Data parallel) cluster Intra-tile Parallel Data Parallel group weight W Data parallel training fmap tile transfer + weight communication within group - Tile transfer required Multi-dimensional parallel training 8

9 16 workers per group Hybrid Interconnect Network Architecture Two communication types (weight, tile) have different traffic patterns Hybrid interconnect network topology : Ring + High-radix gr0 gr1 gr2 gr3 gr4 gr7 gr8 gr11 gr12 gr15 High-radix (FBFLY) network for intra-cluster network : for all-to-all communication gr0 gr1 gr2 gr3 gr4 gr5 gr6 gr7 gr8 gr 12 gr9 gr 13 gr 10 gr 14 gr 11 gr 15 Host CPU : worker Ring for intra-group network : for collective communication 9

10 Challenges of MPT Early layer Larger feature map Smaller weights Tile transfer >> Weight communication Late layer Smaller feature map Larger weights Weight communication >> Tile transfer MPT Bad MPT Good MPT helps for late layers, but huge amount of tile transfer in a few early layers can degrade overall performance. 10

11 Dynamic clustering to minimize the overhead of MPT Multiple possible configurations of MPT Reconfigures the MPT organization through different routing among the workers to minimize overall communication per layer. Data parallel = MPT with 1 group Dynamic clustering MPT with 2 groups MPT with 4 groups W 11

12 16 workers per group Interconnect Network with Dynamic clustering Convolution layers have different layer structure Balance the Weight and tile communication : Dynamic clustering gr0 gr1 gr2 gr3 gr4 gr7 gr8 gr11 gr12 gr15 Host CPU Host CPU Host CPU # groups = 16 # groups = 4 # groups = 1 Host CPU Exploit the connectivity through the host 12

13 16 HMC per group crossbar align Scratch pad Scratch pad Scalable Near-data Processing Architecture gr0 DRAM layers 1. To utilize higher bandwidth of 3D-stacked memory (Winograd transform reduces computation, but increases data access) 2. To exploit the high-speed serial links for hybrid interconnect topology Logic layer gr1 gr2 gr3 gr4 gr7 Input buffer Input buffer PE PE PE PE Systolic Array (MAC) Compute Host CPU Output buffer gr8 gr11 on-chip gr12 network gr15 (control) Input buffer Input buffer Acc Trans -pose Instr. ALU ALU Accum. Input DMA Decode Vector Processor Addr Use NDP for worker Output DMA on-chip network (Data) Task scheduler Task Graph Dependency check Control p2p Comm. collective Comm. Communication DRAM, I/O link 13

14 More details in the paper Activation prediction with non-uniform quantization Transfers quantized data to predict the activations Skip the transfer of tile data transformed to non-activated neurons No accuracy loss with conservative prediction Micro-architecture for communication logic Sparse data transfer for tile transfer with activation prediction Concurrent collective communication for Reduction 14

15 Evaluation - Workloads Layer-wise evaluation for detailed analysis Abbr. CNN layer I, J (# channel) x(y) dim. w dim. Early ResNet-34 conv2 x 128,128 56x56 3x3 Mid-1 ResNet-34 conv4 x 256,256 14x14 3x3 Mid-2 WRN conv3 320,320 16x16 3x3 Late-1 ResNet-34 conv5 x 512,512 7x7 3x3 Late-2 WRN conv4 640,640 8x8 3x3 Full CNN evaluation for overall impact and comparison Used CNNs having large parameter (weight) size Network Wide ResNet [Zagoryuko, arxiv 16] ResNet [He, CVPR 16] FractalNet [Larsson, arxiv 16] Configuration WRN ResNet-34 4 block, 4 column 15

16 Evaluation - Methodology Evaluated System 1 CPU, 256 NDP modules Network (Ring + FBFLY) Router clock : 1GHz 3D-stacked Memory : 320GB/sec bandwidth Worker model implemented on the cycle-accurate network simulator System Configurations Abbr. System Configuration d_dp Direct convolution with data parallelism (update w) w_dp Winograd convolution with data parallelism (update w) w_mp Winograd convolution with MPT (update W) w_mp+* w_mp + dynamic clustering + activation prediction Baseline Proposed 16

17 backward - Execution time backward-energy forward - Execution time forward-energy Evaluation Layer-wise Results Execution time Energy d_dp w_dp (baseline) w_mp w_mp Early Late-2 GMEAN * 0.84x GMEAN Compute SRAM DRAM Link Early Late-2 GMEAN 4.1x GMEAN Compute SRAM DRAM Link 17

18 100 Images/sec (log scale) Evaluation Full CNN Results Comparison with DGX-1 system (8 Volta GPUs with 6 Nvlinks) NCCL library is used for collective communication, batch size = 256 Similar power consumption with 256 NDP workers 8gpu 1ndp 256ndp w_dp 256ndp w_mp 256ndp w_mp+* x 80x 168x x 221x x x 8 1x 4 2 1x 1 Wide ResNet ResNet-34 FractalNet : 2.7x performance increase (up to 6x for late layers) from different parallelism with the same worker architecture 18

19 Conclusion Exploited element-wise computation (or intra-tile parallelism) of Winograd transformed convolution for scalable training. Proposed multi-dimensional parallel training (MPT) that combines data parallelism and intra-tile parallelism. MPT creates a new type of communication (tile transfer) with very different traffic pattern than weight communication. Proposed hybrid interconnect topology (Ring + High-radix topology) Proposed dynamic clustering and activation prediction to minimize the overhead of MPT (details in paper) Results in 2.7x performance increase compared to data parallelism with the same NDP architecture (up to 6x for late layers). 19

20 Thank you

21 Back-up slides

22 Background : Training of CNN Synchronous Stochastic Gradient Descent (S-SGD) Weight update per mini-batch Composed of three phases Forward propagation (fprop) Backward propagation (bprop) Update gradients (updategrad) input f.map x output f.map weight w y x y w Act. w update y w f x fprop (bprop) updategrad 22

23 Winograd Transform updategrad phase : No transformation Winograd domain data X and Y are stored to DRAM at fprop and bprop phases, and weights are updated directly in Winograd domain. Winograd Domain old Load from DRAM X W W update new W Y updategrad

24 Winograd Transform with 4D tensors Input feature maps Weights Output feature maps B batch x I ch. J channels w I ch. B batch y J ch. Transform B t I ch. J channles I ch. B t J ch. X W Y u T : tile with T T elements v

25 Communication amount (MB) Communication amount (MB) Challenges of MPT Early layer Larger feature map Smaller weights Tile transfer > Weight comm. Late layer Smaller feature map Larger weights Weight comm. > Tile transfer Number of workers (p) Number of workers (p) Data parallel (weights) MPT (tiles) MPT (weights)

26 Activation prediction without accuracy loss A simple idea Inverse-transform of a tile to spatial domain neurons is not necessary if the spatial domain neurons are all non-activated. Source worker Non-uniform Quantization Destination worker Calculate estimated value and max. quant. error Skips data transfer for non act. tile Conservative prediction (predict to non-activated if est. value < max. error)

27 Activation prediction without accuracy loss Non-uniform quantization : To follow the distribution of the value of tile data Hardware logic of non-uniform quantization for single-precision float

28 Activation prediction without accuracy loss Act. predict Dest. worker Source worker max. possible error (+) compare y(est.) 2D Predict 1D Predict Max. error (2D) + - Max. error (1D) Transform (2D) Transform (1D) or Transform (1D) uniform value Resolution region Q.decode Tile assemble Quantizer μ, μ, σ 28

29 Activation prediction without accuracy loss Best result with 4-region quant, and showed 53.6~60% of tile transfer 29

30 Scalable Near-data Processing Architecture Communication units Winograd Transfrom 0- unpack Quantizer Prediction Act. map packing (compress) Packet generator Link Communication buffers DRAM DMA Addr gen. push Free Entry List Data partially movable shift register Pointer Act. map pack p3 p2 p1 p0 pack p1 p3 p2 p0 pop Used for Tile transfer - Tile gathering with act. prediction : Data packing p2p (compression) comm. unit is required, and pointer-based shift register is used Buffer Alloc. weight update chunk Reduce layer Reduce crossbar Dynamic clustering dest. Packet generator Broadcast Link Link Used for Collective communication - Ring-based (pipelined) reduce / broadcast - Concurrent collective communication comm. unitwith multiple reduce logic / buffers 30

31 100 Images/sec. (log scale) Normalized Images/sec (per Watt) Evaluation Full CNN Results Full CNN evaluation Comparison with state-of-the-art multi-gpu system Nvidia DGX-1 (8 Volta GPU + NVlink), NCCL collective comm., TensorFlow 1gpu 2gpu 4gpu 8gpu 1ndp 256ndp (w_dp) 256ndp (w_mp++) x 1x 183x 168x 80x Wide ResNet ResNet-34 FractalNet 1x Batch size = 256 (fixed) 68x 1x 221x 8gpu 256ndp (w_dp) 256ndp (w_mp++) K 4K 2K Wide ResNet ResNet-34 FractalNet Batch size increased to show the best performance for multi- GPU system 31

32 1/Time (5 layers) Results Applying MDP to conventional network (Ethernet) and parameter server system Communication reduction has larger impact for low BW system Parameter servers: PS PS PS Network (switches) cluster Transform, Act. predict W p + Tiles p p workers: group0 group1 gruop0 group High BW Low BW w_dp w_mp++ Performance increase of MDP compared to Data Parallel : 2.8x (High BW system) 4.1x (Low BW system) 32

High Performance Computing

High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason