Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Size: px

Start display at page:

Download "Scaling Neural Network Acceleration using Coarse-Grained Parallelism"

Leonard Hill
5 years ago
Views:

1 Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018

2 Neural Networks (NNs) Unprecedented accuracy for challenging applications o Fully-connected, Convolutional, Recurrent (LSTMs) NNs Layer-wise processing: direct acyclic graph (DAG) c t c t + in CONV CONV CONV FC FC out h t FC tanh h t Convolutional NN (CNN) x t FC FC FC LSTM Cell 2

3 Layer Computation Convolutional (CONV) and fully-connected (FC) layers o FC is a CONV with 1 1 images Layer processes a 4D tensor o 2D image (feature maps or fmaps), multiple channels, mini-batch Ifmaps Weights Ofmaps N b N i * N i = N o N b foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O(b,o) += I(b,i) * W(o,i) N o 3

4 Accelerating NNs Domain-specific NN processing engine o An array of specialized processing elements (s) o On-chip SRAM buffer 100x performance and energy efficiency o Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, Main Memory SRAM Buffer Reg File ALU Processing Element 4

5 Scaling Performance Want higher performance? Use more s! Monolithic engine Low resource utilization Long array buses Far from SRAM Tiled architecture? Workload parallelization? Data communication Focus of this work 5

6 Exploiting Coarse-Grained Parallelism Ifmaps Weights Ofmaps N i N b * N i = N b CONV CONV CONV N o in out FC FC N o Intra-layer parallelism o Partition/share data o Data reuse across engines energy efficiency o Avoid on-chip data duplication smaller buffer area Inter-layer pipelining o Forward data o Reduce pipeline stalls performance o Manage intermediate data buffering smaller buffer area 6

7 Intra-Layer Parallelism

8 Parallelizing a Single Layer Ifmaps Weights Ofmaps Batch Input Output Fmap foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O(b,o) += I(b,i) * W(o,i) N b N i * N o N i = N o N b Scheme Ifmaps Ofmaps Weights Batch Partitioned Partitioned Shared Input Partitioned Shared Partitioned Output Shared Partitioned Partitioned Goals: 1. Reduce sharing 2. Optimize sharing Fmap With overlaps Partitioned Shared 8

9 Reducing Sharing: Hybrid Parallelization First CONV layers (fmaps >> weights) batch/fmap Last FC layers (fmaps << weights) input/output Middle layers (fmaps ~~ weights) o An optimization problem to minimize accesses to DRAM & NoC o Use greedy search algorithm to find a hybrid parallelization Scheme Ifmaps Ofmaps Weights Batch Partitioned Partitioned Shared Input Partitioned Shared Partitioned Output Shared Partitioned Partitioned Fmap With overlaps Partitioned Shared 9

10 Optimizing Sharing: Buffer Sharing Skew computation order of engines o All engines start in parallel high throughput o No on-chip data duplication low area Rotate buffered data between engines Optimal distributed buffer & Buffer Sharing dataflow o Fully reuse shared data low energy W[0][1] W[0][0] W[1][1] W[1][0] I[0][1] I[0][0] O[0][0] O[0][1] I[1][1]I[1][0] O[1][0] O[1][1] 10

11 Methodology State-of-the-art NNs o CNNs: AlexNet, VGGNet, GoogLeNet, ResNet o MLPs & LSTMs: medium and large scales Hardware o Engine: Eyeriss [ISCA 16], 8 8 s, 32 kb buffer, 500 MHz o Off-chip memory: LPDDR3-1600, 4 channels 11

12 Performance Energy Buffer Sharing Results NoCoord: same engine resources, but no coordination between BufShr: 1.7x performance, 47% less energy AND 20% less area (if using 4x smaller SRAM) NoCoord BufShr BufShr-buf/4 NoCoord BufShr BufShr-buf/4 With 64 engines (70 mm 2, 8 8 tiled) 12

13 Inter-Layer Pipelining

14 Pipelining Multiple Layers Layer 3 Layer 2 Layer 0 Layer 1 Avoid off-chip access of intermediate data o Save DRAM bandwidth and energy Challenges o Pipeline filling/draining delays: inter-layer data dependencies o SRAM buffer capacity: fully store intermediate data 14

15 Optimizing Pipelining Dataflow Pipeline filling/draining delays o Batch pipelining o Alternate layer loop ordering (ALLO) Buffer usage for intermediate data Optimized pipelining dataflow scheme o Intra-layer buffer sharing o Alternate layer loop ordering (ALLO) o Fmap temporal partitioning (FMTP) 15

16 Ordering Layer Loops Ifmaps Weights Ofmaps N i N o N b * N i = N b foreach ofmap o in No foreach ifmap i in Ni // 2D conv O(o) += I(i) * W(o,i) N o (A Single Layer) foreach ifmap i in Ni foreach ofmap o in No // 2D conv O(o) += I(i) * W(o,i) Ifmaps Ifmaps Ofmaps Ofmaps Time Time Take all ifmaps Generate ofmaps sequentially Take ifmaps sequentially Generate all ofmaps 16

17 Alternate Layer Loop Ordering (ALLO) Unoptimized (Across Layers) Optimized Layer Buffer for ALL fmaps Layer Buffer for ONE fmap Layer Layer Layer Layer Latency for ALL fmaps Time Latency for ONE fmap Time 17

18 Selecting Pipeline Depth Deeper pipeline o Fewer inter-layer DRAM accesses Shallower pipeline o More on-chip SRAM per each layer Auto-tuner VGGNet Pipeline Depth (# Layers) o Optimize for total time and energy o Use beam search algorithm o Work for MLPs, CNNs, and RNNs Energy Time 18

19 Performance Energy Layer Pipelining Results NoPipe: no pipelining, only intra-layer parallelism LayerPipe: 1.5x performance, 30% less energy NoPipe LayerPipe NoPipe LayerPipe With 256 engines (200 mm 2, tiled) 19

20 Overall Evaluation

21 Performance Energy Buffer Sharing + Layer Pipelining With 256 engines (200 mm 2, tiled) BufShr LayerPipe Both Performance Improve 4.6x 1.6x 5.2x Energy Saving 72% 35% 78% Baseline BufShr LayerPipe Both Baseline BufShr LayerPipe Both21

22 Performance Energy Monolithic vs. Tiled With same resources (16384 s and 8 MB SRAM) Monolithic Tiled Performance Improve 3.3x 19.3x Energy Saving <1% 43% E M T E M T 22

23 Summary Effectively and efficiently scale NN acceleration o Coarse-grained parallelism on tiled architecture o 6x better than using a monolithic engine Dataflow optimizations o Intra-layer buffer sharing o Inter-layer pipelining 23

24 Thanks! Questions?

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural