TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Size: px

Start display at page:

Download "TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory"

Clement Anthony
6 years ago
Views:

1 TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017

2 Deep Neural Networks (NNs) Unprecedented accuracy for challenging applications System perspective: compute and memory intensive o Challenging to execute efficiently on conventional systems Many efforts for optimized NN hardware Classification Recognition Control Prediction Optimization Multicores GPUs FPGAs ASICs Clusters 2

Deep Neural Networks (NNs) ifmaps filters ofmaps No Dog Ni * = No foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Ni Convolutional layer

3 Deep Neural Networks (NNs) ifmaps filters ofmaps No Dog Ni * = No foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Ni Convolutional layer (CONV) o Convolutions between fmaps Fully-Connected layer (FC) o Matrix multiplication O = I W foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) 3

4 Domain-Specific NN Accelerators Spatial architectures of PEs High computational efficiency o 100x performance and energy efficiency improvement o Low-precision arithmetic, dynamic pruning, static compression, Memory system? foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) Main Memory Global Buffer PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Reg File ALU Processing Element 4

5 Memory Challenges for Large NNs Large memory footprints and high bandwidth requirements o Deep networks, large layers, complex neuron structures o More efficient computing requires higher data bandwidth Limit the performance scalability for future NNs Common solutions o Large on-chip buffers less area available for compute elements o Multiple DRAM channels high dynamic power, significant leakage 5

6 Power (W) Bandwidth (GBps) Memory Challenges for Large NNs Example: a state-of-the-art accelerator with 400 PEs needs o 1.5 MB SRAM buffer, > 70% of accelerator area o Four LPDDR3 x32 chips with > 25 GBps o 45% power in SRAM and DRAM even with low-power technologies Number of PEs Series1 Series2 Series3 Series4 Series

7 3D Memory + NN Acceleration Opportunities o High bandwidth at low access energy o Abundant parallelism (vaults, banks) Bank TSVs Key questions o What hardware architecture best exploits the high bandwidth and low energy? o How to best schedule and partition NN computations in software? Schedule: dataflow across memory hierarchy Partition: parallelism across vaults Vault (Channel) DRAM Die Logic Die Micron s Hybrid Memory Cube 7

8 TETRIS NN acceleration with 3D memory o Improves performance scalability by 4.1x over 2D o Improves energy efficiency by 1.5x over 2D Hardware architecture o Rebalance resources between PEs and buffers o In-memory accumulation High performance & low energy Alleviate bandwidth pressure Software techniques o Analytical dataflow scheduling o Hybrid partitioning Optimize buffer use Efficient parallel processing 8

9 TETRIS Hardware Architecture

10 TETRIS Architecture Associate one NN engine with each vault o PE array, local register files, and a shared global buffer NoC + routers for accesses to remote vaults All vaults can process NN computations in parallel PE PE PE PE To local vault To remote vault Router Memory Controller Global Buffer PE PE PE PE PE PE PE PE Vault DRAM Die PE PE PE PE Logic Die 10

11 Normalized Energy Normalized Runtime Resource Balancing Larger PE arrays with smaller SRAM buffers o Limited area with 3D integration o High data bandwidth needs more PEs o Low access energy & sequential access pattern PEs with 133 kb buffer (area 1:1) better performance and 6 energy # PEs / buffer capacity 0 Series1 Series2 Series3 Series4 Series5 11

12 In-Memory Accumulation Small buffer more swap to memory higher bandwidth Move simple accumulation logic close to DRAM banks o 2x bandwidth reduction for output data o Carefully choose logic location to avoid impact on DRAM array layout Memory Memory + W X Y Y W X ΔY PE array PE array Y += W * X ΔY = W * X 12

13 Scheduling and Partitioning for TETRIS

14 Dataflow Scheduling Critical for maximizing on-chip data reuse to save energy Ordering: loop blocking and reordering foreach b in batch Nb foreach ifmap u in Ni Locality in global buffer foreach ofmap v in No Non-convex, exhaustive search // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Mapping: execute 2D conv on PE array Regfiles and array interconnect Row stationary [Chen et al., ISCA 16] 14

15 TETRIS Bypass Ordering Few data reuse opportunities in TETRIS small buffers IW bypass, OW bypass, IO bypass o Use buffer only for one stream for maximum benefit o Bypass buffer for the other two to sacrifice their reuse OW bypass ordering ifmaps ofmaps filters Chunk 0 Chunk 1 Off-chip Chunk 2 On-chip Global buffer Reg files 1. Read 1 ifmap chunk into gbuf 2. Stream ofmaps and filters to regf 3. Move ifmaps from gbuf to regf 4. Convolve 15

16 TETRIS Bypass Ordering Near-optimal schedules o With 2% difference from schedules derived from exhaustive search Analytically derived o Closed-form solution o No need for time-consuming exhaustive search NN Runtime Gap (w.r.t. optimal) Energy Gap (w.r.t. optimal) AlexNet 1.48 % 1.86 % ZFNet 1.55 % 1.83 % VGG % 0.20 % VGG % 0.16 % ResNet 2.91 % 0.78 % min A DRAM = 2 N b N o S o t i + N b N i S i + N o N i S w t b N b N i S s.t. t b t i S buf i 1 t b N b, 1 t i N i 16

17 NN Partitioning Process NN computations in parallel in all vaults Fmap partitioning o Divide a fmap into tiles o Each vault processes one tile Vault 0 Vault 1 Layer i Layer i+1 Minimum remote accesses o Prior work exclusively used fmap partitioning Vault 2 Vault 3 17

18 NN Partitioning Process NN computations in parallel in all vaults Output partitioning o Partition all ofmaps into groups o Each vault processes one group Vault 2 Layer i Layer i+1 Vault 3 Better filter weight reuse Fewer total memory accesses Vault 0 Vault 1 18

19 TETRIS Hybrid Partitioning Combine fmap partitioning and output partitioning Total energy = NoC energy + DRAM energy o Balance between minimizing remote accesses and total DRAM accesses Difficulties o Design space exponential to # layers Greedy algorithm reduces to be linear to # layers o Involved dataflow scheduling to determine total DRAM accesses Bypass ordering to quickly estimate total DRAM accesses 19

20 TETRIS Evaluation

21 Methodology Five state-of-the-art NNs o AlexNet, ZFNet, VGG16, VGG19, ResNet o MB total memory footprint for each NN o Up to 152 layers in ResNet, one of the largest NNs today 2D and 3D accelerators with one or more NN engines o 2D engine: 16 x 16 PEs, 576 kb buffer, one LPDDR3 channel 8.5 mm 2, 51.2 Gops/sec Bandwidth-constrained o 3D engine: 14 x 14 PEs, 133 kb buffer, one HMC vault 3.5 mm 2, 39.2 Gops/sec Area-constrained 21

22 Normalized Runtime Single-Engine Comparison Up to 37% performance improvement with TETRIS o Due to higher bandwidth despite smaller PE array 1.2 Large NNs benefit more! AlexNet ZFNet VGG16 VGG19 ResNet 22

23 Normalized Energy Single-Engine Comparison 35 40% energy reduction with TETRIS o Smaller on-chip buffer, better scheduling 1.2 From SRAM & DRAM, and static AlexNet ZFNet VGG16 VGG19 ResNet Series5 Series4 Series3 Series2 Series1 23

24 Normalized Runtime Multi-Engine Comparison 4 2D engines: 34 mm 2, pin constrained (4 LPDDR3 channels) 16 3D engines: 56 mm 2, area constrained (16 HMC vaults) 4.1x performance gain 2x compute density AlexNet ZFNet VGG16 VGG19 ResNet 24

25 Normalized Energy Multi-Engine Comparison 1.5x lower energy o 1.2x from better scheduling and partitioning 4x computation only costs 2.7x power Series5 Series4 Series3 Series2 Series AlexNet ZFNet VGG16 VGG19 ResNet 25

26 TETRIS Summary A scalable and efficient NN accelerator using 3D memory o 4.1x performance and 1.5x energy benefits over 2D baseline Hardware features o PE/buffer area rebalancing o In-memory accumulation Software features o Analytical dataflow scheduling o Hybrid partitioning 26

27 Thanks! Questions?

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)