Hardware for Deep Learning

Size: px

Start display at page:

Download "Hardware for Deep Learning"

Katherine Baker
6 years ago
Views:

1 Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016

2 HARDWARE AND DATA ENABLE DNNS 2

3 THE NEED FOR SPEED Larger data sets and models lead to better accuracy but also increase computation time. Therefore progress in deep neural networks is limited by how fast the networks can be computed. Likewise the application of convnets to low latency inference problems, such as pedestrian detection in self driving car video imagery, is limited by how fast a small set of images, possibly a single image, can be classified. More data è Bigger Models èmore Need for Compute But Moore s law is no longer providing more compute Lavin & Gray, Fast Algorithms for Convolutional Neural Networks,

4 Deep Neural Network What is frames/j and frames/s/mm 2 for training & inference? LeCun, Yann, et al. "Learning algorithms for classification: A comparison on handwritten digit recognition." Neural 4 networks: the statistical mechanics perspective 261 (1995): 276.

5 4 Distinct Sub-problems Training Inference Convolutional Train Conv Inference Conv B x S Weight Reuse Act Dominated Fully-Conn. Train FC Inference FC B Weight Reuse Weight Dominated 32b FP large batches Minimize Training Time Enables larger networks 8b Int small (unit) batches Meet real-time constraint 5

6 Inference 6

7 Precision Use the smallest representation that doesn t sacrifice accuracy (32FP -> 4-6 bits quantized) 7

8 Number Representation FP32 FP S E M S E M Range Accuracy % 6x x %* Int32 1 S 31 M 0 2x10 9 ½ Int S M 0 6x10 4 ½ Int8 1 7 S M ½ Binary 1 M 0-1 ½ 8

Cost of Operations Operation: Energy (pj) 8b Add 0.03 16b Add 0.05 32b Add 0.1 16b FP Add 0.4 32b FP Add 0.9 8b Mult 0.

7 32b SRAM Read (8KB) 5 32b DRAM Read 640 Area (µm 2 ) 36 67 137 1360 4184 282 3495 1640 7700 N/A N/A Energy numbers are

9 Cost of Operations Operation: Energy (pj) 8b Add b Add b Add b FP Add b FP Add 0.9 8b Mult b Mult b FP Mult b FP Mult b SRAM Read (8KB) 5 32b DRAM Read 640 Area (µm 2 ) N/A N/A Energy numbers are from Mark Horowitz Computing s Energy Problem (and what we can do about it), ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library. 9

10 The Importance of Staying Local LPDDR DRAM GB On-Chip SRAM MB Local SRAM KB 640pJ/word 50pJ/word 5pJ/word

11 Mixed Precision Store weights as 4b using Trained quantization, decode to 16b Store activations as 16b w ij x + bi a j 16b x 16b multiply round result to 16b accumulate 24b or 32b to avoid saturation Batch normalization important to center dynamic range

12 Trained Quantization Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization 12 and Huffman Coding, ICLR 2016 (Best Paper)

13 2-6 Bits/Weight Sufficient 13

14 Sparsity Don t do operations that don t matter (10x 30x) 14

15 Sparsity before pruning after pruning pruning synapses pruning neurons Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS

16 Retrain To Recover Accuracy Train Connectivity Prune Connections Train Weights Accuracy Loss L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS

18 Pruning + Trained Quantization

19 Pruning Neural Talk And LSTM 19

20 Fixed-Function Hardware 20

Diannao (Electric Brain) Figure 15. Layout (65nm). DMA# *' weight' neuron output output layer Tn# synapse x ai table' Figure 9. Full hardware implementation of neural networks. DMA# NFU)3% Inst.

# SB% Component Area Power Critical or Block in µm2 (%) in mw (%) path in ns ACCELERATOR 3,023,077 485 1.02 Combinational 608,842 (20.14%) 89 (18.41%) Memory 1,158,000 (38.31%) 177 (36.

76%) NBout 433,906 (14.35%) 92 (19.97%) NFU 846,563 (28.00%) 132 (27.22%) CP 141,809 (5.69%) 31 (6.39%) AXIMUX 9,767 (0.32%) 8 (2.65%) Other 9,226 (0.31%) 26 (5.36%) 5 Table 6.

(ns) Area (mm^2) Energy (nj) Component Area Power Critical or Block in µm2 (%) in mw (%) path in ns ACCELERATOR 3,023,077 485 1.02 Combinational 608,842 (20.14%) 89 (18.41%) Memory 1,158,000 (38.

65%) Figure 10. Energy, critical path and area of full-hardware layers. NBin 427,992 (14.16%) 91 (19.76%) NBout 433,906 (14.35%) 92 (19.97%) NFU (28.

21 Diannao (Electric Brain) Figure 15. Layout (65nm). DMA# *' weight' neuron output output layer Tn# synapse x ai table' Figure 9. Full hardware implementation of neural networks. DMA# NFU)3% Inst.# Inst.# NBout% Tn#x#Tn# neuron NFU)2% Tn# input bi Memory#Interface# +' synapses *' NFU)1% NBin% x +' Instruc:ons# DMA# hidden layer Control#Processor#(CP)# Inst.# SB% Component Area Power Critical or Block in µm2 (%) in mw (%) path in ns ACCELERATOR 3,023, Combinational 608,842 (20.14%) 89 (18.41%) Memory 1,158,000 (38.31%) 177 (36.59%) Registers 375,882 (12.43%) 86 (17.84%) Clock network 68,721 (2.27%) 132 (27.16%) Filler cell 811,632 (26.85%) SB 1,153,814 (38.17%) 105 (22.65%) NBin 427,992 (14.16%) 91 (19.76%) NBout 433,906 (14.35%) 92 (19.97%) NFU 846,563 (28.00%) 132 (27.22%) CP 141,809 (5.69%) 31 (6.39%) AXIMUX 9,767 (0.32%) 8 (2.65%) Other 9,226 (0.31%) 26 (5.36%) 5 Table 6. Characteristics of accelerator and breakdown by com4 FigureCritical 15.PathLayout (65nm). (ns) Area (mm^2) Energy (nj) Component Area Power Critical or Block in µm2 (%) in mw (%) path in ns ACCELERATOR 3,023, Combinational 608,842 (20.14%) 89 (18.41%) Memory 1,158,000 (38.31%) 177 (36.59%) Registers 375,882 (12.43%) 86 (17.84%) Clock network 68,721 (2.27%) 132 (27.16%) 8x8 16x16 32x32 32x4 64x8 128x16 Filler cell 811,632 (26.85%) SB 1,153,814 (38.17%) 105 (22.65%) Figure 10. Energy, critical path and area of full-hardware layers. NBin 427,992 (14.16%) 91 (19.76%) NBout 433,906 (14.35%) 92 (19.97%) NFU (28.00%) neuron 846,563 to a neuron of the next 132 layer,(27.22%) and from one synapcp (5.69%)neuron.31 (6.39%) an execution tic latch141,809 to the associated For instance, AXIMUX time of 15ns 9,767 (0.32%) 8 (2.65%) and an energy reduction of 974x over a core Other 9,226 (0.31%) 26 (5.36%)10 hidden, 10 has been reported for a (90 inputs, ponent type (first 5 lines), and functional block (last 7 lines). logic which is in charge of reading data out of NBin/NBout; such as the Intel ETANN [18] at the beginning of the 1990s, next Figure 16. Speedup of accelerator over SIMD, and of ideal ac-versions will focus on how to reduce or pipeline this not because neural networks were already large at the time, critical path. The total RAM capacity (NBin + NBout + SB celerator over accelerator. but because hardware resources (number of transistors) were + CP instructions) is 44KB (8KB for the CP RAM). The area naturally much more scarce. The principle was to timeand power are dominated by the buffers (NBin/NBout/SB) at share the physical neurons and use the on-chip RAM to respectively 56% and 60%, with the NFU being a close secstore synapses neurons values of hidden operations every cycle and (weintermediate naturally use 16-bit fixed-point layers. However, at that time, many neural networks were in the SIMD as well). As mentioned in Section 7.1,ond theat 28% and 27%. The percentage of the total cell power is 59.47%, but the routing network (included in the different small enough that all synapses and intermediate neurons accelerator performs bit operations every cycle for values could fit in the neural network RAM. Since this is no components of the table breakdown) accounts for a signif496 both classifier and convolutional layers, i.e., 62x more (icant 8 ) share of the total power at 38.77%. At 65nm, due to longer the case, one of the main challenges for large-scale than network the SIMD core. We empirically that on these neural accelerator design has become observe the interplay the high toggle rate of the accelerator, the leakage power is two types of layers, the is on average x between the computational and accelerator the memory hierarchy. almost negligible at 1.73%. faster than the SIMD core, so about 2x above the ratio Finally, we have also evaluated a design with Tn = 8, 5.of computational Accelerator foroperators Large Neural Networks and (62x). We measured that, forthus 64 multipliers in NFU-1. The total area for that design is 0.85 mm2, i.e., 3.59x smaller than for Tn = 16 Inclassifier this section, weconvolutional draw from the analysis Sections and performs and layers,ofthe SIMD3 core Diannao improved CNN computation efficiency by using dedicated functional units and memory buffers optimized for the CNN workload. Multiplier + adder tree + shifter + non-linear lookup orchestrated by instructions Weights in off-chip DRAM 452 GOP/s, 3.02 mm^2 and 485 mw 2 - Figure 11. Accelerator. Chen et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, ASPLOS 2014 Fig cel eve in acc bot tha two fas of cla 2.0 upp to bin not cor hig CO the pre on we tha plo and ory tio of

tch. activaake adon-zero h group nput acetection LNZD e result ould not he posieparate Efficient Inference Engine Total memory clock network register combinational filler cell Act_queue

68%) (12.25%) (%) Area µ (µ m 2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961 (%) (93.22%) (0.14%) (1.48%) (1.40%) (3.76%) (0.12%) (19.10%) (73.57%) (0.

Act The critical path of EIE is 1.15ns setting Act Queue Central Index library does [].

isact optimized. In the multiplication on GPUs. From NE sparse s0 Act0matrix-vector sending 3) Mobile GPU.

even We used From SE Leadingdense model and cusparse CSRMV Act1fors1the original t length GEMV Nzero structed for the compressed sparse model.

From SW assumed s3 AC to DC conversion loss, 85%Pointer Act3 15% regulator efficiency and 15% power consumed by peripheral compocustom nents [26, 27] to report the AP+DRAM power for

22 tch. activaake adon-zero h group nput acetection LNZD e result ould not he posieparate Efficient Inference Engine Total memory clock network register combinational filler cell Act_queue PtrRead SpmatRead ArithmUnit ActRW filler cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area µ (µ m 2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 (%) (93.22%) (0.14%) (1.48%) (1.40%) (3.76%) (0.12%) (19.10%) (73.57%) (0.49%) (2.97%) (3.76%) Table 2: The implementation results of one PE in EIE and (CCU) the breakdown by component type (line 3-7), by module master Value (line 8-13).Act The critical path of EIE is 1.15ns setting Act Queue Central Index library does []. For the compressed dense layer, asnzero the Caffe the PEs sparse layer, we stored the sparse matrix in in CSR format, Act0,1,2,3 E can be Index for and used cusparse CSRMV kernel, which isact optimized. In the multiplication on GPUs. From NE sparse s0 Act0matrix-vector sending 3) Mobile GPU. We use NVIDIA Tegra K1 that has 192 al order Ptr cublas SRAM Bank CUDA cores as our mobile GPU baseline.even We used From SE Leadingdense model and cusparse CSRMV Act1fors1the original t length GEMV Nzero structed for the compressed sparse model. Tegra K1 doesn t have Detect From NW software s2 Act2 interface Odd Ptr to report power consumption, so SRAM we mea-bank sured the total power consumption with a power-meter, then Read From SW assumed s3 AC to DC conversion loss, 85%Pointer Act3 15% regulator efficiency and 15% power consumed by peripheral compocustom nents [26, 27] to report the AP+DRAM power for Tegra K1. PE PE PE SpMat PE Act Value Act Queue Ptr_Even PE PE PE PE PE PE PE Arithm Act Index Even Ptr SRAM Bank PE PE PE Col Sparse Start/ Matrix End Addr SRAM SpMat Odd Ptr SRAM Bank PE Ptr_Odd PE Pointer Read Encoded Weight A Sparse Matrix Access Relative Index (a) (b) Act Value Act SRAM Encoded Weight Col Sparse Start/ Matrix End Addr SRAM D Regs Weight Decoder Regs Sparse Matrix Access Address Accum Relative Index (b) Bypass Dest Src Act Act Regs Regs Absolute Address Arithmetic Unit Act R/W Leading NZero Detect ReLU

23 Energy Efficiency CPU (Baseline) CPU Compressed x 120K 62K 35K 10000x GPU Compressed mgpu mgpu Compressed EIE 77K 15K 12K 9K 11K 1000x 24K 8K 100x 10x 1x Alex-6 CPU (Baseline) Alex-7 Alex-8 100x 248 VGG-6 CPU Compressed 1000x Speedup GPU GPU VGG-7 VGG-8 GPU Compressed NT-We mgpu NT-Wd NT-LSTM Geo Mean mgpu Compressed EIE x 1x 0.1x Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean

24 30 Energy/Inference (mj) Conv layers of VGG 16b arithmetic Activations Weights Math 0

25 Energy/Inference (mj) VGG16 FC Layers 16b Arithmetic Activations Weights Math

26 Inference Summary Prune 70-95% of static weights (3-20x reduction in size and ops) [3-20x] Exploit dynamic sparsity of activations (3x reduction in ops) [10-60x] Reduce precision to ~4 bits (8x size reduction) [80-480x] x smaller fits in smaller RAM (100x less power) [8,000-48,000x] Dedicated hardware to eliminate overhead Facilitates compression and sparsity Exploit locality Sparsity, Precision, Locality

27 Faster Training through Parallelism 27

28 Data Parallel Run Multiple Inputs In Parallel Doesn t affect latency for one input Requires P-fold larger batch size For training requires coordinated weight update 28

29 Parameter Update Parameter Server p = p + p p p Model! Workers Data! Shards Large Scale Distributed Deep Networks, Jeff Dean et al., 2013 Large scale distributed deep networks 29

30 Model-Parallel Convolution by output region (x,y) Kernels Multiple 3D K uvkj x 6D Loop Forall region XY For each output map j For each input map k For each pixel x,y in XY For each kernel element u,v B xyj += A (x-u)(y-v)k x K uvkj A ij A ij A xyk Input maps A xyk B B xyj B xyj xyj B xyj B B xyj xyj B B xyj B xyj xyj B xyj Output maps B xyj

31 Time (seconds) Parallel GPUs on Deep Speech (2560) 9-7 (1760) GPUs Baidu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, 2015

32 4 Distinct Sub-problems Training Inference Convolutional 32b FP Batch Activation Storage GPUs ideal Comm for Parallelism Low-Precision Compressed Latency-Sensitive Fixed-Function HW Arithmetic Dominated B x S Weight Reuse Act Dominated Fully-Conn. 32b FP Batch Weight Storage GPUs ideal Comm. for Parallelism Low-Precision Compressed Latency-Sensitive No weight reuse Fixed-Function HW Storage dominated B Weight Reuse Weight Dominated 32b FP large batches Minimize Training Time Enables larger networks 8b Int small (unit) batches Meet real-time constraint 32

33 Summary Fixed-function hardware will dominate inference (100-10,000x gain) Sparse, low-precision, compressed (25-150x smaller) 3x dynamic sparsity All weights and activations from local memory (10-100x less energy) Flexible enough to track evolving algorithms GPUs will dominate training Only dynamic sparsity (3x activations, 2x dropout) Medium precision (FP16 for weights), stochastic rounding Large memory footprint (batch x retained activations) Communication BW scales with parallelism

High-Performance Hardware for Machine Learning

High-Performance Hardware for Machine Learning Cadence ENN Summit 2/9/2016 Prof. William Dally Stanford University NVIDIA Corporation Hardware and Data enable DNNs The Need for Speed Larger data sets and