ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator

Size: px

Start display at page:

Download "ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator"

Justin Dean
5 years ago
Views:

1 ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)

2 2 Outline Motivation Early Negative Detection (END) Computation Pruning thru END (ComPEND) Evaluation Conclusion

3 3 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x) f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN.

4 4 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x)= f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN.

5 5 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x)= f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN. If we know a priori that x, we can skip unnecessary computations and simply set ReLU output to zero.

6 6 Motivation Distribution of negative inputs to ReLU functions in VGG-6 More than 6%

7 7 Early Negative Detection (END) Two s complement number representation (4 bits) Negative Positive = -8+7 = - = -8+6 = -2 = -8+5 = -3 = -8+4 = = -+ = + = -+ = + For a B-bit number WW : ( ww BB ww BB 2 ww BB 3 ww ww ) WW = ww BB ( 2 BB ) + BB 2 kk= ww kk +2 kk

8 8 Early Negative Detection (END) Inverted two s complement number representation (4 bits) Positive Negative = +8-7 = + = +8-6 = +2 = +8-5 = +3 = +8-4 = = +- = - = +- = - For a B-bit number WW : ( ww BB ww BB 2 ww BB 3 ww ww ) WW = ww BB (+2 BB ) + BB 2 kk= ww kk 2 kk

9 9 Early Negative Detection (END) Inverted two s complement representation for negative detection Decimal Activation: 5 Weight: ) x s complement x ) ReLU

10 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) x ) ReLU

11 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) x ) - ReLU

12 2 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) Skipped! x ) ReLU

13 3 Early Negative Detection (END) Two s complement representation Positive sum value Negative sum WW = ww BB ( 2 BB ) + BB 2 kk= ww kk +2 kk steps Inverted two s complement representation value WW = ww BB (+2 BB ) + BB 2 kk= ww kk 2 kk steps Stop here!

14 4 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]

15 5 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]

16 6 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]

17 7 Computation Pruning thru END (ComPEND) Bit-serial sum of products Takes multiple steps, but the area of a bit-serial unit is much smaller Can integrate more units higher performance Similar to Stripes (P. Judd et al., MICRO 26) MSB LSB W A W N A N + S W W N LSB MSB A + B bits A N + B Steps << S < Conventional sum of products > < Bit-seral sum of products >

18 8 Computation Pruning thru END (ComPEND) Overall architecture of ComPEND DRAM STT-RAM WB WB WB Memory Controller Provider Network Global Controller 9x6 array of s 32 6-bit inputs per 9x6x32 inputs at a time (3x3x52 filter) 6 + additional s A l * W l A l AB AB AB AB

19 9 Computation Pruning thru END (ComPEND) DATA packing Input activation block 32 activations of same X, Y I z I x O z O x A,, A,,2 A,,3 A,,4 A,,3 A,,32 I y F y F x O y 6-bit F z =I z 52-bit Weight bits block 52 bits of weights in same bit position I z O z w,, MSB w,,2 MSB w,,3 MSB w,,4 MSB w,,5 w,,52 MSB MSB I y I x O y O x w,, MSB- w,,2 MSB- w,,3 MSB- w,,4 MSB- w,,5 w,,52 MSB- MSB- F y F x F z =I z w,, LSB w,,2 LSB w,,3 LSB w,,4 LSB w,,5 w,,52 LSB LSB -bit 52-bit < in the case of F z = 52 >

20 2 Computation Pruning thru END (ComPEND) Processing unit Input activations input 6-bit adder tree 32 6-bit input activation registers 32-bit weight bits register Weight bits

21 2 Computation Pruning thru END (ComPEND) Memory controller Manages all kinds of memory-involved data transfers Weight blocks Off-chip memory -> STT-RAM STT-RAM -> Weight Buffers (WBs) WBs -> Weight registers in s DRAM STT-RAM AB AB AB AB Activation blocks Off-chip memory -> Activation Buffers (ABs) Off-chip memory -> Registers in s (FC layers: activation blocks are moved directly from off-chip memory to registers) ABs -> Registers in s WB WB WB Memory Controller Provider Network Global Controller Output activation blocks Global controller -> Off-chip memory

22 22 Computation Pruning thru END (ComPEND) Provider network A, A,2 A,3 A,4 A, A,2 A,3 A,4 Inputs: 32 x 9 x 6 bits A 2, A 2,2 A 2,3 A 2,4 A 2, A 2,2 A 2,3 A 2,4 outputs: 32 x 9 x 6 bits A 3, A 3,2 A 3,3 A 3,4 Sliding window W, A, W,2 A,2 W,3 A,3 W, W,2 W,3 A 3, A 3,2 A 3,3 A 3,4 Sliding window A, A,2 a,3 Activation reuse in s During 2D convolution with 3x3 filters Reconfiguration with 9 types of connections for shuffling weights W 3,3 A 3,3 W 3,3 A 3,3 < Connection type > < Connection type 2 >

23 23 Computation Pruning thru END (ComPEND) head Global controller id id id id Pipeline list pos pos pos pos Decision unit id last pos DATA id last pos DATA id last pos DATA = Entry board MUX id last pos DATA << - 6 decision units Decision unit Decides final sum of products Zero if DATA is negative DATA if last position is LSB Pipeline list id: filter ID pos: bit position in 6-bit weights head: current output of adder tree Entry board id: filter ID last pos: last position in the pipeline DATA: partial sum

24 24 Computation Pruning thru END (ComPEND) Global controller head Pipeline list id pos id pos id pos id pos Decision unit id last pos DATA id last pos DATA id last pos DATA = Entry board MUX Filling up the pipeline P: The next bit in the bit-serial computation P2: A new sum of products that has not yet been entered into the pipeline P3: The next step of a sum of products whose prior step is still in the pipeline id last pos DATA << - completed P P3 F p : ( ww ii,bb ww ii,bb 2 ww ii,bb 3 ww ii, ww ii, ) F q : ( ww jj,bb ww jj,bb 2 ww jj,bb 3 ww jj, ww jj, ) P2

25 25 Computation Pruning thru END (ComPEND) Operation pipeline (4) Global Controller () Weight buffers -> DRAM STT-RAM () WB WB WB Memory Controller (2) Provider Network (3) (2) Provider network (2) -> (3) Processing unit array (3) -> (4) Global controller AB AB AB AB

26 26 Evaluation Pre-trained weights of VGG-6 network and images from ImageNet ILSVRC-22 In-house cycle-accurate timing simulator by using C++ with DRAMSim2 for off-chip memory CACTI 6.5 to model SRAM NVSim for on-chip STT-RAM Synopsys Design Compiler with TSMC 45nm technology library with.9v to get parameters of timing/power/area for s and Provider Network

27 27 Evaluation VGG-6 network We use 5 layers in the VGG-6 network as workloads, excluding layer F. F is excluded since the total size of input activations is too big. Inputs to C are raw data that can be negative. The pruning scheme cannot be applied. C is implemented without ComPEND.

28 28 Evaluation Configuration Area Peak throughput (32-input 6 s in a row 9 rows GHz = 4.6 TOPS)

29 29 Evaluation Runtime Reduced by 6.62% on average compared to that without ComPEND for 5 layers Left bars: without ComPEND Right bars: with ComPEND < for VGG-6 layers > MEM_STT: reads/writes between off-chip memory and STT-RAM STT_WB: runtime of reads/writes between STT-RAM and WB MEM_WB: reads/writes between off-chip memory and WB MEM_AB: reads/writes between off-chip memory and AB AB_: reads/writes between AB and registers in s RUN_: computation in s

30 3 Evaluation Energy (dynamic & static) consumption Reduced by 23.5% on average for 5 layers D/S_CTRL: global controller D/S_NET: provider network D/S_STT: STT-RAM. D/S_AB: activation buffers D/S_WB: weight buffer D/S_: processing units Left bars: without ComPEND Right bars: with ComPEND < for VGG-6 layers >

31 3 Evaluation Power consumption Average over 5 layers Without ComPEND:.2 Watt With ComPEND:.3 Watt < for VGG-6 layers >

32 32 Evaluation Energy-delay product ComPEND reduces EDP and ED 2 P by 36.2% and 46.8% for the execution of the 5 layers < for VGG-6 layers >

33 33 Conclusion Proposed the concept of END (early negative detection) based on inverted two s complement Proposed an architecture that implements ComPEND Achieved 6.62% higher speed and 23.5% less energy consumption for inference Future work Combining with other zero-skipping approaches Handling layers (say, F in VGG-6) exceeding the capacity of the architecture

34 THANK YOU

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and