ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Size: px

Start display at page:

Download "ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA"

Roger Crawford
6 years ago
Views:

1 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong Yang 2,3 and Bill Dally 1,4 Stanford University 1, DeePhi 2, Tsinghua University 3, NVIDIA 4 Feb 23, 217 FPGA 17, Monterey, CA

2 Recurrent Neural Networks and LSTM speech recognition image caption machine translation visual question answering Compression Acceleration Regularization

3 Speech Recognition

4 Machine Translation

5 Huang et al. Visual Storytelling Image Caption

6 VQA: Visual Question Answering which country is the flag of? what is behind him? what is the color of his hair?

7 Recurrent Neural Network MLP image caption sentiment analysis machine translation speech recognition Stanford cs231n lecture notes

8 Comparing CNN / LSTM CNN: weights shared in space RNN/LSTM: weights shared in time => Produces complicated data dependency => Making parallelization difficult

9 LSTM Structure Input LSTM LSTM FC Softmax Output

10 Models are Getting Larger SECH RECOGNITION 1X Training Ops 8 GFLOP 7, hrs of Data ~8% Error 465 GFLOP 12, hrs of Data ~5% Error 214 Deep Speech Deep Speech 2

11 We Need more Computation But Moore s law is no longer providing more compute

12 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

13 Conventional Paradigm Training Inference

14 Conventional Paradigm Training Inference

15 Proposed Paradigm Conventional Training Inference Slow Power Hungry Proposed Training Han et al ICLR 17 Model Compression Han et al NIPS 15 Han et al ICLR 16 (best paper award) Accelerated Inference Han et al ISCA 16 Han et al FPGA 17 (best paper award) Fast Power Efficient

16 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Compression Pruning Quantization Accelerated Inference Efficient Architecture for Sparse LSTM Accelerated Inference Results

17 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

18 Pruning Review Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 15

19 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

20 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

21 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

22 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

23 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

24 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

Load Balance Aware Pruning B @ w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A B @ w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A

25 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles

26 racy vs Sparsity

27 racy vs Sparsity

28 Weight Quantization Table 4: WER Before and After Compression. Networks WER 32bit floating original network 2.3% 32bit floating pruned network 2.7% 16bit fixed pruned network 2.7% 12bit fixed pruned network 2.7% 8bit fixed pruned network 84.5%

29 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

30 FSM for LSTM

31 Scheduling Memory spmm Elt-wise

32 Scheduling Memory spmm Elt-wise

33 Scheduling Memory spmm Elt-wise

34 Scheduling Memory spmm Elt-wise

35 Scheduling Memory spmm Elt-wise

36 Scheduling Memory spmm Elt-wise

37 Scheduling Memory spmm Elt-wise

38 Scheduling Memory spmm Elt-wise

39 Scheduling Memory spmm Elt-wise

40 Scheduling Memory spmm Elt-wise

41 Scheduling Memory spmm Elt-wise

42 Scheduling Memory spmm Elt-wise

43 Scheduling Memory spmm Elt-wise

44 Scheduling Memory spmm Elt-wise

45 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

46 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparsity Parallelization becomes challenging Load balance

Deal with Sparsity vector weight 1 1 1 1 a a 1 a 2 a 3 a 4 a 5 a 3 x w[3] w,3w w2,3w w4,3w 1 2 3 w,2 w,3 w,5 w 1, w 1,5 buf w 2, W 2,1 w 2,2 w 2,3 w 3,4 a 3 w 4,

47 Deal with Sparsity vector weight a a 1 a 2 a 3 a 4 a 5 a 3 x w[3] w,3w w2,3w w4,3w w,2 w,3 w,5 w 1, w 1,5 buf w 2, W 2,1 w 2,2 w 2,3 w 3,4 a 3 w 4, w 4,1 w 4,3 w 4,5 1 buf w,2 w,5 1 w,2 w,5 w 7,3 w 7, w 7,3 w 7,3 w 7,5 Figure 8: The computation pattern: non-zero weights in a column are assigned to 2 s, and ev-

48 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE

49 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller MEM Controller Memory DATA BUS Input fer Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpMM SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t fer (a) (b)

50 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

51 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

52 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead

53 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

54 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

55 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

56 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

57 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

58 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

59 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

60 Speedup vs Sparsity

61 Speedup vs Sparsity 12% free lunch

62 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

63 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

64 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

65 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

66 Demo

67 Thank you! Conventional Training Inference Proposed Training Compression Pruning Quantization Accelerated Inference Han et al ICLR 17 Han et al NIPS 15 Han et al ICLR 16 (best paper award) Han et al ISCA 16 Han et al FPGA 17 (best paper award)

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural ... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University