Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Size: px

Start display at page:

Download "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network"

May Little
6 years ago
Views:

1 Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint Research Institution 3 University of California, Los Angeles 1

Correctness in Large Scale Visual Recognition

2 Convolutional Neural Network Automotive Safety Image Stanford u Image Classification & Object Detection State-of-the-art Algorithm Highest Correctness in Large Scale Visual Recognition Challenge 2012 & 2013 & 2014 & 2015 Widely used in industry & academy 2

3 Convolutional Neural Network input image feature maps feature maps feature maps feature maps output category Convolution layer Max-pooling layer Convolution layer Max-pooling layer Classifier Max-pooling is optional Input feature map Output feature map 3

4 Convolutional Neural Network input image feature maps feature maps feature maps feature maps output category Convolutional layers account for over 90% computation Convolution Max-pooling Convolution Max-pooling Classifier layer layer layer layer [1] A. Krizhevsky, etc. Imagenet classification with deep convolutional neural networks. NIPS [2] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. ICANN 2014 Max-pooling is optional Input feature map Output feature map 4

5 Related Work 1. CNP: An FPGA-based processor for convolutional networks. FPL 2009; 2. A massively parallel coprocessor for convolutional neural networks. ASAP 2009; 3. A programmable parallel accelerator for learning and classification. PACT 2010; 4. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. ISCA 2010; 5. Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM 2011; 6. A massively parallel digital learning processor. NIPS 2008; 7. NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision. CVPRW 2011; 8. Accelerating deep neural networks on mobile processor with embedded programmable logic. (Poster) NIPS 2013; 9. Memory-Centric Accelerator Design for Convolutional Neural Networks. ICCD

6 Hardware Computing on FPGA Computation Resource... PE-1 PE-2 PE-n Challenge: Resrc. Util. Solution: Unroll & Pipeline Interconnect On-chip memory buffer1 buffer2 Challenge: BRAM Limits Solution: Loop Tiling Off-chip Bus External Memory Challenge: BW Limits Solution: Data reuse Challenge: Co-optimization, -- match computation & communication Solution: Roofline Model Main Contribution 6

7 FPGA design in Roofline Model Computational Perf. Computation Resource... PE-1 PE-2 PE-n GFLOP/S Interconnect On-chip memory buffer1 Off-chip Bus buffer2 Computation To Communication Ratio External Memory FLOP / (Mem. Byte Access) Bandwidth Gbyte/S 7

8 Computational Performance FPGA design in Roofline Model GFLOP/S Design Bandwidth FLOP/(Memory byte access) Computation To Communication (CTC) Ratio 8

Computational Attainable Performance Attainable Performance GFLOP/S Computation Bandwidth Roof Computational Communication Roof Computation

9 Computational Attainable Performance Attainable Performance GFLOP/S Computation Bandwidth Roof Computational Communication Roof Computation 1 B 2 3 Communication A Compt. Perf. (2 & 3 < 1) Attain. Perf. (2 & 3 > 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 9

10 Attainable Performance Attainable Performance GFLOPS Computational Roof 1 Bandwidth Roof 2 3 Attain. Perf. (2 & 3 > 1) Attain. Perf. (2 & 3 < 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 10

11 Attainable Performance Attainable Performance GFLOPS Computational Roof 1 Bandwidth Roof Optimization Target: 2 3 Find a Design with Maximized Attainable Performance Attain. Perf. (2 & 3 > 1) Attain. Perf. (2 & 3 < 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 11

12 Loop tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { 7 for(too=to; too<min(to+tn, M); too++) { 1 for(row=0; row<r; row++) { 8 for(tii=ti; tii<(ti+tn, N); tii++) { 2 for(col=0; col<c; col++) { 9 3 for(i=0; for(to=0; i<k; to<m; i++) to++) { { for(j=0; for(ti=0; j<k; ti<n; j++) ti++) { { output_fm[to][row][col] for(i=0; i<k; i++) { += }}}} Off-chip Data Transfer: KMemory Access Optimization }}}}}} On-chip Data: Computation Optimization 6 for(j=0; j<k; j++) { }}}}}} Input feature map Output feature map weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; w ij 1l (Tile loop) (Tile loop) (Tile loop) (Tile loop) R, C, M, N, K, S are all configuration parameters of the convolutional layer 12

13 Loop tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { 7 for(too=to; too<min(to+tn, M); too++) { 8 for(tii=ti; tii<(ti+tn, N); tii++) { 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} }}}} Off-chip Data Transfer: Memory Access Optimization On-chip Data: Computation Optimization (Tile loop) (Tile loop) (Tile loop) (Tile loop) 13

$i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += }}}}}} 5 for(i=0; i<k; i++) { 6 for(j=0; j<k; j++) { 7 for(trr=row; trr<min(row+tr, R); trr++) { 8 for(tcc=col; tcc<min(tcc+tc, C);$

14 Computation Optimization 5 for(trr=row; trr<min(row+tr, R); trr++) 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) 7 for(too=to; too<min(to+tn, M); too++) 8 for(tii=ti; tii<(ti+tn, N); tii++) 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += }}}}}} 5 for(i=0; i<k; i++) { 6 for(j=0; j<k; j++) { 7 for(trr=row; trr<min(row+tr, R); trr++) { 8 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { #pragma HLS pipeline 9 for(too=to; too<min(to+tm, M); too++) { #pragma HLS UNROLL 10 for(tii=ti; tii<(ti+tn, N); tii++) { #pragma HLS UNROLL output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} (Tile loops) weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; 14

Computation Optimization Tn 5 for(i=0; i<k; i++) { (Tile loops) 5 for(trr=row; 6 for(j=0; trr<min(row+tr, j<k; j++) { R); trr++) 6 for(tcc=col; 7 for(trr=row; tcc<min(tcc+tc, trr<min(row+tr, C);

15 Computation Optimization Tn 5 for(i=0; i<k; i++) { (Tile loops) 5 for(trr=row; 6 for(j=0; trr<min(row+tr, j<k; j++) { R); trr++) 6 for(tcc=col; 7 for(trr=row; tcc<min(tcc+tc, trr<min(row+tr, C); tcc++) R); trr++) { 7 8 for(too=to; for(tcc=col; too<min(to+tn, tcc<min(tcc+tc, M); too++) C); tcc++) { 8 #pragma for(tii=ti; HLS pipeline tii<(ti+tn, N); tii++) 9 for(too=to; too<min(to+tm, M); too++) { 9 for(i=0; i<k; i++) { #pragma Tm HLS UNROLL for(j=0; for(tii=ti; j<k; tii<(ti+tn, j++) { N); tii++) { #pragma HLS output_fm[to][row][col] UNROLL += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; output_fm[to][row][col] += }}}}}} weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} 15

16 Computational Performance Tn Tm Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn) DSPs

17 Computational Performance u Consider Memory Access! How? Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn) DSPs

18 Memory Access Optimization 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { } }}} load output feature map S: foo(output_fm(to, row, col)); store output feature map 5 for(trr=row; trr<min(row+tr, R); trr++) 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) 7 for(too=to; too<min(to+tn, M); too++) 8 for(tii=ti; tii<(ti+tn, N); tii++) 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += weights[to][ti][i][j]* input_fm[ti][s*row+i][s*col+j]; }}}}}} 18

19 Memory Access Optimization 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { } }}} load output feature map S: foo(output_fm(to, row, col)); store output feature map 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { }}} } load output feature map S: foo(output_fm(to, row, col)); store output feature map 19

20 Computational Performance x x x x Computation To Communication Ratio x x GFLOP/S x x Computation To Communication (CTC) Ratio FLOP/(DRAM byte access) Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn)

Design Space Legal Solutions of Tm & Tn: N=128 N=192 2097 # of PE = 450

21 Design Space Legal Solutions of Tm & Tn: N=128 N= # of PE = 450 Communication: # of memory access methods Legal Solutions: 3 Total Legal Solutions:

22 Attainable performance (GFLOPS) Design Space Exploration 120 Bandwidth roof 100 Computational roof A D C 80 A CTC Ratio Ratio of (FLOP)/(DRAM byte access) 22

FIFO FIFO System Overview u AXI4Lite Timer AXI4 Interrupt controller Accelerator Estimated On-board run Difference Layer 1 7.32 ms 7.67 ms ~5% Layer 2 5.11 ms 5.

23 FIFO FIFO System Overview u AXI4Lite Timer AXI4 Interrupt controller Accelerator Estimated On-board run Difference Layer ms 7.67 ms ~5% Layer ms 5.35 ms ~5% Data Transfer Engine0 Data Transfer Engine1 Layer 3 MicroBlaze 3.42 ms 3.79 ms ~9% UART Layer ms 2.88 ms ~10% Layer ms 1.93 ms ~10% Overall ms ms ~7% VC707 Vivado Vivado HLS Memory Interface Controller DDR3 interface On-chip Off-chip Main Memory 23

Engine0 Data Transfer Engine1 AXI4Lite AXI4 Memory

24 FIFO FIFO Accelerator Interrupt controller Accelerator Timer UART MicroBlaze Data Transfer Engine0 Data Transfer Engine1 AXI4Lite AXI4 Memory Interface Controller DDR3 interface On-chip Off-chip Main Memory 24

25 Experimental Results: vs. CPU Energy 90x 20x 1 thread -O3 16 threads -O3 FPGA x 5x Speedup 1 thread -O3 16 threads -O3 FPGA CPU Xeon E (32nm) 16 cores 2.2 GHz FPGA Virtex7-485t (28nm) 448 PEs 100MHz gcc O3 OpenMP 3.0 Vivado Vivado HLS

26 Experimental Results: vs. Other FPGAs FPL2009 ASAP2009 PACT2010 ISCA2010 ICCD2013 Our Impl. Virtex 4 Virtex 5 Virtex 5 Virtex 5 Virtex 6 Virtex 7 125MHz 115MHz 125MHz 200MHz 150MHz 100MHz 5.25 GOPS 6.74 GOPS 7 GOPS 16 GOPS 17 GOPS 61.6 GOPS 26

27 Experimental Results: vs. Other FPGAs GFLOPS/Slice Normalized Performance ~80% ASAP2009 FPL2009 PACT2010 ISCA2010 ICCD2013 Our Impl. 27

Attainable performance (GFLOPS) Experimental Results 120 100 A Bandwidth roof Computation Communication Comm.

28 Attainable performance (GFLOPS) Experimental Results A Bandwidth roof Computation Communication Comm. Time Covered C A Computation Communication Comm. Time NOT Covered CTC Ratio Ratio of (FLOP)/(DRAM byte access) 28

29 Conclusions An accelerator for convolutional neural network Contribution: Accurate Analytical model for computation & communication Find the best solution with roofline model Result: On-board run implementation ~3.5x better performance over other FPGA implementations ~80% performance/area improvement 29

30 Thank You Q & A 30

Introduction to Neural Networks

ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have