ECE 5775 High-Level Digital Design Automation Fall DNN Acceleration on FPGAs

Size: px

Start display at page:

Download "ECE 5775 High-Level Digital Design Automation Fall DNN Acceleration on FPGAs"

Felix Riley
5 years ago
Views:

1 ECE 5775 High-Level Digital Design Automation Fall 2018 DNN Acceleration on FPGAs

2 Announcements HW 2 due Friday at 11:59pm First student-led seminar is on Tuesday 10/16 Format: 18-min talk plus 2-min Q&A Q = Questions & Quiz The group winning most audience vote gets a bonus point In-class midterm next Thursday on 10/18 75 mins Open book, Open notes, Closed Internet 1

3 Outline Midterm coverage CNN acceleration on FPGAs 2

4 Midterm (20%) Topics covered Specialized computing FPGAs C-based synthesis Control flow graph and SSA Scheduling Resource sharing Pipelining 3

5 Key Topics (1) Algorithm basics Time complexity, esp. big-o notation Graphs Trees, DAGs, topological sort BDDs, timing analysis FPGAs LUTs and LUT mapping C-based synthesis Arbitrary precision and fixed-point types Basic C-to-RTL mapping Key HLS optimizations to improve design performance 4

6 Key Topics (2) Compiler analysis Control data flow graph Dominance relation SSA Scheduling TCS & RCS algorithms: ILP, list scheduling, SDC Operation chaining, frequency/latency/resource constraints Ability to devise a simple scheduling algorithm to optimize certain design metric 5

7 Key Topics (3) Resource sharing Conflict and compatibility graphs Left edge algorithm Ability to determine minimum resource usage in # of functional units and/or registers, given a fixed schedule Pipelining Dependence types Ability to determine minimum II given a code snippet Modulo scheduling concepts: MII, RecMII, ResMII 6

8 Example: What s the MII + + [1] - - Three AddSub units available Two Multipliers available Assume 1-cycle operations, no chaining What is the minimum II (MII) for pipelining the above data flow graph? Write down both ResMII and RecMII 7

9 Modern DNNs are Computationally Expensive NATURE ELECTRONICS PERSPECTIVE a Compute Density 50 DNNs in academia without structural optimization Number of operations ( 109) 20 VGG-19 (2014) VGG-16 Inception-v4 (2016) ResNet50 Inception-v3 AlexNet (2012) ResNet152 (2015) ResNet110 Xception OverFeat (2013) 5 DPN-131 ResNeXt-101 NASNet-A(N=7) DNNs in academia with structural optimization 10 PolyNet NASNet-A(N=7) GoogLeNet (2014) NASNet-A(N=5) Inception-v2 Inception-v1 1 NASNet-A(N=4) ShuffleNet MobileNet 20% 10% 4% Top-five error X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for Edge Inference of Neural Networks. Nature Electronics, vol 1, Apr b Moore s law trend for performance according to ref. 35 Moore s law end DNNs require enormous amount of compute 1,000 PD of GPUs 500 PD of ASICs PD of FPGAs DaDianNao For example, ResNet50 (70 layers) performs 7.7 billion operations ShiDianNao 200 Stratix 10 GX 2800 required to classify one image Cambricon Diannao Arria V GX660 Performance density (gigaops s 1 mm 2) 100 Arria V5AGZE7 50 NeuFlow 20 Arria II EP2AGZ350 Strati IV EP4SGX Moves Park GTX 690 GTX Titan 5 GTX 1080 GTX Titan X GTX Titan Z Myriad2 P100 TPU Eyeriss 8 EIE 65 nm 45 nm 40 nm 28 nm 16 nm 12 nm GTX 1080 Ti V100

10 DNNs are Moving to Dedicated Hardware Blue Chips Startups Academia Apple Google Intel Microsoft Cambricon Cerebras Deephi (à Xilinx) Graphcore EIE/ESE [Han ISCA 16, FPGA 17] Eyeriss [Chen ISCA 16] FINN [Umuroglu FPGA 17] Minerva [Reagen ISCA 16] 9

11 A Hyper -Active Body of Research Recent publications on "Neural Networks on Silicon" in premier computer hardware conferences (>150 papers since 2015) Data source: This lecture focuses on FPGA-based inference accelerators for convolutional neural networks (CNNs) 10

12 Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Cheng Zhang 1, Peng Li 2, Guangyu Sun 1,3, Yijin Guan 1, Bingjun Xiao 2, Jason Cong 2,3,1 1 Center for Energy-Efficient Computing and Applications, Peking University 2 Computer Science Department, University of California, Los Angeles 3 PKU/UCLA Joint Research Institute in Science and Engineering FPGA 15, Feb

13 Main Contributions 1. Analysis of the different sources of parallelism in the convolution kernel of a CNN 2. Quantitative performance modeling of the hardware design space using the Roofline method 3. Design and implementation of a CNN accelerator for FPGA using Vivado HLS, evaluated on AlexNet 12

14 Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps An output pixel is connected to its neighboring region on each input feature map All pixels on an output feature map use the same filter weights 13

15 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 14

16 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 15

17 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 3. Across different output pixels (i.e. filter positions) 16

18 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 3. Across different output pixels (i.e. filter positions) 4. Across filter pixels 17

19 Parallelism in the Code I K C O Filters R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row++) { 2 for (col=0; col<c; col++) { Loop over output pixels 3 for (to=0; to<o; to++) { Loop over output feature maps 4 for (ti=0; ti<i; ti++) { Loop over input feature maps 5 for (ki=0; ki<k; ki++) { 6 for (kj=0; kj<k; kj++) { Loop over filter pixels output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][s*row+ki][s*col+kj]; }}}}}} S is the stride 18

20 Challenges for FPGA We can t just unroll all the loops due to limited FPGA resources Must choose the right code transformations to exploit the parallelism in a resource efficient way Computation Resource... PE-1 PE-2 PE-n Interconnect On-chip memory buffer1 Off-chip Bus External Memory buffer2 Limited compute and routing resources Technique: loop unrolling, pipelining, interchange Limited on-chip storage Technique: loop tiling Limited off-chip bandwidth Technique: data reuse, compute-memory balance 19

21 Loop Tiling I K C O Filters R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row++) { 2 for (col=0; col<c; col++) { Loop over pixels in an output map 3 for (to=0; to<o; to++) { 4 for (ti=0; ti<i; ti++) { 5 for (ki=0; ki<k; ki++) { 6 for (kj=0; kj<k; kj++) { output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][s*row+ki][s*col+kj]; }}}}}} 20

22 Loop Tiling I K C O Tr Filters Tc R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row+=tr) { 2 for (col=0; col<c; col+=tc) { 3 for (to=0; to<o; to++) { 4 for (ti=0; ti<i; ti++) { 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (ki=0; ki<k; ki++) { 8 for (kj=0; kj<k; kj++) { output_fm[to][trr][tcc] += Loop over different tiles Loops over pixels in each tile Offloading just the inner loops requires only a small portion of the data to be stored on FPGA chip weights[to][ti][ki][kj]*input_fm[ti][s*trr+ki][s*tcc+kj]; }}}}}} }}}} 21

23 Code with Loop Tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { CPU Portion 3 for(to=0; to<m; to<o; to+=tm) { Maximize 4 // for(ti=0; software: ti<n; write ti+=tn) output feature { maploop over input feature memory maps reuse 4 for // (ti=0; write ti<i; output ti+=tn) feature { map // write software: input write feature input map feature + filters map + filters 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { FPGA Portion 7 for(too=to; too<min(to+tn, too<min(to+to, M); O); too++) { Maximize 8 for(tii=ti; tii<(ti+tn, tii<(ti+ti, I); N); tii++) {{ computational 9 for(i=0; (ki=0; ki<k; i++) ki++) { { performance 10 for(j=0; (kj=0; kj<k; j++) kj++) { { output_fm[to][row][col] output_fm[too][trr][tcc] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} } // // software: read output read output feature feature map map }}}} 22

24 Optimizing for On-Chip Performance 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (too=to; too<min(to+to, O); too++) { 8 for (tii=ti; tii<(ti+ti, I); tii++) { 9 for (ki=0; ki<k; ki++) { 10 for (kj=0; kj<k; kj++) { output_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 23

25 Optimizing for On-Chip Performance 9 for (ki=0; ki<k; ki++) { Reorder these two loops to the top level 10 for (kj=0; kj<k; kj++) { 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (too=to; too<min(to+to, O); too++) { 8 for (tii=ti; tii<(ti+ti, I); tii++) { output_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 24

26 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 87 for for(tii=ti; (too=to; too<min(to+to, tii<(ti+tn, N); tii++) O); too++) { { 8 for output_fm[too][trr][tcc] (tii=ti; tii<(ti+ti, I); tii++) += { weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 25

27 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 for(too=to; for(tii=ti; too<min(to+tm, too<min(to+to, tii<(ti+tn, N); tii++) O); M); too++) { { 8 for(tii=ti; #pragma output_fm[too][trr][tcc] HLS tii<(ti+tn, unroll N); tii++) += { 8 for weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] (tii=ti; tii<(ti+ti, I); tii++) += { }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 26

28 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 for(too=to; for(tii=ti; too<min(to+tm, too<min(to+to, tii<(ti+tn, N); tii++) O); M); too++) { { 8 for(tii=ti; #pragma output_fm[too][trr][tcc] HLS tii<(ti+tn, unroll N); tii++) += { 8 for(tii=ti; weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] tii<(ti+tn, tii<(ti+ti, I); N); tii++) += {{ }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] #pragma HLS unroll += }}}}}} weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} Number of cycles to execute the above loop nest K K Tr Tc + L Tr Tc K ) L is the pipeline depth (# of pipeline stages, II=1) 27

29 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 Generated Hardware for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 Performance and for(too=to; for(tii=ti; size of too<min(to+tm, too<min(to+to, tii<(ti+tn, Ti N); tii++) O); M); too++) { { 8 each PE determined for(tii=ti; #pragma output_fm[too][trr][tcc] by HLS tii<(ti+tn, tile unroll N); tii++) += { 8 factors Ti and Tofor(tii=ti; weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] tii<(ti+tn, tii<(ti+ti, I); N); tii++) += {{ To }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] #pragma HLS unroll += Number }}}}}} of data transfers weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += determined }}}}}} by Tr weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; and Tc }}}}}} 28

30 Design Space Complexity Challenge: Number of available optimizations present a huge space of possible designs What is the optimal loop order? What tile size to use for each loop? Implementing and testing each design by hand will be slow and error-prone Some designs will exceed the on-chip compute/memory capacity Solution: Performance modeling + automated design space exploration 29

31 Performance Modeling We calculate the following design metrics: Total number of operations (FLOP) Depends on the CNN model parameters Total external memory access (Byte) Depends on the CNN weight and activation size Total execution time (Sec) Depends on the hardware architecture (e.g., tile factors To and Ti ) Ignore resource constrains for now 30

32 Performance Modeling Total operations FLOPS 2 O I R C K ) Execution time = Number of Cycles Clock Period Number of cycles Tr Tc K) O To I Ti R C K) External memory accesses = ai Bi + aw Bw + ao Bo Size of input fmap buffer: Bi = Ti (Tr+K 1)(Tc+K 1) with stride=1 Size of output fmap buffer: Bo = To Tr Tc Size of weight buffer: Bw = To Tr K ) External access times: ao = , ai = aw = 3 14 ao 31

33 Performance Modeling Computation Resource... PE-1 PE-2 PE-n Interconnect On-chip memory buffer1 buffer2 Computational Throughput Total number of operations = Total execution time GFLOP/Sec Computation To Communication (CTC) Ratio Total number of operations = Total external memory access FLOP / (Byte Accessed) GByte Sec Off-chip Bus External Memory GFLOP/Sec = FLOP/(Byte Accessed) = Required Bandwidth Computational Throughput CTC Ratio GByte/Sec 32

34 Roofline Method [1] GFLOP/Sec Computational Throughput Design Required Bandwidth Gbyte/Sec Computation To Communication (CTC) Ratio [1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, FLOP / Byte 33

35 Roofline Method [1] GFLOP/Sec Computational Throughput Exceeds On-Chip Resources Computational Roof (Max device throughput) Bandwidth-Bottlenecked Theoretical Throughput Attainable Throughput Computation To Communication (CTC) Ratio Bandwidth Roof (Device memory bandwidth) The Roofline [1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, FLOP / Byte 34

36 Design Space Exploration with Roofline 120 Bandwidth Roof Computational Throughput Computational Roof A B Computation To Communication (CTC) Ratio 35

37 Hardware Implementation All values in floating-point Only handles Conv layers, not dense or pooling Virtex7-485t 100MHz 61.6 GOPS 18.6 W power 36

38 Experimental Results Energy Speedup x x x x thread -O3 16 threads -O3 FPGA 0 1 thread -O3 16 threads -O3 FPGA CPU Xeon E (32nm) 16 cores 2.2 GHz FPGA Virtex7-485t (28nm) 448 PEs 100MHz gcc O3 OpenMP 3.0 Vivado Vivado HLS

39 An OpenCL Deep Learning Accelerator on Arria 10 Utku Aydonat, Shane O Connell, Davor Capalija, Andrew C. Ling, Gordon R. Chiu Intel Corporation (formerly Altera) Toronto, Canada FPGA 17, Feb

40 Main Contributions Reducing external memory bandwidth usage by: 1. Storing all intermediate feature maps in on-chip buffers 2. Image batching for dense layers Optimizing the convolution arithmetic using the Winograd Transformation A compute-bound implementation on Arria 10 whose energy efficiency matches the Titan X GPU 39

41 OpenCL Programming void sum (float* a, float* b, float* c) { for (i = 0; i < 100; i++) c[i] = a[i] * b[i]; } kernel void sum (global float* a, global float* b, global float* c) { int gid = get_global_id(0); c[gid] = a[gid] * b[gid]; } Conventional C++ OpenCL The FPGA 15 paper used C++ with Xilinx Vivado HLS to generate RTL This work used OpenCL and Intel FPGA SDK to generate RTL Sequential programming model using loops Inter-loop-iteration parallelism is implicit (requires unrolling) Parallel programming model using multithreaded kernels; where interiteration parallelism is explicit each thread obtains an independent ID 40

42 Data Placement Off-Chip Storage Layer Data FPGA On-Chip Storage Previous papers stored layer data off-chip Insufficient on-chip storage to hold all data for a layer (input+output) PE 12 3 Off-Chip Storage FPGA On-Chip Storage 3 PE 2 1 This work uses Arria 10 FPGA device Enough storage to keep data on-chip (for conv layers in AlexNet) Use double-buffering to store input+output 41

43 Arithmetic Optimizations On FPGA, DSP blocks (used for fixed-point multiplies) are typically the bottlenecked resource Consider a 1-dimensional convolution with output length 2 and filter length 3, denoted F(2,3) f 0 f 1 f 2 i 0 i 1 i 2 i 3 f 0 f 1 f 2 f 0 f 1 f 2 i 0 i 1 i 2 i 1 i 2 i 3 F(2,3) 6 multiplies o 0 o 1 o 0 o 1 In the 70s, Shmuel Winograd proved that F(m,r) can be computed with a lower bound of only m+r 1 multiplies [1] [1] S. Winograd, Arithmetic Complexity of Computations, SIAM, Jan 1,

44 Winograd Transform Naïve Approach f 0 f 1 f 2 i 0 i 1 i 2 i 3 o 0 o 1 o c o = i f c i d i c ) f d i d i ) i d e f ) = i cf c + i d f d + i ) f d i d f c + i ) f d + i e f ) 6 unique multiplies Winograd Approach o c o = y c + y d + y ) d c = i c i ) d e = i d i e d y d y ) y e d d = i d + i ) d ) = i ) i d Each y 4 = d 4 g 4 1 multiply per y i 4 unique multiplies g c = f c g e = f ) g d = f c + f d + f ) 2 g ) = f c f d + f ) 2 d i and g i are linearly mapped from i i and f i 43

45 Comparison to Previous Papers Zhang 2015 Qiu 2016 This Paper Platform Virtex7 VX485t Zynq XC7Z045 Arria Clock (MHz) Quantization 32-bit float 16-bit fixed 16-bit fixed Performance (GOP/s) Power (W) Energy Efficiency (GOP/J) Massive increase in performance due to Winograd Transform and storing all features on-chip Among the first to break the TeraOP/s barrier on FPGA 44

46 Experimental Evaluation Platform img/s Power (W) Arria 10 DLA (20nm) Nvidia Titan X (28nm) Nvidia M4 (28nm) Benchmark app is AlexNet Energy Efficiency (img/s/w) Results show FPGAs can compete with GPUs in energy efficiency Titan X numbers ignore communication overhead and use random data instead of real images (highly optimistic) 45

47 Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng- Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru Zhang 1 1 Electrical and Computer Engineering, Cornell University 2 Electronics Engineering and Computer Science, Peking University 3 Electrical Engineering, University of California Los Angeles 4 Computer Science and Engineering, University of California San Diego FPGA 17, Feb

48 CNNs with Reduced Numerical Precision Hardware architects widely apply fixed-point optimization for CNN acceleration Motivation: both neural nets and image/video apps naturally tolerate small amounts of noise Approach: take a trained floating-point model and apply quantization 16 or 8-bit fixed-point have been shown to be practical Can we go even lower by training a reducednumerical-precision CNN from the ground up? 47

49 Aggressively Quantized CNNs ML research papers: BinaryConnect [NIPS] Dec 2015 BNN [arxiv] Mar 2016 Ternary-Net [arxiv] May 2016 XNOR-Net [ECCV] Oct 2016 HWGQ [CVPR] Jul 2017 LR-Net [arxiv] Oct 2018 Many more! Near state-of-the-art on MNIST, CIFAR-10 and SVHN at time of publication [1] Matthew Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arxiv: , Feb

50 CNN vs. BNN CNN Input Map Weights = Output Map Key Differences 1. Inputs are binarized ( 1 or +1) 2. Weights are binarized ( 1 or +1) 3. Results are binarized after batch normalization BNN Input Map (Binary) Weights (Binary) = x 4x (Integer) Batch Normalization y 4x = x 4x μ γ + β σ ) ε z 4x = +1 if y 4x 0 1 otherwise Binarization Output Map (Binary) 49

51 Advantages of BNN 1. Floating point ops replaced with binary logic ops b 1 b 2 b 1 b b 1 b 2 b 1 XOR b Encode {+1, 1} as {0,1} à multiplies become XORs Conv/dense layers do dot products à XOR and popcount Operations can map to LUT fabric as opposed to DSPs 2. Binarized weights may reduce total model size But note that fewer bits per weight may be offset by having more weights 50

52 BNN CIFAR-10 Architecture [2] Feature map dimensions 32x32 16x16 8x8 4x Number of feature maps conv layers, 3 dense layers, 3 max pooling layers All conv filters are 3x3 First conv layer takes in floating-point input 13.4 Mbits total model size (after hardware optimizations) [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arxiv: , Feb

53 BNN Accelerator Design Goals Target low-power embedded applications Design must be resource efficient to fit a small device Execute layers sequentially on a single module Optimize for batch size of 1 Store all feature maps on-chip Binarization makes feature maps smaller Weights are streamed from off-chip storage Synthesize RTL from C++ source 52

54 BNN Accelerator Architecture Input words Input words BitSel BitSel Convolvers f in + f out output streams + Integer buffer Pooling, Bnorm, Binarize Output words Variable-width Line Buffer f out Conv Weights Challenges Many diverse sources of parallelism (across/within images, feature maps, subword) Design must handle various layer types with different sized feature maps Slow interface between accelerator and general-purpose memory system Our Solution Highly parallel and pipelined architecture with a parameterized number of convolvers Novel variable-width line buffer ensure pipeline is fully utilized Careful memory layout and BitSel unit enable word-by-word data processing, instead of pixel-by-pixel 53

55 BNN HLS Design User writes and tests in C++ CPU-FPGA interface automatically synthesized (by Xilinx SDSoC) Significant reduction in verification time BNN RTL takes days to simulate 1 VariableLineBuffer linebuf; 2 ConvWeights wts; 3 IntegerBuffer outbuf; 4 5 for (i = 0; i < n_input_words; i++) { 6 #pragma HLS pipeline 7 8 // read input word, update linebuffer 9 WordType word = input_data[i]; 10 BitSel(linebuf, word, input_width); // update the weights each time we 13 // begin to process a new fmap 14 if (i % words_per_fmap == 0) 15 wts = weights[i / words_per_fmap]; // perform conv across linebuffer 18 for (c = 0; c < LINE_BUF_COLS; c++) { 19 #pragma HLS unroll 20 outbuf[i % words_per_fmap][c] += 21 conv(c, linebuf, wts); 22 } 23 } HLS code for part of convolver unit 54

56 FPGA Implementation Misc. HW Optimizations 1. Quantized the input image and batch norm params 2. Removed additive biases 3. Simplified batch norm computation BNN Model Test Error Claimed in paper [2] 11.40% Python out-of-the-box [2] 11.58% C++ optimized model 11.19% Accelerator 11.19% FPGA: ZedBoard with Xilinx Zynq-7000 mgpu: Jetson TK1 embedded GPU board CPU: Intel Xeon E multicore processor GPU: NVIDIA Tesla K40 GPU Runtime per Image (ms) mgpu CPU GPU FPGA Speedup 1x 6x 120x 15x Power (W) 3.6* Image/sec/Watt R. Zhao et al., Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. International Symposium on Field-Programmable Gate Arrays, Feb

57 Additional Useful Resources Recent papers on neural networks on silicon Tutorial on hardware architectures for DNNs Landscape of neural network inference accelerators Most cited deep learning papers (since 2012) 56

58 Acknowledgements This tutorial contains/adapts materials developed by Ritchie Zhao (PhD student at Cornell) Authors of the following papers Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks (FPGA 15, PKU-UCLA) An OpenCL Deep Learning Accelerator on Arria 10 (FPGA 17, Intel) 57

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru