High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge

Size: px

Start display at page:

Download "High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge"

Hope Skinner
5 years ago
Views:

1 High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge International Workshop on Highly Efficient Neural Processing October 4th 2018

2 CNN Models : Accuracy, Operations and Parameters A. Canziani, E. Culurciello, A. Paszke, An Analysis of Deep Neural Network Models for Practical Applications, 2017 Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

3 CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many memory cuts with processing elements Integrate computation into the memory itself Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

4 CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many memory cuts with processing elements Integrate computation into the memory itself Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

.. Limit number of parameters (weight values) Limit number of bits of weights and

5 CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks In other words,... Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many memory cuts with processing elements Integrate computation into the memory itself Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

6 CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many memory cuts with processing elements Integrate computation into the memory itself In other words,... K. Usher, The Dwarf in the Dirt, Bones, 2009 Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many

7 CNN Models : Accuracy, Operations and Parameters Challenges in Embedded Neural Networks Limit number of parameters (weight values) Limit number of bits of weights and activations Integrate many memory cuts with processing elements Integrate computation into the memory itself In other words,... K. Usher, The Dwarf in the Dirt, Bones, 2009 Let s Use Ternary { 1, 0, 1} weights and activations on FPGA FPGA : Great digital PIM Hardwiring ANN too risky New and better ANN every other day Ternarization Classification Error Rates on NN-64 (%) CIFAR-10 SVHN GTRSB Float Ternary H. Alemdar et al, Ternary neural networks for resource-efficient AI applications, IJCNN 17 Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

8 Why Ternary Convolutional Neural Networks? Objectives Energy efficient inference for AI tasks Without sacrifiying too much accuracy (Valid at a point in time : learning methods improve continuously) Solution Ternarize { 1, 0, 1} weights and activations a Sweet spot between resource usage and accuracy a. Perhaps the prettiest number system of all is the balanced ternary notation. Donald Knuth, The Art of Computer Programming, Volume 2 : Seminumerical algorithms. Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

9 Training Ternary Neural Networks : Teacher-Student Approach NN Teacher Student parameters { 1, 0, 1} neuron input activation function any any with ( 1, 1) 2-threshold step stochastic firing neuron output { 1, 0, 1} { 1, 0, 1} Teacher ρ = tanh(y i ) 1 with prob. ρ if ρ < 0 n i = 1 with prob. ρ if ρ > 0 0 otherwise Student 1 if y i < b lo n i = 1 if y i > b hi 0 otherwise Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

10 Ternary Neural Networks : Teacher-Student Individual Training Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

11 Experiments with Ternary Networks Multiple networks VGG-like networks two geometries : NN-64 and NN-128 multiple acceleration factors inside network (ranging 1 to 256) tradeoff area/throughput/energy Automation (kind of...) handmade generic hardware building blocks (vhdl) automatically generated networks customizable home-made tools (old school C and tcl) Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

12 Overview 12 theoretical layers 29 physical layers (+30 glue fifos) NN64 : 1930 neurons, 3.5 mega parameters NN128 : 3850 neurons, 14 mega parameters Goal of ternary : have parameters fit in FPGA distributed memories Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

13 Overview Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

14 FPGA Design for Ternary Convolutional Neural Networks Neurones and Parallelism Max NN-64 acceleration factor that fits on a VC709 : 128 Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

15 Acceleration factors Base implementation : at most 1 activation transfered between layers in 1 cycle But some layers take more time than others to compute stalls Use parallelism at layer level : Transfering several activations in 1 cycle in/out of bootleneck layers NN Acc. Parallelism per layer (in/out) size factor NL1 NL2 MPL1 NL3 NL4 MPL2 NL5 NL6 MPL3 NL7 NL8 NL / / / / 1-2 / 1 4 / / / 2 16 / 2 2 / 1 4 / 1 8 / 1-2 / 1 4 / / 4 32 / 4 4 / 1 8 / 2 16 / 2 2 / 1 4 / 1 8 / / 8 64 / 8 8 / 2 16 / 4 32 / 4 4 / 1 8 / 2 16 / 2 2 / / / / 4 32 / 8 64 / 8 8 / 2 16 / 4 32 / 4 4 / / / / 8 64 / / / 4 32 / 8 64 / 8 8 / 2 2/1 - - Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

16 Squeezing High-Efficiency TCNN in FPGA : Adder Trees Ternary Adders Sum of trits Sum of bits With (x, y) { 1, 0, 1} 2, x + y { 2, 1, 0, 1, 2} LUT savings with optimized ternary adder tree Number of inputs Generic 2-bit radix-2 adder tree (LUT) Optimized ternary adder (LUT) Savings 33.3% 57.1% 52.3% 51.1% 50.3% 51.6% 51.8% 51.0% Overall LUT savings when using optimized ternary adder trees Acc. factor Savings for NN % 1.38% 4.61% 10.9% 17.3% 24.1% 32.0% Savings for NN % 1.71% 5.63% 12.9% 19.6% 25.7% Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

17 Squeezing High-Efficiency TCNN in FPGA : Weight Compression Trits encoding Naïve encoding : 2 bits to encode 1 trit suboptimal Ex. : 3 trits encoded on 6 bits while 3 3 = 27 combinations encodable on 5 bits Optimal number of bits per trits : b = log 2 ( 3 T ) = T log 2 (3) Minimal number of bits per trits : b/t log 2 (3) bits Maximum saving % (Shannon limit) Interesting cases 3 trits / 5 bits 16 % saving 5 trits / 8 bits 20 % saving Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

18 Squeezing High-Efficiency TCNN in FPGA : Weight Compression Compression of weights Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

19 Squeezing High-Efficiency TCNN in FPGA : Weight Compression Trading-off BRAM vs logic : Ressources Breakdown and Power Analysis NN-64 with compression 3t5b NN-64 with compression 5t8b y-axis : % of change wrt same degree of parallelism without compression Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

20 Measured Results for NN-64 on Xilinx MHz (VC709) Acc. Resource usage factor LUT (logic) LUTRAM BRAM 18k FF (39.4%) (21.47%) 1410 (48.0%) (37.1%) 256* (69.9%) (60.9%) 2920 (96.7%) (74.0%) NN-64 with parallelism degree 128 Uses half of the FPGA resources, reaches max throughput of 60.2k fps (32 32) (LUT+B)RAM throughput of 18.7 Tb/s (290 Gb/s for FC layers only a ) End to end latency including PCIe + RIFFA : 135 µs Max performance : 18.7 T(T)OP/s (9.33 T(T)MAC/s) max performance : 11.5 W (Idle FPGA 2 W) Peak efficiency of 5226 fps per Watt 1.62 T(T)OP/s/W or 810 G(T)MAC/s/W a. VC709 DRAM throughput 204 Gb/s Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

21 Take away FPGAs are a very good fit for ANN if weights fit in internal memory Extreme quantization needed Huge weight access throughput possible FPGAs are reconfigurable! Who would be mad enough to hardwire a given ANN architecture anyway? ASICs follow a very different (but equally useful) architectural path Creative low level optimizations help squeeze-in high-efficiency networks (NN-64 ) High-throughput : up to 60.2k fps Low latency : k fps High power efficiency : 1.62 T(T)OP/s/W or k fps Pétrot, Prost-Boucle, Bourge HENP 18 October 4 th / 16

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru