Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Size: px

Start display at page:

Download "Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research"

Allen Herbert Bell
5 years ago
Views:

1 Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

2 Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx) Mission Investigate & exploit novel trends in machine learning that play to the strengths of FPGAs: Reduced Precision Neural Networks Page 2

3 Executive Summary FPGAs can do trillions of reduced precision synaptic operations per second & neural nets can put this to good use Inference accelerators that classify 10Ks to Ms of images per second, at < 25 W, on today s hardware Page 3

4 Background Page 4

Convolutional Neural Networks CNN computation is linear

lots of computation and lots of parameters (memory)

419MB & 30GOPS for ImageNet Not suitable for

input(w+x,h+y,d)*filter(m, x,y,d); Challenge: billions of

5 Convolutional Neural Networks CNN computation is linear algebra on originally floating point data types Demands lots of computation and lots of parameters (memory) AlexNet: 244MB & 15GOPS, VGG16: 552MB & 308GOPS; GoogleNet: 419MB & 30GOPS for ImageNet Not suitable for energy-constrained computing environments Output(w,h,m) += input(w+x,h+y,d)*filter(m, x,y,d); Challenge: billions of floating point multiply-accumulate ops & tens of megabytes of parameter data «cat» Page 5

Increasingly Reduced Precision Networks Floating point (FP) CNNs contain a lot of

EMDNN, 2016 with ternary networks on par with FP for AlexNet top-1 and top-5,

(BNNs) works at a small loss of accuracy for large networks Quantization MNIST SVHN

activations 129% 230% 990% FP weights, FP activations 094% 169% 762% % classification

6 Increasingly Reduced Precision Networks Floating point (FP) CNNs contain a lot of redundancy Reducing precision is shown to work to 2b without loss of accuracy BDally EMDNN, 2016 with ternary networks on par with FP for AlexNet top-1 and top-5, ResNet20,32,44,56 Reducing to the Extreme: Binary and Almost Binary Neural Networks (BNNs) works at a small loss of accuracy for large networks Quantization MNIST SVHN CIFAR-10 Binary weights, Binary activations 096% 253% 1015% Binary weights, FP activations 129% 230% 990% FP weights, FP activations 094% 169% 762% % classification error (lower is better) Source: [4] Page 6 [5]

7 Accuracy of Binary Networks Improving Published Results for FP CNNs, BNNs and Extreme Reduced Precision NNs 60 Top-5 Error (ImageNet) /07/ /11/ /04/ /08/ /12/ /05/ /09/2017 CNN Reduced Precision BNN BNNs are new and accuracy results are improving rapidly through for example new training techniques, topological changes and other methods Page 7

8 Potential of Reduced Precision on FPGAs Cost per operation is greatly reduced For example, for BNN: multiply accumulate becomes XNOR with bit counts Memory cost is greatly reduced Large networks can fit entirely into on-chip memory (OCM) (UltraRAM, BRAM) VU9P(16nm): 43MB More memory bandwidth, lower energy LUT DSP 100ks LUTs Ks DSPs Today s FPGAs have a much higher peak performance for reduced precision operations FPGA performance is anti-proportional to the cost per operation when applications are sufficiently parallel Lower cost per op & massively parallel = more ops every cycle Precision Cost per Op LUT Cost per Op DSP MB needed (AlexNet) TOps/s (KU115)* TOps/s (ZU19EG)* 1b ~46 ~66 4b ~11 ~16 8b ~3 ~4 100x 16b ~1 ~1 32b ~05 ~03 *Assumptions: Application can fill device to 70% (fully parallelizable) 250MHZ Page 8

9 Potential of BNNs on FPGAs (ZU19EG) Fewer LUTs/op yields to higher peak performance 66 TOPS 1 TOPS Staying on-chip to achieve more of the peak 01 TOPS 40 TOPS Reduced Precision allows us to scale NN performance on FPGAs to unprecedented levels Assumption: Operational Intensity for 8b and 1b AlexNet, assuming 145GOps/image & 61MB & 76MB Page 9

10 Exploitation of Reduced Precision Neural Networks through FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Page 10

11 FINN Design Principles Custom-tailored hardware for optimal performance and power efficiency Customized data types Customized dataflow architecture to match network topology Exploit compile time optimizations to simplify generated hardware Keep all parameters on-chip Higher energy efficiency and performance Provide flexibility in architecture to scale solutions Support portability and rapid exploration through high level design entries C++ and most recently in OpenCL Page 11

Heterogeneous Dataflow Architecture Systolic Array Data flow architecture Not a systolic

match each layer s compute requirement Equivalent throughput through all layers To avoid

buffering and latency FIFOs, no ping-pong buffers Chosen Dataflow Architecture 1MOPS

12 Heterogeneous Dataflow Architecture Systolic Array Data flow architecture Not a systolic array with scheduling network on processing engines and looping over PEs Customized to match each layer s compute requirement Equivalent throughput through all layers To avoid one size fits all penalties Each layer consumes and produces in same order to minimize buffering and latency FIFOs, no ping-pong buffers Chosen Dataflow Architecture 1MOPS 10MOPS Layers are different instantiations of a C++ template classes (MVTU) 1PE 10PE Page 12

and folding over weights (synaptic) Weight and output stationary (weights and popcounts are

13 Architecture of a Matrix-Vector Threshold Unit (MVTU) Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold units (MVTUs) MVTUs support OFM (neuron) and folding over weights (synaptic) Weight and output stationary (weights and popcounts are retained locally) Max pool units are optionally placed behind MVTUs Weight folding OFM folding Page 13

14 Synthesizable C++ Network Description void DoCompute(ap_uint<64> * in, ap_uint<64> * out) { #pragma HLS DATAFLOW stream<ap_uint<64> > meminstrm("meminstrm"); stream<ap_uint<64> > InStrm("InStrm"); stream<ap_uint<64> > memoutstrm("memoutstrm"); Stream definitions } Mem2Stream<64, inbytespadded>(in, meminstrm); StreamingMatrixVector<L0_SIMD, L0_PE, 16, L0_MW, L0_MH, L0_WMEM, L0_TMEM> (InStrm, inter0, weightmem0, thresmem0); StreamingMatrixVector<L1_SIMD, L1_PE, 16, L1_MW, L1_MH, L1_WMEM, L1_TMEM> (inter0, inter1, weightmem1, thresmem1); StreamingMatrixVector<L2_SIMD, L2_PE, 16, L2_MW, L2_MH, L2_WMEM, L2_TMEM> (inter1, inter2, weightmem2, thresmem2); StreamingMatrixVector<L3_SIMD, L3_PE, 16, L3_MW, L3_MH, L3_WMEM, L3_TMEM> (inter2, outstream, weightmem3, thresmem3); StreamingCast<ap_uint<16>, ap_uint<64> >(outstream, memoutstrm); Stream2Mem<64, outbytespadded>(memoutstrm, out); Move image in from PS memory Layer instantiation connected by streams Move results to PS memory Page 14

Work Flow for Exploration of NNs of FPGAs

Theano (Tensorflow and Caffe in progress) All

RTL needed Fast workflow, integrated with

support different topologies, sizes, rates,

15 Work Flow for Exploration of NNs of FPGAs First prototype integration with tiny-dnn and Theano (Tensorflow and Caffe in progress) All code in C/C++ Can execute on CPU and FPGA - No RTL needed Fast workflow, integrated with standard framework, with flexibility to support different topologies, sizes, rates, resources with different devices (Z7045, KU115, Z7020) Page 15

16 Experimental Results Embedded platforms (Zynq Z7045 & 7020): ZC706, PYNQ open source platform Server class accelerator: ADM_PCIE_8K5 in OpenPOWER (& x86 with SDAccel) Page 16

17 Input Data Solitaire demo (Xilinx demo center & EmbeddedWorld) MNIST handwritten digits Streetview house numbers Cifar-10: cats, dogs, etc Playing cards Imagenet in progress now Page 17

derivative) Input images: 32x32 pixels, RGB image (SVHN, CIFAR-10, traffic signs) Number of layers:

18 Test Networks MLP: Multilayer Perceptron Input images: 28x28 pixels,black-white (handwritten digits) Number of layers: 3 FC layers, 1024 neurons each Compute: MOPS/Frame CNV: CNN (VGG-16 derivative) Input images: 32x32 pixels, RGB image (SVHN, CIFAR-10, traffic signs) Number of layers: 2 (3x3) Conv + Max Pool + 2 (3x3) Conv + Max Pool + 2 Convolutional + 3 FC Compute: GOPS/Frame Page 18

Results - Performance, Latency, Power & Resources Max Throughput Z7045 FPS GOPS/s BRAM LUT Latency

FC- MNIST L 15M 9 085 398 79 097 (36%) 244 226 CNV- CIFAR10 - S 219K 2 465 192 54 538 (25%) 283 117

(2%) 240 81 FC- MNIST L 122k 71 115 6 156 (3%) 282 79 CNV- CIFAR10 - S 116k 1 306 1585 40 404 (18%)

For robotics, AR, UAVs 3x classification rate over best measured numbers on GPU today KU115 FPS

19 Results - Performance, Latency, Power & Resources Max Throughput Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W] Unprecedented classification rates FC- MNIST S 123M (39%) FC- MNIST L 15M (36%) CNV- CIFAR10 - S 219K (25%) K FPS target Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W] FC- MNIST - S 122k (2%) FC- MNIST L 122k (3%) CNV- CIFAR10 - S 116k (18%) Comparable to AlexNet Scalability to extremely small footprints Ultra-low latency (P4 ~11ms) For robotics, AR, UAVs 3x classification rate over best measured numbers on GPU today KU115 FPS GOPS/s BRAM LUT Latency [us] Power [W] CNV- CIFAR10 - L 120k (59%) 671 <41 Page 19

Status & Next Steps Initial proof of concepts & demos are operational and demonstrate the potential Open source release on PYNQ With Python API - https://githubcom/xilinx/bnn-pynq We continue to

20 Status & Next Steps Initial proof of concepts & demos are operational and demonstrate the potential Open source release on PYNQ With Python API - We continue to progress technology investigation Large NNs Higher (but no more than 8b!) & mixed precision NNs Improving accuracy through novel techniques Design space trade-offs accuracy vs performance vs resources Interested to understand system level integration better How does ML plug into data base systems? Heterogeneous at system, node, or device level? Page 20

21 Thank You Page 21

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),