direct hardware mapping of cnns on fpga-based smart cameras

Size: px

Start display at page:

Download "direct hardware mapping of cnns on fpga-based smart cameras"

Beatrix Johns
6 years ago
Views:

1 direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June 3, 2017 DREAM Institut Pascal, Clermont-Ferrand, France

Canziani and al. An Analysis of Deep CNN Models for Practical Applications Arxiv 2016. 2 J.

2 Convolutional Neural Networks (CNNs) Deep CNNs = The state-of-the-art in image classification Good candidates for implementation on smart camera nodes Computationally intensive 1 A. Canziani and al. An Analysis of Deep CNN Models for Practical Applications Arxiv J. Malik and al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation IEEE CVPR

3 CNN Structure... "cat" (conv+act+pool) x L FC - Convolutional layers Feature extractors Computationally intensive 90 % of execution time - Fully connected layers Classifiers Memory heavy conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 convlayers 2.41 GOP 97% fclayers 0.06 GOP 3% (a) - Computation Load fclayers 468 Mbits 98% convlayers 18 Mbits 2% (b) - Parameter Size 2

4 Convolutional Layers - Deep CNNs are deep because of convlayers : from 5 to A large amount of parallelism... [ f[n, i, j] = act b[n] + C K K c=1 p=1 q=1 ] Φ[c, i + p, j + q].θ[n, c, p, q] (1) C : Number of input channels conv00 conv01 Σ0 act0 0 N : Number of output units conv02 K : Size of the convolution kernels Θ : Learned parameters of a given layer b : Learned bias offset 0 1 conv10 conv11 conv12 conv20 conv21 conv22 Σ1 Σ2 act1 act2 1 2 act : Non linear activation function -N C Convolutions = N C K 2 MAC op -Example N=5,C=3 -Real life CNNs : N=256,C=96,K=5 2 conv30 conv31 conv32 conv40 conv41 conv42 Σ3 Σ4 act3 act

5 Hardware Accelerators for CNNs Trained on GPU Clusters Deployed on dedicated hardware For smart camera nodes: FPGAs Fine grain parallelism Hardware flexibility Low power consumption (vs GPUs) Energy Efficiency Throughput CPU GPU FPGA dev 4

Architectures of FPGA-Based CNN accelerators Custom SIMD processors - OpenCL kernels Roofline model : Computation to communication trade-off Fixed-point arithmetic for CNNs 1 Peemen - Memory-centric

6 Architectures of FPGA-Based CNN accelerators Custom SIMD processors - OpenCL kernels Roofline model : Computation to communication trade-off Fixed-point arithmetic for CNNs 1 Peemen - Memory-centric accelerator design for CNNs. In ICCD 13 2 Qiu - Going Deeper with Embedded FPGA Platform for CNNs. FPGA 16 3 Suda - Throughput-Optimized OpenCL FPGA Accelerator for Large-Scale CNNs, FPGA Meloni - Curbing the Roofline : a Scalable and Flexible Architecture for CNNs on FPGA 5 Zhang - Optimizing FPGA-based Accelerator Design for Deep CNNs. FPGA 15 5 Gysel - Hardware-oriented Approximation of CNNs - ICLR 16 5

Dataflow MoC for CNNs - Purely data driven execution model - CNNs : a dataflow process network (DPN) - Apply the dataflow MoC to CNNs - Stream based processing Actors : Granularity?

7 Dataflow MoC for CNNs - Purely data driven execution model - CNNs : a dataflow process network (DPN) - Apply the dataflow MoC to CNNs - Stream based processing Actors : Granularity? Communication channels : FIFO? Literature: NeuFlow, fpgaconvnet... A limited number of processing tiles 1 Farabet - NeuFlow: A runtime reconfigurable dataflow processor for vision - CVPR 11 2 Venieris - fpgaconvnet: Automated Mapping of CNNs on FPGAs 6

8 Direct Hardware Mapping of CNNs Our contribution: Direct Hardware Mapping (DHM) of CNN entities Each CNN actor is physically mapped on the device Each CNN connection is mapped on a signal Process feature maps on the fly / No memory bandwidth limitation 1 op/clk : Throughput ++ limitation : Area and resources availability 7

9 Memory Optimization 1 Memory print of implementation = Pipelined convolution engine Pipelined Convolution = Neighborhood extractor + MAC unit N C convolution engine per layer, N C K 2 MB of storage. Solution : Factorize neighborhood extractors (NEF) p22 p21 p20 line buffer 0 p00 p12 p11 p10 line buffer 1 p01.. p02 p01 p00 pkk Figure: Architecture of a 3 3 neighbourhood extractor : 2 Buffers with image length size are required to perform a 3 3 convolution on streams of pixels p ij Figure: Architecture of a parallel 3 3 MAC unit. Extracted pixels are weighted using 9 multipliers and accumulated to compute one convolution per clock-cycle. 8

10 Memory Optimization 2 -Factorizing Neighbourhood Extraction reduces the memory buffers by a factor of N conv 00 conv 01 conv 02 Σ 0 act ne act 0 0 conv 10 conv 11 Σ 1 act 1 1 act 1 conv 12 1 conv 20 conv 21 conv 22 Σ 2 act ne act 2 2 conv 30 conv 31 conv 32 Σ 3 act 3 3 act 3 conv 40 conv 41 conv 42 Σ 4 act ne act 4 9

11 Memory Optimization wo/ nef w/ nef Required Memory (Bits) conv1 conv2 conv3 conv4 conv5 Figure: Ratio of memory requirements between architectures w/ and wo/ NEF for Alexnet convolutional layers: 390% less memory is required when implementing NEF 10

12 Computation : Fixed point Arithmetic N C K 2 MAC operation per layer : Limited by # DSPs LeNet5 conv2 (C = 6, N = 16, K = 5) requires 2400 multipliers. Available Multipliers in the biggest Cyclone V Multiplication with Logic Elements - Fixed point arithmetic J O(nbits 2 ) acc (%) Logic Elements (kle) Bit-width (bits) LeNet5 CIFAR10 SVHN

13 Multiplication with logic elements [ f[n, i, j] = act b[n] + C K c=1 p=1 q=1 - Θ[n, c, p, q] : Pre-learned convolution kernel : Considered as constant. - Hard-code as generics that parametrize convolution engines. K ] Φ[c, i + p, j + q].θ[n, c, p, q] Θ[n, c, p, q] = 0 : Remove multiplier and connection Θ[n, c, p, q] = 1 : Replace with a straightforward signal Θ[n, c, p, q] = pow2 : Multiplication with shift registers 1100 Zeros Ones Pow2 57% Lenet 3% 40% 33% Cifar10 25% 16% 37% SVHN 13% 46% Hardware resource of a MAC operation (ALM) % Weights with a null / power-of-two value 12

14 Haddoc2 Framework - From High level CNN Model to Low Level Hardware Description Energy Efficiency.prototxt toplevel.vhd Throughput.caffemodel Haddoc 2 params.vhd Caffe Hardware dev - Direct Hardware Mapping : Structural RTL transcription of the CNN graph - Constant multipliers : Convolution kernels hard-coded as VHDL generics - Platform and device independent VHDL 1 Jia - Caffe: Convolutional Architecture for Fast Feature Embedding. In ACM International Conference on Multimedia

15 Haddoc2 Framework 14

16 Implementation Results with Haddoc2 a b LeNet5 FaceDetect CarType Logic Elements (ALMs) (44%) 6158 (5%) (42%) DSP Blocks 1 0 (0 %) 0 (0%) 0 (0%) Block Memory Bits 2752 (1%) (1%) (1%) Frequency (MHz) Processing capabilities (Gops/s) Slices (88%) 6221 (11%) (89%) DSP Blocks 1 0 (0%) 0 (0%) 0 (0%) LUTs as Memory 420 (1%) 1458 (2%) 1154 (1%) Frequency (MHz) Processing capabilities (Gops/s) Table: Resource Utilization of the Haddoc2-generated convolutional layers with 5-bit representation on: a- an Intel Cyclone V FPGA, b- a Xilinx Kintex 7 FPGA. 15

17 Thanks for listening! 16

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu