CNP: An FPGA-based Processor for Convolutional Networks

Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio Culurciello s e-lab Yale {Berin Martini, Polina Akselrod, Selcuk Talay} 1 / 35

1 Introduction Our goal: Real-Time Embedded Vision Our Solution: the CNP (ConvNet Processor) 2 CNP: Architecture 3 Results 2 / 35

Real-time Vision Convolutional Neural Networks The state of the art for object recognition/classification, ConvNets are biologically-inspired hierarchical architectures that can be trained to perform a variety of detection/recognition tasks, such as... 3 / 35

Figure: Object recognition. 4 / 35

5 / 35

Figure: Robot navigation/obstacle avoidance. 6 / 35

Convolutional Networks ConvNets have a feed-forward architecture consisting of multiple linear convolution filters interspersed with point-wise non-linear squashing functions. [2, 3, 4, 1] Convolutions Subsampling Convolutions Subs. Conv. Full connections F6: F7: 2@70x50 1@70x50 Input Image C1: 6@314x234 S2: 6@157x117 C3: 16@151x111 S4: 16@75x55 C5: 80@70x50 7 / 35

Figure: Ex: Features extracted from an image, by a trained ConvNet. 8 / 35

Our Goal: Achieve high-performance in low-power environments: 9 / 35

Our Solution: A ConvNet Processor Our entire system fits in a single FPGA plus an external RAM module, and no extra parts. It is a programmable SIMD (Single Instruction, Multiple Data) processor, with a vector instruction set that matches the elementary operations of a ConvNet. These elementary operations are highly optimized, and make extensive use of the parallelism inherent in the hardware. 10 / 35

Flexibility / Efficiency Implementing a particular ConvNet simply consists in programming the software layer of our processor, and does not require to reconfigure the logic circuits in the FPGA. A network compiler software was implemented, which takes a description of a trained ConvNet and compiles it into a proper program for the ConvNet Processor (CNP). 11 / 35

1 Introduction 2 CNP: Architecture Overview Hardware Software 3 Results 12 / 35

Our System Fits in a single FPGA + an external memory module. The FPGA: a low-end Xilinx Spartan-3A DSP 3400 FPGA, or a high-end Virtex-4 SX35. Both are roughly the same size. Spartan: 126 built-in hardware fixed-point multiply-accumulate units with 18 by 18 bit multiplication, and with accumulation on 48 bits, 200MHz. Virtex 4 : 192 units, 400MHz. 13 / 35

Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing Hardware HW module ALU Driver Memory I/O External Flash SW program Memory CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Streaming Multiport Memory Arbiter { Mux Demux } External Memory Chip(s) Vector / Stream ALU Display Manager I/O Video Manager Screen Camera FPGA 14 / 35

Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Memory I/O Streaming Multiport Memory Arbiter { Mux Demux } Vector / Stream ALU Display Manager I/O Video Manager FPGA Screen Camera 15 / 35

Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Screen Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Camera Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) 16 / 35

FIFOS WR... FIFOS WR FIFOS WR 17 / 35

The most important instruction: 2D Convolution 80 to 90% of the computations Input/output coded on Q10.6, intermediate results on Q36.12 [5] 19 / 35

W00 W01 + + 0 W02 W10 W11 W12 x + x + W20 W21 W22 x x + x x + x + x + x + +...... Pooling / Subsampling 20 / 35

The point-wise non-linearity Implemented as a piecewise approximation of the hyperbolic tangent function g(x) = A.tanh(B.x) g (x) = a i x + b i for x [l i, l i+1 ] (1) a i = 1 2 m + 1 2 n m, n [0, 5]. (2) Eq.(2) is used to replace all the multipliers by simple shifts and adds. 21 / 35

The point-wise non-linearity The piecewise mapping is done by a hardware mapper, which streams the input data through a cascade of linear mappers. Each linear mapper is configurable from the soft CPU. l0 >= a0 b0 x + l0 >= a0 b0 x +... l0 >= a0 b0 x + val? val? val? 22 / 35

10 8 6 4 2 1.5 1.0 0.5 4 2 2 4 0.5 1.0 20 40 60 80 100 120 1.5 Figure: Approximations of x [16 segments] and tanh(x) [16 segments] 23 / 35

24 / 35 Introduction CNP: Architecture Results References Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) Screen Camera

Embedded software A complete library has been developed in C++, to abstract the hardware of the CNP. Its functions: Provide a high-level/modular/scalable user interface to allow different kinds of ConvNets to be loaded from a flash memory/shell interface (with their trained weights), then interpreted and computed in real-time; Hide all the intricacies of the VALU into a Matrix class, that allows simple calls, such as image1.convolve(image1, kernel2); Manipulate the data generated by the hardware, to perform operations that cannot be done by the hardware (e.g. blob detection, calculation of their centroids,... ); Provide a high-level shell interface, to allow scripting/debugging/polling the system. 25 / 35

External software: Training and Porting ConvNets are defined and trained on a conventional machine, using a standard machine learning library (e.g. LUSH, EBLearn); Our custom compiler then takes a Lush description of a trained ConvNet, and automatically compiles it into a compact representation describing the content of each layer type of operation, matrix of connections, kernels. This representation can then be stored on a flash drive (e.g. SD Card) connected to the CNP, to be interpreted on the fly by our embedded program. 26 / 35

Unifying the software: nrgizer [joint work with e Lab, Yale] We are currently developing a C++ library (inspired by LUSH), to unify the software nrgizer: nrgizer is a training environment that allows fast prototyping of ConvNets (and similar) architectures; All the computations in this lib are abstracted in a single class Matrix, that can easily be ported to different platforms (a PC, our CNP, and also GPUs, or embedded devices...); The whole lib can be ported to a little CPU (such as Microblaze, or OpenRisc); The API is extremely simplified... 27 / 35

1 Introduction 2 CNP: Architecture 3 Results Resources Usage Speed Images Ongoing Work 28 / 35

Resources Usage One single Spartan-3A DSP 3400 and an external DDR-RAM memory to store the kernels ( 50kB) and the temporary feature maps (approx. 16MB for 640 480 input images) OR one single Virtex-4 SX35 and two external QDR-SRAM chips Entity Occupancy I/Os 135 out of 469 28% DCMs 2 out of 8 25% Mult/Accs 56 out of 126 44% Bloc Rams 100 out of 126 84% Slices 16790 out of 23872 70% Table: Resources used in a Spartan-3A DSP 3400. 29 / 35

Speed The design can be synthesized to run at up to 200MHz, which gives a peak performance of 10 GMACs for a 7 7 kernel! At 200MHz, it computes 8 fps in multi-scale (constant time for more than 2 scales), with 512x384 input images (this includes pre-processing, maximas detection and centroid calculation) or 12 fps in mono-scale 30 / 35

10 4 10 5 10 3 10 2 10 1 10 0 10 4 10 3 10 2 10 1 10 1 1x1 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 10 0 0 100 200 300 400 500 600 Figure: Left: Time required to compute a 2D convolution, right: Time required to compute a full ConvNet. 31 / 35

Introduction CNP: Architecture Results References Figure: Applications running on the CNP (video generated by the CNP) 32 / 35

Figure: The CNP, integrated in our custom hardware platform (Virtex-4 + QDRs) 33 / 35

Ongoing / Future Work Our currents efforts [with e-lab, Yale]: Transforming the convolver into a flexible and programmable grid of elementary units, to compute several convolutions of different sizes, at the same time; Allowing the operations in the VALU to be cascaded, to combine operations on a single stream; Pre-caching the kernels in the grid, and pre-fetching any stream of data required while processing... to suppress any latency; Making a custom (ASIC) chip, that would integrate both memory and logic in a single chip, to increase the bandwidth and frequency of operation. 34 / 35

References I Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Zurich, Switzerland, 1989. Elsevier. an extended version was published as a technical report of the University of Toronto. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. Y. LeCun and Y. Bengio. Pattern recognition and neural networks. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, pages 306 351. IEEE Press, 2001. Richard G. Shoup. Parameterized convolution filtering in a field programmable gate array. In Selected papers from the Oxford 1993 international workshop on field programmable logic and applications on More FPGAs, pages 274 280, Oxford, United Kingdom, 1994. Abingdon EE&CS Books. 35 / 35