CNP: An FPGA-based Processor for Convolutional Networks

Similar documents
CNP: AN FPGA-BASED PROCESSOR FOR CONVOLUTIONAL NETWORKS

Dataflow Computing for General Purpose Vision

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Digital Integrated Circuits

Minimizing Computation in Convolutional Neural Networks

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Full Linux on FPGA. Sven Gregori

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Revolutionizing the Datacenter

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Neural Computer Architectures

Hardware Implementation of TRaX Architecture

Simplify System Complexity

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Memory-Centric Accelerator Design for Convolutional Neural Networks

System-on Solution from Altera and Xilinx

Embedded Systems: Hardware Components (part I) Todor Stefanov

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Simplify System Complexity

XPU A Programmable FPGA Accelerator for Diverse Workloads

SoC Platforms and CPU Cores

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

Nn-X - a hardware accelerator for convolutional neural networks

VHX - Xilinx - FPGA Programming in VHDL

INTRODUCTION TO FPGA ARCHITECTURE

Convolutional Neural Networks

Computer Systems Colloquium (EE380) Wednesday, 4:15-5:30PM 5:30PM in Gates B01

Field Programmable Gate Array (FPGA)

Face Detection using GPU-based Convolutional Neural Networks

Field Program mable Gate Arrays

direct hardware mapping of cnns on fpga-based smart cameras

27 March 2018 Mikael Arguedas and Morgan Quigley

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Reconfigurable Computing. Introduction

Course Overview Revisited

FPGA VHDL Design Flow AES128 Implementation

Specializing Hardware for Image Processing

FPGA Polyphase Filter Bank Study & Implementation

Sparse Matrix-Vector Multiplication FPGA Implementation

FPGA BASED IMPLEMENTATION OF DEEP NEURAL NETWORKS USING ON-CHIP MEMORY ONLY. Jinhwan Park and Wonyong Sung

Vertex Shader Design I

Metropolitan Road Traffic Simulation on FPGAs

Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

ReconOS: An RTOS Supporting Hardware and Software Threads

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Embedded Computing Platform. Architecture and Instruction Set

Arm s First-Generation Machine Learning Processor

Machine Learning 13. week

EITF35: Introduction to Structured VLSI Design

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

EECS150 - Digital Design Lecture 09 - Parallelism

machine cycle, the CPU: (a) Fetches an instruction, (b) Decodes the instruction, (c) Executes the instruction, and (d) Stores the result.

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Adaptable Intelligence The Next Computing Era

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

Tangent Prop - A formalism for specifying selected invariances in an adaptive network

Yet Another Implementation of CoRAM Memory

Spiral 2-8. Cell Layout

Zynq-7000 All Programmable SoC Product Overview

A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

A Parallel Accelerator for Semantic Search

FPGA Based Digital Design Using Verilog HDL

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

PINE TRAINING ACADEMY

Design of Feature Extraction Circuit for Speech Recognition Applications

FPGA: What? Why? Marco D. Santambrogio

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

Low-Power Neural Processor for Embedded Human and Face detection

Distributed Vision Processing in Smart Camera Networks

Introducing the Spartan-6 & Virtex-6 FPGA Embedded Kits

Handwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Embedded Real-Time Video Processing System on FPGA

Character Recognition Using Convolutional Neural Networks

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

Object Counting Using Convolutional Neural Network Accelerator IP Reference Design

Deploying Deep Neural Networks in the Embedded Space

Spiral 3-1. Hardware/Software Interfacing

Xilinx ML Suite Overview

Learning Outcomes. Spiral 3 1. Digital Design Targets ASICS & FPGAS REVIEW. Hardware/Software Interfacing

PowerPC on NetFPGA CSE 237B. Erik Rubow

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

Experiment 3. Digital Circuit Prototyping Using FPGAs

Transcription:

Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio Culurciello s e-lab Yale {Berin Martini, Polina Akselrod, Selcuk Talay} 1 / 35

1 Introduction Our goal: Real-Time Embedded Vision Our Solution: the CNP (ConvNet Processor) 2 CNP: Architecture 3 Results 2 / 35

Real-time Vision Convolutional Neural Networks The state of the art for object recognition/classification, ConvNets are biologically-inspired hierarchical architectures that can be trained to perform a variety of detection/recognition tasks, such as... 3 / 35

Figure: Object recognition. 4 / 35

5 / 35

Figure: Robot navigation/obstacle avoidance. 6 / 35

Convolutional Networks ConvNets have a feed-forward architecture consisting of multiple linear convolution filters interspersed with point-wise non-linear squashing functions. [2, 3, 4, 1] Convolutions Subsampling Convolutions Subs. Conv. Full connections F6: F7: 2@70x50 1@70x50 Input Image C1: 6@314x234 S2: 6@157x117 C3: 16@151x111 S4: 16@75x55 C5: 80@70x50 7 / 35

Figure: Ex: Features extracted from an image, by a trained ConvNet. 8 / 35

Our Goal: Achieve high-performance in low-power environments: 9 / 35

Our Solution: A ConvNet Processor Our entire system fits in a single FPGA plus an external RAM module, and no extra parts. It is a programmable SIMD (Single Instruction, Multiple Data) processor, with a vector instruction set that matches the elementary operations of a ConvNet. These elementary operations are highly optimized, and make extensive use of the parallelism inherent in the hardware. 10 / 35

Flexibility / Efficiency Implementing a particular ConvNet simply consists in programming the software layer of our processor, and does not require to reconfigure the logic circuits in the FPGA. A network compiler software was implemented, which takes a description of a trained ConvNet and compiles it into a proper program for the ConvNet Processor (CNP). 11 / 35

1 Introduction 2 CNP: Architecture Overview Hardware Software 3 Results 12 / 35

Our System Fits in a single FPGA + an external memory module. The FPGA: a low-end Xilinx Spartan-3A DSP 3400 FPGA, or a high-end Virtex-4 SX35. Both are roughly the same size. Spartan: 126 built-in hardware fixed-point multiply-accumulate units with 18 by 18 bit multiplication, and with accumulation on 48 bits, 200MHz. Virtex 4 : 192 units, 400MHz. 13 / 35

Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing Hardware HW module ALU Driver Memory I/O External Flash SW program Memory CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Streaming Multiport Memory Arbiter { Mux Demux } External Memory Chip(s) Vector / Stream ALU Display Manager I/O Video Manager Screen Camera FPGA 14 / 35

Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Memory I/O Streaming Multiport Memory Arbiter { Mux Demux } Vector / Stream ALU Display Manager I/O Video Manager FPGA Screen Camera 15 / 35

Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Screen Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Camera Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) 16 / 35

FIFOS WR... FIFOS WR FIFOS WR 17 / 35

Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Screen Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Camera Memory Arbiter { Mux Demux } FPGA 18 / 35

The most important instruction: 2D Convolution 80 to 90% of the computations Input/output coded on Q10.6, intermediate results on Q36.12 [5] 19 / 35

W00 W01 + + 0 W02 W10 W11 W12 x + x + W20 W21 W22 x x + x x + x + x + x + +...... Pooling / Subsampling 20 / 35

The point-wise non-linearity Implemented as a piecewise approximation of the hyperbolic tangent function g(x) = A.tanh(B.x) g (x) = a i x + b i for x [l i, l i+1 ] (1) a i = 1 2 m + 1 2 n m, n [0, 5]. (2) Eq.(2) is used to replace all the multipliers by simple shifts and adds. 21 / 35

The point-wise non-linearity The piecewise mapping is done by a hardware mapper, which streams the input data through a cascade of linear mappers. Each linear mapper is configurable from the soft CPU. l0 >= a0 b0 x + l0 >= a0 b0 x +... l0 >= a0 b0 x + val? val? val? 22 / 35

10 8 6 4 2 1.5 1.0 0.5 4 2 2 4 0.5 1.0 20 40 60 80 100 120 1.5 Figure: Approximations of x [16 segments] and tanh(x) [16 segments] 23 / 35

24 / 35 Introduction CNP: Architecture Results References Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) Screen Camera

Embedded software A complete library has been developed in C++, to abstract the hardware of the CNP. Its functions: Provide a high-level/modular/scalable user interface to allow different kinds of ConvNets to be loaded from a flash memory/shell interface (with their trained weights), then interpreted and computed in real-time; Hide all the intricacies of the VALU into a Matrix class, that allows simple calls, such as image1.convolve(image1, kernel2); Manipulate the data generated by the hardware, to perform operations that cannot be done by the hardware (e.g. blob detection, calculation of their centroids,... ); Provide a high-level shell interface, to allow scripting/debugging/polling the system. 25 / 35

External software: Training and Porting ConvNets are defined and trained on a conventional machine, using a standard machine learning library (e.g. LUSH, EBLearn); Our custom compiler then takes a Lush description of a trained ConvNet, and automatically compiles it into a compact representation describing the content of each layer type of operation, matrix of connections, kernels. This representation can then be stored on a flash drive (e.g. SD Card) connected to the CNP, to be interpreted on the fly by our embedded program. 26 / 35

Unifying the software: nrgizer [joint work with e Lab, Yale] We are currently developing a C++ library (inspired by LUSH), to unify the software nrgizer: nrgizer is a training environment that allows fast prototyping of ConvNets (and similar) architectures; All the computations in this lib are abstracted in a single class Matrix, that can easily be ported to different platforms (a PC, our CNP, and also GPUs, or embedded devices...); The whole lib can be ported to a little CPU (such as Microblaze, or OpenRisc); The API is extremely simplified... 27 / 35

1 Introduction 2 CNP: Architecture 3 Results Resources Usage Speed Images Ongoing Work 28 / 35

Resources Usage One single Spartan-3A DSP 3400 and an external DDR-RAM memory to store the kernels ( 50kB) and the temporary feature maps (approx. 16MB for 640 480 input images) OR one single Virtex-4 SX35 and two external QDR-SRAM chips Entity Occupancy I/Os 135 out of 469 28% DCMs 2 out of 8 25% Mult/Accs 56 out of 126 44% Bloc Rams 100 out of 126 84% Slices 16790 out of 23872 70% Table: Resources used in a Spartan-3A DSP 3400. 29 / 35

Speed The design can be synthesized to run at up to 200MHz, which gives a peak performance of 10 GMACs for a 7 7 kernel! At 200MHz, it computes 8 fps in multi-scale (constant time for more than 2 scales), with 512x384 input images (this includes pre-processing, maximas detection and centroid calculation) or 12 fps in mono-scale 30 / 35

10 4 10 5 10 3 10 2 10 1 10 0 10 4 10 3 10 2 10 1 10 1 1x1 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 10 0 0 100 200 300 400 500 600 Figure: Left: Time required to compute a 2D convolution, right: Time required to compute a full ConvNet. 31 / 35

Introduction CNP: Architecture Results References Figure: Applications running on the CNP (video generated by the CNP) 32 / 35

Figure: The CNP, integrated in our custom hardware platform (Virtex-4 + QDRs) 33 / 35

Ongoing / Future Work Our currents efforts [with e-lab, Yale]: Transforming the convolver into a flexible and programmable grid of elementary units, to compute several convolutions of different sizes, at the same time; Allowing the operations in the VALU to be cascaded, to combine operations on a single stream; Pre-caching the kernels in the grid, and pre-fetching any stream of data required while processing... to suppress any latency; Making a custom (ASIC) chip, that would integrate both memory and logic in a single chip, to increase the bandwidth and frequency of operation. 34 / 35

References I Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Zurich, Switzerland, 1989. Elsevier. an extended version was published as a technical report of the University of Toronto. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. Y. LeCun and Y. Bengio. Pattern recognition and neural networks. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, pages 306 351. IEEE Press, 2001. Richard G. Shoup. Parameterized convolution filtering in a field programmable gate array. In Selected papers from the Oxford 1993 international workshop on field programmable logic and applications on More FPGAs, pages 274 280, Oxford, United Kingdom, 1994. Abingdon EE&CS Books. 35 / 35