Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Similar documents
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

direct hardware mapping of cnns on fpga-based smart cameras

Revolutionizing the Datacenter

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Xilinx ML Suite Overview

Binary Convolutional Neural Network on RRAM

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

arxiv: v1 [cs.ne] 19 Jan 2018 Abstract

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

DNN Accelerator Architectures

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Deep Learning Accelerators

Design Exploration of FPGA-based Accelerators for Deep Neural Networks

Hello Edge: Keyword Spotting on Microcontrollers

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

An introduction to Machine Learning silicon

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Deep Learning with Tensorflow AlexNet

THE NVIDIA DEEP LEARNING ACCELERATOR

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

DeepIndex for Accurate and Efficient Image Retrieval

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Supplementary material for Analyzing Filters Toward Efficient ConvNet

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

Brainchip OCTOBER

Machine Learning on FPGAs

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Altera SDK for OpenCL

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

CNP: An FPGA-based Processor for Convolutional Networks

arxiv: v1 [cs.cv] 11 Feb 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

In Live Computer Vision

Minimizing Computation in Convolutional Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Channel Locality Block: A Variant of Squeeze-and-Excitation

Intel HLS Compiler: Fast Design, Coding, and Hardware

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

A Method to Estimate the Energy Consumption of Deep Neural Networks

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Convolutional Neural Networks

All You Want To Know About CNNs. Yukun Zhu

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

Low-Power Neural Processor for Embedded Human and Face detection

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

A performance comparison of Deep Learning frameworks on KNL

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

High Performance Computing

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Lecture 12: Model Serving. CSE599W: Spring 2018

arxiv: v2 [cs.ar] 15 May 2018

Deep Learning for Computer Vision II

Software Defined Hardware

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

INTRODUCTION TO DEEP LEARNING

Unified Deep Learning with CPU, GPU, and FPGA Technologies

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

XPU A Programmable FPGA Accelerator for Diverse Workloads

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Global Optimality in Neural Network Training

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Research Faculty Summit Systems Fueling future disruptions

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

arxiv: v1 [cs.cv] 20 May 2016

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Transcription:

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu Cao. Arizona State University. * ARM Research. Feb 22, 2016

Introduction Outline Convolutional Neural Networks (CNNs) Challenges in hardware implementation End-to-end FPGA accelerator for CNNs Fixed-point operations Parameterized CNN layers in OpenCL Optimization framework Performance and resource utilization models Results: AlexNet and VGG-16 CNN models Conclusion -2-

Convolutional Neural Networks (CNN) Feed-forward neural networks inspired from visual cortex Multi-layer feature extraction and classification Applications Image/video classification, face detection, gesture detection Autonomous driving, speech recognition Tools: Caffe, Torch, Theano, TensorFlow, CNTK, Image from: http://deeplearning.net/tutorial/lenet.html -3-

CNNs for Image Classification ImageNet Challenge 1000 output classes. Dataset: 1.2M training + 50k validation + 100k test images CNNs: winners since 2012 AlexNet [1] CNN: 2012 8 layers, 1.46 GOP/image Top-5 Accuracy : 84.7% VGG [2] CNN: 2014 16-19 layers, 30.9 GOP/image Top-5 Accuracy : 92.6% [1] A. Krizhevsky, et al. ImageNet classification with deep convolutional neural networks. NIPS 2012 [2] K. Simonyan, et al. Very deep convolutional networks for large-scale image recognition. ICLR 2015. -4-

CNN Accelerators GPUs (fast) - high power consumption (~200W) ASIC (energy efficient) - less flexibility FPGA / SoC (FPGA + embedded CPU) Generic CNN accelerators: optimize compute or memory [1]-[3] Balance compute + memory access: AlexNet convolution layers [4] Contribution: End-to-end large-scale CNN accelerator on FPGA Design space exploration to maximize throughput -5- [1] Farabet, et al. Hardware accelerated convolutional neural networks for synthetic vision systems. ISCAS 2010 [2] M. Peemen, et al. Memory-centric accelerator design for convolutional neural networks. ICCD 2013 [3] V. Gokhale, et al. A 240 G-ops/s mobile coprocessor for deep neural networks. CVPR Workshops, 2014. [4] C. Zhang, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. ISFPGA 2015

Challenges in FPGA Implementation Large number of computations AlexNet CNN (2012): 1.46 GOP/image VGG-16 CNN (2014): 30.9 GOP/image Solution: Parallel computational resources Limited FPGA resources No. of CNN layers: 8 152 with different dimensions Solution: Run-time scalable hardware modules Huge models AlexNet: ~240MB: VGG-16: ~550MB Solution: Reduced precision weights & data reuse -6-

OpenCL Framework Compile OpenCL to RTL Stitch RTL to base system Full synthesis: map, P&R, timing analysis Generate FPGA config file Features enabling acceleration: loop unrolling, SIMD factor, compute unit replication, local memory usage. -7-

CNN Operations Convolution Pooling Normalization Classification (inner-product) 5x -152x Convolution: 3-D multiply and accumulate operations Pooling: max/average of a window Normalization: normalize based on neighboring neurons Classification: multiply and accumulate operations Activation function: y=max(x,0) -8-

Accuracy-Precision Study -9- Intermediate layer data (Neurons) : 16-bit precision

3-D Convolution as Matrix Mult. N if K K out( f o, x, y) wt( f o, f i, k x, k y )in( f i, x k x, y k y ) Convolution filter weights 224 3 11 f i 0 k x 0 k y 0 Input features 11 R R G B Flatten and rearrange AlexNet layer Output features Features (N if ) Size Kernel (K) Stride Input 3 224x224 Conv-1 96 55x55 11x11 4 Conv-2 256 27x27 5x5 1 Conv-3 384 13x13 3x3 1 Conv-4 384 13x13 3x3 1 Conv-5 256 13x13 3x3 1 96 G B 3x11x11 96 3x11x11 55x55 55x55-10- K. Chellapilla, et al. High performance convolutional neural networks for document processing. IWFHR 2006

Accelerating Convolution in OpenCL N CONV M A 1 A 2 A 3 B 1 = B 2 N C 1 M N B 3 P P C 1 = A 1 B 1 + A 2 B 2 + A 3 B 3 N conv MAC operations in parallel: (loop unroll) Different feature dimensions: zero-pad to multiple of N conv SIMD vectorization factor: S conv Acceleration factor: N conv *S conv Design knobs for acceleration: (N conv, S conv ) -11-

Normalization PWL Approximation out( f o in( f o, x, y), x, y) f o K / 2 2 1 in ( fi, x, y) K fi fo K / 2 Resource expensive exponent operation Piece-wise linear approximation Floating point implementation Sum of squares operation only Design variable for acceleration: N NORM (loop unroll factor) Function value Relative error (%) 1.0 0.8 0.6 0.4 0.2 Normalization function Approximation function 0.0 0 20 40 60 80 100 1.0 0.8 0.6 0.4 0.2 x o 0.0 0 20 40 60 80 100 x o -12-

Pooling & Inner Product Pooling: out( f o, x, y) max/average 0( k x, k y ) K in( f o, x k x, y k y ) Design variable for acceleration: N POOL (loop unroll factor) Classification or inner-product or fully-connected layer out( f o ) if ) in( f Design variable for acceleration: N FC (loop unroll factor) Limited by external memory bandwidth N f 0 i wt( f o, f i i ) -13-

Target Optimization Framework Find optimal values for (N CONV,S CONV,N NORM,N POOL,N FC ) that maximize throughput of FPGA accelerator. Problem Formulation Minimize TL i0 Subject to L j0 L runtime ( N L j0 i CONV, S CONV DSP j DSP MAX Memory j Memory MAX j0 Logic j Logic MAX, N NORM, N POOL, N FC ) : Performance Model Resource Utilization Model (FPGA - specific) -14-

Performance Modeling Analytical models for performance Convolution Execution time i No. of Convolution Ops i N CONV S CONV Frequency No. of Convolution Ops i PAD NCONV (Conv filter dimensions No.of input features) PAD NCONV (No. of output features) PAD NCONV (output feature dimensions) -15-

Model vs. Measured Performance -16- Non-monotonic behavior because of zero-padding Require global search techniques

Performance Modeling Other CNN layers: pooling, normalization, classification Execution time i #Operations i Unroll factor Frequency Classification layers (FC): limited by memory bandwidth Hardware specification defines upper limit for N FC -17-

FPGA Resource Utilization Models Optimizations in HLS tools : difficult to analytically model Loop pipelining Memory coalescing Linear regression models: DSP blocks, logic elements and on-chip memory resources -18-

Optimization Setup Genetic algorithm in Matlab global optimization toolbox Objective function : Performance models (CNN-specific) Integer variables : (N CONV, S CONV, N NORM, N POOL, N FC ) Constraints : Resource utilization models (FPGA-specific) Altera OpenCL compiler constraints S CONV : 1, 2, 4, 8 or 16 N CONV = N*S CONV (N : integer) FPGA resources Models Optimization toolbox Optimized variables -19-

FPGA Hardware Specification P395-D8 DE5-Net FPGA Stratix-V GSD8 Stratix-V GXA7 Logic elements 695K 622K DSP blocks 1,963 256 M20K RAMs 2,567 2,560 External memory 4 8GB DDR3 2 2GB DDR3 P395-D8 board with Altera Stratix-V GSD8 FPGA -20-

Optimization Progress 2.6x AlexNet on DE5-Net A B C Exec. time (model) 120.6 ms 54.3 ms 46.1 ms Exec. time (measured) 117.7 ms 52.6 ms 45.7 ms Logic elements 158k 152k 153k M20K memory blocks 1,439 1,744 1,673 DSP blocks 164 234 246-21- Designs with similar hardware utilizations - significant performance gap

Experimental Results Parameter P395-D8 board DE5-Net board AlexNet VGG-16 AlexNet VGG-16 N CONV 64 64 32 64 S CONV 8 8 4 2 N NORM 2-2 - N POOL 1 1 1 1 N FC 71 64 32 30 AlexNet VGG-16 Hardware Classification Throughput time/image (ms) (GOPS) P395-D8 20.1 72.4 DE5-Net 45.7 31.8 CPU (i5-4590) 191.9 7.6 P395-D8 262.9 117.8 DE5-Net 651.2 47.5 CPU (i5-4590) 1437.2 21.5 9.5x 5.5x -22-

AlexNet Utilization & Performance Convolution Normalization with high precision FPGA limit -23- No. of DSP blocks highly impacts accelerator throughput

Accuracy Comparison Accuracy measured with 50K ImageNet validation images Accuracy degradation: <2% for top-1; <1% for top-5 Limited precision weights Fixed point operations Full precision in Caffe tool Approximation in normalization function Fixed-point FPGA implementation Accuracy Top-1 Top-5 Top-1 Top-5 AlexNet 56.82% 79.95% 55.41% 78.98% VGG 68.35% 88.44% 66.58% 87.48% -24-

Comparison [1] ICCD 13 [2] FPGA 15 This work This work Virtex-6 Virtex-7 Stratix-V Stratix-V FPGA VLX240T VX485T GSD8 GSD8 Frequency 150 MHz 100 MHz 140 MHz 120 MHz AlexNet AlexNet VGG-16 CNN size 2.74 GMAC 1.33 GOP 1.46 GOP 30.9 GOP Precision fixed float (32b) fixed (8-16b) fixed (8-16b) Throughput 17 GOPS b 61.6 GOPS a 126.6 GOPS a 136.5 GOPS a 72.4 GOPS b 117.8 GOPS b a convolution operations only b all operations for image classification 1 GMAC = 2 GOP [1] M. Peemen, et al. Memory-centric accelerator design for convolutional neural networks. In ICCD 2013. [2] C. Zhang, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM ISFPGA 2015. -25-

Summary OpenCL based FPGA accelerator for CNNs Fixed point operations on reduced precision weights Parameterized design variables for acceleration Optimization strategy to maximize system throughput Genetic algorithm on performance and utilization models End-to-end ImageNet classification demonstrated AlexNet, VGG-16 on two Altera Stratix-V FPGA boards AlexNet : 72.4 GOPS, VGG-16: 117.8 GOPS Future directions: Optimizing pipelined CNN stages for batch processing Running compressed network with sparse matrix operations -26-

Demo: AlexNet on FPGA -27- Running on P395-D8 FPGA board

Thank you! Q & A -28-