Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Similar documents
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

A Framework for Generating High Throughput CNN Implementations on FPGAs

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

C-Brain: A Deep Learning Accelerator

Energy Optimizations for FPGA-based 2-D FFT Architecture

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Binary Convolutional Neural Network on RRAM

Revolutionizing the Datacenter

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Intel HLS Compiler: Fast Design, Coding, and Hardware

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

direct hardware mapping of cnns on fpga-based smart cameras

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Deep Learning Accelerators

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Introduction to Neural Networks

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

DNN Accelerator Architectures

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

THE NVIDIA DEEP LEARNING ACCELERATOR

Hello Edge: Keyword Spotting on Microcontrollers

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

High Performance Computing

Energy Efficient Adaptive Beamforming on Sensor Networks

Master Informatics Eng.

In Live Computer Vision

Parallel Exact Inference on the Cell Broadband Engine Processor

CS 426 Parallel Computing. Parallel Computing Platforms

Sparse Matrix-Vector Multiplication FPGA Implementation

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Minimizing Computation in Convolutional Neural Networks

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

XPU A Programmable FPGA Accelerator for Diverse Workloads

The Nios II Family of Configurable Soft-core Processors

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

arxiv: v2 [cs.ar] 15 May 2018

Xilinx ML Suite Overview

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Deploying Deep Neural Networks in the Embedded Space

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

A Lightweight YOLOv2:

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Versal: AI Engine & Programming Environment

Pactron FPGA Accelerated Computing Solutions

Research Faculty Summit Systems Fueling future disruptions

arxiv: v1 [cs.cv] 11 Feb 2018

Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles

Parallel FFT Program Optimizations on Heterogeneous Computers

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

EE/CSCI 451: Parallel and Distributed Computation

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Course Overview Revisited

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

A Survey of FPGA-based Accelerators for Convolutional Neural Networks

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Transcription:

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM FPGA, Feb. 2017

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 2

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 3

Motivation Natural LanguageProcessing Image Processing and ComputerVision Speech Recognition Self-driving Cars Playing Games 4

Algorithm Level Related Work Computation Reduction Model Compression Fast Fourier Transform Overlap-and-Add Singular Value Decomposition Shrink Model Representation Convolutional Neural Network Acceleration Automatic Code Generation High-level Synthesis Optimize Computation Engine Shrink Data Precision High Throughput Convolver High Throughput FFT Engine Hardware Level Optimize Memory System Double Buffering Hide Memory Latency Heterogeneous Platform Concurrent Processing CPU + FPGA Shared Memory System 5

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 6

Target Platform: CPU+FPGA+ Shared Memory Main Memory Memory Relay Engine Shared Memory Address Space Coherent Memory Interface Accelerator Accelerator Accelerator memory response buffer FPGA Input output Memory Fetch Unit Coherent Memory Interface memory request buffer High Speed Interconnection CPU-FPGA Shared Memory Model Intel QuickAssist QPI-FPGA Platform Key Attributes: 1. CPU threads and FPGA share large virtual address space. 2. Cache line accessible through high speed interconnection. Big advantage for streaming applications; overlap data transfer and processing times => P-M software pipeline 7

}{{} Problem Definition Given an arbitrary trained CNN architecture and resource constrains, map the forward path (inference) onto thecpu-fpgashared memory system, such that the total execution time is minimized. N in N in D in F D in F }{{} D out F D in F }{{} }{{} }{{} D in }{{} D in }{{} N out N out N out N out }{{} D out N out x m,n... x m,n+w 1 f(... ) =y(i, j) x m+w 1,n... x m+w 1,n+w 1 F N out f(x) = max(0,x) Max/Average Window D in F D in N out }{{}}{{} 2D Convolution Sum and Bias }{{}}{{}}{{} Convolutional Layer ReLU Layer Pooling Layer N out }{{} D out N out N out }{{} D out N pool N pool Unrolling } {{ } Fully-connected Layer Pr( Cat ) Pr( Dog ) Pr( Truck ) Pr( Ship ) 8

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 9

+ + + + + + + State of the art on FPGA Convolutional layeroccupiesover 90% ofthetotal computation time. Most previous work focus on using highly optimized hardware to accelerate the convolutional layer. Direct convolution: Adder Tree based convolver Data parallelism is limited by thetreestructure Throughput islimited Input Kernels + Output Input Feature Maps Multiplier Array Adder Tree Accumulator 10

FFT based approach 2D convolution can becomputed efficiently usingfft. In signalprocessing,overlap-and-add (OaA) is an efficient way to compute the discrete convolution of a long signal with a small kernel. Parameter definition Input feature map: N "# N "# D "# Output feature map: N &'( N &'( D &'( Kernel: F F D "# D &'( FFT size: P, tile size: L 11

Convolutional Layer in Frequency Domain Overlap-and-Add (OaA) Partition the input feature map to L L tiles Perform 2D FFT on each L L tile of input feature maps and kernels Perform Hadamard (Dot) Product of each input feature map tile and corresponding kernel map Sum the result and perform 2D IFFT on each output feature map tile Perform Overlap-and-Add on contiguous tiles with stride F 1, where D F is the kernel size out Kernels in Advantages Reduces the total number of operations Increases the data parallelism Easy to store each tile on FPGA d N in L e2 BRAM for high throughput processing Unrolled Input Feature Maps in Tile (after FFT) {z} (1, 1, 4) (1, 1, 3) (1, 1, 2) (1, 1, 1) {z} D in D in P (1, 2, 4) (1, 2, 3) (1, 2, 2) (1, 2, 1) {z} D in P (4, 4, 4) (4, 4, 3) (4, 4, 2) (4, 4, 1) {z} P P P {z} D in {z} D in {z} P D in Multiply-Accumulate Frequency Domain P P P P P P Overlap-and-Add Output Feature Maps (1, 1) (1, 2) (1, 3) (1, 4) (1, 1) (1, 2) (1, 3) (1, 4) (2, 1) (1, (2, 1, 2) 1) (1, (2, 2, 3) 1) (1, (2, 3, 4) 1) (1, 4, 1) (2, 1) (2, 2) (2, 3) (2, 4) (3, 1) (2, (3, 1, 2) 1) (2, (3, 2, 3) 1) (2, (3, 3, 4) 1) (2, 4, 1) (3, 1) (3, 2) (3, 3) (3, 4) (4, 1) (3, (4, 1, 2) 1) (3, (4, 2, 3) 1) (3, (4, 3, 4) 1) (3, 4, 1) (4, 1) (4, 2) (4, 3) (4, 4) (4, 1, 1) (4, 2, 1) (4, 3, 1) (4, 4, 1) 12

Algorithmic Analysis Theoreticalcomputational complexity Direct convolution: O(N 1 F 1 ) OaA-based convolution: O N 1 log F Reduced computation complexity leads to less execution time less energy consumption FFT parallelization => further reduction in the overall execution time. 13

Computational Complexity Comparison AlexNetCONVLayer (GFLOP) VGG16 CONV Layer (GFLOP) 3 40 2 1 30 20 10 0 CONV1 CONV2 CONV3 CONV4 CONV5 Total 0 CONV1 CONV2 CONV3 CONV4 CONV5 Total Space Convolution Overlap-and-Add Space Convolution Overlap-and-Add GFLOP Reduction: AlexNet: 22.12%, VGG16: 41.5% The FFT size to achieve the minimum computational complexity for each layer is different. 14

Mapping Choices on CPU and FPGA Convolutional layer occupies over 90% of the total computation time. The 2D convolution operation can be fully pipelined and parallelized when transformed intofrequencydomain. Convolutional layer is computation bound (instead of memory bandwidth bound). The data transfer overhead for ReLU and Pooling layer overwhelm the acceleration usingfpga. If the kernel size across all CONV layers is the same, we can put ReLU and pooling layer right after CONV layer. However, the hardware complexity to build a generic accelerator for all layers on FPGA is too high. 15

Mapping Choices on CPU and FPGA Prepare Input Feature Maps, kernels (CPU) Accelerate CONV layer (FPGA) Overlap (CPU) Depth Concatenation (CPU) (Optional) Pooling (CPU) (Optional) ReLU (CPU) Fully-Connected (CPU) 16

Variable Length FFT Module Example of Radix-4 FFT module: To do 64 point FFT, we need log 6 64 = 3 stages. To do 16 point FFT, we use the first 2 stages of the 64 point architecture, and bypass the third stage. 17 17

2D Convolver Accelerator P 2 z } { Stream input feature maps 1D FFT 1D FFT 1D FFT 1D FFT {z} Row FFT Stream Matrix Transpose 1D FFT 1D FFT 1D FFT 1D FFT {z} Column FFT {z } 2D FFT Kernel Input kernels + + + + {z } MAC 1D IFFT 1D IFFT 1D IFFT 1D IFFT {z} Column IFFT Stream Matrix Transpose 1D IFFT 1D IFFT 1D IFFT 1D IFFT {z} Row IFFT {z } 2D IFFT Kernel {z } Stream output feature maps Architecture: 2D FFT + Multiply-Accumulate + 2D IFFT 2D FFT = Row FFT + Column FFT Matrix Transpose is achieved by direct wiring or streaming permutation on streaming data from memory 18

Key Idea in our Streaming Permutation Hardware Design Fold the Clos network into a 3-stage Streaming Permutation Network (SPN) SPN S 1 x S 1 Crossbar S 2 x S 2 Crossbar... Memory block Stage 0 S < -to-s < connection Clos Network, N = S 1 x S 2 S 2 S 1 S 2 Crossbar boxes Crossbar boxes Crossbar boxes Stage 1 S < single-port memory blocks, each of size S 1 S 1 x S 1 S 2 x S 2 S 1 x S 1 Stage 2 S < -to-s < connection Permutation in time Permuting temporal order of data elements Input stream SPN, p = S 1 S 2 Mem Blocks S 1 x S 1 S 1 x S 1...... Stage 0 Stage 1 (Permutation in time) Stage 2 Output stream Supports any given arbitrary permutation on streaming data! 19 19

Fast Fourier Transform Parameterized architecture: Algorithm-mapping parameters Vertical parallelism (V? ) Number of FFT processor (N? ) Architecture-binding parameters Type of memory, pipeline depth Optimizations: Memory activation scheduling Deactivate periodically when not accessed Energy efficient permutation Permute streaming data using memories Use the proposed data remapping technique Clock gating Disable the floating point unit when one of its inputs is 0, 1, j, or -j Comparison: 4.5x improvement Fig. 8-point FFT architecture (V p = 2, N p = 2) Energy efficiency (GOPS/Joule) Energy efficiency (GFLOPS/J) 100 50 20 10 0 0 Baseline Optimized design 256 1024 4096 65536 FFT problem size Baseline 256 1024 4096 65536 FFT Problem size Optimized design 20

Performance Evaluation (1) Performance metric: Delay-Multiplier Product It is computed as the number of cycles to process a complete convolutional layer times the number of multipliers consumed by the design. Power consumption of the convolver is proportional to the number of multipliers. Area of the convolver is proportional to the number of multipliers. Approximation to Energy dissipation and Area- Delay Product. 21

Performance Evaluation (2) @A BC&D'E( &F G?HEI J&#K&LKIC DM ratio: @A BC&D'E( &F MHNOPHQID J&#K&LKIC Design Choice: Choose the FFT size such that the DM ratio is maximized for given resource constraints. 22

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 23

Overall System Design CPU-FPGAconcurrent processing through shared memory Overlap memory access latency and data processing latency by buffering data on FPGA By using CPU for light-weight but general tasks, using FPGA for high computational intensive but dedicated tasks, our framework can apply to a wide range of CNN models. 24

+ + + Overall System Design P 2 P 2 P 2 T k P 2 T k P 2 Memory Response From Shared Memory (Input Feature Maps/Kernels) 2D FFT Kernel 2D FFT Kernel 2D FFT Kernel Kernel Buffer Kernel Buffer Memory Request to Shared Memory (Output Feature Maps) P 2 P 2 P 2 T i P 2 1x1 Kernel Bypassing Mux Mux T i T k P 2 T i P 2 T i P 2 Image Buffer Image Buffer T i P 2 2D IFFT Kernel 2D IFFT Kernel 2D IFFT Kernel Mux T i T k P 2 MAC Buffer FFT processed data for reuse acrossdifferentkernels. Image buffer task parallelism T " and kernel buffer task parallelism T S. The system parallelism is T " T S. 25

Outline Motivation Target platform and problem definition Main approach Overallsystem design Experiments and Results 26

Experimental Setup Intel QuickAssist QPI FPGA Platform (Intel Heterogeneous Architecture Research Platform) 10 Core Intel Xeon E5-2600 v2 processor Altera Stratix V FPGA 6.25 MB BRAM + 939 K registers + 234720 ALM + 256 DSP CPU and FPGA share 2 GB address space. FPGA can access data through QPI from the last level cache in the memory system of CPU. FPGA is only cacheline addressable (64 Bytes). We implemented a generic accelerator for convolutional layer on FPGA. The remaining work is done on CPU. 27

Experimental Results Zhang (2015) Qiu (2016) Our Work Data Precision 32-bit float point 16-bit fixed 32-bit float point Platform Virtex-7 Zynq ZC706 Intel Xeon + Stratix V Frequency (MHz) 100 150 200 CNN Model AlexNet VGG16-SVD AlexNet VGG16 Multiplier 747 780 224 224 Memory 4.5 MB 2.13 MB 4.0 MB 4.0 MB Throughput (GFLOPs/sec) 61.62 187.80 (CONV) 83.00 (CONV) 123.48 (CONV) Delay (CONV) (ms) 21.61 163.42 40.81 263.27 Delay * Multiplier 16142 127467 9141 58972 Resource Efficiency (GFLOPs/sec/Multiplier) Power Efficiency (GFLOPs/sec/W) 0.028 0.24 0.37 0.55 3.31 19.50 6.30 9.37 28

Conclusion High throughput OaA-based 2D convolver to accelerate convolutional layer Implemented AlexNet and VGG16 on Intel QuickAssist QPI- FPGA Platform and compared with state-of-art designs AlexNet: 1.35x throughput improvement using 3.33x less multipliers and 1.1x less memory VGG16: 0.66x throughput using 3.48x less multipliers with 32-bit floating point. Potential performance improvement with compressed data representation Future work Using compressed data representation Build automatic code generation tool Explore other algorithms including Winograd algorithm 29

Thanks fpga.usc.edu 30

Backup Slides 31

Deeper understanding of Overlap-and-Add Basic observation Large kernel sizes benefit more from frequency domain approach than small kernel sizes Largestride makes frequency domain approach less advantageous Deeper input feature map channels make frequency domain approach more advantageous due to the linearity of FFT. 32

Deeper understanding of Overlap-and-Add (2) We fix D_in = 64, D_out = 64, kernel size = 3, stride = 1. We vary the inputfeature map size and calculate the computation requirement ratio between directconvolution and OaA. Computation requirement ratio between direct convolution and OaA Ratio 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 Input feature map size 33

Deeper understanding of Overlap-and-Add (3) Large input feature map size leads to more advantage for OaA-based convolution. Corner case: input feature map sizeis slightlygreater than FFT sub-matrix. This causes approximately 4x waste of computation. Convolution Image FFT sub-matrix 14 34

Reference [1] C. Zhang, et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 15. [2] Ren Chen, Hoang Le and Viktor K. Prasanna, Energy Efficient Parameterized FFT Architecture, IEEE International Conference on Field Programmable Logic and Applications (FPL), August 2013. 35