Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Similar documents
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

A Lightweight YOLOv2:

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

RESBINNET: RESIDUAL BINARY NEURAL NETWORK

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

CNN optimization. Rassadin A

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

direct hardware mapping of cnns on fpga-based smart cameras

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Machine Learning on VMware vsphere with NVIDIA GPUs

Embedded Binarized Neural Networks

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Xilinx ML Suite Overview

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

arxiv: v2 [cs.lg] 11 Feb 2016

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

arxiv: v1 [cs.cv] 20 May 2016

Learning-based Power and Runtime Modeling for Convolutional Neural Networks

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Deep Neural Network Acceleration Framework Under Hardware Uncertainty

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

FPGA BASED IMPLEMENTATION OF DEEP NEURAL NETWORKS USING ON-CHIP MEMORY ONLY. Jinhwan Park and Wonyong Sung

arxiv: v1 [cs.lg] 17 Jan 2019

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

DeepLearning on FPGAs

Binary Convolutional Neural Network on RRAM

Model Compression. Girish Varma IIIT Hyderabad

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

XPU A Programmable FPGA Accelerator for Diverse Workloads

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Brainchip OCTOBER

MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS. Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge

arxiv: v3 [cs.lg] 27 Mar 2018

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

Stochastic Function Norm Regularization of DNNs

Unified Deep Learning with CPU, GPU, and FPGA Technologies

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

arxiv: v1 [cs.ne] 25 Sep 2016

ESPRESSO: EFFICIENT FORWARD PROPAGATION FOR BINARY DEEP NEURAL NETWORKS

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Revolutionizing the Datacenter

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

Parallel graph traversal for FPGA

Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU

Debbie Marr Sr. Principal Engineer and Director Intel Labs / Accelerator Architecture Lab. September 2018

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

Convolutional Neural Network Layer Reordering for Acceleration

Real-time convolutional networks for sonar image classification in low-power embedded systems

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Deep Learning Accelerators

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

Outline GF-RNN ReNet. Outline

Deploying Deep Neural Networks in the Embedded Space

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Flexible On-chip Memory Architecture for DCNN Accelerators

High Performance Computing

BINARYFLEX: ON-THE-FLY KERNEL GENERATION IN BINARY CONVOLUTIONAL NETWORKS

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU

Deep Learning with Intel DAAL

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

2011 Signal Processing CoDR: Technology Roadmap W. Turner SPDO. 14 th April 2011

arxiv: v5 [cs.lg] 23 Sep 2015

GPU-accelerated Verification of the Collatz Conjecture

Altera SDK for OpenCL

Transcription:

Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1

Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework Results Conclusions and (Hypothetical) Future Work 2

Project Significance 3

4

Binarization Assigns weights of +1, -1 Turns multiply-accumulates into accumulates Better when done stochastically More energy efficient Datacenters power and cooling limited Faster inference phase Lower latency in commercial applications Could be important moving to low-power/mobile platforms 5

Prior Work 6

Intel Paper: Hardware Acceleration of BNNs Compared standard neural networks with their binarized counterparts on a variety of hardware platforms FPGA, CPU, GPU, ASIC Applies two optimizations for CPU and GPU: Binarizing and Batching Batches of size 10 Focuses on the classification or inference stage Batching stays small to avoid latency (commercial applications) GPU limited by this 7

Binarize CPU 5x faster GPU 11x faster Batching (size 10) CPU 80% faster GPU 5.8x faster Binarize + Batch GPU, ASIC fastest ASIC, FPGA most efficient High throughput, latency CPU, GPU Utilization Poor without batching 8

Courbariaux: BinaryConnect Summary Introduces BinaryConnect method for training DNN with binary weights during forward and backward propagation Retain precisions of weights on which gradients are calculated Binarization acts as a regularizer Near state-of-the art accuracies achieved Both deterministic and stochastic binarization are implemented 9

Courbariaux: BinaryConnect 10

11

Matthieu Courbariaux: BinaryNet Introduces methods for training BNNs (weights and activations) Shows that BNNs are approximately as accurate as state-of-the-art DNNs Shows BNNs reduce memory usage and allow for bitwise operations, with the potential to reduce power consumption Implements a binary matrix multiplication GPU kernel for ~7x speedup 12

Objectives and Hypotheses 13

Objectives Recreate Intel s results, test Courbariaux s work Run a BNN on a CPU, GPU, and FPGA Test three differently sized datasets: MNIST, CIFAR-10, SVHN Measure performance, power consumption, and resource utilization Test the effects of batching Draw some conclusions for various applications of neural networks running on different hardware platforms 14

Hypotheses FPGA will be superior in terms of power consumption and resource utilization At small (~1) batch sizes, FPGA will outperform the others With larger batches, GPU will perform best Performance will scale nonlinearly with the size of dataset Relative performance of the three datasets will be comparable 15

Testing Framework 16

Testing Framework Workload: BinaryConnect Python scripts Uses Theano for matrix operations Hardware: Intel i7 5960k, Nvidia GTX Titan X, and FPSoC (28nm XC7Z020) Metrics: Performance, Power, Resource Utilization Kill A Watt power meter 17

Testing Framework Performance Measure run time for training datasets of MNIST, SVHN, and CIFAR-10 for each system If dataset too large to run to completion, record epoch training times and extrapolate Power Consumption Use a wall outlet power meter on the computer Record total energy used in training session (kwh) Subtract off an idle system energy (extrapolated from measured idle time consumption) FPGA has its own hardware Resource Utilization Linux command top for CPU, nvidia-smi for GPU Whatever the FPGA does 18

19

20

21

CPU Results The CPU is slow for training the BNN The MNIST dataset takes ~18 hours This makes CIFAR-10 and SVHN prohibitively large for our timeframe We can take data several epochs in and extrapolate Power Usage Higher with batching Utilization 75% when running CIFAR-10 (6 out of 8 cores utilized) i7 more appropriate for workload than Xeon 22

23

24

25

26

GPU Results GPU is faster than reported in BinaryConnect Repository This could be a result of batching Power Consumption is heavy Large batches more efficient than small batches due to performance After a certain batch size, power consumption flattens out to maximum Resource Utilization is better than expected Gets better with larger batches >70% even on batch size 1 27

BNN Implementation on FPGA Target embedded FPSoC (28nm XC7Z020) 53k LUTs, 106k FFs, 140 BRAMs, 220 DSPs Stores all feature maps on-chip 4.9Mb of on-chip storage available Used trained network from Theano Use Xilinx HLS to generate RTL from C source

Architecture of FPGA BNN Accelerator Two main components: Data buffer and Compute Units f_in is the input parallelization factor f_out is the output parallelization factor

Experimental Results Fixed f_out= 1 to find the trade-off between f_in and f_out LUT and FFs usage scaled with number of Convolvers BRAM and DSPs generally insensitive to f_in Runtime weakly scales with number of Convolvers Consumes 5W at f_in=8

Comparing Hardware Platforms CPU Batch Size Performance (s/epoch) Energy Usage (kj/epoch) Average Power (W) 1 23212 2699.80 116.31 10 3635 409.74 112.72 50 1510 206.12 136.5 GPU Batch Size Performance (s/epoch) Energy Usage (kj/epoch) Average Power (W) 1 435.8 76.92 176.5 10 71.6 18.37 256.5 50 44.8 12.11 249.81 31

Preliminary Results GPU is clearly superior to CPU in all but instantaneous power draw Batching helps performance and energy efficiency, but sub-linearly Batching increases power draw and resource utilization sub-linearly It is not immediately clear how GPU and CPU compare to FPGA, but we anticipate the power usage, at a minimum, will be far superior for the FPGA. 32

Conclusions and Hypothetical Future Work Hardware can improve performance of BNNs immensely Implement ASIC Better Power Measurement More direct comparison with standard CNNs in inference mode 33

References Courbariaux, Matthieu, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arxiv preprint arxiv:1602.02830 (2016). Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "BinaryConnect: Training deep neural networks with binary weights during propagations." In Advances in Neural Information Processing Systems, pp. 3123-3131. 2015. Nurvitadhi, Eriko, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. "Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC." Proc. ICFPT (2016). Zhang, Jialiang, and Jing Li. "Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network." In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 25-34. ACM, 2017. 34