Bandwidth-Efficient Deep Learning

Similar documents
Efficient Methods for Deep Learning

arxiv: v2 [cs.cv] 3 May 2016

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Hardware for Deep Learning

High-Performance Hardware for Machine Learning

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Deep Learning Accelerators

Revolutionizing the Datacenter

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

How to Estimate the Energy Consumption of Deep Neural Networks

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Xilinx ML Suite Overview

Research Faculty Summit Systems Fueling future disruptions

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

CNN optimization. Rassadin A

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

arxiv: v1 [cs.cv] 11 Feb 2018

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Xilinx Machine Learning Strategies For Edge

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Research Faculty Summit Systems Fueling future disruptions

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Lecture 12: Model Serving. CSE599W: Spring 2018

In Live Computer Vision

A Method to Estimate the Energy Consumption of Deep Neural Networks

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

The Explosion in Neural Network Hardware

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

The Explosion in Neural Network Hardware

THE NVIDIA DEEP LEARNING ACCELERATOR

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Binary Convolutional Neural Network on RRAM

DNN Accelerator Architectures

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

DeepLearning on FPGAs

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Arm s First-Generation Machine Learning Processor

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Brainchip OCTOBER

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

High Performance Computing

Deep Learning Compiler

Architecting new deep neural networks for embedded applications

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

arxiv: v1 [cs.cv] 20 May 2016

arxiv: v2 [cs.ar] 15 May 2018

GPU-Accelerated Deep Learning

ASIC Design of Shared Vector Accelerators for Multicore Processors

SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Model Compression. Girish Varma IIIT Hyderabad

Fei-Fei Li & Justin Johnson & Serena Yeung

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

arxiv: v1 [cs.ne] 20 Nov 2017

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

FABRICATION TECHNOLOGIES

Efficient Processing for Deep Learning: Challenges and Opportuni:es

Real-time convolutional networks for sonar image classification in low-power embedded systems

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

direct hardware mapping of cnns on fpga-based smart cameras

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC

Fast Hardware For AI

Convolutional Neural Network Layer Reordering for Acceleration

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Deep Neural Network Evaluation

Implementing Deep Learning for Video Analytics on Tegra X1.

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Scale-Out Acceleration for Machine Learning

Transcription:

1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology

2 AI is Changing Our Lives Self-Driving Car Machine Translation 2 AlphaGo Smart Robots

3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech 1 2015 Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks

4 The first Challenge: Model Size Hard to distribute large models through over-the-air update

The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5

6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO

7 Where is the Energy Consumed? larger model => more memory reference => more energy

8 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD 0.1 32 bit float ADD 0.9 32 bit Register File 1 32 bit int MULT 3.1 32 bit float MULT 3.7 32 bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost 1 10 100 1000 10000 1 = 1000

9 Where is the Energy Consumed? larger model => more memory reference => more energy Operation 32 bit int ADD 0.1 32 bit float ADD 0.9 32 bit Register File 1 32 bit int MULT 3.1 32 bit float MULT 3.7 32 bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost 1 10 100 1000 10000 how to make deep learning more efficient?

10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

11 Application as a Black Box Algorithm Spec 2006 Hardware CPU

12 Open the Box before Hardware Design Algorithm? Hardware?PU Breaks the boundary between algorithm and hardware

Proposed Paradigm

How to compress a deep learning model? Compression Acceleration Regularization 14

Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS 2015 Compression Acceleration Regularization 15

Pruning Neural Networks Pruning Trained Quantization Huffman Coding 16

Pruning Happens in Human Brain 1000 Trillion Synapses 50 Trillion Synapses 500 Trillion Synapses Newborn 1 year old Adolescent Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172 172, 2013. Pruning Trained Quantization Huffman Coding 17

Pruning AlexNet [Han et al. NIPS 15] CONV Layer: 3x FC Layer: 10x Pruning Trained Quantization Huffman Coding 18

[Han et al. NIPS 15] Pruning Neural Networks -0.0 2 +x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 19

Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 20

Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 21

Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 22

[Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 23

Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 24

Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 25

26 Exploring the Granularity of Sparsity that is Hardware-friendly 4 types of pruning granularity irregular sparsity regular sparsity more regular sparsity fully-dense model => => => [Han et al, NIPS 15] [Molchanov et al, ICLR 17]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 28

[Han et al. ICLR 16] Trained Quantization 2.09, 2.12, 1.92, 1.87 2.0 Pruning Trained Quantization Huffman Coding 29

[Han et al. ICLR 16] Trained Quantization weights cluster index (32 bit float) (2 bit uint) centroids 2.09-0.98 1.48 0.09 3 0 2 1 3: 2.00 0.05-0.14-1.08 2.12 cluster 1 1 0 3 2: 1.50-0.91 1.92 0-1.03 0 3 1 0 1: 0.00 1.87 0 1.53 1.49 3 1 2 2 0: -1.00 gradient -0.03-0.01 0.03 0.02-0.03 0.12 0.02-0.07 0.04-0.01 0.01-0.02 0.12 group by 0.03 0.01-0.02 reduce 0.02-0.01 0.02 0.04 0.01 0.02-0.01 0.01 0.04-0.02 0.04-0.07-0.02 0.01-0.02-0.01-0.02-0.01 0.01-0.03 Pruning Trained Quantization Huffman Coding 30

After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 31

After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 32

[Han et al. ICLR 16] How Many Bits do We Need? Pruning Trained Quantization Huffman Coding 33

[Han et al. ICLR 16] How Many Bits do We Need? 77% Non-uniform quantization Uniform quantization Top-1 Accuracy 76% 75% 74% full-precision top-1 accuracy: 76.15% 73% 2bits 4bits 6bits 8bits Number of bits per weight for ResNet-50 after fine-tuning Pruning Trained Quantization Huffman Coding 34

35 More Aggressive Compression: Ternary Quantization Full Precision Weight Normalized Full Precision Weight Intermediate Ternary Weight Trained Quantization Final Ternary Weight Normalize Quantize Wn Scale Wp -1 0 1-1 -t 0 t 1-1 0 1 -Wn 0 Wp Feed Forward Back Propagate Inference Time gradient1 gradient2 Loss 3 res1.0/conv1/wn res1.0/conv1/wp res3.2/conv2/wn res3.2/conv2/wp linear/wn linear/wp Ternary Weight Value 2 1 0-1 -2-3

[Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet-300 1070KB 27KB 40x 98.36% 98.42% LeNet-5 1720KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Inception- V3 91MB 4.2MB 22x 93.56% 93.67% ResNet-50 97MB 5.8MB 17x 92.87% 93.04% Can we make compact models to begin with? Compression Acceleration Regularization 36

SqueezeNet Input 1 Conv Squeeze 16 64 1 Conv Expand 64 64 3x3 Conv Expand Output Concat/Eltwise 128 Vanilla Fire module Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv 2016 Compression Acceleration Regularization 37

Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% Compression Acceleration Regularization 38

39 Results: Speedup 40fps 30fps 20fps 10fps 0fps 1.6x speedup 20fps 33fps 1.6x speedup 5fps 8fps Baseline: map = 59.47 / 28.48 / 45.43 FLOP = 17.5G # Parameters = 6.0M Pruned: map = 59.30 / 28.33 / 47.72 FLOP = 8.9G # Parameters = 2.5M 1fps Batch = 1 Batch = 8 Batch = 32 2fps

Deep Compression Applied to Industry Deep Compression Compression Acceleration Regularization 40

41 A Facebook AR prototype backed by Deep Compression - 8x model size reduction by Deep Compression - Run on mobile - Source: Facebook F8 17

EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA 2016 42

43 Deep Learning Accelerators First Wave: Compute (Neu Flow) Second Wave: Memory (Diannao family) Third Wave: Algorithm / Hardware Co-Design (EIE) Google TPU: This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs

EIE: the First DNN Accelerator for Sparse, Compressed Model [Han et al. ISCA 16] 0 * A = 0 W * 0 = 0 2.09, 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint Compression Acceleration Regularization 44

EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE3 0 0 0 0 b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5,0 0 0 0 b 5 0 0 0 w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 45

EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE3 0 0 0 0 b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5,0 0 0 0 b 5 0 0 0 w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 Compression Acceleration Regularization 46

[Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE3 0 0 0 0 b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5,0 0 0 0 b 5 0 0 0 w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 rule of thumb: 0 * A = 0 W * 0 = 0 Compression Acceleration Regularization 47

[Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate rule of thumb: 0 * A = 0 W * 0 = 0 2.09, 1.92=> 2 Compression Acceleration Regularization 48

[Han et al. ISCA 16] Post Layout Result of EIE Technology 40 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 Compression Acceleration Regularization 49

5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x 1. 135x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1.0x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 0.3x 189x 98x 63x 34x 16x 10x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) 9.157 5.416 1.874 1.026 0.841 (59.15%) (20.46%) (11.20%) (9.18%) 0.112 1.807 4.955 1.162 1.122 (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) 100000x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 48x 14,826x 15x Alex-7 15x 78x 10 59x 7x10x 3x 18x 7x Alex-8 GPU Compressed 17x 10x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 6 20x 10x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 5x VGG-8 NT-We 3x 25x 7x 15x 20x 4x 5x 7x 23x 6x 7x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. corner. We placed and routed the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a 4096 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We 600 10% DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 50

CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 2 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 0.6x 5x 0.5x 1. 1.0x 1.0x 0.3x 25x 9x 3x 3x 2x 2x 0.5x 0.5x 48x 15x 2x 0.5x 3x 3x 0.6x 0. Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG-7 100000x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 10x Alex-6 78x 10 59x 15x 7x10x 3x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 13x VGG-6 6 20x 10x 102x 14x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) 9.157 5.416 1.874 1.026 0.841 GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for 0.112 (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% 1.807 (19.73%)switching (19.10%) 4096 large scale image 4.955 (54.11%) (73.57%) the power using Prime-Time PX. 1.162 (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% 1.122 (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central 4096 20x 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep 1000 6x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY dense model and for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 51

[Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 52

Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs Compression Acceleration Regularization 53

Feedforward => Recurrent neural network? Compression Acceleration Regularization 54

[Han et al. FPGA 17] ESE Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer Compression Acceleration Regularization 55

ESE is now on AWS Marketplace with Xilinx VU9P FPGA

ESE is now on AWS Marketplace with Xilinx VU9P FPGA

58 ESE is now on AWS Marketplace with Xilinx VU9P FPGA https://aws.amazon.com/marketplace/pp/b079n2j42r

Thank you! songhan@mit.edu Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 59