DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston, MA 2 Harvard University, Cambridge, MA 3 Princeton University, NJ
The age of deep learning Deeper Nets More Compute Bigger Data
The embedded masses Interpret noisy real-world data from sensor-rich systems Keep inference at the edge: always-on sensing Large memory footprint and compute load 3
DNN inference at the edge Trained weights Raw Data Pre-Processing Norm. Data Feature Extraction Features Classifier Classes Normalize mean=0, variance=1 FFT, MFCC, Sobel, HOG, etc Fully-connected Deep neural network (DNN)
DNN classifier design flow Data - Training, validation, test [Reagen et al., ISCA 2016] Training - Optimize hyper-parameters - Minimize size and test error Quantization - Fixed-point 8-bit / 16-bit Inference - CPU, DSP, GPU, Accelerator 5
DNN ENGINE accelerator architecture Inputs DNN ENGINE Outputs Sensor Data + DNN Topology + Weights Input Vector Weight Neuron Hidden Layers Output Classes Argmax Classification Parallelism and data reuse Sparse data and small data types Algorithmic resilience
Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 7
Fully-connected DNN graph Input Vector Output Classes Input Layer Output Layer Hidden Layers 8
Fully-connected DNN graph Weight Neuron Input Vector Output Classes 9
Fully-connected DNN graph SIMD Window Input Vector Output Classes Process a group of neurons in parallel 10
Fully-connected DNN graph Weights Input Vector Output Classes Re-use of activation data across neurons 11
Fully-connected DNN graph Input Vector Output Classes 12
Fully-connected DNN graph Input Vector Output Classes 13
Fully-connected DNN graph Input Vector Output Classes 14
Fully-connected DNN graph Input Vector Output Classes Add bias and apply activation function 15
Fully-connected DNN graph Input Vector Output Classes 16
Fully-connected DNN graph Input Vector Output Classes 17
Fully-connected DNN graph Input Vector Output Classes 18
Fully-connected DNN graph Input Vector Output Classes 19
Balancing efficiency and bandwidth No Data Reuse Parallelism increases throughput and data reuse 20
Balancing efficiency and bandwidth No Data Reuse Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 21
Balancing efficiency and bandwidth No Data Reuse 128b/cycle Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 8-Way SIMD is efficient at reasonable memory BW - 10x Activation reuse, with 128b AXI channel 22
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 8-Way SIMD accelerator architecture 23
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Host Processor loads configuration and input data 24
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Accumulate products of Activation and Weights 25
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Add bias, apply ReLU activation and writeback 26
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation IRQ to host, which retrieves output data 27
Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 28
Exploiting sparse data Input Vector Output Classes 29
Exploiting sparse data Input Vector Output Classes Discard small activation data 30
Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 31
Exploiting sparse data Input Vector Output Classes 32
Exploiting sparse data Input Vector Output Classes 33
Exploiting sparse data Input Vector Output Classes 34
Exploiting sparse data Input Vector Output Classes Discard small activation data 35
Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 36
Exploiting sparse data Input Vector Output Classes 37
Exploiting sparse data Input Vector Output Classes 38
Exploiting sparse data Input Vector Output Classes 39
Exploiting sparse data Input Vector Output Classes 40
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 41
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation 42
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation W-MEM supports 8b or 16b data types 43
Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 44
DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF RZ Writeback NBUF W-MEM RZ SKIP Index DMA Weight Load MAC Datapath Activation [Whatmough et al., ISSCC 17] 45
Timing error tolerance Timing Violation Rate 10-1 10-3 10-5 10-7 MNIST @ 98.36% Accuracy Memory Datapath Combined > 1,000,000 X TC SM BM TC SM TB TC ALL [Whatmough et al., ISSCC 17] 46
Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 47
16nm SoC for always-on applications 48
16nm SoC for always-on applications 49
Measured energy for always-on applications Nominal Vdd / signoff Fmax 667MHz / 0.8V Energy varies with application Exponential with accuracy 2.5x energy 0.3% accuracy On-chip memory critical A few KBs from DRAM >1uJ Constrains model size 50
Measured energy improvement Joint architecture and circuit optimizations Typical silicon at room temp 667MHz / 0.8V 667MHz / 0.67V 10x Measurements demonstrate 10x energy reduction 4x throughput increase 51
State of the art classifiers Dedicated NN accelerators Different performance points Different technologies Accuracy critical metric Linear classifier 88% Exponential cost The next 10x improvement Algorithm innovations? 52
Summary Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience 16nm test chip measurements - Critical to store the model in on-chip memory - 10x energy and 4x throughput improvement - 1uJ per prediction for always-on applications We are grateful for support from DARPA CRAFT and PERFECT projects 53