DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston, MA 2 Harvard University, Cambridge, MA 3 Princeton University, NJ

The age of deep learning Deeper Nets More Compute Bigger Data

The embedded masses Interpret noisy real-world data from sensor-rich systems Keep inference at the edge: always-on sensing Large memory footprint and compute load 3

DNN inference at the edge Trained weights Raw Data Pre-Processing Norm. Data Feature Extraction Features Classifier Classes Normalize mean=0, variance=1 FFT, MFCC, Sobel, HOG, etc Fully-connected Deep neural network (DNN)

DNN classifier design flow Data - Training, validation, test [Reagen et al., ISCA 2016] Training - Optimize hyper-parameters - Minimize size and test error Quantization - Fixed-point 8-bit / 16-bit Inference - CPU, DSP, GPU, Accelerator 5

DNN ENGINE accelerator architecture Inputs DNN ENGINE Outputs Sensor Data + DNN Topology + Weights Input Vector Weight Neuron Hidden Layers Output Classes Argmax Classification Parallelism and data reuse Sparse data and small data types Algorithmic resilience

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 7

Fully-connected DNN graph Input Vector Output Classes Input Layer Output Layer Hidden Layers 8

Fully-connected DNN graph Weight Neuron Input Vector Output Classes 9

Fully-connected DNN graph SIMD Window Input Vector Output Classes Process a group of neurons in parallel 10

Fully-connected DNN graph Weights Input Vector Output Classes Re-use of activation data across neurons 11

Fully-connected DNN graph Input Vector Output Classes 12

Fully-connected DNN graph Input Vector Output Classes 13

Fully-connected DNN graph Input Vector Output Classes 14

Fully-connected DNN graph Input Vector Output Classes Add bias and apply activation function 15

Fully-connected DNN graph Input Vector Output Classes 16

Fully-connected DNN graph Input Vector Output Classes 17

Fully-connected DNN graph Input Vector Output Classes 18

Fully-connected DNN graph Input Vector Output Classes 19

Balancing efficiency and bandwidth No Data Reuse Parallelism increases throughput and data reuse 20

Balancing efficiency and bandwidth No Data Reuse Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 21

Balancing efficiency and bandwidth No Data Reuse 128b/cycle Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 8-Way SIMD is efficient at reasonable memory BW - 10x Activation reuse, with 128b AXI channel 22

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 8-Way SIMD accelerator architecture 23

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Host Processor loads configuration and input data 24

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Accumulate products of Activation and Weights 25

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Add bias, apply ReLU activation and writeback 26

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation IRQ to host, which retrieves output data 27

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 28

Exploiting sparse data Input Vector Output Classes 29

Exploiting sparse data Input Vector Output Classes Discard small activation data 30

Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 31

Exploiting sparse data Input Vector Output Classes 32

Exploiting sparse data Input Vector Output Classes 33

Exploiting sparse data Input Vector Output Classes 34

Exploiting sparse data Input Vector Output Classes Discard small activation data 35

Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 36

Exploiting sparse data Input Vector Output Classes 37

Exploiting sparse data Input Vector Output Classes 38

Exploiting sparse data Input Vector Output Classes 39

Exploiting sparse data Input Vector Output Classes 40

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 41

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation 42

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation W-MEM supports 8b or 16b data types 43

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 44

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF RZ Writeback NBUF W-MEM RZ SKIP Index DMA Weight Load MAC Datapath Activation [Whatmough et al., ISSCC 17] 45

Timing error tolerance Timing Violation Rate 10-1 10-3 10-5 10-7 MNIST @ 98.36% Accuracy Memory Datapath Combined > 1,000,000 X TC SM BM TC SM TB TC ALL [Whatmough et al., ISSCC 17] 46

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 47

16nm SoC for always-on applications 48

16nm SoC for always-on applications 49

Measured energy for always-on applications Nominal Vdd / signoff Fmax 667MHz / 0.8V Energy varies with application Exponential with accuracy 2.5x energy 0.3% accuracy On-chip memory critical A few KBs from DRAM >1uJ Constrains model size 50

Measured energy improvement Joint architecture and circuit optimizations Typical silicon at room temp 667MHz / 0.8V 667MHz / 0.67V 10x Measurements demonstrate 10x energy reduction 4x throughput increase 51

State of the art classifiers Dedicated NN accelerators Different performance points Different technologies Accuracy critical metric Linear classifier 88% Exponential cost The next 10x improvement Algorithm innovations? 52

Summary Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience 16nm test chip measurements - Critical to store the model in on-chip memory - 10x energy and 4x throughput improvement - 1uJ per prediction for always-on applications We are grateful for support from DARPA CRAFT and PERFECT projects 53