DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Size: px

Start display at page:

Download "DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses"

Milo Welch
6 years ago
Views:

Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y.

1 DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston, MA 2 Harvard University, Cambridge, MA 3 Princeton University, NJ

2 The age of deep learning Deeper Nets More Compute Bigger Data

3 The embedded masses Interpret noisy real-world data from sensor-rich systems Keep inference at the edge: always-on sensing Large memory footprint and compute load 3

4 DNN inference at the edge Trained weights Raw Data Pre-Processing Norm. Data Feature Extraction Features Classifier Classes Normalize mean=0, variance=1 FFT, MFCC, Sobel, HOG, etc Fully-connected Deep neural network (DNN)

5 DNN classifier design flow Data - Training, validation, test [Reagen et al., ISCA 2016] Training - Optimize hyper-parameters - Minimize size and test error Quantization - Fixed-point 8-bit / 16-bit Inference - CPU, DSP, GPU, Accelerator 5

6 DNN ENGINE accelerator architecture Inputs DNN ENGINE Outputs Sensor Data + DNN Topology + Weights Input Vector Weight Neuron Hidden Layers Output Classes Argmax Classification Parallelism and data reuse Sparse data and small data types Algorithmic resilience

7 Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 7

8 Fully-connected DNN graph Input Vector Output Classes Input Layer Output Layer Hidden Layers 8

9 Fully-connected DNN graph Weight Neuron Input Vector Output Classes 9

10 Fully-connected DNN graph SIMD Window Input Vector Output Classes Process a group of neurons in parallel 10

11 Fully-connected DNN graph Weights Input Vector Output Classes Re-use of activation data across neurons 11

12 Fully-connected DNN graph Input Vector Output Classes 12

13 Fully-connected DNN graph Input Vector Output Classes 13

14 Fully-connected DNN graph Input Vector Output Classes 14

15 Fully-connected DNN graph Input Vector Output Classes Add bias and apply activation function 15

16 Fully-connected DNN graph Input Vector Output Classes 16

17 Fully-connected DNN graph Input Vector Output Classes 17

18 Fully-connected DNN graph Input Vector Output Classes 18

19 Fully-connected DNN graph Input Vector Output Classes 19

20 Balancing efficiency and bandwidth No Data Reuse Parallelism increases throughput and data reuse 20

21 Balancing efficiency and bandwidth No Data Reuse Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 21

22 Balancing efficiency and bandwidth No Data Reuse 128b/cycle Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 8-Way SIMD is efficient at reasonable memory BW - 10x Activation reuse, with 128b AXI channel 22

23 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 8-Way SIMD accelerator architecture 23

24 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Host Processor loads configuration and input data 24

25 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Accumulate products of Activation and Weights 25

26 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Add bias, apply ReLU activation and writeback 26

27 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation IRQ to host, which retrieves output data 27

28 Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 28

29 Exploiting sparse data Input Vector Output Classes 29

30 Exploiting sparse data Input Vector Output Classes Discard small activation data 30

31 Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 31

32 Exploiting sparse data Input Vector Output Classes 32

33 Exploiting sparse data Input Vector Output Classes 33

34 Exploiting sparse data Input Vector Output Classes 34

35 Exploiting sparse data Input Vector Output Classes Discard small activation data 35

36 Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 36

37 Exploiting sparse data Input Vector Output Classes 37

38 Exploiting sparse data Input Vector Output Classes 38

39 Exploiting sparse data Input Vector Output Classes 39

40 Exploiting sparse data Input Vector Output Classes 40

41 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 41

42 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation 42

43 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation W-MEM supports 8b or 16b data types 43

44 Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 44

45 DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF RZ Writeback NBUF W-MEM RZ SKIP Index DMA Weight Load MAC Datapath Activation [Whatmough et al., ISSCC 17] 45

46 Timing error tolerance Timing Violation Rate % Accuracy Memory Datapath Combined > 1,000,000 X TC SM BM TC SM TB TC ALL [Whatmough et al., ISSCC 17] 46

47 Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 47

48 16nm SoC for always-on applications 48

49 16nm SoC for always-on applications 49

50 Measured energy for always-on applications Nominal Vdd / signoff Fmax 667MHz / 0.8V Energy varies with application Exponential with accuracy 2.5x energy 0.3% accuracy On-chip memory critical A few KBs from DRAM >1uJ Constrains model size 50

51 Measured energy improvement Joint architecture and circuit optimizations Typical silicon at room temp 667MHz / 0.8V 667MHz / 0.67V 10x Measurements demonstrate 10x energy reduction 4x throughput increase 51

52 State of the art classifiers Dedicated NN accelerators Different performance points Different technologies Accuracy critical metric Linear classifier 88% Exponential cost The next 10x improvement Algorithm innovations? 52

53 Summary Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience 16nm test chip measurements - Critical to store the model in on-chip memory - 10x energy and 4x throughput improvement - 1uJ per prediction for always-on applications We are grateful for support from DARPA CRAFT and PERFECT projects 53

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,