DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Similar documents
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

How to Estimate the Energy Consumption of Deep Neural Networks

Hello Edge: Keyword Spotting on Microcontrollers

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Arm s First-Generation Machine Learning Processor

Research Faculty Summit Systems Fueling future disruptions

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Versal: AI Engine & Programming Environment

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Session 14 Overview: Deep-Learning Processors

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Brainchip OCTOBER

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

High Performance Computing

Deep Learning Accelerators

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Xilinx ML Suite Overview

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Deep Learning with Intel DAAL

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,

arxiv: v1 [cs.cv] 11 Feb 2018

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Small is the New Big: Data Analytics on the Edge

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Deep Learning Hardware Acceleration

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

Multimedia in Mobile Phones. Architectures and Trends Lund

Neural Computer Architectures

Gables: A Roofline Model for Mobile SoCs

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

arxiv: v1 [cs.ne] 20 Nov 2017

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Intelligent Hybrid Flash Management

Low-Power Neural Processor for Embedded Human and Face detection

Adaptable Intelligence The Next Computing Era

Chapter 5. Introduction ARM Cortex series

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

An introduction to Machine Learning silicon

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Outline Marquette University

Introduction to Neural Networks

KiloCore: A 32 nm 1000-Processor Array

The Explosion in Neural Network Hardware

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

DNN Accelerator Architectures

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

A Lightweight YOLOv2:

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Accelerating Convolutional Neural Nets. Yunming Zhang

XPU A Programmable FPGA Accelerator for Diverse Workloads

Embedded Binarized Neural Networks

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

The Nios II Family of Configurable Soft-core Processors

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

Xilinx Machine Learning Strategies For Edge

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

CS6220: DATA MINING TECHNIQUES

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

THE NVIDIA DEEP LEARNING ACCELERATOR

A Method to Estimate the Energy Consumption of Deep Neural Networks

Software Defined Hardware

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Smart Ultra-Low Power Visual Sensing

Bridging Analog Neuromorphic and Digital von Neumann Computing

Revolutionizing the Datacenter

Edge Computing and the Next Generation of IoT Sensors. Alex Raimondi

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Computer Architecture Dr. Charles Kim Howard University

A Deep Learning primer

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS. Exploring Convolutional Neural Networks on the

A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

PULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO

Transcription:

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston, MA 2 Harvard University, Cambridge, MA 3 Princeton University, NJ

The age of deep learning Deeper Nets More Compute Bigger Data

The embedded masses Interpret noisy real-world data from sensor-rich systems Keep inference at the edge: always-on sensing Large memory footprint and compute load 3

DNN inference at the edge Trained weights Raw Data Pre-Processing Norm. Data Feature Extraction Features Classifier Classes Normalize mean=0, variance=1 FFT, MFCC, Sobel, HOG, etc Fully-connected Deep neural network (DNN)

DNN classifier design flow Data - Training, validation, test [Reagen et al., ISCA 2016] Training - Optimize hyper-parameters - Minimize size and test error Quantization - Fixed-point 8-bit / 16-bit Inference - CPU, DSP, GPU, Accelerator 5

DNN ENGINE accelerator architecture Inputs DNN ENGINE Outputs Sensor Data + DNN Topology + Weights Input Vector Weight Neuron Hidden Layers Output Classes Argmax Classification Parallelism and data reuse Sparse data and small data types Algorithmic resilience

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 7

Fully-connected DNN graph Input Vector Output Classes Input Layer Output Layer Hidden Layers 8

Fully-connected DNN graph Weight Neuron Input Vector Output Classes 9

Fully-connected DNN graph SIMD Window Input Vector Output Classes Process a group of neurons in parallel 10

Fully-connected DNN graph Weights Input Vector Output Classes Re-use of activation data across neurons 11

Fully-connected DNN graph Input Vector Output Classes 12

Fully-connected DNN graph Input Vector Output Classes 13

Fully-connected DNN graph Input Vector Output Classes 14

Fully-connected DNN graph Input Vector Output Classes Add bias and apply activation function 15

Fully-connected DNN graph Input Vector Output Classes 16

Fully-connected DNN graph Input Vector Output Classes 17

Fully-connected DNN graph Input Vector Output Classes 18

Fully-connected DNN graph Input Vector Output Classes 19

Balancing efficiency and bandwidth No Data Reuse Parallelism increases throughput and data reuse 20

Balancing efficiency and bandwidth No Data Reuse Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 21

Balancing efficiency and bandwidth No Data Reuse 128b/cycle Bandwidth too high Parallelism increases throughput and data reuse - But also increases Memory Bandwidth demands 8-Way SIMD is efficient at reasonable memory BW - 10x Activation reuse, with 128b AXI channel 22

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 8-Way SIMD accelerator architecture 23

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Host Processor loads configuration and input data 24

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Accumulate products of Activation and Weights 25

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation Add bias, apply ReLU activation and writeback 26

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation IRQ to host, which retrieves output data 27

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 28

Exploiting sparse data Input Vector Output Classes 29

Exploiting sparse data Input Vector Output Classes Discard small activation data 30

Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 31

Exploiting sparse data Input Vector Output Classes 32

Exploiting sparse data Input Vector Output Classes 33

Exploiting sparse data Input Vector Output Classes 34

Exploiting sparse data Input Vector Output Classes Discard small activation data 35

Exploiting sparse data Input Vector Output Classes Dynamically prune graph connectivity 36

Exploiting sparse data Input Vector Output Classes 37

Exploiting sparse data Input Vector Output Classes 38

Exploiting sparse data Input Vector Output Classes 39

Exploiting sparse data Input Vector Output Classes 40

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM DMA Weight Load MAC Datapath Activation 41

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation 42

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF Writeback NBUF W-MEM SKIP Index DMA Weight Load MAC Datapath Activation W-MEM supports 8b or 16b data types 43

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 44

DNN ENGINE micro-architecture Host Processor Data In CFG Data Out Data Read IPBUF XBUF RZ Writeback NBUF W-MEM RZ SKIP Index DMA Weight Load MAC Datapath Activation [Whatmough et al., ISSCC 17] 45

Timing error tolerance Timing Violation Rate 10-1 10-3 10-5 10-7 MNIST @ 98.36% Accuracy Memory Datapath Combined > 1,000,000 X TC SM BM TC SM TB TC ALL [Whatmough et al., ISSCC 17] 46

Outline Background and motivation DNN ENGINE - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience Measurement Results Summary 47

16nm SoC for always-on applications 48

16nm SoC for always-on applications 49

Measured energy for always-on applications Nominal Vdd / signoff Fmax 667MHz / 0.8V Energy varies with application Exponential with accuracy 2.5x energy 0.3% accuracy On-chip memory critical A few KBs from DRAM >1uJ Constrains model size 50

Measured energy improvement Joint architecture and circuit optimizations Typical silicon at room temp 667MHz / 0.8V 667MHz / 0.67V 10x Measurements demonstrate 10x energy reduction 4x throughput increase 51

State of the art classifiers Dedicated NN accelerators Different performance points Different technologies Accuracy critical metric Linear classifier 88% Exponential cost The next 10x improvement Algorithm innovations? 52

Summary Inference moving from cloud to the device: always-on sensing DNN ENGINE architecture optimizations - Parallelism and data reuse - Sparse data and small data types - Algorithmic resilience 16nm test chip measurements - Critical to store the model in on-chip memory - 10x energy and 4x throughput improvement - 1uJ per prediction for always-on applications We are grateful for support from DARPA CRAFT and PERFECT projects 53