A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Similar documents
Automatic Speech Recognition (ASR)

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

TINY System Ultra-Low Power Sensor Hub for Always-on Context Features

The Power of Speech: Supporting Voice- Driven Commands in Small, Low-Power. Microcontrollers

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Sophon SC1 White Paper

An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Hello Edge: Keyword Spotting on Microcontrollers

Low-Power Processor Solutions for Always-on Devices

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Vertex Shader Design I

Toward a Memory-centric Architecture

Bringing Intelligence to Enterprise Storage Drives

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

May Wu, Ravi Iyer, Yatin Hoskote, Steven Zhang, Julio Zamora, German Fabila, Ilya Klotchkov, Mukesh Bhartiya. August, 2015

XPU A Programmable FPGA Accelerator for Diverse Workloads

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Characterization of Speech Recognition Systems on GPU Architectures

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Why DNN Works for Speech and How to Make it More Efficient?

An introduction to Machine Learning silicon

SVD-based Universal DNN Modeling for Multiple Scenarios

Variable-Component Deep Neural Network for Robust Speech Recognition

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Digital system (SoC) design for lowcomplexity. Hyun Kim

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Deep Learning Accelerators

COL862 - Low Power Computing

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Memory technology and optimizations ( 2.3) Main Memory

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

The Long-Term Future of Solid State Storage Jim Handy Objective Analysis

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Lucida Sirius and DjiNN Tutorial

Smart Ultra-Low Power Visual Sensing

Bringing Intelligence to Enterprise Storage Drives

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Data Organization and Processing

In Silico Vox: Towards Speech Recognition in Silicon

How to Estimate the Energy Consumption of Deep Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Deep Learning Performance and Cost Evaluation

Speech Technology Using in Wechat

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

How to Build Optimized ML Applications with Arm Software

FABRICATION TECHNOLOGIES

S2C K7 Prodigy Logic Module Series

High-Speed NAND Flash

1. NoCs: What s the point?

PBLN52832 DataSheet V Copyright c 2017 Prochild.

High Performance Computing

A Lightweight YOLOv2:

Monolithic 3D IC Design for Deep Neural Networks

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

Open-Source Speech Recognition for Hand-held and Embedded Devices

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Data Parallel Architectures

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

AI Requires Many Approaches

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Simplify System Complexity

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

How to Build Optimized ML Applications with Arm Software

ARDUINO MEGA INTRODUCTION

Copyright 2012, Elsevier Inc. All rights reserved.

Simplify System Complexity

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

AT-501 Cortex-A5 System On Module Product Brief

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Computer Systems Colloquium (EE380) Wednesday, 4:15-5:30PM 5:30PM in Gates B01

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Developing a Data Driven System for Computational Neuroscience

REAL TIME DIGITAL SIGNAL PROCESSING

Near-Data Processing for Differentiable Machine Learning Models

Copyright 2012, Elsevier Inc. All rights reserved.

In Silico Vox: Towards Speech Recognition in Silicon

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

Zynq-7000 All Programmable SoC Product Overview

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Chapter 2 Designing Crossbar Based Systems

Implementation of DSP Algorithms

Transcription:

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge, MA 1of 31

Diversification of speech interfaces Today Personal assistant (smartphone) Search (smartphone, PC) Future Wearables Appliances Robots Goal: Complement or replace touchscreen interfaces 2of 31

Computational demands of ASR (ASR = Automatic speech recognition) Real-time ASR is now feasible on x86 PCs ARM SoCs can run with minor performance degradation But what about everyone else? Today s model: offload to cloud servers Tomorrow: offload to cloud OR hardware accelerator System power concerns: Memory I/O signaling If processor efficiency is optimized in isolation, these can exceed core power Today s non-volatile memory: MLC NAND flash At 100 pj/bit: 100 MB/s -> 80 mw 3of 31

Hardware accelerated speech interface Primary use cases: 1. Low power 2. Low system complexity 3. Low latency 4of 31

1. Introduction a) Motivation and scope b) ASR formulation Outline 2. Acoustic modeling with deep neural networks (DNN) a) Performance with limited memory bandwidth b) Parallel architecture c) Details of execution unit design 3. Search (Viterbi) architecture 4. Voice activity detection (VAD) a) System power model b) Modulation frequency (MF) algorithm and architecture 5. Test chip a) Circuit features b) Measured performance 5of 31

ASR formulation HMM Inference: search using Viterbi algorithm 6of 31

ASR formulation Acoustic model Training clusters WFST states into tied states (senones) Acoustic model approximates p(y i) where i is the tied state index and y is the feature vector (typ. 10 50 dims.) Example PDFs (MFCC features projected to 2-D) Models have many parameters large memory requirement Size: Typical ASR model is ~50 MB must be stored off-chip Bandwidth: Naïve evaluation would require 5 GB/s (Target for low-power ASR: ~10 MB/s) 7of 31

Bandwidth-limited acoustic models Comparison of frameworks considers: Quantization Parallelization Accuracy GMM = Gaussian mixture model SGMM = Subspace GMM DNN = Deep neural network 8of 31

Feed-forward neural network (DNN) Sequence of simple nonlinear transformations Used in ASR to evaluate all acoustic likelihoods jointly This work: feed-forward networks only (no time dependence) Top image is from: Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015. 9of 31

DNN evaluator architecture 10 of 31

Execution unit assignment International Solid-State Circuits Conference and Voice-Activated Power Gating 11 of 31

Execution unit design Complexity has been minimized: Executes arithmetic commands from sequencer Writes results back to local SRAM 12 of 31

Sigmoid approximation Piecewise (low-order) polynomial fit Less area than plain LUT Simpler to evaluate than high-order fit Chebyshev approximation Better accuracy than Taylor series 13 of 31

Sigmoid approximation Polynomial evaluated using Horner s method Use unpipelined version to save area 7-cycle latency; 5.7k gates (mostly the multiplier) 14 of 31

Search Viterbi algorithm Baseline architecture From: M. Price et al., A 6mW 5kword Speech Recognizer using WFST Models, ISSCC 2014. 15 of 31

Search Viterbi algorithm Improved architecture Compression and caching: memory bandwidth reduction Word lattice: memory bandwidth reduction Merged state lists: 11% area savings 16 of 31

Search Word lattice From state lattice to word lattice Discard intermediate states between words 17 of 31

Search Word lattice (cont.) Light workloads: no external memory writes (pruning keeps word lattice within internal memory) Heavy workloads: 8x reduction (writing only word arcs) 18 of 31

ASR accuracy and speed Similar accuracy to software (Kaldi) with 145k word vocab. At 80 MHz, same search speed as software on 3.7 GHz Xeon 19 of 31

Voice activity detection (VAD) Use VAD to control power gate for ASR (Automatic wake-up) 20 of 31

VAD power impacts Consider typical values: p VAD < 100 μw p downstream > 1 mw D < 0.05 Downstream system contribution If p FA is significant, averaged downstream power exceeds VAD power Optimize for p M and p FA (VAD accuracy) rather than VAD power 21 of 31

Modulation frequency VAD International Solid-State Circuits Conference and Voice-Activated Power Gating 22 of 31

Modulation frequency VAD (cont.) MF features fed to NN classifier Small network (e.g. 3x32) is sufficient Fewer parameters than SVM NN architecture stripped down from ASR No sparse matrix support No parallel evaluation Still quantized No external memory Parameters supplied at startup over SPI bus 23 of 31

VAD performance comparison Two tasks: Aurora2 (left), Forklift/Switchboard (right more difficult) MF algorithm outperforms energy-based (EB) and harmonicity (HM) Performance improves with more training data 24 of 31

ASR/VAD interfaces VAD runs continuously Supervisory logic places ASR in reset during non-speech input ASR requests audio samples buffered by VAD after wake-up 25 of 31

Test chip specifications 26 of 31

Multivoltage design Memory tends to require higher voltage than logic Separate memory supplies I/O level shifters (1.8 2.5 V) cannot handle low logic level Intermediate supply for top level (with supervisory logic) Relationship between supply voltages is constrained Level shifters operate in one direction 27 of 31

Clock gating Explicit clock gating complements automatic clock gating 28 of 31

Test setup On-board control of clocks and power supplies FPGA provides host and memory interfaces Tested live audio, real-time streaming, batch processing 29 of 31

Measured performance and scalability Note: ASR model sizes and search parameters vary between tasks; ASR and VAD tested independently 45x range 30 of 31

Conclusions DNNs can improve performance of HW ASR Even with restricted memory bandwidth: < 10 MB/s DNN based VAD can be robust and compact Model stored on-chip (< 12 kb) Low power (22.3 μw) ASR is not only about neural networks Significant effort required for feature extraction and search NN architecture developed with knowledge of application Combination of algorithm, architecture, and circuit techniques delivers: Improved accuracy fewer word errors Improved programmability train using standard tools Improved scalability lower power consumption 31 of 31