Automatic Speech Recognition (ASR)

Size: px

Start display at page:

Download "Automatic Speech Recognition (ASR)"

Garey Miller
5 years ago
Views:

1 Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC)

2 State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph Search ASR bottleneck: Graph Search (Viterbi) 1% 0% Feature Extraction Likelihood Computation 13% 10% 86% 20% 30% 40% 50% 60% 70% 80% Graph Search Likelihood Computation 90% 100%

3 Speech Graph Weighted-Finite-State-Transducer (WFST) - Acoustic Model - Language Model - Language Dictionary

4 Viterbi Search Frame 0 1.0

5 Viterbi Search Frame Frame

Viterbi Search Frame 0 1.0 Frame 1 0.54 Frame 2 0.

6 Viterbi Search Frame Frame Frame Pruning! Pruning! Pruning!

Viterbi Search Frame 0 1.0 Frame 1 0.54 Frame 2 0.0018 Frame 3 0.

7 Viterbi Search Frame Frame Frame Frame Pruning! THREE Pruning! Pruning!

8 Viterbi Hardware Accelerator (Micro'49)

9 Accelerated ASR System Memory CPU Audio Frames Acoustic Scores GPU Word Lattice Viterbi Accelerator DNN 0 DNN 1 Vit 0 DNN 2 Vit 1 DNN 3 Vit 2

10 Accelerator's Architecture

11 Accelerator's Challenges The pipeline's main source of stalls: Misses at the Arc cache Increased Hash access time Unpredictable access pattern to the WFST's arcs 25K active arc per frame out of 34M arcs Due to overflows & collisions High memory bandwidth To fetch states and arcs of different tokens

12 Improved Arc Cache Decoupled access-execute pattern Issue the memory requests in advance After pruning, the addresses are clear and computable Respect timeliness, through a reorder buffer

13 Memory Bandwidth Reduction Change the pattern of arc's data in memory Sorting state's arcs based on the number of arcs Direct address computation for states with 1 to 4 arcs

14 Experimental Results Gains in performance speedup and energy-saving Area: mm x Speedup 48.4x Reduction 628x Reduction 22.24x Speedup

15 UNFOLD: Memory-Efficient Design (Micro'50)

16 Viterbi Accelerator's Pitfalls The main challenges of the previous design Working on a huge graph: more than 1 GB Requiring high memory Bandwidth: ~10 GBps High power consumption: ~ 1 W

17 On-the-fly Composition Separating fully-composed into several WFSTs Do the composition on-the-fly rather than offline Acoustic Model Language Model

18 Viterbi+Composition Frame 0 (0,0)

19 Viterbi+Composition Frame 0 Frame 1 (1,0) S1/e (0,0) S4/e (4,0) S6/e (6,0)

20 Viterbi+Composition Frame 0 Frame 1 (1,0) Frame 2 S2/e (2,0) S1/e (0,0) S4/e S5/TWO (4,0) S6/e (5,2) S7/e (6,0) (7,0)

21 Compression Techniques Compressing based on graph characteristics Fully-composed, Acoustic Model WFSTs Language Model Compressed representation of word-lattice AM, Fully-composed Most of arcs with no word ID Arc's destination state ID: Remove epsilon-word transitions Relative distance of -1,0,1 to the previous state LM Predictable destinations from backoff state 3.6x compression

22 Accelerator's Design

23 Offset Lookup Table Cross-word transition: triggering a search on LM Linear search: 10x slowdown Binary search: 3x slowdown Exploit locality: keep track of the recent search Tag LM Arc Offset Hit LM State Word ID Hash Func Miss Update Binary Search LM Arc Cache

24 Experimental Results Memory Footprint Reduction

25 Experimental Results Power Reduction 1200 Power dissipation (mw) Address Lookup Table Main Memory Pipeline LM Cache Graph Cache List of States UNFOLD Micro'49

26 DNN Pruning Negative Effects

27 DNN Pruning Side-effects Low confidence in classifying the top-1 In spite of accurately choosing the best class

28 Impact on Viterbi Search Viterbi Expansion under the pruned DNN

29 Workload Increase Viterbi slowdown due to the DNN pruning

30 Solution Choosing smaller number of paths to explore N-Best hypotheses expansion Challenge: Sort tokens on each frame: so expensive O(M2) Our approach: choose the loosely N-Best Use of Hash table to select best paths: On replacement, keep the hypotheses with high likelihood Increase the hash associativity: reduce replacement rate Replacement is done with the worst path

31 Accuracy vs Explored Hypotheses

32 Experimental Results ASR performance: DNN + Viterbi Normalized decoding Time 160 Dnn 140 Viterbi Baseline 70%Pruning 80%Pruning 90%Pruning

33 Experimental Results Normalized ASR energy-consumption ASR energy-consumption: DNN + Viterbi 120 Dnn Viterbi Baseline 70%Pruning 80%Pruning 90%Pruning

34 Locality-Aware Scheme (LAWS) For Automatic Speech Recognition

An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani, Albert Segura, Jose-Maria Arnau, Antonio Gonzalez Computer Architecture Department, Universitat Politecnica de Catalunya