GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

Size: px

Start display at page:

Download "GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search"

Grace Flowers
5 years ago
Views:

1 GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search Wonkyum Lee Jungsuk Kim Ian Lane Electrical and Computer Engineering Carnegie Mellon University March 26, 1

2 Overview Introduc4on Acous4c Model Acous4c Model combina4on GPU Accelerated Model Combina4on Evalua4on Results Summary 2

3 Introduc4on 3

4 Introduc4on ASR (Automa4c Speech Recogni4on) Acoustic Model Lexicon Feature Extraction Decoder Word #1 Word String Word #2 Language Model 4

5 Introduc4on KWS (Keyword Search) Indexer Keyword Search Task Keyword: Speech- To- Text (by ASR) GTC Welcome to GTC two thousands frourteen Thousands of Hours of Indexed Audio Hit: GTC 5

6 Introduc4on Speech RecogniBon Speech to Text EvaluaBon Metric: Word Error Rate KWS Spot the Keyword in Audio EvaluaBon Metric: Actual Term Weighted Value - > Both Tasks require Robust ASR! 6

7 Acous4c Model 7

8 Acous4c Model Gaussian Mixture Model (GMM) 8

9 Acous4c Model GMM/HMM GMMs trained using the EM algorithm are able to self organize to fit a data set Hidden Markov Model models sequenbal paqerns Technical Advances over past 10 years AdaptaBon, DiscriminaBve Training, SGMM 9

10 Acous4c Model Deep Neural Network (DNN) George E. Dahl, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition 10

11 Acous4c Model DNN/HMM Called Hybrid DNN/HMM system Has good discriminabon Temporal aspects are deal with HMM, like lez- to- right HMM models Drawback is computabon is expensive! 11

12 Acous4c Model Combina4on How can we improve ASR with AcousBc Model Robust AcousBc Model More and More Data - > BeQer and BeQer Accuracy Robust Feature(BoQle- neck Feature, Noise Robust Feature) AcousBc Model CombinaBon 12

13 Acous4c Model Combina4on GMM1 DNN1 Log likelihood PaHern by Acous4c Model F e a t u r e GMM2 DNN2 Model Structure 13

14 Acous4c Model Combina4on Different Acous4c Models(model structure, features) have disbnct speech recognibon paqern. - > different performance in Speech RecogniBon and Keyword Search The goal is to find a way to combine different acous4c models for robust speech recogni4on and keyword search Considera4on Data type to be combined for AM combinabon WeighBng criterion Total system run Bme (Real- Bme factor) 14

15 Acous4c Model Combina4on Mul4- stream based AM combina4on Combine mulbple AMs at the AM score level WeighBng Criterion(ArithmeBc, Geometric, Harmonic) One- pass and One :me decoding Other combina4on Method Labce CombinaBon, Rover, Combmnz Intermediate/output level combinabon 15

16 Acous4c Model Combina4on Features o 1 o N AM 1 AM N s 1,1 s 1,2 s 1,3 s N,1 s N,2 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 Remapping w 1 : weight (normalization) s 1 s 2 Σ s 3 s 4 w 2 Combination: Arithmetic mean Geometric mean Harmonic mean WFST(combined) DECODER Words Decoding 16

Acous4c Model Combina4on GPU Accelerated

optimized for parallel = computing HYDRA

17 Acous4c Model Combina4on GPU Accelerated Speech Recogni4on - Talked at GTC 2013 & Speech recognition contains many highly parallel tasks GPU processors + optimized for parallel = computing HYDRA an ASR engine designed specifically for GPUs 17

18 Acous4c Model Combina4on 18

19 Experimental evalua4ons Carnegie Mellon University Data: IARPA BABEL Program Vietnamese language collecbon: babel107b- v0.7 [1] Limited language pack (10 hrs training, 20 hrs test) Features: LMEL: Log Mel filter bank coefficients MFCC: Mel Frequency Cepstral Coefficients BNF: BoQlenect features FFV: Fundamental Frequency VariaBon feature Pitch: Pitch tracking feature Features Dim. Source feature Input frames BNF th lmel + FFV 11 BNF th lmel + FFV + Pitch 11 BNF th MFCC + FFV 11 [1] IARPA, Iarpa babel program, 19

20 Experimental evalua4ons Carnegie Mellon University Model Feature Tree WER (%) ATWV GMM 1 BNF GMM 2 BNF 2 Tree GMM 3 BNF DNN 1 BNF DNN 2 BNF 2 Tree DNN 3 BNF Baseline system performance Trained 6 acous4c models (3 GMMs, 3 DNNs) with 3 different feautres 20

21 Experimental evalua4ons Carnegie Mellon University Combination scheme WER ATWV Best single system (DNN 1 ) Arithmetic mean 63.6 (-3.7) (+30.3%) Geometric mean 65.4 (-2.9) (+23.9%) Harmonic mean 66.2 (-1.1) (+21.0%) WER and ATWV for different combination schemes Combined 6 acousbc models (GMM DNN ) ArithmeBc mean showed the most improved performance. 3.7% absolute WER improvement 30.3% relabve ATWV improvement 21

Experimental evalua4ons ATWV 0.25 0.20 0.15 0.10 0.05 RTF ATWV Carnegie Mellon University 3.0 2.5 2.0 1.5 1.

22 Experimental evalua4ons ATWV RTF ATWV Carnegie Mellon University Real-Time Factor Model 1-Model 1-Model 3-Models 6-Models CPU GPU-search GPU-based AM computation 0.0

23 Experimental evalua4ons Carnegie Mellon University State-level combination obtains best WER vs. Lattice Comb., Rover Note: same phone-states used across all models CombMNZ obtains better ATWV when combining more than 2 models However 5x - 10x slower At comparable RTF: Multi-stream=0.23 > CombMNZ=0.20 Models Multistream CombMNZ 1 model 67.3% (0.14) models 64.7% (0.17) models 63.6% (0.18) models 63.6% (0.18) models (large lattice) 64.7% (0.23)

24 Conclusion Proposed MulB- stream Model combinabon in GPU accelerated speech recognibon framework MulB- stream combinabon gives comparable performance with efficient runbme Future work More combina4on schemes: Weighted model combinabon (Model, HMM state level weights) DNN- based combinabon Faster decoding speed: Use of CUDA mulb- stream technique. 24

25 Q&A Thank you for your attention. 25

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,