Open-Source Speech Recognition for Hand-held and Embedded Devices

Size: px

Start display at page:

Download "Open-Source Speech Recognition for Hand-held and Embedded Devices"

Juniper Park
6 years ago
Views:

1 PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins Daines Mohit Kumar Arthur Chan Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar (rkm@cs.cmu.edu) Alexander I. Rudnicky (air@cs.cmu.edu) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 1

2 What is PocketSphinx? Based on Sphinx-II Open source code under MIT-style license Widely used in CMU and elsewhere Mature and stable API Design goals Statistical Language Model support Finite-State Grammars also available Medium-Large Vocabulary (1-10kwords) Make it go faster Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 2

3 Why do we need it? Typical desktop/workstation of bit memory bus (6-10GB/sec) 1.8-3GHz processor (5000 MIPS) ATA, SATA, or SCSI storage ( MB/sec) Typical PDA/SOC/smartphone of or 32-bit memory bus ( MB/sec) MHz processor ( MIPS) SD/MMC or CF storage (1-16MB/sec) no FPU or vector unit (sometimes a DSP...) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 3

4 ASR bottlenecks Wait, you say: My cell phone is pretty darn fast! At least as fast as that DEC we had a real-time 20k system on back in 1996! However: ASR is system bandwidth limited Sphinx benchmarks (shown to the right) favor large caches and high memory bandwidth (Intel) Search, LM, and dictionary lookup are highly memory-intensive We will have to deal with them (Source: techreport.com) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 4

5 Scaling: Hand-held vs Desktop Speed (xrt) # of words in vocabulary Hand-held Desktop Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 5

6 How to make it go faster Low-hanging fruit Front-end optimizations (fixed-point, logarithm) Speeding up GMM computation Old-fashioned beam tuning Non-speech-related work Memory optimization (+ model compression) Machine-level optimization (assembly code) What's left? Search optimization dynamic beam tuning Language model compression and optimization Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 6

7 Front-End Optimizations Fixed-point calculations 32-bit, or format Using 64-bit multiply (SMULL) on ARM, multiply-accumulate on DSP MFCC calculated in log domain, using a lookup of log 2 w/conversion to log Audio downsampling Allows smaller order FFT and MFCC Not as useful for large-vocabulary systems Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 7

8 GMM Optimizations Top-N based Gaussian selection (Mosur 96) Use previous frame's top codewords to select current frame standard Sphinx-II technique Partial frame-based downsampling (Woszczyna 98) Only update top-n every Mth frame Can significantly affect accuracy kd-tree based Gaussian selection (Fritsch 96) Approximate nearest neighbor search in k dimensions using stable partition trees 10% speedup, little or no effect on accuracy Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 8

9 Search Optimizations Absolute pruning Approximations in the front end and GMM increase the effective beam width, paradoxically decreasing performance We would like to enforce a hard limit on the number of states or word exits evaluated per frame - how? Histogram pruning (Ney 1996) Partition the beam width into bins Dynamically recompute beam based on bin occupancy counts 30% speedup with 10% relative degradation in WER Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 9

10 Memory Optimizations Read-only model files mmap(2)able, shareable between processes leverage OS-level caching (virtual memory) Precompiled (binary) LM Inherited from Sphinx-II Adapted for memory-mapping vocabulary in <32M of RAM Read-only binary model definition file Pre-built radix tree of triphones->senones Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 10

11 Performance Task Vocabulary Perplexity xreal-time Word Error TIDIGITS % RM % WSJ devel5k % Test platform: ipaq MHz StrongARM running Linux (FPU emulation in kernel) Also running on: Other embedded Linux platforms Analog Devices Blackfin, uclinux WinCE using GNU toolchain (untested) Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 11

12 How to get it Web Site: Compiles with GCC for i386, ARM, PowerPC, and Blackfin Cross-compiles using an arm-wince-pe toolchain (available in various Linux distributions) for Windows CE Compatible with Sphinx2 fbs.h interface Good (fast) acoustic models forthcoming Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 12

13 Future work Improve accuracy Remove Sphinx-II codebook limitations Optimize the language model and dictionary Statistical profiling of LM access patterns Investigate dynamic search strategies Remove various legacy code Fast speaker and channel adaptation Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 13

14 Thank you Any questions? This work was supported by DARPA grant NB CH D The content of the information in this publication does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. Language Technologies Institute, Carnegie Mellon University 05/18/06 Slide 14

POCKETSPHINX: A FREE, REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD DEVICES

POCKETSPHINX: A FREE, REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD DEVICES David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I. Rudnicky Carnegie