Confidence Measures: how much we can trust our speech recognizers

Size: px

Start display at page:

Download "Confidence Measures: how much we can trust our speech recognizers"

Roxanne Casey
5 years ago
Views:

1 Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada

2 Outline Speech recognition in multimedia communications Confidence Measures: overviews Confidence Measure as Utterance Verification The conventional method: anti-models Our proposed approaches: Approach 1: to use accurate competing models Approach 2: nested neighborhood in HMM model space Conclusions

3 Speech Recognition in Multimedia Communications (I): server side Desktops The Internet (IP) Laptops Multimedia servers PDA Communication servers Server side Telephone Networks (3G) Multimedia Content Analysis: Audio/Video Segmentation Speech Transcription Text Understanding Speaker/Topic Identification Speech/Audio/Video Indexing Audio Mining Telephone Mobile phone

4 Multimedia Content Analysis for Information retrieval Multimedia stream (documents) Media Separation Video Audio Text Audio Segmentation Language Understanding speech speech Audio Speech Recognition Text Speaker Identification Semantics Content Speaker ID Video Processing Video Synchronization and Indexing Index for information retrieval

5 Speech Recognition in Multimedia Communications (II): client side Desktops Multimedia servers Communication servers The Internet (IP) Telephone Networks (3G) Natural User Interface: Speech input/output Multi-modal interface Dialog systems Intelligent agent Intelligent Agent Laptops PDA Telephone Mobile phone End User Side (client side)

6 Intelligent Agent Context Tracking Speech Recognition Speech Input Multimedia servers Dialogue Manager Domain Knowledge Language Understanding Response Generation Text-to-speech Synthesis Text Input Text Output Speech Output Users Key Issues: Robust speech recognition Spoken language understanding Dialogue modeling

7 Speech Recognition: state-of-the-art The technology allows us to easily build a state-of-the-art speech system for a variety of applications. Not an issue to build a system if given: Large training corpus (in-house, LDC for public, etc.) Software tools (HTK, etc.) There exist many demo systems; few practically used in-field Most speech recognition systems are not usable in practice Reasons: vulnerable to various interferences (noises, accent, etc.) not stable not reliable Not friendly to non-expert users

8 A Few Examples from the DARPA Communicator system User: do they have a flight that leaves earlier in the day Recognizer: in have flight that is earlier mid-day User: i cannot continue this conversation Recognizer: i not continue this comfort-inn User: Recognizer: my starting location is Fort-Wayne Indiana my starting location is four morning Indiana

9 Improve Usability of Speech Recognition Systems Robust Speech Recognition Make the system more robust in practical applications Confidence Measures Attach each recognized word a score to indicate its reliability in recognition Focus on words with high scores; discard low-score words Detect and correct recognition errors Reject out-of-vocabulary words used by non-expert users Selectively interpret recognition results

10 Previous Examples with confidence measure attached User: do they have a flight that leaves earlier in the day Recognizer: in (0.43) have (0.65) flight (0.83) that (0.34) is (0.27) earlier (0.76) mid-day (0.12) User: i cannot continue this conversation Recognizer: i (0.91) not (0.69) continue (0.87) this (0.79) comfort-inn (0.16) User: Recognizer: my starting location is Fort-Wayne Indiana my(0.74) starting(0.92) location(0.83) is(0.78) four(0.31) morning(0.06) Indiana(0.90)

11 How to calculate confidence measures (CM) for speech recognition CM as a combination of predictor features CM as a posterior probability CM as utterance verification

12 Speech & Audio Processing sliding window waveform Feature Extraction (Linear Prediction, Filter Bank) Feature vectors Audio/speech coding bit stream for transmission Audio Segmentation speech/music/noise Speech Recognition words

13 Automatic Speech Recognition X speech feature vectors Acoustic Matching h r y U P phoneme sequence Language Matching How are you? W word sequence Acoustic Models Language Models W = argmaxpr( W W W Hidden Markov Models (HMM s) = argmaxpr( X X) = P, W) Pr( W) Markov Chain (N-gram) arg max W Pr( X W ) Pr( W Pr( X ) )

14 CM as Utterance Verification waveform How are you? Recognition Results H r y U Phoneme sequence Λ H Λ W Λ@ Λr Λ y ΛU

15 Verify every phoneme segment Assume one speech segment Xi is recognized as represented by HMM Verify the decision under the framework of statistical hypothesis testing H0: Xi is truly from VS. H1: Xi is not from Solution is Likelihood Ratio (LR) Testing: LR = Pr( X Pr( X i i H H 0 1 ) ) >η

16 Problem: how to calculate the denominator For the numerator in LR testing: Pr( X i H l( X What about the denominator? ) 0 i Λ@ H1 is not easy to model, is a composite hypothesis which includes so many possibilities: ) a) Xi from another phoneme rather b) Xi is speech sound not in the phoneme set c) Xi is a non-speech noise d) None of the above

17 The conventional method: using anti-models For each phoneme a, estimate an anti-model using all training data which are not from the phoneme. The anti-model is a HMM having the same structure as Every phoneme a has two models: Regular recognition model An anti-model Λ a Λ a Λ a Λ a In previous example: LR i = l( X l( X i i ) ) >η LRi can be used as a confidence measure for this speech segment.

18 New verification method based on accurate competing models (I) If we can t enumerate and model all situations in H1, approximate it with only the most significant one. a. Xi from another phoneme rather 1) Xi from competing models 2) Xi from non-competing models b. Xi is speech sound not in the phoneme set c. Xi is a non-speech noise d. None of the above In most cases, especially when Xi was misrecognized, the probability in the case a.1) dominates the total probability in H1, i.e., Pr( X i H1).

19 How to formulate the competing models Competing tokens of a is defined as all data segments (in training set) which are mis-recognized during recognition. For any HMM, we have two sets: The true token set: St( ) The competing token set: Sc( ) Given the decision: data X is recognized as H0: X is truly from model VS. H1: X is not from model H 0: X S t (Λ) H 1: X S c (Λ) LR Pr(X H Pr(X H 1 ) ) Pr( X Pr( X S S ( Λ)) ( Λ)) p( X p( X λt ) λ ) 0 t = = = Competing c c model

20 In-Search Data Selection method $-b-u b-u-k u-k+t k-t+i T-i+s i-s+t s-t+r T-r+i r-i+p i-p+$ Reference Segmentation State search beam Token Selection phone a-b+c phone a-b+c a-b+c phone a -b +c phone a -b +c A word-ending active path t True Token Sets Competing Token Sets Time

21 New Verification method based on nested neighborhood in model space Original model Competing model Other models Tight Neighborhood 1 Medium Neighborhood 2 Large Neighborhood 3 HMM Model Space

22 Verification based on nested neighborhood in model space (II) Given the decision: data X is recognized as H0: X is truly from model VS. H1: X is not from model H0: true model of X lies in small neighborhood s VS. H1: true model of X lies in the region b s Small Neighborhood s Big Neighborhood b

23 Verification based on nested neighborhood in model space (III) Bayes factors: the tool to implement the verification BF = Λ b Λ S Λ l( X λ) ρ λ λ s ( ) d > η l( X λ) ρ ( λ) dλ S b Model is viewed as random variable in model space l(x ) is the likelihood function of any speech data X s( ) is the prior distribution of inside neighborhood s b( ) is the prior distribution of in the region b s

24 Specify Neighborhoods and Priors (I) A parametric neighborhood: (C, ) neighborhood: Λ ( c, ρ ) * ( λ ) = { λ m * ikd Cd 1 d ρ m m * ikd + Cd 1 d ρ } Use a small (C1, 1) for small neighborhood s Same for all Gaussian components in all HMMs Use a large (C2, 2) for big neighborhood b Same for all Gaussian components in all HMMs Priors: constrained uniform pdf in neighborhood Tune C1, 1, C2, 2 (C1<C2, 1< 2) for best performance

25 Specify Neighborhoods and Priors (II) A non-parametric neighborhood: Define -function priors: λ 2 λ 1 λ 3 Λ * λ λ 5 λ 4 1 ρ ( λ) = Λ δ ( λ λi ) N λ Λ i BF calculation is simplified: BF = λ Λ i λ Λ i s b l( X l( X λ ) / i λ ) / i N N s b

26 Experiments Tested on Bell Labs Communicator system for the DAPAR communicator project. (a travel reservation task) Use confidence measure to detect recognition errors from the recognition results given by our best recognizer. The best recognition performance: 15.8% word error rate Verify correctly recognized words vs. mis-recognized words For each utterance, calculate confidence measure for every recognized phoneme, then combine for each word. Baseline: Likelihood Ratio Testing based on anti-models New approach 1: Likelihood Ratio Testing based on competing models New Approach 2: Bayes factors based on nested neighborhoods

27 Experimental Results of new approach 1: ROC curves ROC curves in test set Eva00 with three different verification models std anti mono new mono new tri False Alarm Rate False Rejection Rate

28 Experimental results of new approach 1: equal error rate (EER) Method baseline New Approach 1 (mono-phone model) New Approach 1 (tri-phone model) Equal Error Rate (EER) 40.0% 27.3% 24.6%

29 Experimental results of new approach 2: ROC curves Comparison of ROC curves Baseline (Anti models) (C,p) neighborgood Delta Neighborgood (N=1500) Dist2C neighborhood 0.7 False Acceptance False Rejection

30 Experimental results of new approach 2: equal error rate (EER) Method baseline Parametric neighborhood (global) Parametric neighborhood (state-dependent) Non-parametric neighborhood (case A) Non-parametric neighborhood (case B) Equal Error Rate (EER) 40.0% 36.7% 32.4% 31.5% 31.0%

31 Conclusions Reliable confidence measures (CM) play a key role in building a successful speech recognition system. Utterance verification is a good framework to calculate confidence measures (CM). We proposed two new approaches for CM calculation By using accurate competing models Based on neighborhood information in HMM space and Bayes factors. The new approaches are proved to outperform the conventional method based on anti-models. Future works: How to interpret H1 in other ways In approach 2, how to define a good neighborhood quantitatively in high-dimension HMM model space

33 ERROR: undefined OFFENDING COMMAND: STACK:

Spoken Term Detection Using Multiple Speech Recognizers Outputs at NTCIR-9 SpokenDoc STD subtask

NTCIR-9 Workshop: SpokenDoc Spoken Term Detection Using Multiple Speech Recognizers Outputs at NTCIR-9 SpokenDoc STD subtask Hiromitsu Nishizaki Yuto Furuya Satoshi Natori Yoshihiro Sekiguchi University