Automatic Speech Recognition on Mobile Devices and over Communication Networks

Size: px

Start display at page:

Download "Automatic Speech Recognition on Mobile Devices and over Communication Networks"

Ilene Gwen O’Connor’
6 years ago
Views:

1 Zheng-Hua Tan and Berge Lindberg Automatic Speech Recognition on Mobile Devices and over Communication Networks ^Spri inger g<

Contents Preface Contributors v xix 1. Network, Distributed and Embedded Speech Recognition: An Overview 1 Zheng-Hua Tan and Imre Varga 1.1 Introduction 1 1.

2 Contents Preface Contributors v xix 1. Network, Distributed and Embedded Speech Recognition: An Overview 1 Zheng-Hua Tan and Imre Varga 1.1 Introduction ASR and Its Deployment in Devices and Networks Automatic Speech Recognition Resources and Constraints of Mobile Devices Resources and Constraints of Communication Networks Architectural Solutions for ASR in Devices and Networks Network Speech Recognition Distributed Speech Recognition Feature Extraction Source Coding Channel Coding and Packetisation Error Concealment DSR Standards A Configurable DSR System Embedded Speech Recognition ESRScenario Applications and Platforms Fixed-Point Arithmetic Optimisation Robustness Discussion 20 References 21

x Part I Contents Network Speech Recognition 2. Speech Coding and Packet Loss Effects on Speech and Speaker Recognition 27 Laurent Besacier 2.1 Introduction 27 2.

3 x Part I Contents Network Speech Recognition 2. Speech Coding and Packet Loss Effects on Speech and Speaker Recognition 27 Laurent Besacier 2.1 Introduction Sources of Degradation in Network Speech Recognition Speech and Audio Coding Standards Packet Loss Effects on the Automatic Speech Recognition Task Experimental Setup Degradation Due to Simulated Packet Loss Degradation with Real Transmissions Degradation Due to Speech and Audio Codecs Effect for the Automatic Speaker Verification Task Speaker Verification Experiments Over Compressed Speech and Packet Loss Speaker Verification Experiments Over GSM Compressed Speech Conclusion 38 Acknowledgments 38 References Speech Recognition Over Mobile Networks 41 Hong Kook Kim and Richard C. Rose 3.1 Introduction Techniques for Improving ASR Performance Over Mobile Networks Bitstream-Based Approach Feature Transform Mel-Scaled LPCC LPC-Based MFCC (LP-MFCC) Pseudo-Cepstrum (PCEP) and Its Mel-Scaled Variant (MPCEP) Enhancement of ASR Performance Over Mobile Networks Compensation for the Effect of Mobile Systems Compensation for Speech Coding Distortion in LSP Domain Compensation for Channel Errors Conclusion 57 References Speech Recognition Over IP Networks 63 Hong Kook Kim 4.1 Introduction Speech Recognition and IP Networks Relationship Between ASR Performance and Speech Quality Impact of Speech Coding Distortion Impact of Network Channel Distortion 67

Contents xi 4.3 Robustness Against Packet Loss 69 4.3.1 Rate Control 69 4.3.2 Forward Error Correction 70 4.3.3 Interleaving 70 4.3.4 Error Concealment and ASRDecoder- Based Concealment 71 4.

4 Contents xi 4.3 Robustness Against Packet Loss Rate Control Forward Error Correction Interleaving Error Concealment and ASRDecoder- Based Concealment Speech Coder for Speech Recognition Over IP Networks MFCC-Based Speech Coder Efficient Vector Quantization of MFCCs Speech Quality Comparison ASR Performance Comparison Conclusion 82 References 82 Part II Distributed Speech Recognition 5. Distributed Speech Recognition Standards 87 David Pearce 5.1 Introduction Overview of the Set of DSR Standards Scope of the Standards Electro-Acoustics Speech Detection or External Control Signal Pre-Processing Parameterisation Compression and Error Protection Formatting Error Detection and Mitigation Decompression Server Side Post Processing Feature Derivatives DSR Basic Front-End ES Feature Extraction Compression Error Detection and Mitigation DSR Advanced Front-End ES Feature Extraction VAD Compression Recognition Performance of the DSR Front-Ends Aurora Speech Databases and ETSI Performance Testing Aurora 3: Multilingual SpeechDat-Car Digits Small Vocabulary Evaluation GPP Evaluations and Comparisons to AMR Coded Speech ETSI DSR Extended Front-End Standards ES and ES Transport Protocols: The IETF RTP Payload Formats for DSR Conclusion 105 Acknowledgements 105 References 105

xii Contents 6. Speech Feature Extraction and Reconstruction 107 Ben Milner 6.1 Introduction 107 6.2 Feature Extraction 109 6.2.1 Basic Terminal-Side Feature Extraction 109 6.2.2 Advanced Terminal-Side Feature Extraction 115 6.

5 xii Contents 6. Speech Feature Extraction and Reconstruction 107 Ben Milner 6.1 Introduction Feature Extraction Basic Terminal-Side Feature Extraction Advanced Terminal-Side Feature Extraction Quantisation and Packetisation Server-Side Processing Speech Reconstruction Analysisof Received Speech Information Speech Reconstruction Prediction of Voicing and Fundamental Frequency Fundamental Frequency Prediction from MFCC Vectors Voicing Prediction from MFCC Vectors Speech Reconstruction from Predicted Fundamental Frequency and Voicing Conclusion 129 References Quantization of Speech Features: Source Coding 131 Stephen So and Kuldip K. Paliwal 7.1 Introduction Quantization Schemes Brief Introduction to Quantization Theory Distortion Measures for Quantization in Speech Processing Scalar Quantization Block Quantization Vector Quantization GMM-Based Block Quantization Quantization of ASR Feature Vectors Introduction and Literature Review Statistical Properties of MFCCs Use of Cepstral Liftering for MFCC Variance Normahzation Relationship Between the Distortion Measure and Recognition Performance Improving Noise Robustness: Perceptual Weighting of Filterbank Energies Experimental Results ETSI Aurora-2 Distributed Speech Recognition Task Experimental Setup Non-Uniform Scalar Quantization Using HRO Bit Allocation Unconstrained Vector Quantization GMM-Based Block Quantization Multi-frame GMM-Based Block Quantization Perceptually-Weighted Vector Quantization of Logarithmic Filterbank Energies Conclusion 158 References 159

Contents xiii 8. Error Recovery: Channel Coding and Packetization 163 BengtJ. Borgström, Alexis Bernard, and Abeer Alwan 8.1 Distributed Speech Recognition Systems 163 8.

6 Contents xiii 8. Error Recovery: Channel Coding and Packetization 163 BengtJ. Borgström, Alexis Bernard, and Abeer Alwan 8.1 Distributed Speech Recognition Systems Characterization and Modeling of Communication Channels Signal Degradation Over Wireless Communication Channels Signal Degradation Over IP Networks Modeling Bursty Communication Channels Media-Specific FEC Media-Independent FEC Combining FEC with Error Concealment Methods Linear Block Codes Cyclic Codes Convolutional Codes Unequal Error Protections Frame Interleaving Optimal Spread Block Interleavers Convolutional Interleavers Decorrelated Block Interleavers Examples of Modern Error Recovery Standards ETSI DSR Standard (ETSI 2000) ETSI GSM/EFR Standard (ETSI 1998) Summary 183 Acknowledgements 184 References Error Concealment 187 Reinhold Haeb-Umbach and Valentin Ion 9.1 Introduction Speech Recognition in the Presence of Corrupted Features Modified Observation Probability Gaussian Approximation Feature Posterior Estimation in a DSR Framework ETSI DSR Standards Source Coder Redundancy Channel Models Estimation of Feature Posterior Related Work Performance Evaluations Experimental Setup Results on GSM Data Channel Results on Packet Erasure Channel Conclusion 207 Acknowledgments 208 References 208

xiv Contents Part III Embedded Speech Recognition 10. Algorithm Optimizations: Low Computational Complexity 213 Miroslav Novak 10.1 Introduction 213 10.

7 xiv Contents Part III Embedded Speech Recognition 10. Algorithm Optimizations: Low Computational Complexity 213 Miroslav Novak 10.1 Introduction Common Limitations of Embedded Platforms Memory Limitations CPU Limitations Overview of an ASR System Front End Observation Model Model Organization Efficient Computation Strategies Search Viterbi Search Implementation Search Graph Construction Fast Match Alternative Decoding Schemes Conclusion 229 Acknowledgments 229 References Algorithm Optimizations: Low Memory Footprint 233 Marcel Vasilache 11.1 Introduction Notations and Problem Statement Model Complexity Control Akaike's Information Criterion Bayesian Information Criterion Second Order Approximation Other Measures Parameter Tying Model Level State Level Density Level Subspaces Clustering Parameter Representations Floating Point Representation Fixed Point Representation Quantization Quantized Parameters HMMs Scalar Quantization Vector Quantization Subspace Distribution Clustering HMM Subspace Partitioning Density Clustering 249

Contents xv 11.8 Computational Complexity Implications 249 11.9 Practicalities and Conclusion 250 References 251 12. Fixed-Point Arithmetic 255 Enrico Bocchieri 12.1 Introduction 255 12.

8 Contents xv 11.8 Computational Complexity Implications Practicalities and Conclusion 250 References Fixed-Point Arithmetic 255 Enrico Bocchieri 12.1 Introduction Fixed-Point Arithmetic Programming with Fixed-Point Numbers Fixed-Point Representation and Quantization LVCSR MAP Recognizer HMM State Likelihoods State Duration Model Language Model Viterbi Decoder Acoustic Front-End Fixed-Point Implementation of the Recognizer Log-Likelihoods Viterbi Frame-Synchronous Search Gaussian Parameters MFCC Front-End Experiments Real-Time on the Device Conclusion 274 Acknowledgements 274 References 274 Part IV Systems and Applications 13. Software Architectures for Networked Mobile Speech Applications James C. Ferrans and Jonathan Engelsma 13.1 Introduction Embedded and Distnbuted Speech Engines The Voice Web Multimodal User Interfaces Distributed Speech Recognition Multimodal Architectures Simultaneous and Sequential Multimodality Mode Composition Classesof Multimodal Architectures Fully Embedded or "Fat Client" (a) Distributed Processing Engines (b) Thin Client (d) Remote Visual Interface (e) "Pudgy" Client (c) Discussion The "Plus V" Distributed Multimodal Architecture Other Distributed Multimodal Architectures 295

xvi Contents 13.4.1 Video Interactive Services with VoiceXML 295 13.4.2 Multimodal for Set-Top Boxes 295 13.4.3 Bare Minimum Mobile Voice Search 296 13.4.4 A Transcription-Based Architecture 297 13.

9 xvi Contents Video Interactive Services with VoiceXML Multimodal for Set-Top Boxes Bare Minimum Mobile Voice Search A Transcription-Based Architecture Towards a Commercial Ecosystem Conclusion 298 References Speech Recognition in Mobile Phones 301 Imre Varga and Imre Kiss 14.1 Introduction Applications of Speech Recognition for Mobile Phones Multilinguality and Language Support Multilingual Speaker Independent Name Dialing Multilinguality in Other ASR Applications Language Resources Noise Robustness Robust HMM Models Feature Extraction Noise Reduction Footprint and Complexity Reduction Footprint Reduction of Acoustic Models Footprint Reduction of Language Models Footprint Reduction of Pronunciation Lexicon Reduction of Computational Complexity in Embedded ASR Systems Low Memory, Fast Decoding Platforms and an Example Application Example Application: Large Vocabulary Isolated Word Dictation Conclusion and Outlook 323 References Handheld Speech to Speech Translation System 327 Yuqing Gao, Bowen Zhou, Weizhong Zhu and Wei Zhang 15.1 Introduction System Overview System architecture Hardware and OS Specifications Interface System Components and Optimization LVCSR on Handheld Devices Natural Language Understanding and Generation Based Translation Weighted Finite State Transducer Based Translation Embedded Speech Synthesis 340

Contents xvii 15.4 Experiments and Discussions 341 15.4.1 Speech Recognition Experiments 341 15.4.2 Translation Experiments 343 15.5 Conclusion 344 References 345 16.

10 Contents xvii 15.4 Experiments and Discussions Speech Recognition Experiments Translation Experiments Conclusion 344 References Automotive Speech Recognition 347 Harald Höge, Sascha Hohenner, Bernhard Kämmerer, Niels Kunstmann, Stefanie Schachtl, Martin Schönle, and Panji Setiawan 16.1 Introduction Siemens Speech Processing From Research to Products Development for Performance andquality High-Performance Recognizer Ultra-Compact Text-to-Speech Synthesizer Natural Voice Dialog Speaker Characterization and Recognition Example Automotive Voice Applications: Infotainment, Navigation, Manuals, and Internet Radio Station Selection MP3 Title Selection Navigation Destination Entry Manuals and Help Systems Access to Structured Web Content Access to Web Services Automotive Platform Issues and Challenges Hardware Constraints Software Constraints User Constraints Acoustic Channel Noise Robust Recognition Technology ASRFront-End Minimum Mean Square Weighting Rules Recursive Least Squares Weighting Rules Implementation of RLS Weighting Rules Recognition Results Methodology for Evaluation of Automotive Recognizers Quality Measurement Using SNR Curves Common Evaluation Procedures Proposed SNR-Approach Data Recording Evaluation Best Practice Conclusion 372 References 372

xviii Contents 17. Energy Aware Speech Recognition for Mobile Devices 375 Brian Delaney 17.1 Introduction 375 17.1.1 Battery Technology 375 17.1.2 Energy Aware Design Principles 376 17.1.3 Related Work 377 17.

11 xviii Contents 17. Energy Aware Speech Recognition for Mobile Devices 375 Brian Delaney 17.1 Introduction Battery Technology Energy Aware Design Principles Related Work Case Study of Distributed Speech Recognition Using the HP Labs Smartbadge System Signal Processing Front-End Energy Consumption of DSR with IEEE Wireless Networks Energy Consumption of DSR Using Bluetooth Networks Comparison of and Bluetooth in DSR Conclusion 395 References 395 Index 397

Advances in Pattern Recognition

Advances in Pattern Recognition Advances in Pattern Recognition is a series of books which brings together current developments in all areas of this multi-disciplinary topic. It covers both theoretical