A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Similar documents
Variable-Component Deep Neural Network for Robust Speech Recognition

Robust speech recognition using features based on zero crossings with peak amplitudes

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

NON-LINEAR DIMENSION REDUCTION OF GABOR FEATURES FOR NOISE-ROBUST ASR. Hitesh Anand Gupta, Anirudh Raju, Abeer Alwan

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Interfacing of CASA and Multistream recognition. Cedex, France. CH-1920, Martigny, Switzerland

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

SUBBAND ACOUSTIC WAVEFORM FRONT-END FOR ROBUST SPEECH RECOGNITION USING SUPPORT VECTOR MACHINES

Manifold Constrained Deep Neural Networks for ASR

Multifactor Fusion for Audio-Visual Speaker Recognition

Principles of Audio Coding

SVD-based Universal DNN Modeling for Multiple Scenarios

Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping

Discriminative training and Feature combination

ENTROPY CODING OF QUANTIZED SPECTRAL COMPONENTS IN FDLP AUDIO CODEC

Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising

Image denoising in the wavelet domain using Improved Neigh-shrink

SPREAD SPECTRUM AUDIO WATERMARKING SCHEME BASED ON PSYCHOACOUSTIC MODEL

Streaming Hierarchical Video Segmentation. Chenliang Xu, Caiming Xiong and Jason J. Corso SUNY at Buffalo

signal-to-noise ratio (PSNR), 2

WHO WANTS TO BE A MILLIONAIRE?

A Neural Network for Real-Time Signal Processing

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Hybrid NN/HMM Acoustic Modeling Techniques for Distributed Speech Recognition

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM

AUDIOVISUAL COMMUNICATION

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

Lecture 7: Neural network acoustic models in speech recognition

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System

Audio-coding standards

Color Space Projection, Feature Fusion and Concurrent Neural Modules for Biometric Image Recognition

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Perspectives on Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-based Telephony

Deep Belief Networks for phone recognition

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

Additional Remarks on Designing Category-Level Attributes for Discriminative Visual Recognition

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Audio-coding standards

Chapter 14 MPEG Audio Compression

Mpeg 1 layer 3 (mp3) general overview

Intrinsic Spectral Analysis Based on Temporal Context Features for Query-by-Example Spoken Term Detection

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

A Short Narrative on the Scope of Work Involved in Data Conditioning and Seismic Reservoir Characterization

A New Manifold Representation for Visual Speech Recognition

LATENT SEMANTIC INDEXING BY SELF-ORGANIZING MAP. Mikko Kurimo and Chafic Mokbel

Data Hiding in Video

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Multimedia Communications. Audio coding

I D I A P R E S E A R C H R E P O R T. October submitted for publication

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

CS6220: DATA MINING TECHNIQUES

Lecture 11: Classification

AUTOMATIC speech recognition (ASR) systems suffer

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Domain Adaptation For Mobile Robot Navigation

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Confidence Measures: how much we can trust our speech recognizers

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Toward a robust 2D spatio-temporal self-organization

Empirical Mode Decomposition Based Denoising by Customized Thresholding

Scalable Trigram Backoff Language Models

An Optimized Template Matching Approach to Intra Coding in Video/Image Compression

An Introduction to Pattern Recognition

Using Genetic Algorithms to Improve Pattern Classification Performance

2. Basic Task of Pattern Classification

ISCA Archive

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

2014, IJARCSSE All Rights Reserved Page 461

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

AUDIOVISUAL COMMUNICATION

Speech Technology Using in Wechat

[N569] Wavelet speech enhancement based on voiced/unvoiced decision

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

Mengjiao Zhao, Wei-Ping Zhu

ELL 788 Computational Perception & Cognition July November 2015

Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search

A Gaussian Mixture Model Spectral Representation for Speech Recognition

Recognition of online captured, handwritten Tamil words on Android

Chapter 3. Speech segmentation. 3.1 Preprocessing

Speaker Verification with Adaptive Spectral Subband Centroids

To extract the files, use the following command: tar -xzf longtermsnr.tar.gz

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

Speech Synthesis. Simon King University of Edinburgh

Speech User Interface for Information Retrieval

Transcription:

A long, deep and wide artificial neural net for robust speech recognition in unknown noise Feipeng Li, Phani S. Nidadavolu, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, 21218 Abstract A long deep and wide artificial neural net (LDWNN) with multiple neural nets for individual frequency subbands is proposed for robust speech recognition in unknown noise. It is assumed that the effect of arbitrary additive noise on speech recognition can be approximated by white noise (or speech-shaped noise) of similar level across multiple frequency subbands. The neural nets are trained in and speech-shaped noise at 20, 10, and 5 db SNR to accommodate noise of different levels, followed by a neural net trained to select the most suitable neural net for optimum information extraction within a frequency subband. The posteriors from multiple frequency subbands are fused by another neural net to give a more reliable estimation. Experimental results show that the subband net adapts well to unknow noise. Index Terms: structured deep neural networks, multistream recognition of speech, neural net, noise robustness 1. Introduction How to deal with unknown noise in realistic environments is a long-lasting problem for automatic speech recognition (ASR). Today most ASR systems use artificial neural net (ANN) for front-end processing. It works well when the test condition matches the training condition, but it deteriorate quickly when the speech is corrupted by noise. A common approach to improve system robustness is to train the neural net in all expected conditions [8, 4], which generally gives a system of better generalization ability but more mediocre performance in matched conditions. For example, a neural net trained on This work was supported by the Defense Advanced Research Projects Agency (DARPA) RATS project D10PC0015 and Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research Laboratory contract number W911NF-12-C- 0013. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL, or the U.S. Government. both speech and noisy speech at 0 db SNR white noise may show better performance for an unseen condition of noisy speech at 10 db SNR white noise than two neural nets trained in only one of the two conditions, but it usually performs worse on the matched conditions. In many realistic environments, both the noise level and spectral shape may change with time. It is impractical to train a single neural net and expect it to show good performance for all noises. The human auditory system, in contrast, show superb performance as well as adaptation ability in almost all noisy conditions. According to [1], the cochlear frequency range spans of about 20 articulatory bands from 0.3 to 8 khz that act as a processing channels for speech recognition. Speech events are extracted by hundreds of inner hair cells in every articulatory band, and this information is then assembled in the central auditory system. The capacity of each processing channel is determined by the signal-to-noise ratio within the channel. Inspired by the parallel processing in human speech perception, a multi-stream speech recognition system has been proposed, in which the full band speech is divided into multiple subbands [9, 10, 16]. Motivations and history of the multi stream processing can be found in [18]. In the current study, a neural net is trained within each subband to classify the speech sound given the partial acoustic evidence. The speech information from multiple subbands is integrated by a set of fusion neural nets. To maximize the information being extracted from individual subbands in both and noisy conditions, we have conducted several experiments to optimise the training of subband neural nets. It is shown that a neural net trained in condition shows poor generalization in noise, while a neural nets trained in noise generally show poor performance in signs. A multi-style trained neural net maintains a balance between the two by sacrificing performance for generalization ability. In this study we develop a LDWNN for robust speech recognition by replacing the single subband neural net of multi-stream system with an neural nets using relatively long spans of speech signal as their inputs, trained on signal and signal with various levels of noise. The subband neural nets appear to gen-

Power Spectral eralize well in unknown noise. 90 db db 50 db speech signal Subband Feature Extraction Ensemble-Net 1 Ensemble-Net 2 Ensemble-Net N Fusion Decoder phone sequence 30 db 10 db Figure 2: LDWNN with multiple frequency subbands for robust speech recognition 0.25 0.5 1.0 2.0 4.0 8.0 Frequency (Hz) Figure 1: Approximation of unknown noise (solid blue curve) by narrow-band white noise (short horizontal red lines) across multiple frequency subbands 2. Approach The LDWNN is developed based on the multi-stream framework for parallel processing. It is assumed that the effect of non-stationary additive noise on neural net training or testing can be simulated by an artificial noise of similar spectral shape. Fig. 1 shows how the arbitrary spectral shape of unknown noise is approximated by using white noise of the same level across multiple frequency subbands. In speech recognition a neural net is trained to estimate a posteriori probability of sound classes given the acoustic evidence [15]. It performs the best when the distribution of testing data matches that of training data. When the speech sound is corrupted by noise, the mismatch between the training and testing data can cause the performance to drop quickly. Adding noise to training data helps to improve the generalization of neural net, but it reduces performance in matched conditions. The higher the noise level, the less a neural net can learn. In multi-style training the speech feature is ed with different types of noise at various levels. It generally produces a system of mediocre performance, especially when the training noises differ dramatically in level and spectral shape. A more practical solution is to train an neural net at various noise levels from low to high signal-tonoise ratios. For any unknown noise, the posteriors of the neural net gives the best performance is selected for each frequency subband. 2.1. LDWNN Figure 2 depicts the block diagram of the LDWNN for robust speech recognition. The full band speech with unknown additive noise is decomposed into multiple (7 or 5 for wide/narrow-band speech) frequency subbands, each covers about 3 Bark along the auditory frequency. Within each subband the information is derived from 300 ms long spans of speech signal using the Frequency Domain Linear Prediction [17, 13, 12] which characterizes the long-term temporal modulation as well as the shortterm spectral modulation. A subband net is trained to estimate the posterior probability of a phoneme in or noise given the partial acoustic evidence. A neural net is trained to fuse the speech information from multiple subbands to give a more reliable estimation. The system is long as it takes 300 ms long segments of the input signal as the evidence for its initial subband classification and concatenates outputs from the first classification stage over 200 ms, thus effectively using 500 ms signal spans for deriving the estimated posterior probability of each speech sound, is deep as it includes three stages of neural nets for feature extraction, stream selection, and fusion in a sequence, and is wide because the full band speech is divided into multiple subbands, each has an net for various levels of noise. Subband Speech Feature ANN-CL ANN-L ANN-M ANN-H Concatenate Hierarchical Temporal Context (HTC) (±5, ±10) ANN Selector Figure 3: subband net for optimal information extraction. ANN-CL, ANN-L, ANN-M, and ANN-H are trained in, low, middle, and high level of noise to accommodate time-varying unknown noise The proposed speech recognition system uses the Artificial Neural Network - Hidden Markov Model - (ANN - HMM) technique [3]. The dimensionality of the presigmoid outputs of subband nets is reduced to 25 by applying the Karhunen-Loeve Transform (KLT). Reduced vectors from all subband classifiers at the current time and times ± 50ms and ± 100 ms around this time are concatenated [14] then to form an input for the fusion neural net, which estimates the final posterior probability of phonemes given the acoustic evidence. For simplicity, phoneme classes are single-state monophones without using any context information. Viterbi algorithm is applied to decode the phoneme sequence. The state transition matrix uses equal probabilities for selftransitions and next-state transitions. post. prob.

Phoneme Error Rate (PER)(%) (a) subband 2 (CF 0.5 khz) 1 1 2 Phoneme Error Rate (PER)(%) 1 1 2 50 45 (b) subband 4 (CF 2 khz) Phoneme Error Rate (PER)(%) (c) subband 6 (CF 4 khz) 1 1 2 Figure 4: Phoneme Error Rate (PER) of subband net (denoted as ) for LDWNN system vs. single neural nets trained in or both and noise (denoted as and ed respectively) for MS system 2.2. Subband Ensemble Net Figure 3 depicts the block diagram of subband net for optimal information extraction from a frequency subband. A set of N neural nets, including ANN-CL trained in and several others ANN-L/M/H trained in various levels of noise, are employed to cover all possible conditions. In this study, we use the training speech signal at, 20, 10, and 5 db SNR because the neural net trained in noise seem to generalize well within ±5 db SNR. Instead of using white noise, which has a much bigger masking effect in the high frequency than in the low frequency, babble noise is used because it has the same long-term spectral shape as speech signal, and therefore provides a balanced masking across all frequency subbands. The fusion ANN is trained on all noise levels. It is critical that outputs from the sub-band stages are presented during the training at different noise levels in random order. For that purpose, we created a training set that contains speech of the N possible conditions in a randomized order. To make sure that the prior probability is equal for all noises, a sequence of pseudo-random number with uniform distribution between 1 and N is used as the index of noise type to be applied at any given time during the training. 3. Experiments Three experiments are conducted to test the proposed approach using the TIMIT speech corpus. The first experiment evaluates the maximum performance and generalization ability of subband net as the noise level increases from, 20, 15, 10, 5, and 0 db SNR in babble noise. The second experiment evaluates the performance of overall system after fusion. In the third experiment the LDWNN system is assessed by unknown noise. The results are compared with those of the singlestream (SS) baseline using MFCC, PNCC, and FDLP fea- Phoneme Error Rate(%) 50 45 40 35 30 Figure 5: Performance of LDWNN system with subband net (denoted as ) and MS system with single subband neural net trained in or multi-style training (denoted as or ed respectively) ture and a multi-stream (MS) baseline system [12] with and without a performance monitor, denoted as MS and MS-PM respectively. Both MS systems have only one neural net in every subband. The subband neural nets are trained in either (denoted as ) or both and noisy conditions (denoted as ed). For simplicity the SS, MS, MS-PM, and LDWNN systems are built as single-state monophone systems without using any context information. All neural nets are three-layer MLPs with a hidden layer of 1000 units. The TIMIT corpus has 5040 sentences, of which 3696 sentences are used for training, and the rest 1344 are used for testing. The target phoneme set contains 40 phonemes. The speech sounds are sampled at 16k Hz. 4. Results Results of the first experiment indicate that the subband net works well for a wide range of SNR across multiple frequency subbands. Fig. 4 depicts the phoneme

Table 1: Phoneme Error Rate (PER) of variants SS and MS system in unknown noise Noise SS MS MS-PM LDWNN (db SNR) MFCC PNCC FDLP ed ed 33.50 33.51 31.35 30.04 31. 29.45 30.99 32.29 babble (15).14 58.93 57.10 47.05 39.15 45.16 38.48 36.03 subway (15) 57.42 49.48 46.62 38.28 35.27 36.28 34.35 33.93 factory1 (10) 74. 72.05 68.10 63.74 56.46 61.56.26 45.91 restaurant (10) 72.48 67.14 63.14 58.19 49..69 48. 42.96 street (5).98.26 67.26 62.56 54.72 59.37 53.92 44.41 exhall (5) 79.90 74.97.67 69.47 58.04 64.46 57.76 48.30 f16 (0) 77.32 84.31 86.10 86.54 89.31 83.83.07 71.27 car (0) 90.24 52.58 54.32 40.79 35.78 35.48 34.56 34.41 SS single-stream baseline trained in ; MS multi-stream system with only one stream including all subbands; and ed refer to the training conditions for subband neural nets; MS-PM multi-stream system with a performance monitor; LDWNN proposed long, deep, and wide neural net (a multi-stream system with subband net). error rate of subband net for the 2, 4, and 6 th subband, centered around 0.5, 2, and 4 khz, in and various levels of babble noise. For comparison the phoneme error rates of single neural nets trained in, babble noise at 20, 15, 10, 5, 0 db SNR, or multi-style training (i.e., all the conditions ed) are also plotted on the same figure. The neural net trained in works very well in, but the performance drops quickly as the noise level increases. The neural nets trained in various levels of babble noise show better generalization ability, but their maximum performance decreases noticeably even in mild noise at 20 db SNR. Multi-style training helps improve the performance in the range of 10 to 20 db SNR, but it still performs worse in conditions and in heavy noise ( 10 db SNR) than the neural nets of matched conditions. The subband net significantly outperforms all single nets on matched conditions in every frequency subband. Results of the second experiment indicate that the LDWNN show superior performance and generalization ability in both and unknown noise. Fig. 5 compares the LDWNN system with the MS system with and without performance monitor. The MS systems have a single subband neural nets trained in or multi-style condition for every subband. The three curves start at the same level at about 30% in condition. As the noise level increases, the MS system with -trained subband neural nets climbs quickly to about % at 10 db SNR. The MS system with multi-style trained subband neural nets show much slower decrease in system performance, but still it is far behind the LDWNN system with subband net, which has a phoneme error rate of about 40% at 10 db SNR. Results of the third experiment (refer to Tab. 1) show that the LDWNN system (i.e., a multi-stream system with subband net) substantially out-performs the MS system (both MS and MS-PM) with subband neural nets trained or multi-style condition for all types of unknown noise at various db SNR. It is noted that the MS system with multi-style training substantially out-performs the MS system trained in condition. Stream selection with performance monitor reduces the phoneme error rate by about 5% relative for MS system with -trained subband neural nets, but the gain diminishes for MS system with multi-style training. The average phoneme error rate (PER) of the proposed LWDNN system is about 20% and 14% relative lower than the MS baseline with subband neural net trained in and multi-style condition respectively. It is on average about 40%, 30%, and 28% relative lower than the single-stream baseline using MFCC, PNCC, and FDLP respectively. 5. Conclusion In this study, we propose a long, deep and wide system named LDWNN for robust speech recognition based on the multi-stream framework, in which the full band speech is divided into multiple frequency subbands, each act as an independent channel for speech recognition. It is assumed that any unknown noise can be approximated by white noise (or speech-shaped noise) of similar level across multiple frequency subbands. Within each subband, an net, which consists of multiple neural nets trained for different noise levels, followed by a neural net to select the most suitable net for any unknown noise. Experimental results show that the subband net optimizes information extraction for every frequency subband in both and noisy conditions. The LDWNN based on subband nets show superior performance and generalization ability in both and unknown noise.

6. References [1] Allen, J. B. (1994), How do humans process and recognize speech? IEEE Trans. Speech and Audio 2(4), 567 577. [2] Athineos, M., Ellis, D. P. W. Autoregressive modelling of temporal envelopes., IEEE Trans. Signal Processing. (11):5237 5245, 2007. [3] Bourlard, H. and Morgan, N. Connectionist speech recognition: a hybrid approach, Springer 1994 [4] Deng, Li, et al. Large-vocabulary speech recognition under adverse acoustic environments. INTER- SPEECH. 2000. [5] Hermansky, H., Variani, E., Peddinti, V. Method for ASR performance prediction based on temporal characteristics of speech sounds, IEEE ICASSP 2013. [15] Richard, Michael D., and Richard P. Lippmann. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural computation, 1991 [16] Sharma, S., Multi-stream approach to robust speech recognition., Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland. 1999. [17] M. Athineos, H. Hermansky, and D. P. Ellis, Lptrap: Linear predictive temporal patterns, in Proc. INTERSPEECH 2004 [18] H. Hermansky, Multistream Recognition of Speech: Dealing With Unknown Unknowns., Invited Paper, Proceedings of the IEEE, Volume:101, Issue: 5, 2013 [6] Hermansky, H., Speech recognition from spectral dynamics., Invited paper, Proc. Indian Academy of Sciences. 36(5):729 744, 2011. [7] Li, F., Mallidi, H., and Hermansky, H. Phone recognition in critical bands using subband temporal modulations, in Proceedings Interspeech, 2012, p. P7.C06. [8] Lippmann, R., E. Martin, and Douglas B. Paul. Multi-style training for robust isolated-word speech recognition. IEEE ICASSP 1987. Vol. 12. [9] H.Hermansky, S.Tibrewala, and M.Pavel, Towardsasronpar- tially corrupted speech, IEEE ICSLP 96. Proceedings., vol. 1. pp. 4624. [10] H. Bourlard and S. Dupont, A new asr approach based on independent processing and recombination of partial frequency bands, in Proc. Int. Conf. Spoken Language Processing, 1996 [11] Li, F. and Hermansky, H. Effect of filter bandwidth and spectral sampling rate on phoneme recognition, IEEE ICASSP 2013 [12] Li, F. Subband hybrid feature for multi-stream speech recognition, IEEE ICASSP 2014 [13] Ganapathy, S., Thomas, S., and Hermansky, H., Temporal envelope compensation for robust phoneme recognition using modulation spectrum., J. Acoust. Soc. Amer. 128(6):3769 37, 2010. [14] Pinto, J. and Yegnanarayana, B. and Hermansky, H. and Magimai-Doss, M. Exploiting contextual information for improved phoneme recognition, IEEE ICASSP 2008