Feature Enhancement with a Reservoir-based Denoising Auto Encoder

Size: px

Start display at page:

Download "Feature Enhancement with a Reservoir-based Denoising Auto Encoder"

Dylan Mitchell
5 years ago
Views:

1 Featue Enhancement with a Resevoi-based Denoising Auto Encode Azaakhsh Jalalvand, Kis Demuynck, Jean-Piee Matens Multimedia Lab, ELIS, Ghent Univesity/iMinds, Sint-Pietesnieuwstaat 41, B-9000, Ghent, Belgium Azaakhsh.Jalalvand@ugent.be Abstact Recently, automatic speech ecognition has advanced significantly by the intoduction of deep neual netwoks fo acoustic modeling. Howeve, thee is no clea evidence yet that this does not come at the pice of less genealization to conditions that wee not pesent duing taining. On the othe hand, acoustic modeling with Resevoi Computing (RC) did not offe impoved clean speech ecognition but it leads to good obustness against noise and channel distotions. In this pape, the aim is to establish whethe adding featue denoising in the font-end can futhe impove the obustness of an RC-based ecognize, and if so, whethe one can devise an RC-based Denoising Auto Encode that outpefoms a taditional denoise like the ETSI Advanced Font-End. In ode to answe these questions, expeiments ae conducted on the Auoa-2 benchmak. Keywods ecuent neual netwoks, esevoi computing, denoising auto encode, obust speech ecognition I. INTRODUCTION If one wants to apply automatic speech ecognition (ASR) on mobile devices, an ASR system is needed that is accuate and obust against the pesence of noise and channel distotions. Despite many yeas of wok, achieving obust ecognition emains a big challenge. Since an ASR is composed of a font-end (fo the extacting acoustic featue vectos) and a back-end (fo decoding these featue vectos) one can envision two types of techniques to tackle the poblem. One is to enhance (denoise) the featue vectos in the font-end [1], [2], the othe is to take account of the noise duing featue decoding in the back-end [3], [4]. On top of that, one can develop acoustic model types that ae intinsicaally moe esistant to noise [5] [8]. Typical featue enhancement adopts signal pocessing techniques such as Wiene filteing (single-channel) and beamfoming (multi-channel). They gave ise to an Advanced Fontend (AFE) [1] and the SPLICE [9] algoithms, espectively. Less typical is to adopt a neual netwok to convet the noisy featue vectos to clean vectos [10]. Such a netwok is called a Denoising Auto Encode (DAE). Back-end techniques ae usually esticted to GMM-based acoustic modeling (GMM = Gaussian Mixtue Model) because fo this type of model it is easy to undestand how to include the vaiations due to the noise into the decode. In pevious wok we investigated Resevoi Computing (RC) [11], [12] as a neual-based appoach to acoustic modeling. The agument was that RC-netwoks ae dynamical systems which ae able to focus on meaningful speech dynamics which ae bound to diffe fom the dynamics induced by the noise. We achieved quite competitive esults fo noiseobust continuous digit ecognition (CDR) on Auoa-2 [7], [13]. A deep RC-HMM hybid (HMM = Hidden Makov Model) can appaently compete with a taditional GMM- HMM system in clean conditions and clealy outpefom it in noisy envionments. Moe ecently, we also discoveed that supplying AFE featues to the RC-HMM system can futhe inceases its noise obustness [8]. In this wok we futhe investigate the impact of the fontend in an RC-HMM system. In paticula, we popose to apply an RC-netwok fo featue denoising befoe ecognition. We ague that such complex nonlinea dynamical system is bound to captue bette the complex elationships between noisy and clean utteances than the taditional signal pocessing methods that have only a vey shot memoy. The est of the pape is oganized as follows: Section II povides a concise outline of the basic pinciples of RC, Section III descibes ways of integating esevoi netwoks in an RC-HMM hybid speech ecognize, Section IV eviews the RC-based denoising of the acoustic featues we popose and Sections V and VI summaize the expeimental famewok (Auoa-2) and the esults obtained within this famewok. The pape ends with conclusions and futue wok. II. RESERVOIR COMPUTING (RC) The basic pinciple of RC is that one can etieve infomation fom sequential inputs by means of a two-laye RNN with the following chaacteistics (see Fig. 1). The inputs ae U t Input laye W in R t W ec Resevoi W out Y t Readout laye U t W in R t Y t andom input weights andom ecuent weights tained output weights input featue vecto input weight matix esevoi state vecto W ec ecuent weight matix W out output weight matix eadout vecto Fig. 1. Speech A basic RC Font-end system consists of a esevoi and RC a eadout laye. The esevoi is composed of inteconnected Mean & non-linea neuons with andomly fixed weights. The Featue eadout laye Resevoi Vaiance consists of linea neuons Resevoi with tained weights. extaction Nomalization spasely Intemediate-pocessing connected to a pool of ecuently Classical GMM-HMM inteconnected nonlinea neuons foming so-called esevoi and the outputs Decoelation GMMs Input laye Decode Readouts Digits Speech Font-end

2 ae obtained as a linea combinations of the esevoi outputs. The esevoi neuons constitute a hidden laye that, at time t, is diven by inputs U t and delayed hidden laye outputs R t 1. Impotant is that (1) the weights of the hidden neuons ae fixed, (2) the output neuons ae linea, and consequently, (3) the output weights can be optimized by means of a least squaed linea egession. The entie system is called a esevoi netwok. Its outputs Y t ae usually called eadouts [11] so as to diffeentiate them unambiguously fom the esevoi outputs R t. In ode to become moe esistant to andom intefame changes in the inputs (e.g. changes due to the spectal analysis o the ambient noise), one can ceate a esevoi of Leaky Integato Neuons [14]). A netwok containing such a esevoi is govened by the following equations: R t = (1 λ)r t 1 + λ f es (W in U t + W ec R t 1 ) (1) Y t = W out R t (2) with a leak ate λ between 0 and 1, with W in, W ec containing the input and ecuent weights to the esevoi neuons, and with W out containing the output weights. The weights of the hidden neuons ae fixed by means of a andom pocess chaacteized by fou contol paametes (see [15] fo moe details): (1) α U, the maximal absolute eigenvalue of the input weight matix W in, (2) ρ, the maximal absolute eigenvalue of the ecuent weight matix W ec, (3) K in, the numbe of inputs diving each esevoi neuon and (4) K ec, the numbe of delayed esevoi outputs diving each esevoi neuon. The fist two paametes contol the stengths of the input and the ecuent stimulations of a esevoi neuon, wheeas the latte two contol the spasity of the input and ecuent weight matices. Togethe with λ they constitute the esevoi contol paametes which have to be set popely in ode to assue the esevoi is well behaved. Any effective esevoi should at least have the so-called echo state popety, stating that with time, the esevoi should foget about the initial state it was in. That is also why a esevoi netwok was oiginally called an Echo State Netwok [11]. It was shown in [11] that the echo state popety holds if ρ called the spectal adius is smalle than 1. The esevoi can be envisioned as a pedefined but complex non-linea dynamical system that pefoms a tempoal analysis of the input steam. Ou claim is that such a system can extact featues which ae esistant to the pesence of noise whose dynamics diffe fom the speech dynamics. The output weights ae detemined so that they minimize the mean squaed eo between the eadouts Y t and the desied eadouts D t ove the taining examples [7]. As a consequence, they follow fom a set of linea equations. If a esevoi netwok is tained fo ecognition, the desied output D t is a unit vecto with one non-zeo enty encoding the desied HMM-state at time t. If it is tained fo denoising, D t is the desied clean speech featue vecto at time t. III. A HYBRID RC-HMM FOR SPEECH RECOGNITION An RC-HMM hybid woks with an HMM that epesents the task and a neual netwok that is supposed to convet the inputs U t into HMM state likelihoods. The seach fo the best path though the HMM is found using a Vitebi seach. In the case of an RC-HMM hybid, the eadouts y t,i (with i indexing the eadouts) ae assumed to esemble the posteio pobabilities P (q t = i U1). t This means that z t,i = y t,i /P (q t = i) is a scaled likelihood and consequently, that the best state sequence follows fom T ˆq = ag max P (q, y) = ag max z t,qt P (q t q t 1 ), q q t=1 Fig. 2 shows the achitectue fo the case of continuous digit ecognition (CDR) and a multi-stage RC netwok in which each netwok output is supplied to the next stage [8]. The tansition pobability P 0 which is added to the digit loop contols the balance between deletions and insetions. Since y t,i is not confined to [0,1] it is fist mapped to that inteval befoe computing z t,i. The mapping is achieved by a simple clip-and-scale appoach, as descibed in [8]. The diffeent stages of the RC-netwok ae tained independently, one afte the othe. As in [8], [16] we use a bi-diectional RC-netwok with one esevoi pocessing the fames fom left-to-ight and anothe one pocessing them fom ight-to-left. The eadouts at time t ae then computed as a linea function of the two esevoi outputs at time t. IV. AN RC-BASED DENOISING AUTOENCODER A system that aims at econstucting clean featue vectos fom noisy featue vectos is called a denoising autoencode (DAE) [17]. A taditional example of a DAE is spectal subtaction: it is able to emove pat of the noise by woking fame-by-fame (= memoy-less). One can ague that a complex dynamical system with memoy should be able to do a bette job. Theefoe, we popose to ceate a DEA by means of an RC-netwok incopoating one unidiectional esevoi because such a system has memoy and it pemits to pefom a non-linea tansfomation of the noisy featues to clean featues. The back-side of this appoach is of couse that the DAE has to be tained and that duing this taining one needs the clean vesion of the noisy speech featue vectos. Since we use MFCCs (Mels-scale Fequency Cepstal Coefiificients) as the featue vectos, and as the DAE has to denoise these MFCCs, one can adopt diffeent stategies fo achieving this. In fact, the MFCC-computation includes a chain of pocessing stages (See Fig. 3), and the denoising RCnetwok could be inseted at diffeent positions in this chain. Taditional speech enhancement fo instance often woks in the Discete-time Fouie Tansfom (DFT) domain [18], [19] (at position P1), wheeas most wok on obust ASR applies denoising in the log mel-fequency domain [20] (at position P2) and in the MFCC-domain [21], [22] (at position P3). Note that although the high dimensionality of the input nomally leads to a highe computational cost [23], this may not necessaily be the case hee due to the spase inteconnection between inputs and esevoi neuons, and also due to employing linea nodes as the eadouts. Theefoe, we conside a RC-netwok at all positions. Futhemoe, each RC-netwok is supposed to have access to static as well as the dynamic featues, but the netwok is only expected to poduce denoised static featues. The dynamic paametes of the input to the ASR ae then deived fom these denoised static featues.

3 Featue extaction Mean & Vaiance Nomalization ut laye Resevoi Resevoi eadouts Intemediate-pocessing Inputs Laye 1 Decoelation GMMs U Y 1 Classical GMM-HMM Digits Intemediate layes Decode Y k Laye L Y L P(U q) I1 D 1 D Speech Featue extaction Font-end Mean & Vaiance Nomalization GMMs Posteio pobabilities I2 # Hybid RC-HMM F Readouts to Digits Resevoi Fig. 2. Achitectue Resevoi of an RC-HMM Log-Likelihood hybid compising Decode a multi-laye esevoi netwok fo CDR. The HMM has two initial states (I1 and I2), one final state (F) and it compises 11 multi-state digit convete models (D1... D11) and a single state silence model (#) Input laye Readouts P o Speech U t± n / Windowing Deivatives Calculato DFT P1 Log Melfilte P2 P3 DCT bank Input laye 3n n Resevoi / Resevoi / Û t Readouts TABLE I. NOISE TYPES AND FILTERS USED IN AURORA-2 DATASET Tain & Test A Test B Test C N1: subway N1: estauant N1: subway Noise N2: babble N2: steet N2: steet types N3: ca noise N3: aipot N4: exhibition hall N4: tain station Filte G712 G712 MIRS Fig. 3. Possible points to apply the denoising system (top) and thei stuctue in moe details (down) V. EXPERIMENTAL SETUP In this section we pesent the expeimental famewok that was adopted to investigate the potential of the diffeent system configuations pesented in the pevious sections. A. Speech copus: Auoa-2 The Auoa-2 copus consists of clean and noise coupted digit sequences counting 1 to 7 digits pe utteance. Each utteance is passed though a G712 o a MIRS filte [24], and then sampled at 8 khz. Since thee ae two vaiants of 0 in Ameican English, namely zeo and oh, the vocabulay is composed of 11 digits. The data is divided into a taining pat and an evaluation pat. The famewok suppots two types of expeiments: clean taining expeiments in which systems ae developed on 8440 clean taining utteances fom 110 adults and multi-style taining expeiments in which systems ae developed on 8440 noise coupted vesions of the same utteances. The couption is andomly chosen out of fou noise types and five SNRs ( (clean), 20, 15, 10 and 5 db). The evaluation utteances come fom speakes that ae not pesent in the taining data. They ae divided into thee tests. Tests A and B each contain 28,028 utteances coveing 4004 diffeent digit sequences, 4 noise types and 7 SNRs ( (clean), 20, 15, 10, 5, 0, and -5 db). The noise types occuing in Test B do not occu in the multistyle taining data, while those of Test A do. Test C contains 14,014 utteances coveing 2002 diffeent digit sequences, 2 noise types (one matched and one mismatched) and 7 SNRs. Unlike all othe utteances they passed though a MIRS instead of a G712 filte (see Table I). B. Evaluation esults We epot aveage Wod Eo Rates (WERs) on tests A-C fo all SNRs, and we conside both clean speech taining and multi-style taining. In multi-style taining, clean utteances and utteances with SNRs between 20 and 5 db ae used fo taining. In the final evaluation phase both the acoustic models and the DAE ae tained on the complete taining set, but using the contol paametes that wee found optimal in a development phase duing which two thids of the taining set ae used fo taining and the emaining thid fo contol paamete optimization (e.g. the Vitebi decode penalty, P 0 of the ecognize). In this pape, we only epot the esults of the final evaluation phase fo each expeiment. C. Refeence systems In ode to set a efeence, we fist epot some state-ofthe-at system pefomances (see Table II). In paticula, we conside the ML-based GMM systems using AFE-featues poposed in [25], the ML-based and MCE-based GMM systems poposed in [2] and [26], two GMM systems embedding moe sophisticated back-ends based on joint uncetainty decoding (JUD) and Vecto Tylo Seies (VTS) espectively [4] and the tandem system embedding deep belief netwoks and GMMs (T-DBN-GMM), epoted in [5]. The figues show that advanced back-end techniques (JUD and VTS) lead to a lage gain in noise obustness than advanced font-end techniques, but it is not clea fom the papes how much degadation they induce fo clean speech ecognition. D. Font-end setups We will investigate thee diffeent acoustic featue sets: MFCCs (log enegy and c 1... c 12 ), Mel filtebank featues (MelFB) (24 channel log enegies), and the AFE featues (denoised c 0... c 12 without dopping non-speech fames) [1]. In all cases, the analysis is pefomed on 30 ms Hammingwindowed fames and the hop size between fames is τ f =

4 TABLE II. COMPARING AVERAGE WERS (IN %) PER CONDITION FOR TEST SETS A-C OF AURORA-2 USING A 3-LAYER HYBRID RC-HMM FOR BOTH CLEAN AND MULTI-STYLE TRAINING. Clean Multi System Clean dB Clean dB GMM (AFE) [25] GMM (MFCC) [2] GMM (SPLICE) [27] GMM (MFCC-MCE) [26] GMM (VTS) [4] GMM (JUD) [4] T-DBN-GMM [5] RC (MelFB) RC (MFCC) RC (AFE) Seen Unseen Ref P1 P2 P3 P3 P Coss coelation Fig. 4. The coelation between the output of the DAE and the clean nomalized featues, by denoising the data steam in diffeent points. 10 ms. Each featue set is supplemented with and featues. Befoe supplying the featue vectos to the ASR, an utteance-wise nomalization that ceates zeo-mean and unitvaiance inputs pe featue is pefomed. E. RC-HMM hybid setup Following ou pevious wok, the esevoi contol paametes of each esevoi ae detemined in the same way.. Defining τ λ = τf / ln(1 λ) as the leaky integation time constant and T as the expected state duation, we select (ρ, τ λ, K in, K ec ) = (0.8, T, 10, 10) and we chose α U so that the aveage vaiance of the eseevoi outputs eaches a cetain level [15]. In this wok we utilize a 3-laye RC-HMM system and since layes 2 and 3 see basically the same inputs, α U is taken the same fo both layes. The size of the esevoi defined as the numbe of neuons it contains (N es ) is consideed to be an independent vaiable. Note that since K in and K ec ae kept fixed to 10, the CPU-time needed fo calculating the eadouts scales linealy with the size of the esevoi. The esevoi netwoks ae tained by means of a Tikhonov egession [28] and each digit is modeled by a 7-state left-toight HMM whilst the silence is modeled by a single state. The taget outputs encode the visited HMM state at time t. F. RC-based DAE setup In line with [15], we contemplate that ρ and λ ae the only task-dependent contol paametes of a esevoi. Consequently, even though denoising is a completely diffeent task than digit state ecognition, we keep Kin = K ec = 10 and we maintain the same method as befoe fo computing α U. Futhemoe, since leaky integation is a fom of smoothing coesponding to a some low-pass filte, we ague that no leaky integation should be applied hee as the spectum of the denoised inputs is not expected to be naowe than that of the noisy inputs. Consequently, thee is only one paamete left to optimiza, namely ρ. In all expeiments, iespective of the chosen featue set, we have used a 2-laye RC-netwok with one unidiectional esevoi of 1K neuons pe laye. VI. EXPERIMENTAL RESULTS In this section we eview the expeiments we conducted to assess the capacity of an RC-netwok to denoise the featues and the capacity it has to futhe aise the obustness of the CDR system. A. Denoising capacity of an RC-netwok Duing assessment of the denoising capacity of the RCnetwoks we adopt the aveage coelations between the 39 denoised and the clean MFCCs featues as the quality citeion. Howeve, since the ASR will always wok with utteance based mean and vaiance nomalized featues, the coelations ae computed afte this nomalization step. We conducted two expeiments: one in which the test samples come fom Test A and epesent conditions that ae pesent duing taining (= Seen ) and one in which the test samples come fom Test C and epesent anothe channel and unseen SNRs (= Unseen ). In each expeiment, the DAE is imputed at all thee positions P1, P2 and P3 (see Section IV). The esults ae depicted in Fig. 4. Ref denotes to the coelation existing between the aw (noisy) and the clean featues. The data show that denoising in the DFT-domain is not woking well but denoising in the two othe domains does. In fact, iespective of the expeiment, denoising the MelFB and the MFCC featues seem to be equally effective. We will theefoe evaluate them both in combination with CDR. In an additional expeiment we constucted RC-netwoks that wee tained to denoise the dynamic as well as the static MFCCs instead of just the static featues. This appoach lead to the aveage coelation maked by P3. Thee seems to be no benefit in woking this way. Finally, we wanted to assess the limits of the RC-based DAE by aising the esevoi size fom 1K to 4K neuons. The esults obtained with the lage esevois (maked as P3 ) ae only slightly bette than the ones obtained with the smalle esevois. In ode to illustate the effect of the DAE, we have depicted on Fig. 5 the MelFB spectogams fo a noisy speech sample (SNR = 5dB) befoe and afte denoising togethe with the clean speech spectogam. It is especially notewothy that the DAE does an excellent job in the silence pats. This is patly due to the lage numbe of non-speech (silent) fames in the taining data.

DENOISED FEATURES, AND (3) THE RETRAINED RECOGNIZER TESTED ON DENOISED FEATURES. (a) Noisy MelFB featues Baseline Clean Multi Clean 0-20 -5dB Clean 0-20 -5dB MFCC 0.93 10.7 59.8 1.46 6.2 46.5 AFE 0.

6 (b) Clean copy (c) Denoised MelFB Fig. 5. Denoising MelFB featues of a sample with steet noise of SNR 5 db fom Test C. B.

5 TABLE III. COMPARING AVERAGE WERS (IN %) ON TEST SETS A-C PER CONFIGURATION: (1) THE BASELINE RC-HMM RECOGNIZER TRAINED ON MFCC AND AFE, (2) THE BASELINE RECOGNIZER TRAINED ON MFCC AND AFE BUT TESTED ON DENOISED FEATURES, AND (3) THE RETRAINED RECOGNIZER TESTED ON DENOISED FEATURES. (a) Noisy MelFB featues Baseline Clean Multi Clean dB Clean dB MFCC AFE Test on MFCC Baseline AFE Reain & MFCC Test AFE (b) Clean copy (c) Denoised MelFB Fig. 5. Denoising MelFB featues of a sample with steet noise of SNR 5 db fom Test C. B. Need fo a denoising font-end in an RC-based CDR In a fist ecognition expeiment we test the thee acoustic featue sets descibed in the pevious section in combination with a thee-laye bi-diectional RC-HMM hybid with 8K neuons pe laye. The esults of this expeiment ae listed in the bottom section of Table II. Appaently, in the clean speech taining expeiment thee is only a little diffeence between the featue sets in the matched condition (clean). In the mismatched conditions howeve, the AFE does lead to moe obustness, be it that the gain is vey much less impessive than it is fo GMM-based systems. This finding seems to confim the hypothesis that a esevoi can filte out a lage pat of the noise without having been confonted with noise duing taining. In the multi-style taining expeiment, whee the mismatch between the test and the taining emains much smalle, the positive effect of the AFE is also much smalle, but nevetheless emaining to some extent. Consequently, if an RC-based DAE could outpefom the AFE it could lead to inceased the obustness. C. Can RC-based DAE outpefom the AFE? In ode to answe the question, we conducted expeiments with a font-end incopoating an RC-based DAE at position P3. We consideed thee configuations: (1) the baseline featues and acoustic models ae used (= Baseline), (2) the denoised featues ae used but the acoustic models ae left unchanged (= Test on Baseline) and (3) the denoised featues ae used in combination with acoustic models that ae etained on these featues (= Retain & Test). In the case of etaining, we maintained the distinction between clean speech taining and multi-style taining, but of couse, the clean speech taining esults have to be intepeted with cae because afte all, the DAE has aleady seen the noisy taining examples that belong to the multi-style taining copus. The esults listed in Table III show that the RC-based denoise does not outpefom the AFE as the pefomance of the baseline system with AFE featues is slightly bette than that of a system woking with plain MFCCs and an RC-based DAE on all conditions. In spite of this, compaing the AFE ows of the baseline and the new system, demonstates that the RC-based DAE does lead to a small exta gain in stongly mismatched conditions (clean speech taining and noisy test samples) but that this comes at the expense of a significant loss of the pefomance fo matched conditions (clean test samples). In the multi-style taining expeiment whee the mismatch emains modeate, the RC-based denoise even has a (small) detimental effect in all conditions. This can only mean that the DAE emoves infomation fom the featue steam that is othewise exploitable by the RC back-end. As could be expected, the Test on Baseline configuation is chaacteized by a stong detimental effect in matched conditions because the featues used duing taining and test have become diffeent even in these conditions. In a contol expeiment, we also tained and tested the RCbased denoising of MFCCs by inseting a DAE at position P2. In line with the coelations measued befoe, the emeging systems achieved vey simila esult showing no pefeence fo positions P2 and P3. VII. CONCLUSION AND FUTURE WORK Recently, we wee able to show that esevoi computing (RC) is a viable acoustic modeling technique that can lead to moe obust automatic speech ecognition (ASR). Expeiments on the Auoa-2 benchmak demonstated that ou RC-based systems aleady outpefom the most complex GMM-HMM based systems on the task of continuous digit ecognition. The main objectives of the pesent pape wee: (1) to establish whethe esevoi netwoks can be tained to denoise the featue vectos, (2) to investigate how much benefit can be attained by applying denoising in the font-end of the RC-ecognize, (3) to establish whethe RC-based denoising can outpefom taditional signal pocessing based techniques (e.g. like in the ETSI advanced font-end (AFE)), and (4)

6 to find out whethe the two denoising techniques ae maybe complementay. Fist of all, we could show that an RC-based denoising auto encode (DAE) imputed at a suitable position in the MFCC font-end can lead to inceasing mean coelations between the featues of the noisy and the clean speech utteances. Next, we could demonstate that adopting featue denoising in the font-end of an RC-based ASR is beneficial, but not as much as it is in the case of a taditional GMM-based ASR. A somewhat disappointing esult is that an RC-based DAE fails to induce as much impovement as the AFE. Howeve, by combining the two tecniques, a small gain is possible to achieve in seveely mismatched conditions, be it at the expense of a small degadation in the matched condition. In the nea futue we will compae the RC-based DAE with the AFE in situations whee the noise is moe non-stationay. Futhemoe, we hope to apply some of the ideas which lead to uncetainty decoding in an RC-backend. Finally, we intend to extend ou eseach to noise obust lage vocabulay speech ecognition (e.g. Auoa-4). ACKNOWLEDGEMENT The eseach leading to the esults pesented hee has eceived funding fom Flemish Science Foundation (FWO) unde gant ageement G N (RECAP). REFERENCES [1] ETSI, Speech pocessing, tansmission and quality aspects STQ; distibuted speech ecognition; advanced font-end featue extaction algoithm; compession algoithms, ES , Tech. Rep., [2] C.-P. Chen and J. A. Bilmes, MVA pocessing of speech featues, IEEE Tans. Audio, Speech and Language Pocessing, vol. 15, no. 1, pp , jan [3] M. Van Segboeck and H. Van Hamme, Advances in missing featue techniques fo obust lage vocabulay continuous speech ecognition, IEEE Tans. Audio, Speech and Language Pocessing, vol. 19, no. 1, pp , [4] H. Xu, M. Gales, and K. Chin, Joint uncetainty decoding with pedictive methods fo noise obust speech ecognition, IEEE Tans. Audio, Speech and Language Pocessing, vol. 19, no. 6, pp , [5] O. Vinyals and S. Ravui, Compaing multilaye pecepton to deep belief netwok tandem featues fo obust ASR, in Poc. ICASSP, 2011, pp [6] S.-X. Zhang and M. Gales, Stuctued suppot vecto machines fo noise obust continuous speech ecognition, in Poc. Intespeech, 2011, pp [7] A. Jalalvand, F. Tiefenbach, and J.-P. Matens, Continuous digit ecognition in noise: Resevois can do an excellent job! in Poc. Intespeech, 2012, p. ID:644. [8] A. Jalalvand, K. Demuynck, and J.-P. Matens, Noise obust continuous digit ecognition with esevoi-based acoustic models, in Poc. ISPACS, 2013, p. ID:99. [9] L. Deng, A. Aceo, L. Jiang, J. Doppo, and X. Huang, Highpefomance obust speech ecognition using steeo taining data, in Poc. ICASSP, vol. 1, 2001, pp [10] S. Tamua and A. Waibel, Noise eduction using connectionist models, in Poc. ICASSP, ap 1988, pp [11] H. Jaege, The echo state appoach to analysing and taining ecuent neual netwoks - with an eatum note, GMD Repot 148, Geman National Reseach Cente fo Infomation Technology, Tech. Rep., [12] D. Vestaeten, B. Schauwen, M. D Haene, and D. Stoobandt, An expeimental unification of esevoi computing methods, IEEE Tans. Neual Netwoks, vol. 20, pp , [13] A. Jalalvand, F. Tiefenbach, D. Vestaeten, and J.-P. Matens, Connected digit ecognition by means of esevoi computing, in Poc. Intespeech, 2011, pp [14] H. Jaege, M. Lukoševičius, D. Popovici, and U. Siewet, Optimization and applications of echo state netwoks with leaky-integato neuons, Neual Netwoks, vol. 20, no. 3, pp , Ap [15] M. Lukoševičius, A pactical guide to applying echo state netwoks, in Neual Netwoks: Ticks of the Tade, se. Lectue Notes in Compute Science. Spinge Belin Heidelbeg, 2012, vol. 7700, pp [16] F. Tiefenbach, A. Jalalvand, K. Demuynck, and J.-P. Matens, Acoustic modeling with hieachical esevois, IEEE Tans. Audio, Speech and Language Pocessing, vol. PP, no. 99, [17] P. Vincent, H. Laochelle, Y. Bengio, and P.-A. Manzagol, Extacting and composing obust featues with denoising autoencodes, in Poc. ICML, 2008, pp [18] Y. Ephaim and D. Malah, Speech enhancement using a minimummean squae eo shot-time spectal amplitude estimato, IEEE Tans. Acoustics,Speech and Signal Pocessing, vol. 32, no. 6, pp , [19] P. J. Wolfe and S. J. Godsill, Efficient altenatives to the ephaim and malah suppession ule fo audio signal enhancement, in special issue) EURASIP JASP on Digital Audio fo Multimedia Communications, 2003, pp [20] D. Yu, L. Deng, J. Doppo, J. Wu, Y. Gong, and A. Aceo, Robust speech ecognition using a cepstal minimum-mean-squae-eomotivated noise suppesso, IEEE Tans. Audio, Speech and Language Pocessing, vol. 16, no. 5, pp , [21] K. Indebo, R. Povinelli, and M. Johnson, Minimum mean-squaed eo estimation of mel-fequency cepstal coefficients using a novel distotion model, IEEE Tans. Audio, Speech and Language Pocessing, vol. 16, no. 8, pp , [22] L. Deng, J. Doppo, and A. Aceo, Estimating cepstum of speech unde the pesence of noise using a joint pio of static and dynamic featues. IEEE Tans. Speech Audio Pocessing, vol. 12, no. 3, pp , [23] R. Rotili, E. Pincipi, S. Cifani, F. Piazza, and S. Squatini, Multichannel featue enhancement fo obust speech ecognition, Speech technologies. InTech, ISBN, pp , [24] H.-G. Hisch and D. Peace, The AURORA expeimental famewok fo the pefomance evaluation of speech ecognition systems unde noise conditions, in Automatic Speech Recognition: Challenges fo the Next Millennium. ISCA ITRW, 2000, pp [25] H. G. Hisch and D. Peace, Applying the advanced ETSI fontend to the Auoa-2 task, vesion 1.1, Tech. Rep., [26] X. Xiao, J. Li, E.-S. Chng, H. Li, and C.-H. Lee, A study on the genealization capability of acoustic models fo obust speech ecognition, IEEE Tans. Audio, Speech and Language Pocessing, vol. 18, no. 6, pp , [27] T. Kai, M. Suzuki, and K. Chijiiwa, Combination of SPLICE and featue nomalization fo noise obust speech ecognition, Intl. Wokshop on Nonlinea Cicuits, Communications and Signal Pocessing, vol. 16, no. 4, pp , [28] C. M. Bishop, Taining with noise is equivalent to Tikhonov egulaization, Neual Computation, vol. 7, pp , 1994.

Controlled Information Maximization for SOM Knowledge Induced Learning

Controlled Information Maximization for SOM Knowledge Induced Learning 3 Int'l Conf. Atificial Intelligence ICAI'5 Contolled Infomation Maximization fo SOM Knowledge Induced Leaning Ryotao Kamimua IT Education Cente and Gaduate School of Science and Technology, Tokai Univeisity