Andi Buzo, Horia Cucu, Mihai Safta and Corneliu Burileanu Speech & Dialogue (SpeeD) Research Laboratory University Politehnica of Bucharest (UPB)
The MediaEval 2012 SWS task A multilingual, query by example, Spoken Term Detection (STD) task! Involves searching for a spoken term within audio content using a spoken query. Similar to keyword spotting, except that the keyword is spoken! The solution is straight-forward for high-resourced languages: just use a robust LVCSR system. The task s purpose is to build STD systems for underresourced languages (with very few resources). Development data: 3 hours of phone-level annotated speech in several African languages: isindebele, SiSwati, Tshivenda and Xitsonga 2
Our proposed STD solution A phone recognition system based on the architecture suggested in the NIST 2006 STD campaign The indexing of the content audio is done offline and the searching of the queries is done online. All the audio content is transformed into a sequence of phonemes (by the means of ASR) in the indexing stage. The terms are transformed into sequences of phonemes (by the means of ASR) and searched into the content in the searching stage. The advantage: the searching stage is very fast as opposed to searching the audio term directly into the audio content. 3
The indexing component Based on the Romanian ASR system for continuous speech: HMM acoustic model trained with 70 hours of read speech N-gram language model trained with 170 million words Average performance: 18% WER on clean, read speech Fine-tuning the Romanian ASR system for PhER: Created a phoneme N-gram LM Tuned the relative beam width related parameters: PhER reduction from 36.8% to 31.4% Tuned the language weight and phone insertion penalty: PhER reduction from 31.4% to 25.3% 4
The indexing component - adaptation Phone mapping: 77 African phones -> 28 Romanian phones 1) directly using IPA classification (if same phones) 2) to the closest phone according to IPA classification (if any) 3) based a speech-recognition confusion matrix Adapting (MAP) the acoustic model using the African development speech data set PhER reduction from 61.2% to 48.1% ASR systems Evaluation on PhER [%] Romanian ASR Baseline for continuous speech Romanian speech 36.8 Romanian ASR After beam width tuning Romanian speech 31.4 Romanian ASR After language params tuning Romanian speech 25.3 African speech 61.2 Romanian ASR After adaptation with African data African speech 48.1 5
The searching component the DTWSS method If the ASR system s PhER would be zero => The searching component could be a simple string search ASR system s PhER not zero: we propose DTWSS Aligns the query phone string to the content phone string Sliding window; length proportional (1.5x) to the query length DTWSS key features: Short queries are penalized (amount controlled by alpha) Spreaded DTW matches penalized (amount controlled by beta) Alignment score: s (1 L PhER)(1 L Q QM L Qm L Qm L )(1 W L L Q S ) 6
Why standard DTW is not good? Standard DTW score would be: s 1 PhER C1 e u p l e c l a m Q1 p l e c * * Score 0.66 C1 e u p l e c l a m Q2 p * e * l a 0.66 C1 e u p l e c l a m Q3 p l * 0.66 C1 e u p l e c l a m Q4 e u p l e c * * * 0.66 7
The searching component tuning alpha and beta DTWSS penalization parameters are tuned on the development STD database. The standard evaluation metric for STD is Actual Term Weighted Value (ATWV) proposed by NIST (max value is 1) Standard DTW method (baseline) has alpha = beta = 0 ATWV α=0 α=0.1 α=0.2 α=0.4 α=0.6 α=0.8 α=1.0 β =0.0 0.21 0.21 0.22 0.22 0.23 0.22 0.22 β =0.2 0.29 0.29 0.31 0.30 0.32 0.28 0.25 β =0.4 0.31 0.33 0.33 0.33 0.33 0.34 0.30 β =0.6 0.31 0.32 0.32 0.32 0.33 0.31 0.31 β =0.8 0.28 0.31 0.30 0.32 0.30 0.28 0.26 8
SWS task results evalqevalc DTWSS (α=0.8 β=0.4) 0.31 DTWSS (α=0.6 β=0.6) 0.31 DTWSS (α=0.1 β=0.4) 0.27 ATWV 25.10.2013 speed.pub.ro devqdevc 0.49 0.47 0.47 9
Comparison with other methods Team-Method evalqevaldevc devq- BUT-AKWS 0.49 0.49 JHU_HLTCOE-RAILS 0.36 0.38 TID-IRDTW 0.33 0.38 DTWSS (α=0.8 β=0.4) 0.31 0.49 TUM-CDTW 0.29 0.26 GTTS-Phone_Lattice 0.08 0.09 TUKE-DTWSVM 0 0 Our STD system has ranked in the middle. Our method performs similar to those that tread STD as a pattern recognition problem by aligning speech features. The most accurate method has the highest computational cost (it does not perform an offline indexing)! 10
Conclusions The Romanian ASR is adapted to recognize African phones This is the indexing component of the STD system A novel DTW method was proposed to address the search of imperfectly recognized queries in imperfectly recognized content This is the searching component of the STD system The penalization of long DTW matches and short queries helped increase the ATWV The STD system ranked well in MediaEval 2012 SWS competition 11