Andi Buzo, Horia Cucu, Mihai Safta and Corneliu Burileanu. Speech & Dialogue (SpeeD) Research Laboratory University Politehnica of Bucharest (UPB)

Similar documents
THE SPOKEN WEB SEARCH TASK AT MEDIAEVAL ({etienne.barnard

QUESST2014: EVALUATING QUERY-BY-EXAMPLE SPEECH SEARCH IN A ZERO-RESOURCE SETTING WITH REAL-LIFE QUERIES

Spoken Term Detection Using Multiple Speech Recognizers Outputs at NTCIR-9 SpokenDoc STD subtask

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Programming-By-Example Gesture Recognition Kevin Gabayan, Steven Lansel December 15, 2006

The ALBAYZIN 2016 Search on Speech Evaluation Plan

Segmented Dynamic Time Warping for Spoken Query-by-Example Search

CNN based Query by Example Spoken Term Detection

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

An In-Depth Comparison of Keyword Specific Thresholding and Sum-to-One Score Normalization

Confidence Measures: how much we can trust our speech recognizers

Speech Technology Using in Wechat

Novel Methods for Query Selection and Query Combination in Query-By-Example Spoken Term Detection

The L 2 F Query-by-Example Spoken Term Detection system for the ALBAYZIN 2016 evaluation

Dynamic Time Warping

Speech Tuner. and Chief Scientist at EIG

RLAT Rapid Language Adaptation Toolkit

Selection of Best Match Keyword using Spoken Term Detection for Spoken Document Indexing

A Survey on Spoken Document Indexing and Retrieval

Discriminative training and Feature combination

MEMORY EFFICIENT SUBSEQUENCE DTW FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION. Xavier Anguera and Miquel Ferrarons

LATENT SEMANTIC INDEXING BY SELF-ORGANIZING MAP. Mikko Kurimo and Chafic Mokbel

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

PHONE-BASED SPOKEN DOCUMENT RETRIEVAL IN CONFORMANCE WITH THE MPEG-7 STANDARD

CHIST-ERA Projects Seminar Topic IUI

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Use of GPU and Feature Reduction for Fast Query-by-Example Spoken Term Detection

Fusion of LVCSR and Posteriorgram Based Keyword Search

SVD-based Universal DNN Modeling for Multiple Scenarios

Xing Fan, Carlos Busso and John H.L. Hansen

Lecture 7: Neural network acoustic models in speech recognition

Large Scale Distributed Acoustic Modeling With Back-off N-grams

Automatic Transcription of Speech From Applied Research to the Market

Knowledge-Based Word Lattice Rescoring in a Dynamic Context. Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

D6.1.2: Second report on scientific evaluations

Handwritten Text Recognition

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

PARALLEL TRAINING ALGORITHMS FOR CONTINUOUS SPEECH RECOGNITION, IMPLEMENTED IN A MESSAGE PASSING FRAMEWORK

Weighted Finite State Transducers in Automatic Speech Recognition

Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving. Dr. Joachim Köhler LT-Innovate Brussels

Detection of Acoustic Events in Meeting-Room Environment

A Methodology for End-to-End Evaluation of Arabic Document Image Processing Software

Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping

arxiv: v1 [cs.cl] 28 Nov 2016

Multimedia Information Retrieval

WHO WANTS TO BE A MILLIONAIRE?

Chapter 3. Speech segmentation. 3.1 Preprocessing

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Contextual Information Retrieval Using Ontology-Based User Profiles

Speech Applications. How do they work?

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

Sequence Prediction with Neural Segmental Models. Hao Tang

Towards Optimized Multimodal Concept Indexing

Multi-Modal Communication

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

LARGE-VOCABULARY CHINESE TEXT/SPEECH INFORMATION RETRIEVAL USING MANDARIN SPEECH QUERIES

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

arxiv: v1 [cs.cl] 30 Jan 2018

Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion

Automatic Speech Recognition (ASR)

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Visualization and text mining of patent and non-patent data

Variable-Component Deep Neural Network for Robust Speech Recognition

Imperfect transcript driven speech recognition

Weighted Finite State Transducers in Automatic Speech Recognition

BUPT at TREC 2009: Entity Track

Julius rev LEE Akinobu, and Julius Development Team 2007/12/19. 1 Introduction 2

Voice activated spell-check

Lecture October. 1 Examples of machine learning problems and basic terminology

PERSONALIZED TAG RECOMMENDATION

Constrained Discriminative Training of N-gram Language Models

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

Scott Shaobing Chen & P.S. Gopalakrishnan. IBM T.J. Watson Research Center. as follows:

Overview of the NTCIR-12 SpokenQuery&Doc-2 Task

AUTHOR COPY. Audio-video based character recognition for handwritten mathematical content in classroom videos

3 Publishing Technique

NEAR-IR BROADBAND POLARIZER DESIGN BASED ON PHOTONIC CRYSTALS

A Document Graph Based Query Focused Multi- Document Summarizer

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

An Interactive Framework for Document Retrieval and Presentation with Question-Answering Function in Restricted Domain

Handwritten Text Recognition

Adobe Premiere Course Curriculum

Ping-pong decoding Combining forward and backward search

Automatic summarization of video data

Stochastic Segment Modeling for Offline Handwriting Recognition

Learning The Lexicon!

Pattern Spotting in Historical Document Image

Voice command module for Smart Home Automation

Web Information Retrieval. Exercises Evaluation in information retrieval

The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays

Transcription:

Andi Buzo, Horia Cucu, Mihai Safta and Corneliu Burileanu Speech & Dialogue (SpeeD) Research Laboratory University Politehnica of Bucharest (UPB)

The MediaEval 2012 SWS task A multilingual, query by example, Spoken Term Detection (STD) task! Involves searching for a spoken term within audio content using a spoken query. Similar to keyword spotting, except that the keyword is spoken! The solution is straight-forward for high-resourced languages: just use a robust LVCSR system. The task s purpose is to build STD systems for underresourced languages (with very few resources). Development data: 3 hours of phone-level annotated speech in several African languages: isindebele, SiSwati, Tshivenda and Xitsonga 2

Our proposed STD solution A phone recognition system based on the architecture suggested in the NIST 2006 STD campaign The indexing of the content audio is done offline and the searching of the queries is done online. All the audio content is transformed into a sequence of phonemes (by the means of ASR) in the indexing stage. The terms are transformed into sequences of phonemes (by the means of ASR) and searched into the content in the searching stage. The advantage: the searching stage is very fast as opposed to searching the audio term directly into the audio content. 3

The indexing component Based on the Romanian ASR system for continuous speech: HMM acoustic model trained with 70 hours of read speech N-gram language model trained with 170 million words Average performance: 18% WER on clean, read speech Fine-tuning the Romanian ASR system for PhER: Created a phoneme N-gram LM Tuned the relative beam width related parameters: PhER reduction from 36.8% to 31.4% Tuned the language weight and phone insertion penalty: PhER reduction from 31.4% to 25.3% 4

The indexing component - adaptation Phone mapping: 77 African phones -> 28 Romanian phones 1) directly using IPA classification (if same phones) 2) to the closest phone according to IPA classification (if any) 3) based a speech-recognition confusion matrix Adapting (MAP) the acoustic model using the African development speech data set PhER reduction from 61.2% to 48.1% ASR systems Evaluation on PhER [%] Romanian ASR Baseline for continuous speech Romanian speech 36.8 Romanian ASR After beam width tuning Romanian speech 31.4 Romanian ASR After language params tuning Romanian speech 25.3 African speech 61.2 Romanian ASR After adaptation with African data African speech 48.1 5

The searching component the DTWSS method If the ASR system s PhER would be zero => The searching component could be a simple string search ASR system s PhER not zero: we propose DTWSS Aligns the query phone string to the content phone string Sliding window; length proportional (1.5x) to the query length DTWSS key features: Short queries are penalized (amount controlled by alpha) Spreaded DTW matches penalized (amount controlled by beta) Alignment score: s (1 L PhER)(1 L Q QM L Qm L Qm L )(1 W L L Q S ) 6

Why standard DTW is not good? Standard DTW score would be: s 1 PhER C1 e u p l e c l a m Q1 p l e c * * Score 0.66 C1 e u p l e c l a m Q2 p * e * l a 0.66 C1 e u p l e c l a m Q3 p l * 0.66 C1 e u p l e c l a m Q4 e u p l e c * * * 0.66 7

The searching component tuning alpha and beta DTWSS penalization parameters are tuned on the development STD database. The standard evaluation metric for STD is Actual Term Weighted Value (ATWV) proposed by NIST (max value is 1) Standard DTW method (baseline) has alpha = beta = 0 ATWV α=0 α=0.1 α=0.2 α=0.4 α=0.6 α=0.8 α=1.0 β =0.0 0.21 0.21 0.22 0.22 0.23 0.22 0.22 β =0.2 0.29 0.29 0.31 0.30 0.32 0.28 0.25 β =0.4 0.31 0.33 0.33 0.33 0.33 0.34 0.30 β =0.6 0.31 0.32 0.32 0.32 0.33 0.31 0.31 β =0.8 0.28 0.31 0.30 0.32 0.30 0.28 0.26 8

SWS task results evalqevalc DTWSS (α=0.8 β=0.4) 0.31 DTWSS (α=0.6 β=0.6) 0.31 DTWSS (α=0.1 β=0.4) 0.27 ATWV 25.10.2013 speed.pub.ro devqdevc 0.49 0.47 0.47 9

Comparison with other methods Team-Method evalqevaldevc devq- BUT-AKWS 0.49 0.49 JHU_HLTCOE-RAILS 0.36 0.38 TID-IRDTW 0.33 0.38 DTWSS (α=0.8 β=0.4) 0.31 0.49 TUM-CDTW 0.29 0.26 GTTS-Phone_Lattice 0.08 0.09 TUKE-DTWSVM 0 0 Our STD system has ranked in the middle. Our method performs similar to those that tread STD as a pattern recognition problem by aligning speech features. The most accurate method has the highest computational cost (it does not perform an offline indexing)! 10

Conclusions The Romanian ASR is adapted to recognize African phones This is the indexing component of the STD system A novel DTW method was proposed to address the search of imperfectly recognized queries in imperfectly recognized content This is the searching component of the STD system The penalization of long DTW matches and short queries helped increase the ATWV The STD system ranked well in MediaEval 2012 SWS competition 11