Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009
|
|
- Lilian Shaw
- 6 years ago
- Views:
Transcription
1 Machine Learning in Speech Synthesis Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009
2 Overview u Speech Synthesis History and Overview l From hand-crafted to data-driven driven techniques u Text to Speech Processes u Waveform synthesis l Unit selection and Statistical Parametric Synthesis u Evaluation u Voice conversion u Project ideas
3 Physical Models Blowing air through tubes von Kemplen s synthesizer 1791 Synthesis by physical models Homer Dudley s Voder. 1939
4 More Computation More Data u Formant synthesis (60s-80s) l Waveform construction from components u Diphone synthesis (80s-90s) l Waveform by concatenation of small number of instances of speech u Unit selection (90s-00s) 00s) l Waveform by concatenation of very large number of instances of speech u Statistical Parametric Synthesis (00s-..) l Waveform construction from parametric models
5 Waveform Generation - Formant synthesis - Random word/phrase concatenation - Phone concatenation - Diphone concatenation - Sub-word unit selection - Cluster based unit selection - Statistical Parametric Synthesis
6 Speech Synthesis u Text Analysis l Chunking, tokenization, token expansion u Linguistic Analysis l Pronunciations l Prosody u Waveform generation l From phones and prosody to waveforms
7 Text processing u Find the words l Splitting tokens too e.g. 04/11/2009 l Removing punctuation u Identifying word types l Numbers: years, quantities, ordinals l 1996 sheep were stolen on 25 Nov 1996 u Identifying words/abbreviations l CIA, 10m, 12sf, WeH7200
8 Pronunciations u Giving pronunciation for each word l A phoneme string (plus tone, stress ) u A constructed lexicon l ( pencil n (p eh1 n s ih l)) l ( two n (t uw1)) u Letter to sound rules l Pronunciation of out of vocabulary words l Machine learning prediction from letters
9 Pronunciation of Unknown Words u How do you pronounce new words u 4% of tokens (in news) are new u You can t t synthesis then without pronunciations u You can t t recognize them without pronunciations u Letter-to to-sounds rules u Grapheme-to to-phoneme rules
10 LTS: Hand written u Hand written rules l [LeftContext]] X [RightContext[ RightContext] -> > Y l e.g. l c [h r] -> > k l c [h] -> ch l c [i] -> > s l c -> > k
11 LTS: Machine Learning Techniques u Need an existing lexicon l Pronunciations: words and phones l But different number of letters and phones u Need an alignment l Between letters and phones l checked -> ch eh k t
12 LTS: alignment checked -> ch eh k t c h e c k e d ch _ eh k t Some letters go to nothing Some letters go to two phones box -> b aa k-s table -> t ey b ax-l -
13 Find alignment automatically u Epsilon scattering l Find all possible alignments l Estimate p(l,p) ) on each alignment l Find most probable alignment u Hand seed l Hand specify allowable pairs l Estimate p(l,p) ) on each possible alignment l Find most probable alignment u Statistical Machine Translation (IBM model 1) l Estimate p(l,p) ) on each possible alignment l Find most probably alignment
14 Not everything aligns u 0, 1, and 2 letter cases l e -> > epsilon moved l x -> k-s, g-z box example l e -> y-uw askew u Some alignment aren t t sensible l dept -> > d ih p aa r t m ax n t l cmu -> > s iy eh m y uw
15 Training LTS models u Use CART trees l One model for each letter u Predict phone (epsilon, phone, dual phone) l From letter 3-context 3 (and POS) u # # # c h e c -> ch u # # c h e c k -> > _ u # c h e c k e -> > eh u c h e c k e d -> > k
16 LTS results Split lexicon into train/test 90%/10% i.e. every tenth entry is extracted for testing Lexicon OALD CMUDICT BRULEX DE-CELEX Thai Letter Acc 95.80% 91.99% 99.00% 98.79% 95.60% Word Acc 75.56% 57.80% 93.03% 89.38% 68.76%
17 Example Tree
18 But we need more than phones What about lexical stress p r aa1 j eh k t -> p r aa j eh1 k t Two possibilities A separate prediction model Join model introduce eh/eh1 (BETTER) L no S Letter W no S Word LTP+S 96.36% % 63.68% LTPS 96.27% 95.80% 74.69% 74.56%
19 Does it really work 40K words from Time Magazine 1775 (4.6%) not in OALD LTS gets 70% correct (test set was 74%) Names Unknown US Spelling Typos Occurs %
20 Prosody Modeling u Phrasing l Where to take breaths u Intonation l Where (and what size) are accents l F0 realization u Duration l What is the length of each phoneme
21 Intonation Contour
22 Unit Selection vs Parametric Unit Selection The standard method Select appropriate sub-word units from large databases of natural speech Parametric Synthesis: [NITECH: Tokuda et al] HMM-generation based synthesis Cluster units to form models Generate from the models Take average of units
23 Unit Selection Target cost and Join cost [Hunt and Black 96] Target cost is distance from desired unit to actual unit in the databases Based on phonetic, prosodic metrical context Join cost is how well the selected units join
24 Hunt and Black Costs
25 HB Unit Selection Use Viterbi to find best set of units Note Finding longest is typically not optimal
26 Clustering Units Cluster units [Donovan et al 96, Black et al 97] Moves calculation to compile time
27 Unit Selection Issues Cost metrics Finding best weights, best techniques etc Database design Best database coverage Automatic labeling accuracy Finding errors/confidence Limited domain: Target the databases to a particular application Talking clocks Targeted domain synthesis
28 Old vs New Unit Selection: large carefully labelled database quality good when good examples available quality will sometimes be bad no control of prosody Parametric Synthesis: smaller less carefully labelled database quality consistent resynthesis requires vocoder, (buzzy) can (must) control prosody model size much smaller than Unit DB
29 Parametric Synthesis Probabilistic Models Simplification Generative model Predict acoustic frames from text
30 Trajectories u Frame (State) based prediction l Ignores dynamics u Various solutions l MLPG (maximum likelihood parameter generation) l Trajectory HMMs l Global Variance l MGE, minimal generation error
31 SPSS Systems u HTS (NITECH) l Based on HTK l Predicts HMM-states l (Default) uses MCEP and MLSA filter l Supported in Festival u Clustergen (CMU) l No use of HTK l Predicts Frames l (Default) uses MCEP and MLSA filter l More tightly coupled with Festival
32 FestVox CLUSTERGEN Synthesizer Requires: Prompt transcriptions (txt.done.data) Waveform files (well recorded) FestVox Labelling EHMM (Kishore) Context Independent models and forced alignment (Have used Janus labels too). Parameter extraction: (HTS s) melcep/mlsa filter for resynthesis F0 extraction Clustering Wagon vector clustering for each HMM-state name
33 Clustering by CART Update to Wagon (Edinburgh Speech Tools). Tight coupling of features with FestVox utts Support for arbitrary vectors Define impurity on clusters of N vectors Clustering F0 and MCEP Tested jointly and separately Features for clustering (51): phonetic, syllable, phrasal context
34 Training Output Three models: Spectral (MCEP) CART tree F0 CART tree Duration CART tree F0 model: Smoothed extracted F0 through all speech (i.e. unvoiced regions get F0 values) Chose voicing at runtime phonetically
35 CLUSTERGEN Synthesis Generate phoneme strings (as before) For each phone: Find HMM-state names: ah_1, ah_2, ah_3 Predict duration of each Create empty mcep vector to fill duration Predict mcep values from cluster tree Predict F0 value from cluster tree Use MLSA filter to regenerate speech
36 Objective Score CLUSTERGEN Mean Mel Cepstral Distortion over test set MCD: Voice Conversion ranges MCD: CG scores smaller is better
37 Example CG Voices 7 Arctic databases: 1200 utterances, 43K segs,, 1hr speech awb bdl clb ksp slt jmk rms
38 Database size vs Quality slt_arctic data size Utts Clusters RMS F0 MCD
39 Making it Better u Label data, build model u But maybe there are better labels u So find labels that maximize model accuracy
40 Move Labels Cluster trees a_1 a_2 Predicted gaussians Actual values
41 Move Labels u Use EHMM to label segments/hmm states u Build Clustergen Model u Iterate l Predict Cluster (mean/std) for each frame l For each label boundary If dist(actual_after,pred_before) ) < dist(actual_after,pred_after) Move label forward If dist(actual_before,pred_after) ) < dist(actual_before,pred_before) Move label backward
42 Distance Metric u Distance from predicted to actual l Euclidean l F0, static, deltas, voicing l With/without standard deviation normalization l Weighting u Best choice l Static without stddev normalization l (This is closest to MCD)
43 ML with 10 iterations rms voice (66 minutes of speech) train 1019 utts, test 113 utts (every tenth) Pass Move +ve -ve MCD stddev F
44 Move Labels Voice base 2008 ml ahw awb bdl clb jmk ksp rms rxr slt Average improvement (excluding awb)
45 Does it sound better u rms l abtest (10 utterances) ml 7 base 1 = 2 u Slt l abtest ml 7 base 2 = 1 base ml
46 Arctic MLSB improvements 5.2 Move Label Segment Boundaries DMCD Spectrum (MCD 1-24) sbb ksp slt rms tta jmk rxr ahw clb jdb 4.2 awb Power (MCD cep0)
47 Grapheme Based Synthesis u Synthesis without a phoneme set u Use the letters as phonemes l ( alan nil (a l a n)) l ( black nil ( b l a c k )) u Spanish (easier?) l 419 utterances l HMM training to label databases l Simple pronunciation rules l Polici a -> > p o l i c i i a l Cuatro -> > c u a t r o
48 Spanish Grapheme Synthesis
49 English Grapheme Synthesis - Use Letters are phones - 26 phonemes - ( alan n (a l a n)) - ( black n (b l a c k)) - Build HMM acoustic models for labeling - For English - This is a pen - We went to the church at Christmas - Festival intro - do eight meat - Requires method to fix errors - Letter to letter mapping
50 Common Data Sets u Data drive techniques need data u Diphone Databases l CSTR and CMU US English Diphone sets (kal( and ked) u CMU ARCTIC Databases l 1200 phonetically balanced utterances (about 1 hour) l 7 different speakers (2 male 2 female 3 accented) l EGG, phonetically labeled l Utterances chosen from out-of of-copyright text l Easy to say l Freely distributable l Tools to build your own in your own language
51 Blizzard Challenge u Realistic evaluation l Under the same conditions u Blizzard Challenge [Black and Tokuda] l Participants build voice from common dataset l Synthesis test sentences l Large set of listening experiments l Since 2005, now in 4 th year l 18 groups in 2008
52 How to test synthesis u Blizzard tests: l Do you like it? (MOS scores) l Can you understand it? SUS sentence The unsure steaks overcame the zippy rudder u Can t t this be done automatically? l Not yet (at least not reliably enough) l But we now have lots of data for training techniques u Why does it still sound like robot? l Need better (appropriate testing)
53 SUS Sentences u sus_00022 u sus_00012 u sus_00005 u sus_00017
54 SUS Sentences u The serene adjustments foresaw the acceptable acquisition u The temperamental gateways forgave the weatherbeaten finalist u The sorrowful premieres sang the ostentatious gymnast u The disruptive billboards blew the sugary endorsement
55 Voice Identity u What makes a voice identity l Lexical Choice: Woo-hoo hoo, I pity the fool l Phonetic choice l Intonation and duration l Spectral qualities (vocal tract shape) l Excitation
56 Voice Conversion techniques u Full ASR and TTS l Much too hard to do reliably u Codebook transformation l ASR HMM state to HMM state transformation u GMM based transformation l Build a mapping function between frames
57 Learning VC models u First need to get parallel speech l Source and Target say same thing l Use DTW to align (in the spectral domain) l Trying to learn a functional mapping l utterances u Text-independent VC l Means no parallel speech available l Use some form of synthesis to generate it
58 VC Training process u Extract F0, power and MFCC from source and target utterances u DTW align source and target u Loop until convergence l Build GMM to map between source/target l DTW source/target using GMM mapping
59 VC Training process
60 VC Run-time
61 Voice Transformation - Festvox GMM transformation suite (Toda) awb bdl jmk slt awb bdl jmk slt
62 VC in Synthesis u Can be used as a post filter in synthesis l Build kal_diphone to target VC l Use on all output of kal_diphone u Can be used to convert a full DB l Convert a full db and rebuild a voice
63 Style/Emotion Conversion u Unit Selection (or SPS) l Require lots of data in desired style/emotion u VC technique l Use as filter to main voice (same speaker) l Convert neutral to angry, sad, happy
64 Can you say that again? u Voice conversion for speaking in noise u Different quality when you repeat things u Different quality when you speak in noise l Lombard effect (when very loud) l Speech-in-noise in regular noise
65 Speaking in Noise (Langner) u Collect data l Randomly play noise in person s s ears l Normal l In Noise u Collect 500 of each type u Build VC model l Normal -> > in-noise u Actually l Spectral, duration, f0 and power differences
66 Synthesis in Noise u For bus information task u Play different synthesis information utts l With SIN synthesizer l With SWN synthesizer l With VC (SWN->SIN) synthesizer u Measure their understanding l SIN synthesizer better (in Noise) l SIN synthesizer better (without Noise for elderly)
67 Transterpolation u Incrementally transform a voice X% l BDL-SLT by 10% l SLT-BDL by 10% u Count when you think it changes from M-FM u Fun but what are the uses
68 De-identification u Remove speaker identity l But keep it still human like u Health Records l HIPAA laws require this l Not just removing names and SSNs u Remove identifiable properties l Use Voice conversion to remove spectral l Use F0/duration mapping to remove prosodic l Use ASR/MT techniques to remove lexical
69 Summary u Data-driven speech synthesis l Text processing l Prosody and pronunciation l Waveform synthesis u Finding the right optimization l Find an objective metric that correlates with human perception
70 Potential Projects u Find F0 in story telling l F0 is easy to find in isolated sentences l What about full paragraphs l Storytellers use much wider range u Find F0 shapes/accent types l Use HMM to recognize types of accents l (trajectory modeling) l Following tilt and Moeller model
71 Parametric Synthesis u Better parametric representation of speech l Particularly excitation parameterization u Better Acoustic measures of quality l Use Blizzard answers to build/check objective measure u Statistical Klatt Parametric synthesis l Using knowledge-base parameters l F0, aspiration, nasality, formants l Automatically derive Klatt parameters for db l Use them for statistical parametric synthesis
72
Introduction to Speech Synthesis
IBM TJ Watson Research Center Human Language Technologies Introduction to Speech Synthesis Raul Fernandez fernanra@us.ibm.com IBM Research, Yorktown Heights Outline Ø Introduction and Motivation General
More informationModeling Coarticulation in Continuous Speech
ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013 Outline in 1 2 3 4 5 2 / 40 in is the influence of one phoneme on another Figure: of coarticulation
More informationHybrid Speech Synthesis
Hybrid Speech Synthesis Simon King Centre for Speech Technology Research University of Edinburgh 2 What are you going to learn? Another recap of unit selection let s properly understand the Acoustic Space
More informationCS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky. Lecture 6: Waveform Synthesis (in Concatenative TTS)
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky Lecture 6: Waveform Synthesis (in Concatenative TTS) IP Notice: many of these slides come directly from Richard Sproat s
More informationApplying Backoff to Concatenative Speech Synthesis
Applying Backoff to Concatenative Speech Synthesis Lily Liu Stanford University lliu23@stanford.edu Luladay Price Stanford University luladayp@stanford.edu Andrew Zhang Stanford University azhang97@stanford.edu
More informationOGIresLPC : Diphone synthesizer using residualexcited linear prediction
Oregon Health & Science University OHSU Digital Commons CSETech October 1997 OGIresLPC : Diphone synthesizer using residualexcited linear prediction Michael Macon Andrew Cronk Johan Wouters Alex Kain Follow
More informationAcoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing
Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se
More informationImproved Tamil Text to Speech Synthesis
Improved Tamil Text to Speech Synthesis M.Vinu Krithiga * and T.V.Geetha ** * Research Scholar ; ** Professor Department of Computer Science and Engineering,
More informationAn Open Source Speech Synthesis Frontend for HTS
An Open Source Speech Synthesis Frontend for HTS Markus Toman and Michael Pucher FTW Telecommunications Research Center Vienna Donau-City-Straße 1, A-1220 Vienna, Austria http://www.ftw.at {toman,pucher}@ftw.at
More informationSpeech Synthesis. Simon King University of Edinburgh
Speech Synthesis Simon King University of Edinburgh Hybrid speech synthesis Partial synthesis Case study: Trajectory Tiling Orientation SPSS (with HMMs or DNNs) flexible, robust to labelling errors but
More informationIncreased Diphone Recognition for an Afrikaans TTS system
Increased Diphone Recognition for an Afrikaans TTS system Francois Rousseau and Daniel Mashao Department of Electrical Engineering, University of Cape Town, Rondebosch, Cape Town, South Africa, frousseau@crg.ee.uct.ac.za,
More informationHIDDEN Markov model (HMM)-based statistical parametric
1492 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012 Minimum Kullback Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis Zhen-Hua Ling, Member,
More informationIntroduction to HTK Toolkit
Introduction to HTK Toolkit Berlin Chen 2003 Reference: - The HTK Book, Version 3.2 Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools Homework:
More informationText-Independent Speaker Identification
December 8, 1999 Text-Independent Speaker Identification Til T. Phan and Thomas Soong 1.0 Introduction 1.1 Motivation The problem of speaker identification is an area with many different applications.
More informationLearning The Lexicon!
Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,
More informationSpeech-Coding Techniques. Chapter 3
Speech-Coding Techniques Chapter 3 Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the quality RTP payload types
More information1 Introduction. 2 Speech Compression
Abstract In this paper, the effect of MPEG audio compression on HMM-based speech synthesis is studied. Speech signals are encoded with various compression rates and analyzed using the GlottHMM vocoder.
More informationRLAT Rapid Language Adaptation Toolkit
RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction
More informationFlite: a small fast run-time synthesis engine
ISCA Archive Flite: a small fast run-time synthesis engine Alan W Black and Kevin A. Lenzo Carnegie Mellon University awb@cs.cmu.edu, lenzo@cs.cmu.edu Abstract Flite is a small, fast run-time synthesis
More informationGYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)
GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE
More informationVoice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression
INTERSPEECH 2013 Voice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression Hanna Silén, Jani Nurminen, Elina Helander, Moncef Gabbouj Department of Signal Processing,
More informationPitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery
Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,
More informationVoice Conversion Using Dynamic Kernel. Partial Least Squares Regression
Voice Conversion Using Dynamic Kernel 1 Partial Least Squares Regression Elina Helander, Hanna Silén, Tuomas Virtanen, Member, IEEE, and Moncef Gabbouj, Fellow, IEEE Abstract A drawback of many voice conversion
More informationPrinciples of Audio Coding
Principles of Audio Coding Topics today Introduction VOCODERS Psychoacoustics Equal-Loudness Curve Frequency Masking Temporal Masking (CSIT 410) 2 Introduction Speech compression algorithm focuses on exploiting
More informationGSM Network and Services
GSM Network and Services Voice coding 1 From voice to radio waves voice/source coding channel coding block coding convolutional coding interleaving encryption burst building modulation diff encoding symbol
More informationSpeech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008
peech Recogni,on using HTK C4706 Fadi Biadsy April 21 st, 2008 1 Outline peech Recogni,on Feature Extrac,on HMM 3 basic problems HTK teps to Build a speech recognizer 2 peech Recogni,on peech ignal to
More informationDesigning in Text-To-Speech Capability for Portable Devices
Designing in Text-To-Speech Capability for Portable Devices Market Dynamics Currently, the automotive, wireless and cellular markets are experiencing increased demand from customers for a simplified and
More informationEffect of MPEG Audio Compression on HMM-based Speech Synthesis
Effect of MPEG Audio Compression on HMM-based Speech Synthesis Bajibabu Bollepalli 1, Tuomo Raitio 2, Paavo Alku 2 1 Department of Speech, Music and Hearing, KTH, Stockholm, Sweden 2 Department of Signal
More informationIntelligent Hands Free Speech based SMS System on Android
Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer
More informationBuilding a Unit Selection Synthesis Voice
Building a Unit Selection Synthesis Voice Version 04/03/18, 02:36:09 PM Steps Prepare the database get speech data (and the accompanying text) annotate database using the Aligner convert to Festival format
More informationSAS: A speaker verification spoofing database containing diverse attacks
SAS: A speaker verification spoofing database containing diverse attacks Zhizheng Wu 1, Ali Khodabakhsh 2, Cenk Demiroglu 2, Junichi Yamagishi 1,3, Daisuke Saito 4, Tomoki Toda 5, Simon King 1 1 University
More informationPhoneme vs. Diphone in Unit Selection TTS of Lithuanian
Baltic J. Modern Computing, Vol. 6 (2018), No. 2, 93-103 https://doi.org/10.22364/bjmc.2018.6.2.01 Phoneme vs. Diphone in Unit Selection TTS of Lithuanian Pijus KASPARAITIS, Kipras KANČYS Institute of
More informationLoquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts
Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts 1. Overview Davide Bonardo with the collaboration of Simon Parr The release of Loquendo TTS
More informationMaximum Likelihood Beamforming for Robust Automatic Speech Recognition
Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR
More informationIntroduction to The HTK Toolkit
Introduction to The HTK Toolkit Hsin-min Wang Reference: - The HTK Book Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools A Tutorial Example
More informationA Gaussian Mixture Model Spectral Representation for Speech Recognition
A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted
More informationVoice Command Based Computer Application Control Using MFCC
Voice Command Based Computer Application Control Using MFCC Abinayaa B., Arun D., Darshini B., Nataraj C Department of Embedded Systems Technologies, Sri Ramakrishna College of Engineering, Coimbatore,
More informationChapter 3. Speech segmentation. 3.1 Preprocessing
, as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents
More informationDeveloping the Maltese Speech Synthesis Engine
Developing the Maltese Speech Synthesis Engine Crimsonwing Research Team The Maltese Text to Speech Synthesiser Crimsonwing (Malta) p.l.c. awarded tender to develop the Maltese Text to Speech Synthesiser
More informationThe DEMOSTHeNES Speech Composer
The DEMOSTHeNES Speech Composer Gerasimos Xydas and Georgios Kouroupetroglou University of Athens, Department of Informatics and Telecommunications Division of Communication and Signal Processing Panepistimiopolis,
More informationA ROBUST SPEAKER CLUSTERING ALGORITHM
A ROBUST SPEAKER CLUSTERING ALGORITHM J. Ajmera IDIAP P.O. Box 592 CH-1920 Martigny, Switzerland jitendra@idiap.ch C. Wooters ICSI 1947 Center St., Suite 600 Berkeley, CA 94704, USA wooters@icsi.berkeley.edu
More informationDetection of Acoustic Events in Meeting-Room Environment
11/Dec/2008 Detection of Acoustic Events in Meeting-Room Environment Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of 34 Content Introduction State of the Art Acoustic
More informationDynamic Time Warping
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW
More informationMATLAB Apps for Teaching Digital Speech Processing
MATLAB Apps for Teaching Digital Speech Processing Lawrence Rabiner, Rutgers University Ronald Schafer, Stanford University GUI LITE 2.5 editor written by Maria d Souza and Dan Litvin MATLAB coding support
More informationAnalyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun
Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun 1. Introduction The human voice is very versatile and carries a multitude of emotions. Emotion in speech carries extra insight about
More informationIntegrate Speech Technology for Hands-free Operation
Integrate Speech Technology for Hands-free Operation Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks
More informationApplications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3
Applications of Keyword-Constraining in Speaker Recognition Howard Lei hlei@icsi.berkeley.edu July 2, 2007 Contents 1 Introduction 3 2 The keyword HMM system 4 2.1 Background keyword HMM training............................
More informationCT516 Advanced Digital Communications Lecture 7: Speech Encoder
CT516 Advanced Digital Communications Lecture 7: Speech Encoder Yash M. Vasavada Associate Professor, DA-IICT, Gandhinagar 2nd February 2017 Yash M. Vasavada (DA-IICT) CT516: Adv. Digital Comm. 2nd February
More informationAn overview of interactive voice response applications
An overview of interactive voice response applications Suneetha Chittamuri Senior Software Engineer IBM India April, 2004 Copyright International Business Machines Corporation 2004. All rights reserved.
More informationAudio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011
Audio Fundamentals, Compression Techniques & Standards Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011 Outlines Audio Fundamentals Sampling, digitization, quantization μ-law
More informationTHE PERFORMANCE of automatic speech recognition
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 2109 Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments Michael L. Seltzer,
More informationMinimal-Impact Personal Audio Archives
Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu
More informationAssignment 1: Speech Production and Models EN2300 Speech Signal Processing
Assignment 1: Speech Production and Models EN2300 Speech Signal Processing 2011-10-23 Instructions for the deliverables Perform all (or as many as you can) of the tasks in this project assignment. Summarize
More informationOCR Coverage. Open Court Reading Grade K CCSS Correlation
Grade K Common Core State Standards Reading: Literature Key Ideas and Details RL.K.1 With prompting and support, ask and answer questions about key details in a text. OCR Coverage Unit 1: T70 Unit 2: T271,
More informationGender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV
Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,
More informationSpeech User Interface for Information Retrieval
Speech User Interface for Information Retrieval Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic Institute, Nagpur Sadar, Nagpur 440001 (INDIA) urmilas@rediffmail.com Cell : +919422803996
More informationWeb-enabled Speech Synthesizer for Tamil
Web-enabled Speech Synthesizer for Tamil P. Prathibha and A. G. Ramakrishnan Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, INDIA 1. Introduction The Internet is popular
More informationBluray (
Bluray (http://www.blu-ray.com/faq) MPEG-2 - enhanced for HD, also used for playback of DVDs and HDTV recordings MPEG-4 AVC - part of the MPEG-4 standard also known as H.264 (High Profile and Main Profile)
More informationLecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen
Lecture 8 LVCSR Training and Decoding Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 November
More informationNiusha, the first Persian speech-enabled IVR platform
2010 5th International Symposium on Telecommunications (IST'2010) Niusha, the first Persian speech-enabled IVR platform M.H. Bokaei, H. Sameti, H. Eghbal-zadeh, B. BabaAli, KH. Hosseinzadeh, M. Bahrani,
More informationDigital Speech Coding
Digital Speech Processing David Tipper Associate Professor Graduate Program of Telecommunications and Networking University of Pittsburgh Telcom 2700/INFSCI 1072 Slides 7 http://www.sis.pitt.edu/~dtipper/tipper.html
More informationAditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India
Analysis of Different Classifier Using Feature Extraction in Speaker Identification and Verification under Adverse Acoustic Condition for Different Scenario Shrikant Upadhyay Assistant Professor, Department
More informationSpeech Applications. How do they work?
Speech Applications How do they work? What is a VUI? What the user interacts with when using a speech application VUI Elements Prompts or System Messages Prerecorded or Synthesized Grammars Define the
More informationHow to create dialog system
How to create dialog system Jan Balata Dept. of computer graphics and interaction Czech Technical University in Prague 1 / 55 Outline Intro Architecture Types of control Designing dialog system IBM Bluemix
More informationINTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING
INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING Jon P. Nedel Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of
More informationConfidence Measures: how much we can trust our speech recognizers
Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition
More informationREALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann
REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD Kang Liu and Joern Ostermann Institut für Informationsverarbeitung, Leibniz Universität Hannover Appelstr. 9A, 3167 Hannover, Germany
More informationDiscriminative training and Feature combination
Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics
More informationInteractions 2 Listening and Speaking Guide to CD Tracks
Interactions 2 Listening and Speaking Guide to CD Tracks Page Activity Activity Name CD Track MP3 File Name 5 2 Previewing Vocabulary 1 2 Int. 2 LS-2013_1-02 5 3 Comprehension Questions 1 3 Int. 2 LS-2013_1-03
More informationInternational Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015
International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Communication media for Blinds Based on Voice Mrs.K.M.Sanghavi 1, Radhika Maru
More informationFP SIMPLE4ALL deliverable D6.5. Deliverable D6.5. Initial Public Release of Open Source Tools
Deliverable D6.5 Initial Public Release of Open Source Tools The research leading to these results has received funding from the European Community s Seventh Framework Programme (FP7/2007-2013) under grant
More informationText analysis. From a string of characters to a list of words , LTI, Carnegie Mellon
Text analysis From a string of characters to a list of words. This is a pen. Text analysis My cat who lives dangerously has nine lives. He stole $100 from the bank. He stole 1996 cattle on 25 Nov 1996.
More informationON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES
ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES Venkatraman Atti 1 and Andreas Spanias 1 Abstract In this paper, we present a collection of software educational tools for
More informationWHO WANTS TO BE A MILLIONAIRE?
IDIAP COMMUNICATION REPORT WHO WANTS TO BE A MILLIONAIRE? Huseyn Gasimov a Aleksei Triastcyn Hervé Bourlard Idiap-Com-03-2012 JULY 2012 a EPFL Centre du Parc, Rue Marconi 19, PO Box 592, CH - 1920 Martigny
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationTHE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS
THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS Proadpran Punyabukkana, Jirasak Chirathivat, Chanin Chanma, Juthasit Maekwongtrakarn, Atiwong Suchato Spoken Language
More informationAn Experimental Evaluation of Keyword-Filler Hidden Markov Models
An Experimental Evaluation of Keyword-Filler Hidden Markov Models A. Jansen and P. Niyogi April 13, 2009 Abstract We present the results of a small study involving the use of keyword-filler hidden Markov
More informationAUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.
AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING Junhong Zhao 1,2, Hua Yuan 3, Wai-Kim Leung 4, Helen Meng 4, Jia Liu 3 and Shanhong Xia 1
More informationa child-friendly word processor for children to write documents
Table of Contents Get Started... 1 Quick Start... 2 Classes and Users... 3 Clicker Explorer... 4 Ribbon... 6 Write Documents... 7 Document Tools... 8 Type with a Keyboard... 12 Write with a Clicker Set...
More informationSpeech Recognition, The process of taking spoken word as an input to a computer
Speech Recognition, The process of taking spoken word as an input to a computer program (Baumann) Have you ever found yourself yelling at your computer, wishing you could make it understand what you want
More informationIJETST- Vol. 03 Issue 05 Pages May ISSN
International Journal of Emerging Trends in Science and Technology Implementation of MFCC Extraction Architecture and DTW Technique in Speech Recognition System R.M.Sneha 1, K.L.Hemalatha 2 1 PG Student,
More informationAssignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018
Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments
More informationAUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015
AUDIO Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015 Key objectives How do humans generate and process sound? How does digital sound work? How fast do I have to sample audio?
More informationModeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013 2129 Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical
More informationHuman Computer Interaction Using Speech Recognition Technology
International Bulletin of Mathematical Research Volume 2, Issue 1, March 2015 Pages 231-235, ISSN: 2394-7802 Human Computer Interaction Using Recognition Technology Madhu Joshi 1 and Saurabh Ranjan Srivastava
More informationStarting-Up Fast with Speech-Over Professional
Starting-Up Fast with Speech-Over Professional Contents #1 Getting Ready... 2 Starting Up... 2 Initial Preferences Settings... 3 Adding a Narration Clip... 3 On-Line Tutorials... 3 #2: Creating a Synchronized
More informationLearning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis
Interspeech 2018 2-6 September 2018, Hyderabad Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Synthesis Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai National Engineering
More informationRECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS
RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS Dipti D. Joshi, M.B. Zalte (EXTC Department, K.J. Somaiya College of Engineering, University of Mumbai, India) Diptijoshi3@gmail.com
More informationEditing Pronunciation in Clicker 5
Editing Pronunciation in Clicker 5 Depending on which computer you use for Clicker 5, you may notice that some words, especially proper names and technical terms, are mispronounced when you click a word
More informationFACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT
FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT Shoichiro IWASAWA*I, Tatsuo YOTSUKURA*2, Shigeo MORISHIMA*2 */ Telecommunication Advancement Organization *2Facu!ty of Engineering, Seikei University
More informationCreate Swift mobile apps with IBM Watson services IBM Corporation
Create Swift mobile apps with IBM Watson services Create a Watson sentiment analysis app with Swift Learning objectives In this section, you ll learn how to write a mobile app in Swift for ios and add
More informationVocal Joystick User Guide
Vocal Joystick User Guide For Windows XP/Vista/Linux Version 0.5 beta The Vocal Joystick team includes: Jon Malkin Xiao Li Susumu Harada Jeff Bilmes Table of Contents Disclaimer... 3 About the Vocal Joystick...
More informationModel TS-04 -W. Wireless Speech Keyboard System 2.0. Users Guide
Model TS-04 -W Wireless Speech Keyboard System 2.0 Users Guide Overview TextSpeak TS-04 text-to-speech speaker and wireless keyboard system instantly converts typed text to a natural sounding voice. The
More informationTowards Audiovisual TTS
Towards Audiovisual TTS in Estonian Einar MEISTER a, SaschaFAGEL b and RainerMETSVAHI a a Institute of Cybernetics at Tallinn University of Technology, Estonia b zoobemessageentertainmentgmbh, Berlin,
More informationCMU Sphinx: the recognizer library
CMU Sphinx: the recognizer library Authors: Massimo Basile Mario Fabrizi Supervisor: Prof. Paola Velardi 01/02/2013 Contents 1 Introduction 2 2 Sphinx download and installation 4 2.1 Download..........................................
More informationMARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID PHONES
International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 3, Issue 2, May 2016, 34-38 IIST MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID
More informationLanguage Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12
correlated to the Standards, Grades 5 12 1 correlated to the Standards, Grades 5 12 ELL Stage III: Grades 3 5 Listening and Speaking Standard 1: The student will listen actively to the ideas of others
More informationEM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition
EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,
More informationHow accurate is AGNITIO KIVOX Voice ID?
How accurate is AGNITIO KIVOX Voice ID? Overview Using natural speech, KIVOX can work with error rates below 1%. When optimized for short utterances, where the same phrase is used for enrolment and authentication,
More informationUnsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning
Unsupervised Learning Clustering and the EM Algorithm Susanna Ricco Supervised Learning Given data in the form < x, y >, y is the target to learn. Good news: Easy to tell if our algorithm is giving the
More information