Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

Size: px

Start display at page:

Download "Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009"

Lilian Shaw
6 years ago
Views:

1 Machine Learning in Speech Synthesis Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

2 Overview u Speech Synthesis History and Overview l From hand-crafted to data-driven driven techniques u Text to Speech Processes u Waveform synthesis l Unit selection and Statistical Parametric Synthesis u Evaluation u Voice conversion u Project ideas

3 Physical Models Blowing air through tubes von Kemplen s synthesizer 1791 Synthesis by physical models Homer Dudley s Voder. 1939

4 More Computation More Data u Formant synthesis (60s-80s) l Waveform construction from components u Diphone synthesis (80s-90s) l Waveform by concatenation of small number of instances of speech u Unit selection (90s-00s) 00s) l Waveform by concatenation of very large number of instances of speech u Statistical Parametric Synthesis (00s-..) l Waveform construction from parametric models

5 Waveform Generation - Formant synthesis - Random word/phrase concatenation - Phone concatenation - Diphone concatenation - Sub-word unit selection - Cluster based unit selection - Statistical Parametric Synthesis

6 Speech Synthesis u Text Analysis l Chunking, tokenization, token expansion u Linguistic Analysis l Pronunciations l Prosody u Waveform generation l From phones and prosody to waveforms

7 Text processing u Find the words l Splitting tokens too e.g. 04/11/2009 l Removing punctuation u Identifying word types l Numbers: years, quantities, ordinals l 1996 sheep were stolen on 25 Nov 1996 u Identifying words/abbreviations l CIA, 10m, 12sf, WeH7200

8 Pronunciations u Giving pronunciation for each word l A phoneme string (plus tone, stress ) u A constructed lexicon l ( pencil n (p eh1 n s ih l)) l ( two n (t uw1)) u Letter to sound rules l Pronunciation of out of vocabulary words l Machine learning prediction from letters

9 Pronunciation of Unknown Words u How do you pronounce new words u 4% of tokens (in news) are new u You can t t synthesis then without pronunciations u You can t t recognize them without pronunciations u Letter-to to-sounds rules u Grapheme-to to-phoneme rules

10 LTS: Hand written u Hand written rules l [LeftContext]] X [RightContext[ RightContext] -> > Y l e.g. l c [h r] -> > k l c [h] -> ch l c [i] -> > s l c -> > k

11 LTS: Machine Learning Techniques u Need an existing lexicon l Pronunciations: words and phones l But different number of letters and phones u Need an alignment l Between letters and phones l checked -> ch eh k t

12 LTS: alignment checked -> ch eh k t c h e c k e d ch _ eh k t Some letters go to nothing Some letters go to two phones box -> b aa k-s table -> t ey b ax-l -

13 Find alignment automatically u Epsilon scattering l Find all possible alignments l Estimate p(l,p) ) on each alignment l Find most probable alignment u Hand seed l Hand specify allowable pairs l Estimate p(l,p) ) on each possible alignment l Find most probable alignment u Statistical Machine Translation (IBM model 1) l Estimate p(l,p) ) on each possible alignment l Find most probably alignment

14 Not everything aligns u 0, 1, and 2 letter cases l e -> > epsilon moved l x -> k-s, g-z box example l e -> y-uw askew u Some alignment aren t t sensible l dept -> > d ih p aa r t m ax n t l cmu -> > s iy eh m y uw

15 Training LTS models u Use CART trees l One model for each letter u Predict phone (epsilon, phone, dual phone) l From letter 3-context 3 (and POS) u # # # c h e c -> ch u # # c h e c k -> > _ u # c h e c k e -> > eh u c h e c k e d -> > k

16 LTS results Split lexicon into train/test 90%/10% i.e. every tenth entry is extracted for testing Lexicon OALD CMUDICT BRULEX DE-CELEX Thai Letter Acc 95.80% 91.99% 99.00% 98.79% 95.60% Word Acc 75.56% 57.80% 93.03% 89.38% 68.76%

17 Example Tree

18 But we need more than phones What about lexical stress p r aa1 j eh k t -> p r aa j eh1 k t Two possibilities A separate prediction model Join model introduce eh/eh1 (BETTER) L no S Letter W no S Word LTP+S 96.36% % 63.68% LTPS 96.27% 95.80% 74.69% 74.56%

19 Does it really work 40K words from Time Magazine 1775 (4.6%) not in OALD LTS gets 70% correct (test set was 74%) Names Unknown US Spelling Typos Occurs %

20 Prosody Modeling u Phrasing l Where to take breaths u Intonation l Where (and what size) are accents l F0 realization u Duration l What is the length of each phoneme

21 Intonation Contour

22 Unit Selection vs Parametric Unit Selection The standard method Select appropriate sub-word units from large databases of natural speech Parametric Synthesis: [NITECH: Tokuda et al] HMM-generation based synthesis Cluster units to form models Generate from the models Take average of units

23 Unit Selection Target cost and Join cost [Hunt and Black 96] Target cost is distance from desired unit to actual unit in the databases Based on phonetic, prosodic metrical context Join cost is how well the selected units join

24 Hunt and Black Costs

25 HB Unit Selection Use Viterbi to find best set of units Note Finding longest is typically not optimal

26 Clustering Units Cluster units [Donovan et al 96, Black et al 97] Moves calculation to compile time

27 Unit Selection Issues Cost metrics Finding best weights, best techniques etc Database design Best database coverage Automatic labeling accuracy Finding errors/confidence Limited domain: Target the databases to a particular application Talking clocks Targeted domain synthesis

28 Old vs New Unit Selection: large carefully labelled database quality good when good examples available quality will sometimes be bad no control of prosody Parametric Synthesis: smaller less carefully labelled database quality consistent resynthesis requires vocoder, (buzzy) can (must) control prosody model size much smaller than Unit DB

29 Parametric Synthesis Probabilistic Models Simplification Generative model Predict acoustic frames from text

30 Trajectories u Frame (State) based prediction l Ignores dynamics u Various solutions l MLPG (maximum likelihood parameter generation) l Trajectory HMMs l Global Variance l MGE, minimal generation error

31 SPSS Systems u HTS (NITECH) l Based on HTK l Predicts HMM-states l (Default) uses MCEP and MLSA filter l Supported in Festival u Clustergen (CMU) l No use of HTK l Predicts Frames l (Default) uses MCEP and MLSA filter l More tightly coupled with Festival

32 FestVox CLUSTERGEN Synthesizer Requires: Prompt transcriptions (txt.done.data) Waveform files (well recorded) FestVox Labelling EHMM (Kishore) Context Independent models and forced alignment (Have used Janus labels too). Parameter extraction: (HTS s) melcep/mlsa filter for resynthesis F0 extraction Clustering Wagon vector clustering for each HMM-state name

33 Clustering by CART Update to Wagon (Edinburgh Speech Tools). Tight coupling of features with FestVox utts Support for arbitrary vectors Define impurity on clusters of N vectors Clustering F0 and MCEP Tested jointly and separately Features for clustering (51): phonetic, syllable, phrasal context

34 Training Output Three models: Spectral (MCEP) CART tree F0 CART tree Duration CART tree F0 model: Smoothed extracted F0 through all speech (i.e. unvoiced regions get F0 values) Chose voicing at runtime phonetically

35 CLUSTERGEN Synthesis Generate phoneme strings (as before) For each phone: Find HMM-state names: ah_1, ah_2, ah_3 Predict duration of each Create empty mcep vector to fill duration Predict mcep values from cluster tree Predict F0 value from cluster tree Use MLSA filter to regenerate speech

36 Objective Score CLUSTERGEN Mean Mel Cepstral Distortion over test set MCD: Voice Conversion ranges MCD: CG scores smaller is better

37 Example CG Voices 7 Arctic databases: 1200 utterances, 43K segs,, 1hr speech awb bdl clb ksp slt jmk rms

38 Database size vs Quality slt_arctic data size Utts Clusters RMS F0 MCD

39 Making it Better u Label data, build model u But maybe there are better labels u So find labels that maximize model accuracy

40 Move Labels Cluster trees a_1 a_2 Predicted gaussians Actual values

41 Move Labels u Use EHMM to label segments/hmm states u Build Clustergen Model u Iterate l Predict Cluster (mean/std) for each frame l For each label boundary If dist(actual_after,pred_before) ) < dist(actual_after,pred_after) Move label forward If dist(actual_before,pred_after) ) < dist(actual_before,pred_before) Move label backward

42 Distance Metric u Distance from predicted to actual l Euclidean l F0, static, deltas, voicing l With/without standard deviation normalization l Weighting u Best choice l Static without stddev normalization l (This is closest to MCD)

43 ML with 10 iterations rms voice (66 minutes of speech) train 1019 utts, test 113 utts (every tenth) Pass Move +ve -ve MCD stddev F

44 Move Labels Voice base 2008 ml ahw awb bdl clb jmk ksp rms rxr slt Average improvement (excluding awb)

45 Does it sound better u rms l abtest (10 utterances) ml 7 base 1 = 2 u Slt l abtest ml 7 base 2 = 1 base ml

46 Arctic MLSB improvements 5.2 Move Label Segment Boundaries DMCD Spectrum (MCD 1-24) sbb ksp slt rms tta jmk rxr ahw clb jdb 4.2 awb Power (MCD cep0)

47 Grapheme Based Synthesis u Synthesis without a phoneme set u Use the letters as phonemes l ( alan nil (a l a n)) l ( black nil ( b l a c k )) u Spanish (easier?) l 419 utterances l HMM training to label databases l Simple pronunciation rules l Polici a -> > p o l i c i i a l Cuatro -> > c u a t r o

48 Spanish Grapheme Synthesis

49 English Grapheme Synthesis - Use Letters are phones - 26 phonemes - ( alan n (a l a n)) - ( black n (b l a c k)) - Build HMM acoustic models for labeling - For English - This is a pen - We went to the church at Christmas - Festival intro - do eight meat - Requires method to fix errors - Letter to letter mapping

50 Common Data Sets u Data drive techniques need data u Diphone Databases l CSTR and CMU US English Diphone sets (kal( and ked) u CMU ARCTIC Databases l 1200 phonetically balanced utterances (about 1 hour) l 7 different speakers (2 male 2 female 3 accented) l EGG, phonetically labeled l Utterances chosen from out-of of-copyright text l Easy to say l Freely distributable l Tools to build your own in your own language

51 Blizzard Challenge u Realistic evaluation l Under the same conditions u Blizzard Challenge [Black and Tokuda] l Participants build voice from common dataset l Synthesis test sentences l Large set of listening experiments l Since 2005, now in 4 th year l 18 groups in 2008

52 How to test synthesis u Blizzard tests: l Do you like it? (MOS scores) l Can you understand it? SUS sentence The unsure steaks overcame the zippy rudder u Can t t this be done automatically? l Not yet (at least not reliably enough) l But we now have lots of data for training techniques u Why does it still sound like robot? l Need better (appropriate testing)

53 SUS Sentences u sus_00022 u sus_00012 u sus_00005 u sus_00017

54 SUS Sentences u The serene adjustments foresaw the acceptable acquisition u The temperamental gateways forgave the weatherbeaten finalist u The sorrowful premieres sang the ostentatious gymnast u The disruptive billboards blew the sugary endorsement

55 Voice Identity u What makes a voice identity l Lexical Choice: Woo-hoo hoo, I pity the fool l Phonetic choice l Intonation and duration l Spectral qualities (vocal tract shape) l Excitation

56 Voice Conversion techniques u Full ASR and TTS l Much too hard to do reliably u Codebook transformation l ASR HMM state to HMM state transformation u GMM based transformation l Build a mapping function between frames

57 Learning VC models u First need to get parallel speech l Source and Target say same thing l Use DTW to align (in the spectral domain) l Trying to learn a functional mapping l utterances u Text-independent VC l Means no parallel speech available l Use some form of synthesis to generate it

58 VC Training process u Extract F0, power and MFCC from source and target utterances u DTW align source and target u Loop until convergence l Build GMM to map between source/target l DTW source/target using GMM mapping

59 VC Training process

60 VC Run-time

61 Voice Transformation - Festvox GMM transformation suite (Toda) awb bdl jmk slt awb bdl jmk slt

62 VC in Synthesis u Can be used as a post filter in synthesis l Build kal_diphone to target VC l Use on all output of kal_diphone u Can be used to convert a full DB l Convert a full db and rebuild a voice

63 Style/Emotion Conversion u Unit Selection (or SPS) l Require lots of data in desired style/emotion u VC technique l Use as filter to main voice (same speaker) l Convert neutral to angry, sad, happy

64 Can you say that again? u Voice conversion for speaking in noise u Different quality when you repeat things u Different quality when you speak in noise l Lombard effect (when very loud) l Speech-in-noise in regular noise

65 Speaking in Noise (Langner) u Collect data l Randomly play noise in person s s ears l Normal l In Noise u Collect 500 of each type u Build VC model l Normal -> > in-noise u Actually l Spectral, duration, f0 and power differences

66 Synthesis in Noise u For bus information task u Play different synthesis information utts l With SIN synthesizer l With SWN synthesizer l With VC (SWN->SIN) synthesizer u Measure their understanding l SIN synthesizer better (in Noise) l SIN synthesizer better (without Noise for elderly)

67 Transterpolation u Incrementally transform a voice X% l BDL-SLT by 10% l SLT-BDL by 10% u Count when you think it changes from M-FM u Fun but what are the uses

68 De-identification u Remove speaker identity l But keep it still human like u Health Records l HIPAA laws require this l Not just removing names and SSNs u Remove identifiable properties l Use Voice conversion to remove spectral l Use F0/duration mapping to remove prosodic l Use ASR/MT techniques to remove lexical

69 Summary u Data-driven speech synthesis l Text processing l Prosody and pronunciation l Waveform synthesis u Finding the right optimization l Find an objective metric that correlates with human perception

70 Potential Projects u Find F0 in story telling l F0 is easy to find in isolated sentences l What about full paragraphs l Storytellers use much wider range u Find F0 shapes/accent types l Use HMM to recognize types of accents l (trajectory modeling) l Following tilt and Moeller model

71 Parametric Synthesis u Better parametric representation of speech l Particularly excitation parameterization u Better Acoustic measures of quality l Use Blizzard answers to build/check objective measure u Statistical Klatt Parametric synthesis l Using knowledge-base parameters l F0, aspiration, nasality, formants l Automatically derive Klatt parameters for db l Use them for statistical parametric synthesis

Introduction to Speech Synthesis

IBM TJ Watson Research Center Human Language Technologies Introduction to Speech Synthesis Raul Fernandez fernanra@us.ibm.com IBM Research, Yorktown Heights Outline Ø Introduction and Motivation General