Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

Size: px
Start display at page:

Download "Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009"

Transcription

1 Machine Learning in Speech Synthesis Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

2 Overview u Speech Synthesis History and Overview l From hand-crafted to data-driven driven techniques u Text to Speech Processes u Waveform synthesis l Unit selection and Statistical Parametric Synthesis u Evaluation u Voice conversion u Project ideas

3 Physical Models Blowing air through tubes von Kemplen s synthesizer 1791 Synthesis by physical models Homer Dudley s Voder. 1939

4 More Computation More Data u Formant synthesis (60s-80s) l Waveform construction from components u Diphone synthesis (80s-90s) l Waveform by concatenation of small number of instances of speech u Unit selection (90s-00s) 00s) l Waveform by concatenation of very large number of instances of speech u Statistical Parametric Synthesis (00s-..) l Waveform construction from parametric models

5 Waveform Generation - Formant synthesis - Random word/phrase concatenation - Phone concatenation - Diphone concatenation - Sub-word unit selection - Cluster based unit selection - Statistical Parametric Synthesis

6 Speech Synthesis u Text Analysis l Chunking, tokenization, token expansion u Linguistic Analysis l Pronunciations l Prosody u Waveform generation l From phones and prosody to waveforms

7 Text processing u Find the words l Splitting tokens too e.g. 04/11/2009 l Removing punctuation u Identifying word types l Numbers: years, quantities, ordinals l 1996 sheep were stolen on 25 Nov 1996 u Identifying words/abbreviations l CIA, 10m, 12sf, WeH7200

8 Pronunciations u Giving pronunciation for each word l A phoneme string (plus tone, stress ) u A constructed lexicon l ( pencil n (p eh1 n s ih l)) l ( two n (t uw1)) u Letter to sound rules l Pronunciation of out of vocabulary words l Machine learning prediction from letters

9 Pronunciation of Unknown Words u How do you pronounce new words u 4% of tokens (in news) are new u You can t t synthesis then without pronunciations u You can t t recognize them without pronunciations u Letter-to to-sounds rules u Grapheme-to to-phoneme rules

10 LTS: Hand written u Hand written rules l [LeftContext]] X [RightContext[ RightContext] -> > Y l e.g. l c [h r] -> > k l c [h] -> ch l c [i] -> > s l c -> > k

11 LTS: Machine Learning Techniques u Need an existing lexicon l Pronunciations: words and phones l But different number of letters and phones u Need an alignment l Between letters and phones l checked -> ch eh k t

12 LTS: alignment checked -> ch eh k t c h e c k e d ch _ eh k t Some letters go to nothing Some letters go to two phones box -> b aa k-s table -> t ey b ax-l -

13 Find alignment automatically u Epsilon scattering l Find all possible alignments l Estimate p(l,p) ) on each alignment l Find most probable alignment u Hand seed l Hand specify allowable pairs l Estimate p(l,p) ) on each possible alignment l Find most probable alignment u Statistical Machine Translation (IBM model 1) l Estimate p(l,p) ) on each possible alignment l Find most probably alignment

14 Not everything aligns u 0, 1, and 2 letter cases l e -> > epsilon moved l x -> k-s, g-z box example l e -> y-uw askew u Some alignment aren t t sensible l dept -> > d ih p aa r t m ax n t l cmu -> > s iy eh m y uw

15 Training LTS models u Use CART trees l One model for each letter u Predict phone (epsilon, phone, dual phone) l From letter 3-context 3 (and POS) u # # # c h e c -> ch u # # c h e c k -> > _ u # c h e c k e -> > eh u c h e c k e d -> > k

16 LTS results Split lexicon into train/test 90%/10% i.e. every tenth entry is extracted for testing Lexicon OALD CMUDICT BRULEX DE-CELEX Thai Letter Acc 95.80% 91.99% 99.00% 98.79% 95.60% Word Acc 75.56% 57.80% 93.03% 89.38% 68.76%

17 Example Tree

18 But we need more than phones What about lexical stress p r aa1 j eh k t -> p r aa j eh1 k t Two possibilities A separate prediction model Join model introduce eh/eh1 (BETTER) L no S Letter W no S Word LTP+S 96.36% % 63.68% LTPS 96.27% 95.80% 74.69% 74.56%

19 Does it really work 40K words from Time Magazine 1775 (4.6%) not in OALD LTS gets 70% correct (test set was 74%) Names Unknown US Spelling Typos Occurs %

20 Prosody Modeling u Phrasing l Where to take breaths u Intonation l Where (and what size) are accents l F0 realization u Duration l What is the length of each phoneme

21 Intonation Contour

22 Unit Selection vs Parametric Unit Selection The standard method Select appropriate sub-word units from large databases of natural speech Parametric Synthesis: [NITECH: Tokuda et al] HMM-generation based synthesis Cluster units to form models Generate from the models Take average of units

23 Unit Selection Target cost and Join cost [Hunt and Black 96] Target cost is distance from desired unit to actual unit in the databases Based on phonetic, prosodic metrical context Join cost is how well the selected units join

24 Hunt and Black Costs

25 HB Unit Selection Use Viterbi to find best set of units Note Finding longest is typically not optimal

26 Clustering Units Cluster units [Donovan et al 96, Black et al 97] Moves calculation to compile time

27 Unit Selection Issues Cost metrics Finding best weights, best techniques etc Database design Best database coverage Automatic labeling accuracy Finding errors/confidence Limited domain: Target the databases to a particular application Talking clocks Targeted domain synthesis

28 Old vs New Unit Selection: large carefully labelled database quality good when good examples available quality will sometimes be bad no control of prosody Parametric Synthesis: smaller less carefully labelled database quality consistent resynthesis requires vocoder, (buzzy) can (must) control prosody model size much smaller than Unit DB

29 Parametric Synthesis Probabilistic Models Simplification Generative model Predict acoustic frames from text

30 Trajectories u Frame (State) based prediction l Ignores dynamics u Various solutions l MLPG (maximum likelihood parameter generation) l Trajectory HMMs l Global Variance l MGE, minimal generation error

31 SPSS Systems u HTS (NITECH) l Based on HTK l Predicts HMM-states l (Default) uses MCEP and MLSA filter l Supported in Festival u Clustergen (CMU) l No use of HTK l Predicts Frames l (Default) uses MCEP and MLSA filter l More tightly coupled with Festival

32 FestVox CLUSTERGEN Synthesizer Requires: Prompt transcriptions (txt.done.data) Waveform files (well recorded) FestVox Labelling EHMM (Kishore) Context Independent models and forced alignment (Have used Janus labels too). Parameter extraction: (HTS s) melcep/mlsa filter for resynthesis F0 extraction Clustering Wagon vector clustering for each HMM-state name

33 Clustering by CART Update to Wagon (Edinburgh Speech Tools). Tight coupling of features with FestVox utts Support for arbitrary vectors Define impurity on clusters of N vectors Clustering F0 and MCEP Tested jointly and separately Features for clustering (51): phonetic, syllable, phrasal context

34 Training Output Three models: Spectral (MCEP) CART tree F0 CART tree Duration CART tree F0 model: Smoothed extracted F0 through all speech (i.e. unvoiced regions get F0 values) Chose voicing at runtime phonetically

35 CLUSTERGEN Synthesis Generate phoneme strings (as before) For each phone: Find HMM-state names: ah_1, ah_2, ah_3 Predict duration of each Create empty mcep vector to fill duration Predict mcep values from cluster tree Predict F0 value from cluster tree Use MLSA filter to regenerate speech

36 Objective Score CLUSTERGEN Mean Mel Cepstral Distortion over test set MCD: Voice Conversion ranges MCD: CG scores smaller is better

37 Example CG Voices 7 Arctic databases: 1200 utterances, 43K segs,, 1hr speech awb bdl clb ksp slt jmk rms

38 Database size vs Quality slt_arctic data size Utts Clusters RMS F0 MCD

39 Making it Better u Label data, build model u But maybe there are better labels u So find labels that maximize model accuracy

40 Move Labels Cluster trees a_1 a_2 Predicted gaussians Actual values

41 Move Labels u Use EHMM to label segments/hmm states u Build Clustergen Model u Iterate l Predict Cluster (mean/std) for each frame l For each label boundary If dist(actual_after,pred_before) ) < dist(actual_after,pred_after) Move label forward If dist(actual_before,pred_after) ) < dist(actual_before,pred_before) Move label backward

42 Distance Metric u Distance from predicted to actual l Euclidean l F0, static, deltas, voicing l With/without standard deviation normalization l Weighting u Best choice l Static without stddev normalization l (This is closest to MCD)

43 ML with 10 iterations rms voice (66 minutes of speech) train 1019 utts, test 113 utts (every tenth) Pass Move +ve -ve MCD stddev F

44 Move Labels Voice base 2008 ml ahw awb bdl clb jmk ksp rms rxr slt Average improvement (excluding awb)

45 Does it sound better u rms l abtest (10 utterances) ml 7 base 1 = 2 u Slt l abtest ml 7 base 2 = 1 base ml

46 Arctic MLSB improvements 5.2 Move Label Segment Boundaries DMCD Spectrum (MCD 1-24) sbb ksp slt rms tta jmk rxr ahw clb jdb 4.2 awb Power (MCD cep0)

47 Grapheme Based Synthesis u Synthesis without a phoneme set u Use the letters as phonemes l ( alan nil (a l a n)) l ( black nil ( b l a c k )) u Spanish (easier?) l 419 utterances l HMM training to label databases l Simple pronunciation rules l Polici a -> > p o l i c i i a l Cuatro -> > c u a t r o

48 Spanish Grapheme Synthesis

49 English Grapheme Synthesis - Use Letters are phones - 26 phonemes - ( alan n (a l a n)) - ( black n (b l a c k)) - Build HMM acoustic models for labeling - For English - This is a pen - We went to the church at Christmas - Festival intro - do eight meat - Requires method to fix errors - Letter to letter mapping

50 Common Data Sets u Data drive techniques need data u Diphone Databases l CSTR and CMU US English Diphone sets (kal( and ked) u CMU ARCTIC Databases l 1200 phonetically balanced utterances (about 1 hour) l 7 different speakers (2 male 2 female 3 accented) l EGG, phonetically labeled l Utterances chosen from out-of of-copyright text l Easy to say l Freely distributable l Tools to build your own in your own language

51 Blizzard Challenge u Realistic evaluation l Under the same conditions u Blizzard Challenge [Black and Tokuda] l Participants build voice from common dataset l Synthesis test sentences l Large set of listening experiments l Since 2005, now in 4 th year l 18 groups in 2008

52 How to test synthesis u Blizzard tests: l Do you like it? (MOS scores) l Can you understand it? SUS sentence The unsure steaks overcame the zippy rudder u Can t t this be done automatically? l Not yet (at least not reliably enough) l But we now have lots of data for training techniques u Why does it still sound like robot? l Need better (appropriate testing)

53 SUS Sentences u sus_00022 u sus_00012 u sus_00005 u sus_00017

54 SUS Sentences u The serene adjustments foresaw the acceptable acquisition u The temperamental gateways forgave the weatherbeaten finalist u The sorrowful premieres sang the ostentatious gymnast u The disruptive billboards blew the sugary endorsement

55 Voice Identity u What makes a voice identity l Lexical Choice: Woo-hoo hoo, I pity the fool l Phonetic choice l Intonation and duration l Spectral qualities (vocal tract shape) l Excitation

56 Voice Conversion techniques u Full ASR and TTS l Much too hard to do reliably u Codebook transformation l ASR HMM state to HMM state transformation u GMM based transformation l Build a mapping function between frames

57 Learning VC models u First need to get parallel speech l Source and Target say same thing l Use DTW to align (in the spectral domain) l Trying to learn a functional mapping l utterances u Text-independent VC l Means no parallel speech available l Use some form of synthesis to generate it

58 VC Training process u Extract F0, power and MFCC from source and target utterances u DTW align source and target u Loop until convergence l Build GMM to map between source/target l DTW source/target using GMM mapping

59 VC Training process

60 VC Run-time

61 Voice Transformation - Festvox GMM transformation suite (Toda) awb bdl jmk slt awb bdl jmk slt

62 VC in Synthesis u Can be used as a post filter in synthesis l Build kal_diphone to target VC l Use on all output of kal_diphone u Can be used to convert a full DB l Convert a full db and rebuild a voice

63 Style/Emotion Conversion u Unit Selection (or SPS) l Require lots of data in desired style/emotion u VC technique l Use as filter to main voice (same speaker) l Convert neutral to angry, sad, happy

64 Can you say that again? u Voice conversion for speaking in noise u Different quality when you repeat things u Different quality when you speak in noise l Lombard effect (when very loud) l Speech-in-noise in regular noise

65 Speaking in Noise (Langner) u Collect data l Randomly play noise in person s s ears l Normal l In Noise u Collect 500 of each type u Build VC model l Normal -> > in-noise u Actually l Spectral, duration, f0 and power differences

66 Synthesis in Noise u For bus information task u Play different synthesis information utts l With SIN synthesizer l With SWN synthesizer l With VC (SWN->SIN) synthesizer u Measure their understanding l SIN synthesizer better (in Noise) l SIN synthesizer better (without Noise for elderly)

67 Transterpolation u Incrementally transform a voice X% l BDL-SLT by 10% l SLT-BDL by 10% u Count when you think it changes from M-FM u Fun but what are the uses

68 De-identification u Remove speaker identity l But keep it still human like u Health Records l HIPAA laws require this l Not just removing names and SSNs u Remove identifiable properties l Use Voice conversion to remove spectral l Use F0/duration mapping to remove prosodic l Use ASR/MT techniques to remove lexical

69 Summary u Data-driven speech synthesis l Text processing l Prosody and pronunciation l Waveform synthesis u Finding the right optimization l Find an objective metric that correlates with human perception

70 Potential Projects u Find F0 in story telling l F0 is easy to find in isolated sentences l What about full paragraphs l Storytellers use much wider range u Find F0 shapes/accent types l Use HMM to recognize types of accents l (trajectory modeling) l Following tilt and Moeller model

71 Parametric Synthesis u Better parametric representation of speech l Particularly excitation parameterization u Better Acoustic measures of quality l Use Blizzard answers to build/check objective measure u Statistical Klatt Parametric synthesis l Using knowledge-base parameters l F0, aspiration, nasality, formants l Automatically derive Klatt parameters for db l Use them for statistical parametric synthesis

72

Introduction to Speech Synthesis

Introduction to Speech Synthesis IBM TJ Watson Research Center Human Language Technologies Introduction to Speech Synthesis Raul Fernandez fernanra@us.ibm.com IBM Research, Yorktown Heights Outline Ø Introduction and Motivation General

More information

Modeling Coarticulation in Continuous Speech

Modeling Coarticulation in Continuous Speech ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013 Outline in 1 2 3 4 5 2 / 40 in is the influence of one phoneme on another Figure: of coarticulation

More information

Hybrid Speech Synthesis

Hybrid Speech Synthesis Hybrid Speech Synthesis Simon King Centre for Speech Technology Research University of Edinburgh 2 What are you going to learn? Another recap of unit selection let s properly understand the Acoustic Space

More information

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky. Lecture 6: Waveform Synthesis (in Concatenative TTS)

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky. Lecture 6: Waveform Synthesis (in Concatenative TTS) CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky Lecture 6: Waveform Synthesis (in Concatenative TTS) IP Notice: many of these slides come directly from Richard Sproat s

More information

Applying Backoff to Concatenative Speech Synthesis

Applying Backoff to Concatenative Speech Synthesis Applying Backoff to Concatenative Speech Synthesis Lily Liu Stanford University lliu23@stanford.edu Luladay Price Stanford University luladayp@stanford.edu Andrew Zhang Stanford University azhang97@stanford.edu

More information

OGIresLPC : Diphone synthesizer using residualexcited linear prediction

OGIresLPC : Diphone synthesizer using residualexcited linear prediction Oregon Health & Science University OHSU Digital Commons CSETech October 1997 OGIresLPC : Diphone synthesizer using residualexcited linear prediction Michael Macon Andrew Cronk Johan Wouters Alex Kain Follow

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Improved Tamil Text to Speech Synthesis

Improved Tamil Text to Speech Synthesis Improved Tamil Text to Speech Synthesis M.Vinu Krithiga * and T.V.Geetha ** * Research Scholar ; ** Professor Department of Computer Science and Engineering,

More information

An Open Source Speech Synthesis Frontend for HTS

An Open Source Speech Synthesis Frontend for HTS An Open Source Speech Synthesis Frontend for HTS Markus Toman and Michael Pucher FTW Telecommunications Research Center Vienna Donau-City-Straße 1, A-1220 Vienna, Austria http://www.ftw.at {toman,pucher}@ftw.at

More information

Speech Synthesis. Simon King University of Edinburgh

Speech Synthesis. Simon King University of Edinburgh Speech Synthesis Simon King University of Edinburgh Hybrid speech synthesis Partial synthesis Case study: Trajectory Tiling Orientation SPSS (with HMMs or DNNs) flexible, robust to labelling errors but

More information

Increased Diphone Recognition for an Afrikaans TTS system

Increased Diphone Recognition for an Afrikaans TTS system Increased Diphone Recognition for an Afrikaans TTS system Francois Rousseau and Daniel Mashao Department of Electrical Engineering, University of Cape Town, Rondebosch, Cape Town, South Africa, frousseau@crg.ee.uct.ac.za,

More information

HIDDEN Markov model (HMM)-based statistical parametric

HIDDEN Markov model (HMM)-based statistical parametric 1492 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012 Minimum Kullback Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis Zhen-Hua Ling, Member,

More information

Introduction to HTK Toolkit

Introduction to HTK Toolkit Introduction to HTK Toolkit Berlin Chen 2003 Reference: - The HTK Book, Version 3.2 Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools Homework:

More information

Text-Independent Speaker Identification

Text-Independent Speaker Identification December 8, 1999 Text-Independent Speaker Identification Til T. Phan and Thomas Soong 1.0 Introduction 1.1 Motivation The problem of speaker identification is an area with many different applications.

More information

Learning The Lexicon!

Learning The Lexicon! Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,

More information

Speech-Coding Techniques. Chapter 3

Speech-Coding Techniques. Chapter 3 Speech-Coding Techniques Chapter 3 Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the quality RTP payload types

More information

1 Introduction. 2 Speech Compression

1 Introduction. 2 Speech Compression Abstract In this paper, the effect of MPEG audio compression on HMM-based speech synthesis is studied. Speech signals are encoded with various compression rates and analyzed using the GlottHMM vocoder.

More information

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction

More information

Flite: a small fast run-time synthesis engine

Flite: a small fast run-time synthesis engine ISCA Archive Flite: a small fast run-time synthesis engine Alan W Black and Kevin A. Lenzo Carnegie Mellon University awb@cs.cmu.edu, lenzo@cs.cmu.edu Abstract Flite is a small, fast run-time synthesis

More information

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE

More information

Voice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression

Voice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression INTERSPEECH 2013 Voice Conversion for Non-Parallel Datasets Using Dynamic Kernel Partial Least Squares Regression Hanna Silén, Jani Nurminen, Elina Helander, Moncef Gabbouj Department of Signal Processing,

More information

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,

More information

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression Voice Conversion Using Dynamic Kernel 1 Partial Least Squares Regression Elina Helander, Hanna Silén, Tuomas Virtanen, Member, IEEE, and Moncef Gabbouj, Fellow, IEEE Abstract A drawback of many voice conversion

More information

Principles of Audio Coding

Principles of Audio Coding Principles of Audio Coding Topics today Introduction VOCODERS Psychoacoustics Equal-Loudness Curve Frequency Masking Temporal Masking (CSIT 410) 2 Introduction Speech compression algorithm focuses on exploiting

More information

GSM Network and Services

GSM Network and Services GSM Network and Services Voice coding 1 From voice to radio waves voice/source coding channel coding block coding convolutional coding interleaving encryption burst building modulation diff encoding symbol

More information

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008 peech Recogni,on using HTK C4706 Fadi Biadsy April 21 st, 2008 1 Outline peech Recogni,on Feature Extrac,on HMM 3 basic problems HTK teps to Build a speech recognizer 2 peech Recogni,on peech ignal to

More information

Designing in Text-To-Speech Capability for Portable Devices

Designing in Text-To-Speech Capability for Portable Devices Designing in Text-To-Speech Capability for Portable Devices Market Dynamics Currently, the automotive, wireless and cellular markets are experiencing increased demand from customers for a simplified and

More information

Effect of MPEG Audio Compression on HMM-based Speech Synthesis

Effect of MPEG Audio Compression on HMM-based Speech Synthesis Effect of MPEG Audio Compression on HMM-based Speech Synthesis Bajibabu Bollepalli 1, Tuomo Raitio 2, Paavo Alku 2 1 Department of Speech, Music and Hearing, KTH, Stockholm, Sweden 2 Department of Signal

More information

Intelligent Hands Free Speech based SMS System on Android

Intelligent Hands Free Speech based SMS System on Android Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer

More information

Building a Unit Selection Synthesis Voice

Building a Unit Selection Synthesis Voice Building a Unit Selection Synthesis Voice Version 04/03/18, 02:36:09 PM Steps Prepare the database get speech data (and the accompanying text) annotate database using the Aligner convert to Festival format

More information

SAS: A speaker verification spoofing database containing diverse attacks

SAS: A speaker verification spoofing database containing diverse attacks SAS: A speaker verification spoofing database containing diverse attacks Zhizheng Wu 1, Ali Khodabakhsh 2, Cenk Demiroglu 2, Junichi Yamagishi 1,3, Daisuke Saito 4, Tomoki Toda 5, Simon King 1 1 University

More information

Phoneme vs. Diphone in Unit Selection TTS of Lithuanian

Phoneme vs. Diphone in Unit Selection TTS of Lithuanian Baltic J. Modern Computing, Vol. 6 (2018), No. 2, 93-103 https://doi.org/10.22364/bjmc.2018.6.2.01 Phoneme vs. Diphone in Unit Selection TTS of Lithuanian Pijus KASPARAITIS, Kipras KANČYS Institute of

More information

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts 1. Overview Davide Bonardo with the collaboration of Simon Parr The release of Loquendo TTS

More information

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR

More information

Introduction to The HTK Toolkit

Introduction to The HTK Toolkit Introduction to The HTK Toolkit Hsin-min Wang Reference: - The HTK Book Outline An Overview of HTK HTK Processing Stages Data Preparation Tools Training Tools Testing Tools Analysis Tools A Tutorial Example

More information

A Gaussian Mixture Model Spectral Representation for Speech Recognition

A Gaussian Mixture Model Spectral Representation for Speech Recognition A Gaussian Mixture Model Spectral Representation for Speech Recognition Matthew Nicholas Stuttle Hughes Hall and Cambridge University Engineering Department PSfrag replacements July 2003 Dissertation submitted

More information

Voice Command Based Computer Application Control Using MFCC

Voice Command Based Computer Application Control Using MFCC Voice Command Based Computer Application Control Using MFCC Abinayaa B., Arun D., Darshini B., Nataraj C Department of Embedded Systems Technologies, Sri Ramakrishna College of Engineering, Coimbatore,

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

Developing the Maltese Speech Synthesis Engine

Developing the Maltese Speech Synthesis Engine Developing the Maltese Speech Synthesis Engine Crimsonwing Research Team The Maltese Text to Speech Synthesiser Crimsonwing (Malta) p.l.c. awarded tender to develop the Maltese Text to Speech Synthesiser

More information

The DEMOSTHeNES Speech Composer

The DEMOSTHeNES Speech Composer The DEMOSTHeNES Speech Composer Gerasimos Xydas and Georgios Kouroupetroglou University of Athens, Department of Informatics and Telecommunications Division of Communication and Signal Processing Panepistimiopolis,

More information

A ROBUST SPEAKER CLUSTERING ALGORITHM

A ROBUST SPEAKER CLUSTERING ALGORITHM A ROBUST SPEAKER CLUSTERING ALGORITHM J. Ajmera IDIAP P.O. Box 592 CH-1920 Martigny, Switzerland jitendra@idiap.ch C. Wooters ICSI 1947 Center St., Suite 600 Berkeley, CA 94704, USA wooters@icsi.berkeley.edu

More information

Detection of Acoustic Events in Meeting-Room Environment

Detection of Acoustic Events in Meeting-Room Environment 11/Dec/2008 Detection of Acoustic Events in Meeting-Room Environment Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of 34 Content Introduction State of the Art Acoustic

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

MATLAB Apps for Teaching Digital Speech Processing

MATLAB Apps for Teaching Digital Speech Processing MATLAB Apps for Teaching Digital Speech Processing Lawrence Rabiner, Rutgers University Ronald Schafer, Stanford University GUI LITE 2.5 editor written by Maria d Souza and Dan Litvin MATLAB coding support

More information

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun 1. Introduction The human voice is very versatile and carries a multitude of emotions. Emotion in speech carries extra insight about

More information

Integrate Speech Technology for Hands-free Operation

Integrate Speech Technology for Hands-free Operation Integrate Speech Technology for Hands-free Operation Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks

More information

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3 Applications of Keyword-Constraining in Speaker Recognition Howard Lei hlei@icsi.berkeley.edu July 2, 2007 Contents 1 Introduction 3 2 The keyword HMM system 4 2.1 Background keyword HMM training............................

More information

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

CT516 Advanced Digital Communications Lecture 7: Speech Encoder CT516 Advanced Digital Communications Lecture 7: Speech Encoder Yash M. Vasavada Associate Professor, DA-IICT, Gandhinagar 2nd February 2017 Yash M. Vasavada (DA-IICT) CT516: Adv. Digital Comm. 2nd February

More information

An overview of interactive voice response applications

An overview of interactive voice response applications An overview of interactive voice response applications Suneetha Chittamuri Senior Software Engineer IBM India April, 2004 Copyright International Business Machines Corporation 2004. All rights reserved.

More information

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011 Audio Fundamentals, Compression Techniques & Standards Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011 Outlines Audio Fundamentals Sampling, digitization, quantization μ-law

More information

THE PERFORMANCE of automatic speech recognition

THE PERFORMANCE of automatic speech recognition IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 2109 Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments Michael L. Seltzer,

More information

Minimal-Impact Personal Audio Archives

Minimal-Impact Personal Audio Archives Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu

More information

Assignment 1: Speech Production and Models EN2300 Speech Signal Processing

Assignment 1: Speech Production and Models EN2300 Speech Signal Processing Assignment 1: Speech Production and Models EN2300 Speech Signal Processing 2011-10-23 Instructions for the deliverables Perform all (or as many as you can) of the tasks in this project assignment. Summarize

More information

OCR Coverage. Open Court Reading Grade K CCSS Correlation

OCR Coverage. Open Court Reading Grade K CCSS Correlation Grade K Common Core State Standards Reading: Literature Key Ideas and Details RL.K.1 With prompting and support, ask and answer questions about key details in a text. OCR Coverage Unit 1: T70 Unit 2: T271,

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

Speech User Interface for Information Retrieval

Speech User Interface for Information Retrieval Speech User Interface for Information Retrieval Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic Institute, Nagpur Sadar, Nagpur 440001 (INDIA) urmilas@rediffmail.com Cell : +919422803996

More information

Web-enabled Speech Synthesizer for Tamil

Web-enabled Speech Synthesizer for Tamil Web-enabled Speech Synthesizer for Tamil P. Prathibha and A. G. Ramakrishnan Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, INDIA 1. Introduction The Internet is popular

More information

Bluray (

Bluray ( Bluray (http://www.blu-ray.com/faq) MPEG-2 - enhanced for HD, also used for playback of DVDs and HDTV recordings MPEG-4 AVC - part of the MPEG-4 standard also known as H.264 (High Profile and Main Profile)

More information

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 8 LVCSR Training and Decoding Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 November

More information

Niusha, the first Persian speech-enabled IVR platform

Niusha, the first Persian speech-enabled IVR platform 2010 5th International Symposium on Telecommunications (IST'2010) Niusha, the first Persian speech-enabled IVR platform M.H. Bokaei, H. Sameti, H. Eghbal-zadeh, B. BabaAli, KH. Hosseinzadeh, M. Bahrani,

More information

Digital Speech Coding

Digital Speech Coding Digital Speech Processing David Tipper Associate Professor Graduate Program of Telecommunications and Networking University of Pittsburgh Telcom 2700/INFSCI 1072 Slides 7 http://www.sis.pitt.edu/~dtipper/tipper.html

More information

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India Analysis of Different Classifier Using Feature Extraction in Speaker Identification and Verification under Adverse Acoustic Condition for Different Scenario Shrikant Upadhyay Assistant Professor, Department

More information

Speech Applications. How do they work?

Speech Applications. How do they work? Speech Applications How do they work? What is a VUI? What the user interacts with when using a speech application VUI Elements Prompts or System Messages Prerecorded or Synthesized Grammars Define the

More information

How to create dialog system

How to create dialog system How to create dialog system Jan Balata Dept. of computer graphics and interaction Czech Technical University in Prague 1 / 55 Outline Intro Architecture Types of control Designing dialog system IBM Bluemix

More information

INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING

INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING Jon P. Nedel Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of

More information

Confidence Measures: how much we can trust our speech recognizers

Confidence Measures: how much we can trust our speech recognizers Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition

More information

REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann

REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD Kang Liu and Joern Ostermann Institut für Informationsverarbeitung, Leibniz Universität Hannover Appelstr. 9A, 3167 Hannover, Germany

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Interactions 2 Listening and Speaking Guide to CD Tracks

Interactions 2 Listening and Speaking Guide to CD Tracks Interactions 2 Listening and Speaking Guide to CD Tracks Page Activity Activity Name CD Track MP3 File Name 5 2 Previewing Vocabulary 1 2 Int. 2 LS-2013_1-02 5 3 Comprehension Questions 1 3 Int. 2 LS-2013_1-03

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Communication media for Blinds Based on Voice Mrs.K.M.Sanghavi 1, Radhika Maru

More information

FP SIMPLE4ALL deliverable D6.5. Deliverable D6.5. Initial Public Release of Open Source Tools

FP SIMPLE4ALL deliverable D6.5. Deliverable D6.5. Initial Public Release of Open Source Tools Deliverable D6.5 Initial Public Release of Open Source Tools The research leading to these results has received funding from the European Community s Seventh Framework Programme (FP7/2007-2013) under grant

More information

Text analysis. From a string of characters to a list of words , LTI, Carnegie Mellon

Text analysis. From a string of characters to a list of words , LTI, Carnegie Mellon Text analysis From a string of characters to a list of words. This is a pen. Text analysis My cat who lives dangerously has nine lives. He stole $100 from the bank. He stole 1996 cattle on 25 Nov 1996.

More information

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES Venkatraman Atti 1 and Andreas Spanias 1 Abstract In this paper, we present a collection of software educational tools for

More information

WHO WANTS TO BE A MILLIONAIRE?

WHO WANTS TO BE A MILLIONAIRE? IDIAP COMMUNICATION REPORT WHO WANTS TO BE A MILLIONAIRE? Huseyn Gasimov a Aleksei Triastcyn Hervé Bourlard Idiap-Com-03-2012 JULY 2012 a EPFL Centre du Parc, Rue Marconi 19, PO Box 592, CH - 1920 Martigny

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS

THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS Proadpran Punyabukkana, Jirasak Chirathivat, Chanin Chanma, Juthasit Maekwongtrakarn, Atiwong Suchato Spoken Language

More information

An Experimental Evaluation of Keyword-Filler Hidden Markov Models

An Experimental Evaluation of Keyword-Filler Hidden Markov Models An Experimental Evaluation of Keyword-Filler Hidden Markov Models A. Jansen and P. Niyogi April 13, 2009 Abstract We present the results of a small study involving the use of keyword-filler hidden Markov

More information

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING. AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING Junhong Zhao 1,2, Hua Yuan 3, Wai-Kim Leung 4, Helen Meng 4, Jia Liu 3 and Shanhong Xia 1

More information

a child-friendly word processor for children to write documents

a child-friendly word processor for children to write documents Table of Contents Get Started... 1 Quick Start... 2 Classes and Users... 3 Clicker Explorer... 4 Ribbon... 6 Write Documents... 7 Document Tools... 8 Type with a Keyboard... 12 Write with a Clicker Set...

More information

Speech Recognition, The process of taking spoken word as an input to a computer

Speech Recognition, The process of taking spoken word as an input to a computer Speech Recognition, The process of taking spoken word as an input to a computer program (Baumann) Have you ever found yourself yelling at your computer, wishing you could make it understand what you want

More information

IJETST- Vol. 03 Issue 05 Pages May ISSN

IJETST- Vol. 03 Issue 05 Pages May ISSN International Journal of Emerging Trends in Science and Technology Implementation of MFCC Extraction Architecture and DTW Technique in Speech Recognition System R.M.Sneha 1, K.L.Hemalatha 2 1 PG Student,

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015 AUDIO Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015 Key objectives How do humans generate and process sound? How does digital sound work? How fast do I have to sample audio?

More information

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013 2129 Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical

More information

Human Computer Interaction Using Speech Recognition Technology

Human Computer Interaction Using Speech Recognition Technology International Bulletin of Mathematical Research Volume 2, Issue 1, March 2015 Pages 231-235, ISSN: 2394-7802 Human Computer Interaction Using Recognition Technology Madhu Joshi 1 and Saurabh Ranjan Srivastava

More information

Starting-Up Fast with Speech-Over Professional

Starting-Up Fast with Speech-Over Professional Starting-Up Fast with Speech-Over Professional Contents #1 Getting Ready... 2 Starting Up... 2 Initial Preferences Settings... 3 Adding a Narration Clip... 3 On-Line Tutorials... 3 #2: Creating a Synchronized

More information

Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis Interspeech 2018 2-6 September 2018, Hyderabad Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Synthesis Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai National Engineering

More information

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS Dipti D. Joshi, M.B. Zalte (EXTC Department, K.J. Somaiya College of Engineering, University of Mumbai, India) Diptijoshi3@gmail.com

More information

Editing Pronunciation in Clicker 5

Editing Pronunciation in Clicker 5 Editing Pronunciation in Clicker 5 Depending on which computer you use for Clicker 5, you may notice that some words, especially proper names and technical terms, are mispronounced when you click a word

More information

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT Shoichiro IWASAWA*I, Tatsuo YOTSUKURA*2, Shigeo MORISHIMA*2 */ Telecommunication Advancement Organization *2Facu!ty of Engineering, Seikei University

More information

Create Swift mobile apps with IBM Watson services IBM Corporation

Create Swift mobile apps with IBM Watson services IBM Corporation Create Swift mobile apps with IBM Watson services Create a Watson sentiment analysis app with Swift Learning objectives In this section, you ll learn how to write a mobile app in Swift for ios and add

More information

Vocal Joystick User Guide

Vocal Joystick User Guide Vocal Joystick User Guide For Windows XP/Vista/Linux Version 0.5 beta The Vocal Joystick team includes: Jon Malkin Xiao Li Susumu Harada Jeff Bilmes Table of Contents Disclaimer... 3 About the Vocal Joystick...

More information

Model TS-04 -W. Wireless Speech Keyboard System 2.0. Users Guide

Model TS-04 -W. Wireless Speech Keyboard System 2.0. Users Guide Model TS-04 -W Wireless Speech Keyboard System 2.0 Users Guide Overview TextSpeak TS-04 text-to-speech speaker and wireless keyboard system instantly converts typed text to a natural sounding voice. The

More information

Towards Audiovisual TTS

Towards Audiovisual TTS Towards Audiovisual TTS in Estonian Einar MEISTER a, SaschaFAGEL b and RainerMETSVAHI a a Institute of Cybernetics at Tallinn University of Technology, Estonia b zoobemessageentertainmentgmbh, Berlin,

More information

CMU Sphinx: the recognizer library

CMU Sphinx: the recognizer library CMU Sphinx: the recognizer library Authors: Massimo Basile Mario Fabrizi Supervisor: Prof. Paola Velardi 01/02/2013 Contents 1 Introduction 2 2 Sphinx download and installation 4 2.1 Download..........................................

More information

MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID PHONES

MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID PHONES International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 3, Issue 2, May 2016, 34-38 IIST MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID

More information

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12 correlated to the Standards, Grades 5 12 1 correlated to the Standards, Grades 5 12 ELL Stage III: Grades 3 5 Listening and Speaking Standard 1: The student will listen actively to the ideas of others

More information

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,

More information

How accurate is AGNITIO KIVOX Voice ID?

How accurate is AGNITIO KIVOX Voice ID? How accurate is AGNITIO KIVOX Voice ID? Overview Using natural speech, KIVOX can work with error rates below 1%. When optimized for short utterances, where the same phrase is used for enrolment and authentication,

More information

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning Unsupervised Learning Clustering and the EM Algorithm Susanna Ricco Supervised Learning Given data in the form < x, y >, y is the target to learn. Good news: Easy to tell if our algorithm is giving the

More information