Modeling Coarticulation in Continuous Speech

Similar documents
Topics in Linguistic Theory: Laboratory Phonology Spring 2007

Learning The Lexicon!

Speech Articulation Training PART 1. VATA (Vowel Articulation Training Aid)

A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

NATO-NBVC 1200/2400 bps coder selection; results of TNO tests in phase 1

Assignment 1: Speech Production and Models EN2300 Speech Signal Processing

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.

INTEGRATION OF SPEECH & VIDEO: APPLICATIONS FOR LIP SYNCH: LIP MOVEMENT SYNTHESIS & TIME WARPING

Principles of Audio Coding

Towards Audiovisual TTS

Improved Tamil Text to Speech Synthesis

Model TS-04 -W. Wireless Speech Keyboard System 2.0. Users Guide

Facial Animation System Based on Image Warping Algorithm

Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky. Lecture 6: Waveform Synthesis (in Concatenative TTS)

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Minimal-Impact Personal Audio Archives

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS

MATLAB Apps for Teaching Digital Speech Processing

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Language Live! Levels 1 and 2 correlated to the Arizona English Language Proficiency (ELP) Standards, Grades 5 12

MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID PHONES

Speech-Coding Techniques. Chapter 3

Web-enabled Speech Synthesizer for Tamil

1 Introduction. 2 Speech Compression

Mahdi Amiri. February Sharif University of Technology

An Experimental Evaluation of Keyword-Filler Hidden Markov Models

OGIresLPC : Diphone synthesizer using residualexcited linear prediction

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

An Open Source Speech Synthesis Frontend for HTS

2.4 Audio Compression

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Text-to-Audiovisual Speech Synthesizer

Towards a High-Quality and Well-Controlled Finnish Audio-Visual Speech Synthesizer

Speech Recogni,on using HTK CS4706. Fadi Biadsy April 21 st, 2008

Expression control using synthetic speech.

In Pursuit of Visemes

1st component influence. y axis location (mm) Incoming context phone. Audio Visual Codebook. Visual phoneme similarity matrix

Designing in Text-To-Speech Capability for Portable Devices

Speech and audio coding

Effect of MPEG Audio Compression on HMM-based Speech Synthesis

PJP-50USB. Conference Microphone Speaker. User s Manual MIC MUTE VOL 3 CLEAR STANDBY ENTER MENU

StaRt. A Biofeedback App. Heather Campbell, Helen Carey, Celine Wu, Dalit Shalom Developing Assistive Technologies NYU October 2014

These are meant to be used as desktop reminders or cheat sheets for using Read&Write Gold. To use. your Print Dialog box as shown

SUPPORT VECTOR MACHINES FOR THAI PHONEME RECOGNITION

Chapter 3. Speech segmentation. 3.1 Preprocessing

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts

Hybrid Speech Synthesis

Proceedings of Meetings on Acoustics

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

LetterScroll: Text Entry Using a Wheel for Visually Impaired Users

Multifactor Fusion for Audio-Visual Speaker Recognition

TOWARDS A HIGH QUALITY FINNISH TALKING HEAD

image-based visual synthesis: facial overlay

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Correlation to Georgia Quality Core Curriculum

Multimedia Systems Speech II Mahdi Amiri February 2012 Sharif University of Technology

VTalk: A System for generating Text-to-Audio-Visual Speech

A LOW-COST STEREOVISION BASED SYSTEM FOR ACQUISITION OF VISIBLE ARTICULATORY DATA

Video Ovilus User Page

Starting-Up Fast with Speech-Over Professional

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

Applying Backoff to Concatenative Speech Synthesis

Practice Test Guidance Document for the 2018 Administration of the AASCD 2.0 Independent Field Test

Artificial Visual Speech Synchronized with a Speech Synthesis System

Omni Dictionary USER MANUAL ENGLISH

What is a receptive field? Why a sensory neuron has such particular RF How a RF was developed?

COMBINING FEATURE SETS WITH SUPPORT VECTOR MACHINES: APPLICATION TO SPEAKER RECOGNITION

Voice Analysis for Mobile Networks

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Panopto Quick Start (Faculty)

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

Introduction to Speech Synthesis

Spectral modeling of musical sounds

Quick Start Guide MAC Operating System Built-In Accessibility

Speech Technology Using in Wechat

Ohio s State Tests and Ohio English Language Proficiency Assessment Practice Site Guidance Document Updated September 14, 2018

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

Lecture 8. LVCSR Training and Decoding. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Multimedia Indexing. Lecture 12: EE E6820: Speech & Audio Processing & Recognition. Spoken document retrieval Audio databases.

Lecture 12: Multimedia Indexing. Spoken Document Retrieval (SDR)

Intelligent Hands Free Speech based SMS System on Android

Voiced-Unvoiced-Silence Classification via Hierarchical Dual Geometry Analysis

Masters in Computer Speech Text and Internet Technology. Module: Speech Practical. HMM-based Speech Recognition

A-DAFX: ADAPTIVE DIGITAL AUDIO EFFECTS. Verfaille V., Arfib D.

1. Before adjusting sound quality

Phonemes Interpolation

ReadyGEN Grade 2, 2016

arxiv: v1 [cs.cv] 3 Oct 2017

REALISTIC FACIAL EXPRESSION SYNTHESIS FOR AN IMAGE-BASED TALKING HEAD. Kang Liu and Joern Ostermann

wego write Predictable User Guide Find more resources online: For wego write-d Speech-Generating Devices

Xing Fan, Carlos Busso and John H.L. Hansen

Confidence Measures: how much we can trust our speech recognizers

Extraction and Representation of Features, Spring Lecture 4: Speech and Audio: Basics and Resources. Zheng-Hua Tan

Dr Andrew Abel University of Stirling, Scotland

Ohio s State Tests and Ohio English Language Proficiency Assessment Practice Site Guidance Document Updated July 21, 2017

Transcription:

ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013

Outline in 1 2 3 4 5 2 / 40

in is the influence of one phoneme on another Figure: of coarticulation in cnv (left) and clr (right) Degree of coarticulation depends on phonetic context For example, /w/ has a stronger effect on the following vowel than /z/ Motivation: To-date there is no comprehensive, data-driven model that explains the timing and degree of coarticulation effect of one phoneme on its neighbors 3 / 40

Speaking Styles: Defined in Clear speech and conversational speech are defined as Clear (clr): speech spoken clearly when talking to a hearing-impaired listener Conversational (cnv): speech spoken when speaking with a colleague 4 / 40

Applications in There are several application areas for coarticulation modeling: Convert conversational style speech to clear increase intelligibility Dysarthria diagnosis assessing the presence or severity Formant tracking automatically correct errors in tracking Text-to-speech vary the degree of coarticulation from conversational to clear Index of intelligibility infer the intelligibility of speech based on coarticulation 5 / 40

Broad and Clermont (1987) in Broad and Clermont (1987) produced several models of formant transition in vowels in CV and CV/d/ contexts The most detailed model used a linear combination of coarticulation functions and target values functions modeled with exponential functions Consonants limited to voiced stops /b,d,g/ 6 / 40

Niu and van Santen (2003) in Niu and van Santen (2003) applied Broad and Clermont s CV/d/ model to Dysarthria to measure coarticulation Expanded model application to generic CVC tokens ing was limited to vowel centers : effects of a dysarthric speaker were higher than normal speaker 7 / 40

Amano and Hosom (2010) in Amano and Hosom (2010) expanded upon Niu and van Santen by modeling the entire vowel region of a CVC Region of evaluation extended to consonant center if consonant was an Approximant (/w,y,l,r/) Changed exponential to sigmoid as coarticulation function : Applied model to formant tracking error detection and correction 8 / 40

Bush and Kain (2013) in Expanded model trajectories over CVC region Relaxed synchronicity; previous work modeled all formants (F1-F4) synchronously Validated model using an intelligibility test to show that model is capturing important features from spectral domain Resynthesis results: 74.6% for observed formants, 70.8% for modeled formants 9 / 40

Proposed in Uses triphone models as local models of coarticulation during analysis Creates continuous speech by cross-fading triphone models during synthesis Includes formant bandwidth Handles two primary components of plosives (/b,d,g,p,t,k/) separately Optimization uses joint-optimization over identical types 10 / 40

Triphone Trajectory in An individual formant trajectory X(t) of a triphone is modeled as ˆX(t; Λ) = f L (t) T L + f C (t) T C + f R (t) T R which is a convex linear combination of T L, T C, and T R representing global formant target values for three consecutive acoustic events. A phone includes one or more distinct acoustic events. f L (t), f C (t) and f R (t) are coarticulation functions 11 / 40

Triphone Trajectory Functions in The coarticulation functions are based on a sigmoid f(t; s, p) = (1 + e s (t p) ) 1 f L (t; s L, p L ) = f(t; s L, p L ) f R (t; s R, p R ) = f(t; s R, p R ) f C (t) = 1 f L (t) f R (t) s represents sigmoid slope and p sigmoid midpoint position parameters Λ = {T L, T C, T R, s L, p L, s R, p R } are specific to a single formant trajectory asynchronous model 12 / 40

Triphone clr in freq (Hz) 2500 2000 1500 1000 500 ffq act. bw act. 1.0 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 0.2 F2; w-iy-l, clear 4.0 4.1 4.2 4.3 4.4 4.5 4.6 maxf C (t) =0.98 4.0 4.1 4.2 4.3 4.4 4.5 4.6 time (s) maxf C (t) =1.00 4.0 4.1 4.2 4.3 4.4 4.5 4.6 time (s) Figure: F2 model on the triphone w-iy-l in clr speech 13 / 40

Triphone cnv in freq (Hz) 2500 2000 1500 1000 500 ffq act. bw act. 1.0 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 0.2 F2; w-iy-l, conv 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 maxf C (t) =0.69 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 time (s) maxf C (t) =0.99 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 time (s) Figure: F2 model on the triphone w-iy-l in cnv speech 14 / 40

Triphone Error in We define the per-token model error as t R t=tl (X(t) ˆX(t, Λ) E(X, Λ) = t R t L where X(t) and ˆX(t, Λ) are observed and estimated individual formant trajectories. The error is evaluated over t L to t R where t L is the center of the first phone, and t R is the center of the final phone. ) 2 15 / 40

Estimating Parameters in The parameter estimation strategy is nested hill-climbing with restart. Parameter set consists of: Targets (global) functions (local to each triphone token) 16 / 40

Estimating Targets in Initialize targets to median formant frequency and bandwidth at phoneme centers for all phonemes Use hill-climbing to optimize targets At each iteration, we find optimal coarticulation parameters: s and p parameters Intervals: F 1 = 200, 250,..., 1000 Hz F 2 = 400, 450,..., 3000 Hz F 3 = 900, 950,..., 4000 Hz F 4 = 3000, 3050,..., 6000 Hz B1, B2, B3, B4 = 10, 60,..., 460 Hz 17 / 40

Estimating Parameters in For each existing triphone type, we consider its target parameters (T L, T C, T R ) and identify all trajectories belonging to that triphone We then jointly estimate s L and s R values by a secondary hill-climbing method Finally, for each s value we use a tertiary hill-climbing method to estimate optimal p L and p R parameters Intervals: s = 10, 20,..., 150 p = 80, 70,..., 80 ms, relative to the phoneme boundary 18 / 40

Aligned Local Triphone s in 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 /pau/ /sh/ /aa/ /pcl/ /p/ /pau/ Figure: F2 coarticulation functions for sequence pau-sh-aa-pcl-p-pau 19 / 40

Overlaid Functions in 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 /pau/ /sh/ /aa/ /pcl/ /p/ /pau/ Figure: F2 coarticulation functions for sequence pau-sh-aa-pcl-p-pau 20 / 40

Functions F in 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 1.0 0.5 /pau/ /sh/ /aa/ /pcl/ /p/ /pau/ Figure: Cross-faded coarticulation functions 21 / 40

Original and Final Spectrum in Frequency 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 /pau/ /sh/ /aa/ /pcl//p/ /pau/ Figure: X N 4 = F N P T P 4 22 / 40

Parallel Style Corpus in One male, native speaker of American English Sentences contain neutral carrier phrase (5 total) followed by a keyword (242 total) in sentence final context e. g. I know the meaning of the word will Keywords are common English CVC words with 23 initial and final consonants and 8 monophthongs All sentences spoken in both clear and conversational styles Two recordings per style of each sentence Total number of keyword tokens: 242 2 2 = 968 Typical token generates four triphones Diphthongs not represented 23 / 40

Global Targets Vowels in F2 (Hz) 2500 2000 1500 Vowels aa iy ae eh ah ih uw uh 1000 500 200 400 600 800 1000 F1 (Hz) 24 / 40

Global Targets Approximants in 2500 Approximants l w r y 2000 F2 (Hz) 1500 1000 500 200 400 600 800 1000 F1 (Hz) 25 / 40

Global Targets Nasals in 2500 Nasal n m ng 2000 F2 (Hz) 1500 1000 500 200 400 600 800 1000 F1 (Hz) 26 / 40

Global Targets Unvoiced Fricatives in F2 (Hz) 2500 2000 1500 Unvoiced fricative sh th f h s 1000 500 200 400 600 800 1000 F1 (Hz) 27 / 40

Global Targets Voiced Fricatives in 2500 Voiced fricative dh v z 2000 F2 (Hz) 1500 1000 500 200 400 600 800 1000 F1 (Hz) 28 / 40

Global Targets Unvoiced Stops in 2500 Unvoiced stop ch k p t 2000 F2 (Hz) 1500 1000 500 200 400 600 800 1000 F1 (Hz) 29 / 40

Global Targets Voiced Stops in 2500 Voiced stop jh d b g 2000 F2 (Hz) 1500 1000 500 200 400 600 800 1000 F1 (Hz) 30 / 40

Global Targets Closures in F2 (Hz) 2500 2000 1500 Closures dcl chcl jhcl tcl bcl gcl pcl kcl 1000 500 200 400 600 800 1000 F1 (Hz) 31 / 40

Global Targets Vowel Comparison in frequency (Hz) 3500 3000 2500 2000 1500 1000 500 0 iy ih eh ae ah aa uh uw Figure: Vowel targets (circles) compared with Hillenbrand et al (x). F1 (red), F2 (green) and F3 (blue). 32 / 40

Global Targets Consonant Comparison in frequency (Hz) 3500 3000 2500 2000 1500 1000 500 0 b d dh f g h k l m n ng p r s sh t th v w y z Figure: Consonant targets (circles) and Allen et al (x). F1 (red), F2 (green) and F3 (blue). 33 / 40

Asynchronous Formant Movement clr in count 600 500 400 300 200 100 p L p R 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 time (s) Figure: std dev of p L and p R values over F1-F4 34 / 40

Asynchronous Formant Movement cnv in count 600 500 400 300 200 100 p L p R 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 time (s) Figure: std dev of p L and p R values over F1-F4 35 / 40

Audio Demo in frequency (Hz) 8000 7000 6000 5000 4000 3000 Resynthesis uses linear predictive coding and formant analysis Energy and pitch trajectories preserved Samples of both vocoded (left) and vocoded with model trajectories replacing observed trajectories (right) frequency (Hz) 8000 7000 6000 5000 4000 3000 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time (s) 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time (s) 36 / 40

Validation in Goal: Test if resynthesis using model parameters and global targets produces intelligible speech Two vocoded stimulus conditions: observed and model formant trajectories Two styles: clr and cnv Two speakers: male and female Use 20% testing material from parallel corpus; 80% training used in determining targets Stimuli to be loudness normalized and 12-talker babble noise added at +3 db SNR AMT listening to speech samples; must choose the term heard from a list of five terms four being decoy terms Decoy terms are selected based on closest phonetic similarity to the target term, using common CVC words (e. g. fan, van, than, pan and ban ) 37 / 40

Clear/Conv Targets in Goal: Are clr targets sufficient to model cnv style speech? Does the opposite hold? Four stimulus conditions (train/test): clr/clr, clr/cnv, cnv/clr and cnv/cnv Matched Mismatched All cnv/cnv clr/cnv = cnv+clr/cnv clr/clr cnv/clr cnv+clr/clr 38 / 40

s in Developed a new data-driven methodology to estimate style and context-independent vowel and consonant formant targets Developed a joint optimization technique that robustly estimates coarticulation parameters Demonstrated analysis and synthesis of speech using new continuous coarticulation model Outlined experiments to be conducted to validate model and study style-mismatched targets 39 / 40

Questions in Thank you! 40 / 40