Developing the Maltese Speech Synthesis Engine

Size: px

Start display at page:

Download "Developing the Maltese Speech Synthesis Engine"

Sheena Hart
5 years ago
Views:

1 Developing the Maltese Speech Synthesis Engine Crimsonwing Research Team The Maltese Text to Speech Synthesiser Crimsonwing (Malta) p.l.c. awarded tender to develop the Maltese Text to Speech Synthesiser by the Foundation for Information Technology Accessibility (FITA) Project co-financed (85%) by the EU s ERDF (European Regional Development Fund), and national funds (15%) Operational Programme I Cohesion Policy Investing in Competitiveness for a Better Quality of Life

2 The Maltese Text to Speech Synthesiser Features: 3 different voices: male, female, child High quality: Studio recorded (44 KHz 16bit sound quality) Neutral discourse Windows SAPI compliant (Speech Application Programming Interface) Inter-operability with any application that is SAPI compliant (e.g. Window-Eyes, etc.) Freely available for download Available in 2012 Text to Speech (TTS) Synthesis 1 st generation (1960 s to mid-1980 s): Formant synthesis Articulatory synthesis (based on vocal tract models) Robotic sounding 2 nd generation (mid-1980 s to mid-1990 s): Concatenative synthesis Single instance of each recorded unit Heavy DSP (digital signal processing) Can suffer from audible glitches at concatenation points 1 st work in Maltese TTS falls here (P. Micallef, PhD 1998) 3 rd generation (mid-1990 s onwards): Concatenative Synthesis with Unit Selection Multiple instances of each recorded unit Choosing the best chain of candidate units Less DSP

3 Evolution of the MSE Prototype 1: Second Generation engine based on the diphones created by Prof. Paul Micallef SAPI Complaint Prototype 2: Third generation engine One voice (male) Limited diphones & Lexicon No intonation model Prototype 3 Three voices Intonation model implemented Diphones (100K) and Lexicon (30K) Concatenative Speech Synthesis What type of units to use for TTS? Diphones chosen for the Maltese TTS engine. Compromise between number of units, co-articulation effects Easier to do concatenation at the stationary parts of speech signals ǫə + b /d/ /ǫə / /b/ /Ǻ/

Simple Example The word jiena converted to phonetic form /jǻə nǡ/

/Ǡ/ Grouped into 5 phone pairs (diphones) [ j], [jǻə ], [Ǻə n],

into into consideration pitch and energy Concatenative Speech

generated The required prosodic model is generated Database with

utterance is divided into segments (units) and the best matching

4 Simple Example The word jiena converted to phonetic form /jǻə nǡ/ via lexicon or rules Consists of the 4 phones /j/, /Ǻə /, /n/, and /Ǡ/ Grouped into 5 phone pairs (diphones) [ j], [jǻə ], [Ǻə n], [nǡ], and [Ǡ ] We need to find the best sequence of diphones taking into into consideration pitch and energy Concatenative Speech Synthesis Dan dǡn mhux mțȓ xogħol ȓǥə l ħafif, hǡfǻf, imma ǺmmǠ jrid jrǻt isir. ǺsǺr. Given some utterance to be synthesised A phonemic transcription is generated The required prosodic model is generated Database with recorded speech, segmented into audio segments (units) The given utterance is divided into segments (units) and the best matching units from the database are selected The units are concatenated together Some DSP is applied to smooth the joins between the units

5 Dan mhux xogħol ħafif imma jrid isir. Odin irid debħa mdemma għal kull wieħed mill-āellieda tiegħu biex iħallih jidħol āewwa Valħalla. Qalb ittaqlib tal-ħajja talbniedem, il-ħolqien sabiħ jindokra lill- Diphone Database Diphone database recorded speech corpus TTS Quality of synthesised speech is highly dependent on the corpus of recorded speech used to create the diphone database Large database required for sufficiently naturalsounding speech (spanning several to tens of hours) Large number of diphones needed for unit selection TTS /b/ + /Ǡ/ Diphone Database Creation Diphone database Dan mhux xogħol ħafif imma jrid isir. Odin irid debħa mdemma għal kull wieħed mill-āellieda tiegħu biex iħallih jidħol āewwa Valħalla. Qalb ittaqlib tal-ħajja talbniedem, il-ħolqien sabiħ jindokra lillrecorded speech corpus TTS Diphone Coverage How many of the potential diphones occur in Maltese? Which are the most frequent diphones? Need statistics on diphone frequency and variation Research Paper Preparation of a Free-Running Text Corpus for Maltese Concatenative Speech Synthesis; presented at the 3rd International Conference on Maltese Linguistics, 08-Apr-2011

6 Dan mhux xogħol ħafif imma jrid isir. Odin irid debħa mdemma għal kull wieħed mill-āellieda tiegħu biex iħallih jidħol āewwa Valħalla. Qalb ittaqlib tal-ħajja talbniedem, il-ħolqien sabiħ jindokra lill- Diphone Database Creation Diphone database recorded speech corpus TTS Diphone cutting: Manual process Performance of automatic diphone segmentation methods is currently limited Semi-automatic methods still require manual intervention Labour and time intensive Lexicon Phonemic Transcription database Tool constructed to manage the database

7 Applications Spelli client application packed with MSE MSE as a web service Online demo on fitamalta.eu ispeakmaltese (ipad / iphone / ipod / Android / Windows Mobile 7)

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts 1. Overview Davide Bonardo with the collaboration of Simon Parr The release of Loquendo TTS