An Open Source Speech Synthesis Frontend for HTS

Similar documents
Introduction to Speech Synthesis

Applying Backoff to Concatenative Speech Synthesis

RLAT Rapid Language Adaptation Toolkit

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Intelligent Hands Free Speech based SMS System on Android

HIDDEN Markov model (HMM)-based statistical parametric

1 Introduction. 2 Speech Compression

Effect of MPEG Audio Compression on HMM-based Speech Synthesis

Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Development of a Toolkit for Spoken Dialog Systems with an Anthropomorphic Agent: Galatea

Integrate Speech Technology for Hands-free Operation

Increased Diphone Recognition for an Afrikaans TTS system

Modeling Coarticulation in Continuous Speech

Create Swift mobile apps with IBM Watson services IBM Corporation

Flite: a small fast run-time synthesis engine

Speech Recognition. Project: Phone Recognition using Sphinx. Chia-Ho Ling. Sunya Santananchai. Professor: Dr. Kepuska

CMU Sphinx: the recognizer library

Towards Corpus Annotation Standards The MATE Workbench 1

Multimodal Transcription Software Programmes

Web-enabled Speech Synthesizer for Tamil

Machine Learning in Speech Synthesis. Alan W Black Language Technologies Institute Carnegie Mellon University Sept 2009

XII International PhD Workshop OWD 2010, October Efficient Diphone Database Creation for MBROLA, a Multilingual Speech Synthesiser

Phoneme vs. Diphone in Unit Selection TTS of Lithuanian

Fugue: Musical Programming on the Web

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

MARATHI TEXT-TO-SPEECH SYNTHESISYSTEM FOR ANDROID PHONES

Introduction to HTK Toolkit

Niusha, the first Persian speech-enabled IVR platform

WEB APPLICATION FOR VOICE OPERATED EXCHANGE

Tina John University of Munich Workshop on standards for phonological corpora Tina John M.A. 1

Building a Unit Selection Synthesis Voice

Ghotit Real Writer and Reader for Chromebooks and Android User Guide

FP SIMPLE4ALL deliverable D6.5. Deliverable D6.5. Initial Public Release of Open Source Tools

THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS

The DEMOSTHeNES Speech Composer

Voice activated spell-check

Starting-Up Fast with Speech-Over Professional

An HMM/DNN Comparison for Synchronized Text-to-speech and Tongue Motion Synthesis

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

MUSE: AN OPEN SOURCE SPEECH TECHNOLOGY RESEARCH PLATFORM. Peter Cahill and Julie Carson-Berndsen

CHAPTER 8 Multimedia Information Retrieval

SYNTHESIZED STEREO MAPPING VIA DEEP NEURAL NETWORKS FOR NOISY SPEECH RECOGNITION

Loquendo TTS Director: The next generation prompt-authoring suite for creating, editing and checking prompts

Porting mobile web application engine to the Android platform

Designing in Text-To-Speech Capability for Portable Devices

ReadyGEN Grade 2, 2016

Tamil Image Text to Speech Using Rasperry PI

Towards Ease of Building Legos in Assessing ehealth Language Technologies

OGIresLPC : Diphone synthesizer using residualexcited linear prediction

WFSTDM Builder Network-based Spoken Dialogue System Builder for Easy Prototyping

PROCE55 Mobile: Web API App. Web API.

COMPREHENSIVE MANY-TO-MANY PHONEME-TO-VISEME MAPPING AND ITS APPLICATION FOR CONCATENATIVE VISUAL SPEECH SYNTHESIS

Introduction to Mobile Development

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Pilot Deployment of HTML5 Video Descriptions

An Introduction To Object Oriented Programming 3rd Edition

This document was created by an unregistered ChmMagic, please go to to register. it. Thanks

Digital Audio Basics

AUDIOVISUAL SYNTHESIS OF EXAGGERATED SPEECH FOR CORRECTIVE FEEDBACK IN COMPUTER-ASSISTED PRONUNCIATION TRAINING.

Hybrid Speech Synthesis

Editing Pronunciation in Clicker 5

PHONE-BASED SPOKEN DOCUMENT RETRIEVAL IN CONFORMANCE WITH THE MPEG-7 STANDARD

Comprehensive Tool for Generation and Compatibility Management of Subtitles for English Language Videos

SE 3S03 - Tutorial 2. Zahra Ali. Week of Feb 8, 2016

A new trainable trajectory formation system for facial animation

WikiSpeech enabling open source text-to-speech for Wikipedia

Car Information Systems for ITS

Type your codes into the Username and Password section and click on Login.

Growing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages

Mobile Application for Improving Speech and Text Data Collection Approach

EPSON Speech IC Speech Guide Creation Tool User Guide

Recent Development of Open-Source Speech Recognition Engine Julius

Realize Reader Chrome App Version Early Learning User Guide

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Table of Content. Installing Read & Write Gold New & Enhanced Features General Options 31-33

Turns your voice into text with up to 99% accuracy. New - Up to a 15% improvement to out-of-the-box accuracy compared to Dragon version 12

Type your codes into the Username and Password section and click on Login.

Speech Recognition Voice Pro Enterprise 4.0 Client (Windows based Client) MANUAL Linguatec GmbH

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

CD 600 IntelliLink, Navi 650, Navi 950 IntelliLink Frequently Asked Questions

JBoss World 2009 Aaron Darcy

FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT

SAS: A speaker verification spoofing database containing diverse attacks

Improved Tamil Text to Speech Synthesis

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Realize Reader Windows App. Early Learner User Guide

Presenter: Ravikumar Balakrishna & Santosh Subramanya April 26, 2018

Developing the Maltese Speech Synthesis Engine

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE

Usability and accessibility in EPUB 3. and in the future of e-learning. SMART on ICT 2012 International Open Forum

Using Speech Input and Output in Logo Programming Peter Tomcsányi

USER GUIDE FOR SMARTPHONE BASED REAL-TIME ACOUSTIC FEEDBACK CANCELLATION SYSTEM

How to create dialog system

Analyzing Robustness of UML State Machines

Recognition of online captured, handwritten Tamil words on Android

Model TS-04 -W. Wireless Speech Keyboard System 2.0. Users Guide

WP6 Personal Mobility Assistant. D6.4: Application Design Studio Prototype

Enhancing applications with Cognitive APIs IBM Corporation

5th 4DIAC Users' Workshop

Type your codes into the Username and Password section and click on Login.

Transcription:

An Open Source Speech Synthesis Frontend for HTS Markus Toman and Michael Pucher FTW Telecommunications Research Center Vienna Donau-City-Straße 1, A-1220 Vienna, Austria http://www.ftw.at {toman,pucher}@ftw.at Abstract. This paper describes a software framework for HMM-based speech synthesis that we have developed and released to the public. The framework is compatible to the well-known HTS toolkit by incorporating hts engine and Flite. It enables HTS voices to be used as Microsoft Windows system voices and to be integrated into Android and ios apps. Non-English languages are supported through the capability to load Festival format pronunciation dictionaries and letter to sound rules. The release also includes an Austrian German voice model of a male, professional speaker recorded in studio quality as well as pronunciation dictionary, letter to sound rules and basic text preprocessing procedures for Austrian German. The framework is available under an MIT-style license. Keywords: speech synthesis, HTS, hidden markov model, frontend, software 1 Introduction Hidden-Markov Model (HMM) based speech synthesis provides a methodology for flexible speech synthesis while keeping a low memory footprint [1]. It also enables speaker adaptation from average voice models, allowing the creation of new voice models from sparse voice data [2], as well as techniques like interpolation [3][4] and transformation [5][6] of voices. A well-known toolkit for creating HMM-based voice models is HTS [7]. Separate software toolkits are available to actually synthesize speech waveforms from HTS models. A popular, freely available framework is hts engine [8]. Speech synthesis frontends on the other hand provide means for analyzing and processing text, producing the necessary input for the synthesizer. In HTS this input is a set of labels where usually each label represents a single phone and contextual information, including surrounding phones, position in utterance, prosodic information etc. While not exclusively being frontends and not specifically targeted for HTS, popular choices are Festival [9] or Flite [10]. Festival is a complex software framework for building speech synthesis systems focusing Unix-based operating systems. Flite was built as a lightweight alternative to Festival with low memory footprint and fast runtime in mind. Our main goal when creating the presented frontend framework was to easily allow HTS voices to be used with the Speech Application Programming Interface 5 (SAPI5). This allows the framework to be installed on different versions of the Microsoft Windows operation system as speech synthesis engine, making HTS voice models available as system voices to applications like screen readers, e-book creators etc. The second goal was simple integration of new languages and phone sets. The third goal was

2 Markus Toman and Michael Pucher portability to mobile devices. The framework is available under an MIT-style license at http://m-toman.github.io/salb/. Flite has been adapted for HTS in the Flite+hts engine software [8] and due to its small and portable nature it seemed like a good fit to our requirements. The structure of Flite makes integrating new languages rather cumbersome. 1 Therefore our framework integrates Flite for text analysis of English while additionally providing a second text analysis module that can utilize Festival style pronunciation dictionaries and letter to sound trees. Text preprocessing tasks (e.g. number and date conversion) can be added to the module in C++. Adding a completely new text processing module is also possible. The framework includes hts engine for speech waveform synthesis and can be extended by other synthesizers. The framework also includes a free voice model of a male, professional speaker for Standard Austrian German. 2 Voice model Leo Category Phones (IPA) Vowels A A: 6 6: 5 e E e: (monoph.) i I i: O o o: ø: æ: 5 œ œ: @ u U u: Y y y: Vowels 6: Õ: æ: œ: (monoph.) nasalized Diphthongs ai 6:5 A:5 6I AU E5 E:5 I5 i5 i:5 O5 O:5 o:5 OY ø:5 œ5 U5 u:5 U:5 Y5 y:5 Plosives (stops) g k P p t b d Nasal stops m n N Fricatives ç x f h s S v z Z Approximants j Trill r Lateral approx. l Table 1. Phone set used for Austrian German voice Leo. With the framework we provide a free voice model of a male, professional speaker for Standard Austrian German called Leo. The model is built from 3,700 utterances recorded in studio quality using a phonetically balanced corpus. The phone set used in the voice can be seen in Table 1. A pronunciation dictionary with 14,000 entries, letter to sound rules and procedures for number conversion are also included. 1 We have published instructions on adding a new language to Flite: http://sourceforge.net/p/atflite/wiki/addingnewlanguage/

An Open Source Speech Synthesis Frontend 3 3 Framework architecture The general architecture of the SALB framework is shown in Figure 1. Frontend modules provide means to communicate with the user or other applications through different channels. For example the user can directly trigger speech synthesis tasks by the Command-Line-Interface (CLI), other applications can use the SAPI5 interface to use the framework in a uniformly manner together with other synthesis engines that implement SAPI5. The frontend modules use the C++ Application Programming Interface (API) of the core module manager, which in turn coordinates the backend modules, performing text processing and the actual speech synthesis task. The C++ API can be used directly to embed the framework in other applications. Frontend SAPI CLI JNI (Android) API Calls Manager flite Internal Text Analysis hts_engine Text Analysis Synthesis Engine Fig. 1. General framework architecture 3.1 Manager module and API The core of the framework is the manager module which provides a uniform API for frontend modules or other applications. It provides abstractions for different elements of the speech synthesis process. This API is provided by the TTSManager class. A TextFragment is a piece of text in a given language. Each TextFragment has FragmentProperties associated which control synthesis parameters (e.g. voice, speaking rate) for this text fragment. Multiple TextFragment objects can form a Text object. This can be used to synthesize a text consisting of fragments with different synthesis parameters (e.g. a text read by different voices). A Text or TextFragment

4 Markus Toman and Michael Pucher object with associated FragmentProperties can be passed to a TTSManager object which executes the speech synthesis process and returns a TTSResult object. This process is depicted in Figure 2. The TTSManager object first selects an adequate TextAnalyzer object based on the value in FragmentProperties specifying the text analyzer to use or a value specifying the language of the text. The TextFragment is then passed to the TextAnalyzer object which returns a series of Label objects in a container called Labels. A Label represents the basic unit for synthesis, usually being a single phone with contextual information. The TTSManager then selects an adequate Synthesizer implementation, again based on FragmentProperties, and passes the Label to it. The Label class is responsible for providing the desired format. Currently the only available format is the HTS label format, which can easily be stored as special character delimited string. The Synthesizer returns a TTSResult object containing the synthesized waveform as vector of discrete samples as well as meta information. TTSManager TextFragment TextAnalyzer Labels Synthesizer TTSResult Fig. 2. Data flow in synthesis process 3.2 Frontend interfaces The following sections describe the frontend interfaces currently included in the framework. Speech Application Programming Interface (SAPI) The framework provides a frontend implementing the SAPI5. This allows the registration of the framework as speech synthesis engine on Microsoft Windows platforms, therefore enabling HTS voices to be registered as system voices. We provide a Microsoft Visual Studio project [11] to compile a SAPI-enabled dynamic link library for 32-bit and 64-bit systems that can be registered with the operating system. Subsequently, voices can be added using the register voice tool that comes with the framework. Lastly a script, that allows to create installer packages that install or uninstall the engine and associated voice models, is bundled with the framework. Command Line Interface The distribution also contains a simple command line tool which, given all necessary input for the text analysis and synthesis modules, produces RIFF wav output files from textual input.

An Open Source Speech Synthesis Frontend 5 Android Integration in Android apps is possible through the Java Native Interface (JNI) and the Android Native Development Kit (NDK) [12]. The framework comes with Android make files, a JNI wrapper and a Java class for demonstration purposes. 3.3 Text analysis modules The following sections describe the text analysis modules currently included in the framework. Flite For converting English text to a series of labels for synthesis, we integrated Flite. The class FliteTextAnalyzer (derived from TextAnalyzer) is a wrapper converting input and output data for and from Flite. Internal text analyzer The distribution also comes with an internal text analysis module InternalTextAnalyzer (derived from TextAnalyzer). This module reads a specific rules file consisting of a pronunciation dictionary and letter to sound rules in Festival style format. Preprocessing of text (e.g. for numbers and dates) can be added by extending the Normalizer class which delegates to different implementations based on the chosen language. We provide a simple Normalizer for Austrian German (AustrianGermanNormalizer) as well as a comprehensive rules file which can be used as an example to integrate new languages. This module uses an utterance structure which consists of the classes Phrase, Word, Syllable and Phone. The PhraseIterator class is used to navigate in this structure to build the resulting Label object. 3.4 Synthesis modules The following section describes the currently available synthesis module in the framework. hts engine This module provides a wrapper around hts engine and is implemented by the class HTSEngineSynthesizer (derived from Synthesizer). The Label objects provide strings in HTS label format which are the input to hts engine. The resulting waveform is then converted and encapsulated in a TTSResult object and returned to the TTSManager and subsequently to the caller. We have changed the algorithm for changing the speaking rate to linear scaling due to the results in [4]. 4 Adding new languages One main goal when developing the framework was the possibility to easily integrate voice models of other languages than English. As literature aiding this process is scarce, we present some basic guidelines in the following sections.

6 Markus Toman and Michael Pucher 4.1 Gathering data Before creating a recording script, a defined phone set is needed. These already exist for many languages. If not, the inventory of all relevant phones of a language should be defined in cooperation with phoneticians. One possibility is to gather conversational speech data in the target language and then produce manual transcriptions. The granularity of the transcription is very important and has a direct impact on the quality of the final voice models. For example, diphthongs can be modeled as separate phone symbols or split into two symbols. When a phone set is defined, recording scripts can be generated either through manual transcription or by using orthographic text (e.g. from newspapers) and a letter to sound system. The recording script should contain all phones in as many triphone contexts, better quinphone contexts, as possible. In any case each phone should occur multiple times (preferably at least 10 times, considering that a set set will also be split off the training data). Given a data set of phonetically transcribed sentences, algorithms to solve the set cover problem can be employed to produce a final recording script. From this data set, a test set can be selected by the same procedure (e.g. select the smallest set that contains each phone at least 2 times). Speakers should be recorded in studio setting and with neutral prosody. The output of this step is a phone set, recording scripts and a corpus of recorded utterances in waveforms and transcriptions. 4.2 Integration The internal text analyzer reads rules for text processing from an input stream. These rules consist of a lexicon and a letter to sound tree. A word is first looked up in the lexicon. If it can not be found there, the letter to sound rules are used to create the phonetic transcription. A voice model and the text processing rules are sufficient for bassic speech synthesis and building SAPI5 voice packages using the framework. Extending the source code is necessary if more sophisticated text processing is required, the C++ class AustrianGermanNormalizer can be used as an example for this. The method InternalTextAnalyzer::TextFragmentToPhrase can be adapted to implement alternatives to the Festival lexica and letter to sound rules. Lexicon The lexicon (or pronunciation dictionary) part of the text rules is a set of mappings from orthography to phonetics. The framework uses Festival style lexica in Scheme syntax. 2 A lexicon can be derived from the data corpus used for the recording or from publicly available pronunciation dictionaries. Letter-to-sound rules The internal text analyzer is able to read Festival style letter to sound rules. 3 A method for building letter to sound rules from an existing lexicon can be found in [13]. For languages with an orthography very close to the phonetics, the letter to sound rules can also be hand-crafted. 2 Festival lexicon definition: http://www.cstr.ed.ac.uk/projects/festival/manual/festival 13.html 3 Festival LTS rules: http://www.cstr.ed.ac.uk/projects/festival/manual/festival 13.html#SEC43

An Open Source Speech Synthesis Frontend 7 4.3 Training voice models Voice models to be used with the SALB framework can be trained using the HTS toolkit. Available demonstration scripts for speaker dependent voices can be adapted for this purpose. When using the demonstration scripts, it is necessary to replace raw sound files, labels and question files. Label files can be produced using the SALB framework once at least a lexicon containing all words from the training set is available. The question files contain questions used in the decision tree based clustering [14]. An illustrative example of the usage of the phone questions in a decision tree can be seen in Figure 3. A minimal question file should at least contain all phone identity questions (e.g. for a phone oh and quinphone models, at least the following questions should be defined: LL-oh, L-oh, C-oh, R-oh, RR-oh ). When all parameters (e.g. sampling rate) have been set, the training process can be started to produce an htsvoice model file that can be used with the SALB framework. C-Vowel true false C-oh L-oh true false true false C-t mcep_s2_21 mcep_s2_22 mcep_s2_25 true false mcep_s2_23 mcep_s2_24 Fig. 3. Questions in decision tree based clustering 5 Conclusion We have presented an open source speech synthesis framework, a software that bridges existing tools for HTS-based synthesis like hts engine and Flite with SAPI5 to enable HTS voices to be used as Windows system voices. It allows to load and use Festival style lexica and letter to sound rules for easy integration of new languages. The C++ API allows embedding in other applications as well as Android and ios apps. Also included is a voice model of a male, professional Standard Austrian German speaker with lexicon and letter to sound rules. We have also given a brief tutorial on how to train voice models for further languages and add use them with the SALB framework.

8 Markus Toman and Michael Pucher Acknowledgements This work was supported by the Austrian Science Fund (FWF): P22890-N23. The Competence Center FTW Forschungszentrum Telekommunikation Wien GmbH is funded within the program COMET - Competence Centers for Excellent Technologies by BMVIT, BMWA, and the City of Vienna. The COMET program is managed by the FFG. References 1. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11) (2009) 1039 1064 2. Yamagishi, J., Kobayashi, T.: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Transactions on Information and Systems E90-D(2) (February 2007) 533 543 3. Pucher, M., Schabus, D., Yamagishi, J., Neubarth, F., Strom, V.: Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis. Speech Communication 52(2) (feb 2010) 164 179 4. Valentini-Botinhao, C., Toman, M., Pucher, M., Schabus, D., Yamagishi, J.: Intelligibility Analysis of Fast Synthesized Speech. In: Proc. Interspeech, Singapore (September 2014) 2922 2926 5. Wu, Y.J., Nankaku, Y., Tokuda, K.: State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH), Brighton, United Kingdom (2009) 528 531 6. Toman, M., Pucher, M., Schabus, D.: Cross-variety speaker transformation in HSMM-based speech synthesis. In: Proceedings of the 8th ISCA Workshop on Speech Synthesis (SSW), Barcelona, Spain (August 2013) 77 81 7. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMMbased speech synthesis system (HTS) version 2.0. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis (SSW), Bonn, Germany (August 2007) 294 299 8. HTS working group: hts-engine. http://hts-engine.sourceforge.net/ 9. University of Edinburgh: Festival. http://www.cstr.ed.ac.uk/projects/ festival/ 10. Carnegie Mellon University: Flite. http://www.festvox.org/flite/ 11. Microsoft Corporation: Visual Studio. https://www.visualstudio.com/ 12. Google Inc: Android NDK. https://developer.android.com/tools/sdk/ ndk/ 13. Black, A.W., Lenzo, K., Pagel, V.: Issues in building general letter to sound rules. In: The Third ESCA Workshop in Speech Synthesis. (1998) 77 80 14. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest, Hungary (September 1999) 2374 2350