Information Technology - Coding of Audiovisual Objects Part 3: Audio

Similar documents
ISO/IEC TR TECHNICAL REPORT. Information technology Coding of audio-visual objects Part 24: Audio and systems interaction

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 1: Systems

B C ISO/IEC 9595 INTERNATIONAL STANDARD. Information technology Open Systems Interconnection Common management information service

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system: Motion JPEG 2000

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system Part 3: Motion JPEG 2000

ISO INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Cloud computing Reference architecture

ISO/IEC INTERNATIONAL STANDARD. Information technology Coding of audio-visual objects Part 18: Font compression and streaming

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia application format (MPEG-A) Part 13: Augmented reality application format

ISO/IEC 8822 INTERNATIONAL STANDARD. Information technology - Open Systems Interconnection - Presentation service definition

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology ASN.1 encoding rules: XML Encoding Rules (XER)

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology EAN/UCC Application Identifiers and Fact Data Identifiers and Maintenance

B C ISO/IEC INTERNATIONAL STANDARD

ISO/IEC Information technology Coding of audio-visual objects Part 15: Advanced Video Coding (AVC) file format

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Cloud computing Overview and vocabulary

ISO/IEC INTERNATIONAL STANDARD. Information technology - Digital compression and coding of continuous-tone still images: Compliance testing

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia service platform technologies Part 2: MPEG extensible middleware (MXM) API

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Dynamic adaptive streaming over HTTP (DASH) Part 2: Conformance and reference software

ISO IEC. INTERNATIONAL ISO/IEC STANDARD Information technology Fibre Distributed Data Interface (FDDI) Part 5: Hybrid Ring Control (HRC)

B C ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology CDIF transfer format Part 3: Encoding ENCODING.1

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia application format (MPEG-A) Part 4: Musical slide show application format

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system Part 14: XML representation and reference

This document is a preview generated by EVS

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia service platform technologies Part 5: Service aggregation

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG video technologies Part 4: Video tool library

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG audio technologies Part 3: Unified speech and audio coding

ISO 3901 INTERNATIONAL STANDARD. Information and documentation International Standard Recording Code (ISRC)

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Coding of audio-visual objects Part 16: Animation Framework extension (AFX)

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system: Reference software

ISO/IEC INTERNATIONAL STANDARD. Information technology ASN.1 encoding rules: Specification of Octet Encoding Rules (OER)

ISO/IEC INTERNATIONAL STANDARD. Information technology ASN.1 encoding rules: Specification of Encoding Control Notation (ECN)

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia framework (MPEG-21) Part 21: Media Contract Ontology

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 5: Multimedia description schemes

ISO/IEC INTERNATIONAL STANDARD. Information technology Coding of audio-

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG extensible middleware (MXM) Part 3: MXM reference software

ISO/IEC INTERNATIONAL STANDARD

ISO INTERNATIONAL STANDARD. Technical product documentation Lettering Part 4: Diacritical and particular marks for the Latin alphabet

ISO/IEC INTERNATIONAL STANDARD. Information technology Open Systems Interconnection The Directory Part 5: Protocol specifications

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG systems technologies Part 5: Bitstream Syntax Description Language (BSDL)

ISO/IEC INTERNATIONAL STANDARD. Information technology Security techniques Entity authentication

ISO/IEC TR TECHNICAL REPORT. Information technology Telecommunications and information exchange between systems Managed P2P: Framework

ISO/IEC INTERNATIONAL STANDARD. Information technology Open Distributed Processing Interface references and binding

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia service platform technologies Part 3: Conformance and reference software

ISO/IEC TR This is a preview - click here to buy the full publication TECHNICAL REPORT. First edition

This document is a preview generated by EVS

ISO/IEC Information technology Open Systems Interconnection The Directory: Protocol specifications

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD

ISOJIEC INTERNATIONAL STANDARD

ISOJIEC I INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia application format (MPEG-A) Part 11: Stereoscopic video application format

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Software engineering Product evaluation Part 3: Process for developers

ISOAEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Abstract Syntax Notation One (ASN.1): Specification of basic notation

ISO/IEC Information technology Multimedia content description interface Part 7: Conformance testing

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology - Open Distributed Processing - Reference Model: Foundations

ISO. International Organization for Standardization. ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM. Secretariat: USA (ANSI)

ISO/IEC INTERNATIONAL STANDARD. Information technology Open systems interconnection Part 1: Object identifier resolution system

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 2: Description definition language

INTERNATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia Middleware Part 6: Fault management

ISO/IEC INTERNATIONAL STANDARD. Information technology Open distributed processing Reference model: Foundations

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system: An entry level JPEG 2000 encoder

lso/iec INTERNATIONAL STANDARD Information technology - Remote Operations: Concepts, model and notation

AMERICAN NATIONAL STANDARD

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 4: Audio

This document is a preview generated by EVS

ISO/IEC INTERNATIONAL STANDARD. Software engineering Software measurement process. Ingénierie du logiciel Méthode de mesure des logiciels

INTERNATIONAL ISO/IEC STANDARD

INTERNATIONAL STANDARD

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD

Transcription:

ISO/IEC CD 14496-3TTS ÃISO/IEC ISO/IEC JTC 1/SC 29 N 2203 Date:1997-10-31 ISO/IEC CD 14496-3 Subpart 6 ISO/IEC JTC 1/SC 29/WG 11 Secretariat: Information Technology - Coding of Audiovisual Objects Part 3: Audio Subpart 6 : Text-to-Speech Document type:international Standard Document subtype:not applicable Document stage:(30) Committee Document language:e C:\MPEG-CD\1903TTS.DOCISOSTDVersion 1.8 1996-10-30 II

ÃISO/IEC ISO/IEC CD 14496-3TTS Contents 1 SCOPE...1 2 NORMATIVE REFERENCES...1 3 DEFINITIONS...1 4 SYMBOLS AND ABBREVIATIONS...2 5 METHOD OF DESCRIBING SYNTAX...3 6 MPEG-4 AUDIO TEXT-TO-SPEECH BITSTREAM SYNTAX...3 6.1 MPEG-4 AUDIO TTSSPECIFICCONFIG...3 6.2 MPEG-4 AUDIO TEXT-TO-SPEECH PAYLOAD...3 6.3 MPEG-4 AUDIO TEXT-TO-SPEECH SENTENCE...3 7 SEMANTICS FOR MPEG-4 AUDIO TEXT-TO-SPEECH BITSTREAM...5 7.1 GENERAL...5 7.2 MPEG-4 AUDIO TTSSPECIFICCONFIG...5 7.3 MPEG-4 AUDIO TEXT-TO-SPEECH PAYLOAD...5 7.4 MPEG-4 AUDIO TEXT-TO-SPEECH SENTENCE...6 8 MSDL-S REPRESENTATION OF THE BITSTREAM...7 8.1 TTSSPECIFICCONFIG...7 8.2 ALPDUPAYLOAD...8 8.3 TTSSENTENCE...8 8.4 TTSPROSODY...9 9 MPEG-4 AUDIO TEXT-TO-SPEECH DECODING PROCESS...10 9.1 INTERFACE BETWEEN DEMUX AND SYNTACTIC DECODER...11 9.2 INTERFACE BETWEEN SYNTACTIC DECODER AND SPEECH SYNTHESIZER...11 9.3 INTERFACE FROM SPEECH SYNTHESIZER TO COMPOSITOR...11 9.4 INTERFACE FROM COMPOSITOR TO SPEECH SYNTHESIZER...12 9.5 INTERFACE BETWEEN SPEECH SYNTHESIZER AND PHONEME-TO-FAP CONVERTER...12 ANNEX A (INFORMATIVE) APPLICATIONS OF MPEG-4 AUDIO TEXT-TO-SPEECH DECODER...14 III

ISO/IEC CD 14496-3TTS ÃISO/IEC Foreword Boilerplate text for ISO standards: ISO (the International Organisation for Standardisation) is a world-wide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organisations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardisation. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. Boilerplate text for ISO/IEC standards: ISO (the International Organisation for Standardisation) and IEC (the International Electrotechnical Commission) form the specialised system for world-wide standardisation. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organisation to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organisations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. Draft International Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International Standard requires approval by at least 75 % of the national bodies casting a vote. IV

ÃISO/IEC ISO/IEC CD 14496-3TTS Introduction This document is the 6 th subpart of the MPEG-4 Audio phase 1 CD (ISO Document Number: CD14496-3). The MPEG-4 Audio phase 1 CD is divided into six sections: section 1 (Main Document), section 2 (MPEG-4 Audio Parametric Coding), section 3 (MPEG-4 Audio Code Excited Linear Prediction (CELP) Coding), section 4 (MPEG-4 Audio Transform Coding), section 5 (MPEG-4 Audio Structured Audio), and section 6 (MPEG-4 Audio Text-to- Speech Coding, this file). V

ÃISO/IEC ISO/IEC CD 14496-3TTS Information Technology - Coding of Audiovisual Objects Part 3 : Audio Subpart 6 : Text-to-Speech 1 Scope This subpart of ISO/IEC CD 14496-3 specifies the coded representation of MPEG-4 Audio Text-to-Speech (M-TTS) and its decoder for high quality synthesized speech and for enabling various applications. The exact synthesis method is not a standardization issue partly because there are already various speech synthesis techniques. This subpart of ISO/IEC CD 14496-3 is intended for application to M-TTS functionalities such as those for facial animation (FA) and moving picture (MP) interoperability with a coded bitstream. The M-TTS functionalities include a capability of utilizing prosodic information extracted from natural speech. They also include the applications to the speaking device for FA tools and a dubbing device for moving pictures by utilizing lip shape and input text information. The text-to-speech (TTS) synthesis technology is recently becoming a rather common interface tool and begins to play an important role in various multimedia application areas. For instance, by using TTS synthesis functionality, multimedia contents with narration can be easily composed without recording natural speech sound. Moreover, TTS synthesis with facial animation (FA) / moving picture (MP) functionalities would possibly make the contents much richer. In other words, TTS technology can be used as a speech output device for FA tools and can also be used for MP dubbing with lip shape information. In MPEG-4, common interfaces only for the TTS synthesizer and for FA/MP interoperability are proposed. The proposed MPEG-4 Audio TTS functionalities can be considered as a superset of the conventional TTS framework. This TTS synthesizer can also utilize prosodic information of natural speech in addition to input text and can generate much higher quality synthetic speech. The interface bitstream format is strongly user-friendly: if some parameters of the prosodic information are not available, the missed parameters are generated by utilizing pre- established rules. The functionalities of the M-TTS thus range from conventional TTS synthesis function to natural speech coding and its application areas, i.e., from a simple TTS synthesis function to those for FA and MP. 2 Normative References The following standards contain provisions which, through reference in this text, constitute provisions of this International Standard. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this International Standard are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. Members of IEC and ISO maintain registers of currently valid International Standards. ISO/IEC 10646-1:1993 Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Prt 1: Architecture and Basic Multilingual Plane. The Unicode Consortium: 1996 The Unicode Standard Version 2.0 (Published by Addison-Wesley Developers Press) 3 Definitions International Phonetic Alphabet; IPA The worldwide agreed symbol set to represent various phonemes appearing in human speech. lip shape pattern A number that specifies a particular pattern of the preclassified lip shape. 1

ISO/IEC CD 14496-3TTS ÃISO/IEC lip synchronization A functionality that synchronizes speech with corresponding lip shapes. MPEG-4 Audio Text-to-Speech Decoder A device that produces synthesized speech by utilizing the M-TTS bitstream while supporting all the M-TTS functionalities such as speech synthesis for FA and MP dubbing. moving picture dubbing A functionality that assigns synthetic speech to the corresponding moving picture while utilizing lip shape pattern information for synchronization. M-TTS sentence This defines the information such as prosody, gender, and age for only the corresponding sentence to be synthesized. M-TTS sequence This sequence can contain multiple M-TTS sentences and additional information that affects all M-TTS sentences in the sequence. phoneme-to-fap converter A device that converts phoneme information to FAPs. text-to-speech synthesizer A device producing synthesized speech according to the input sentence character strings. trick mode A set of functions that enables stop, play, forward, and backward operations for users. 4 Symbols and Abbreviations The following symbols and the abbreviations are defined to describe this International Standard. F0: fundamental frequency (pitch frequency) CELP: code excited linear prediction DEMUX: demultiplexer FA: facial animation FAP: facial animation parameter ID: identifier IPA: International Phonetic Alphabet MP: moving picture M-TTS: MPEG-4 Audio TTS STOD: story teller on demand TTS: text-to-speech 2

ÃISO/IEC ISO/IEC CD 14496-3TTS 5 Method of Describing Syntax The bitstream retrieved by the decoder is described in 6. Each data item in the bitstream is in bold type. It is described by its name, its length in bits, and a mnemonic for its type and order of transmission. 6 MPEG-4 Audio Text-to-Speech Bitstream Syntax 6.1 MPEG-4 Audio TTSSpecificConfig Syntax No. of bits Mnemonic TTSSpecificConfig() { Language_Code 18 uimsbf Video_Enable 1 bslbf Lip_Shape_Enable 1 bslbf Trick_Mode_Enable 1 bslbf 6.2 MPEG-4 Audio Text-to-Speech Payload Syntax No. of bits Mnemonic AlPduPayload { TTS_Sequence() TTS_Sequence() { TTS_Sequence_Start_Code sc+8=32 bslbf TTS_Sequence_ID 5 uimsbf Language_Code 18 uimsbf Gender_Enable 1 bslbf Age_Enable 1 bslbf Speech_Rate_Enable 1 bslbf Prosody_Enable 1 bslbf Video_Enable 1 bslbf Lip_Shape_Enable 1 bslbf Trick_Mode_Enable 1 bslbf do { TTS_Sentence() while (next_bits()==tts_sentence_start_code) 6.3 MPEG-4 Audio Text-to-Speech Sentence Syntax No. of bits Mnemonic TTS_Sentence() { TTS_Sentence_Start_Code sc+8=32 bslbf TTS_Sentence_ID 10 uimsbf 3

ISO/IEC CD 14496-3TTS ÃISO/IEC Silence 1 bslbf if (Silence) { Silence_Duration 12 uimsbf else { if (Gender_Enable) { Gender 1 bslbf if (Age_Enable) { Age 3 uimsbf if (!Video_Enable && Speech_Rate_Enable) { Speech_Rate 4 uimsbf Length_of_Text 12 uimsbf for (j=0; j<length_of_text; j++) { TTS_Text 8 bslbf if (Prosody_Enable) { Dur_Enable 1 bslbf F0_Contour_Enable 1 bslbf Energy_Contour_Enable 1 bslbf Number_of_Phonemes 10 uimsbf Phoneme_Symbols_Length 13 uimsbf for (j=0 ; j<phoneme_symbols_length ; j++) { Phoneme_Symbols 8 bslbf for (j=0 ; j<number_of_phonemes ; j++) { if(dur_enable) { Dur_each_Phoneme 12 uimsbf if (F0_Contour_Enable) { num_f0 5 uimsbf for (j=0; j<num_f0;j++) { F0_Contour_each_Phoneme 8 uimsbf F0_Contour_each_Phoneme_Time 12 uimsbf if (Energy_Contour_Enable) { Energy_Contour_each_Phoneme 8*3=24 uimsbf 4

ÃISO/IEC ISO/IEC CD 14496-3TTS if (Video_Enable) { Sentence_Duration 16 uimsbf Position_in_Sentence 16 uimsbf Offset 10 uimsbf if (Lip_Shape_Enable) { Number_of_Lip_Shape 10 uimsbf for (j=0 ; j<number_of_lip_shape ; j++) { Lip_Shape_in_Sentence 16 uimsbf Lip_Shape 8 uimsbf 7 Semantics for MPEG-4 Audio Text-to-Speech Bitstream 7.1 General The M-TTS bitstream is described in three parts; M-TTS specific configuration in 7.2, M-TTS sequence in 7.3, and M-TTS sentence in 7.4. M-TTS system interface contains information that specific M-TTS decoder(s) should be used. M-TTS sequence can contain multiple M-TTS sentences. The information in the M-TTS sequence affects all M-TTS sentences while the information in the M-TTS sentence affects only the corresponding M-TTS sentence. 7.2 MPEG-4 Audio TTSSpecificConfig Language_Code - When this is 00 (00110000 00110000 in binary), the IPA is to be sent. In all other languages, this is the ISO 639 Language Code. In addition to this 16 bits, two bits that represent dialects of each language is added is added at the end (user defined). Video_Enable - This is a one-bit flag which is set to 1 when the MPEG-4 Audio TTS decoder works with MP. Lip_Shape_Enable - This is a one-bit flag that is set to 1 when the coded input bitstream has lip shape information. Trick_Mode_Enable - This is a one-bit flag which is set to 1 when the coded input bitstream permits trick mode functions such as stop, play, forward, and backward. 7.3 MPEG-4 Audio Text-to-Speech Payload TTS_Sequence_Start_Code - This is a bit string 000000XX in hexadecimal that identifies the start of the M-TTS coded sequence. TTS_Sequence_ID - This is a five-bit ID to uniquely identify each object appearing in one scene. Language_Code - When this is 00 (00110000 00110000 in binary), the IPA is to be sent. In all other languages, this is the ISO 639 Language Code. In addition to this 16 bits, two bits that represent dialects of each language is added is added at the end (user defined). 5

ISO/IEC CD 14496-3TTS ÃISO/IEC Gender_Enable - This is a one-bit flag which is set to 1 when the gender information exists. Age_Enable - This is a one-bit flag which is set to 1 when the age information exists. Speech_Rate_Enable - This is a one-bit flag which is set to 1 when the speech rate information exists. Prosody_Enable - This is a one-bit flag which is set to 1 when the prosody information exists. VideoEnable This is a one-bit flag which is set to 1 when the M-TTS decoder works with MP. In this case, M-TTS should synchronize synthetic speech to MP and accommodate the functionality of ttsforward and ttsbackward. When VideoEnable flag is set, M-TTS decoder uses system clock to select adequate TTS-Sentence frame and fetches Sentence_Duration, Position_in_Sentence, Offset data. TTS synthesizer assigns appropriate duration for each phoneme to meet Sentence_Duration. The starting point of speech in a sentence is decided by Position_in_Sentence. If Position_in_Sentence equals 0 (the starting point is the initial of sentence), TTS uses Offset as a delay time to synchronize synthetic speech to MP. LipshapeEnable This is a one-bit flag which is set to 1 when the coded input bitstream has lip shape information. With lip shape information, M-TTS can request FA tool to change lip shape according to timing information (Lip_Shape_in_Sentence) and predifined lip shape pattern. Trick_Mode_Enable - This is a one-bit flag which is set to 1 when the coded input bitstream permits trick mode functions such as stop, play, forward, and backward. **Ed : Right position of this syntax element should be considered later with system group. Object_Descriptor or node definition in BIFS? 7.4 MPEG-4 Audio Text-to-Speech Sentence TTS_Sentence_Start_Code - This is a bit string 000000XX in hexadecimal which identifies the start of the sentence to be processed. TTS_Sentence_ID - This is a ten-bit ID to uniquely identify a sentence in the M-TTS text data sequence for indexing purpose. Silence - This is a one-bit flag which is set to 1 when the current position is silence. Silence_Duration - This defines the time duration of the current silence segment in milliseconds. It has a value from 1 to 1023. The value 0 is prohibited. Gender - This is a one-bit which is set to 1 if the gender of the synthetic speech producer is male and 0, if female. Age - This represents the age of the speaker for synthetic speech. The meaning of age is defined in the table below. Age age of the speaker 000 below 6 001 6-12 010 13-18 011 19-25 100 26-34 101 35-45 110 45-60 111 over 60 6

ÃISO/IEC ISO/IEC CD 14496-3TTS SpeechRate This defines the synthetic speech rate in 16 levels. This is user-defined since the average speaking speed of each language differs. Length_of_Text - This identifies the length of the TTS_Text data in bytes. TTS_Text - This is a character string containing the input text. Dur_Enable - This is a one bit flag which is set to 1when the duration information for each phoneme exists. F0_Contour_Enable - This is a one bit flag which is set to 1 when the pitch contour information for each phoneme exists. Energy_Contour_Enable - This is a one bit flag which is set to 1 when the energy contour information for each phoneme exists. Number_of_Phonemes - This defines the number of phonemes needed for speech synthesis of the input text. Phonemes_Symbols_Length This identifies the length of Phonemes_Symbols (IPA code) data in bytes since the IPA code has optional modifiers and dialect codes. Phoneme_Symbols - This defines the indexing number for the current phoneme by using the Unicode 2.0 numbering system. Each phoneme symbol is represented as a number for the corresponding IPA. Three two-byte numbers can be used for each IPA representation including a two-byte integer for the character, and an optional two-byte integer for the spacing modifier, and another optional two-byte integer for the diacritical mark. Dur_each_Phoneme - This defines the duration of each phoneme in msec. num_f0 This defines the number of F0 values specified for the current phoneme. F0_Contour_each_Phoneme This defines F0 value in Hz at time instant F0_Contour_each_Phoneme_Time. F0_Contour_each_Phoneme_Time This defines the integer number of the time in ms for the position of the F0_Contour_each_Phoneme. Energy_Contour_each_Phoneme - This defines the energy level of current phoneme in integer db for three points (0%, 50%, 100% positions of the phoneme are for three points). Sentence_Duration - This defines the duration of the sentence in msec. Position_in_Sentence - This defines the position of the current stop in a sentence as an elapsed time in msec. Offset - This defines the duration of a very short pause before the start of synthesized speech output in msec. Number_of_Lip_Shape - This defines the number of lip-shape patterns to be processed. Lip_Shape_in_Sentence - This defines the position of each lip shape from the beginning of the sentence in msec. Lip_Shape - This defines the indexing number for the current lip-shape pattern to be processed that is defined in the table 12-5 in CD14496-2. 8 MSDL-S Representation of the Bitstream 8.1 ttsspecificconfig class ttsspecificconfig { 7

ISO/IEC CD 14496-3TTS ÃISO/IEC unsigned int(18) languagecode; bit(1) videoenable; bit(1) lipshapeenable; bit(1) trickmodeenable; 8.2 alpdupayload class alpdupayload { ttssequence sequence; class ttssequence : aligned bit(32) tts_sequence_start_code { unsigned int(5) ttssequenceid; unsigned int(18) languagecode; bit(1) genderenable; bit(1) ageenable; bit(1) speechrateenable; bit(1) prosodyenable; bit(1) videoenable; bit(1) lipshapeenable; bit(1) trickmodeenable; ttssentence sentence; 8.3 ttssentence class ttssentence : aligned bit(32) tts_sentence_start_code { unsigned int(10) ttssentenceid; bit(1) silence; if (silence) { unsigned int(12) silenceduration; else { if (genderenable) { unsigned int(1) gender; 8

ÃISO/IEC ISO/IEC CD 14496-3TTS if (ageenable) { unsigned int(3) age; if (!videoenable && speechrateenable) { unsigned int(4) speechrate; unsigned int(12) lengthtext; unsigned int(8) ttstext[lengthtext]; if (prosodyenable) { ttsprosody prosody; if (videoenable) { unsigned int(16) sentenceduration; unsigned int(16) positioninsentence; unsigned int(10) offset; if (lipshapeenable) { unsigned int(10) numberlipshape unsigned int(16) LipShapeInSentence[numberLipShape] unsigned int(8) lipshape[numberlipshape]; 8.4 ttsprosody class ttsprosody { bit(1) durenable; bit(1) f0contourenable; bit(1) energycontourenable; unsigned int(10) numberphonemes; unsigned int(13) phonemesymbolslength; 9

ISO/IEC CD 14496-3TTS ÃISO/IEC unsigned int(8) phonemesymbols[phonemesymbolslength] ; for (j=0 ; j<numberofphonemes ; j++) { if (durenable) { unsigned int(12) dureachphoneme ; if (f0contourenable) { unsigned int(5) numf0 ; unsigned int(8) f0contoureachphoneme[numf0] ; unsigned int(12) f0contoureachphonemetime[numf0] ; if (energycontourenable) { unsigned int(8) energycontoureachphoneme[3]; 9 MPEG-4 Audio Text-to-Speech Decoding Process The architecture of the M-TTS decoder is described below and only the interfaces relevant to the M-TTS decoder are the subjects of standardization. The number above each arrow indicates the section describing each interface. M-TTS Decoder 9.4 D E M U X Syntactic Speech 9.1 Decoder 9.2 Synthesizer 9.3 9.5 Phoneme-to-FAP Converter Face Decoder C o m p o s I t o r 10

ÃISO/IEC ISO/IEC CD 14496-3TTS Figure 1 MPEG-4 Audio TTS decoder architecture. In this architecture the following types of interfaces can be distinguished: Interface between DEMUX and the syntactic decoder Interface between the syntactic decoder and the speech synthesizer Interface from the speech synthesizer to the compositor Interface from the compositor to the speech synthesizer Interface between the speech synthesizer and the phoneme-to-fap converter 9.1 Interface between DEMUX and Syntactic Decoder Receiving a bitstream, DEMUX passes coded M-TTS bitstreams to the syntactic decoder. 9.2 Interface between Syntactic Decoder and Speech Synthesizer Receiving a coded M-TTS bitstream, the syntactic decoder passes some of the following bitstreams to the speech synthesizer. Input type of the M-TTS data: specifies synchronized operation with FA or MP Control commands stream: Control command sequence Input text: character string(s) for the text to be synthesized Auxiliary information: Prosodic parameters including phoneme symbols Lip shape patterns Information for trick mode operation The MSDL-S representations of the classes for this interface are defined in 8. 9.3 Interface from Speech Synthesizer to Compositor This interface is identical to the interface for digitized natural speech to the compositor. The class ttsaudiointerface defines the data structure for the interface of encoded speech from the speech synthesizer to the compositor. class ttsaudiointerface { DigitalSpeech *dgspeech; 11

ISO/IEC CD 14496-3TTS ÃISO/IEC 9.4 Interface from Compositor to Speech Synthesizer This interface is defined to allow the local control of the synthesized speech by users. This user interface can support trick mode of the synthesized speech in synchronization with MP and can change some prosodic properties of the synthesized speech by using the class ttscontrol defined as follows: class ttscontrol { void ttsplay(): void ttsforward(): void ttsbackward(): void ttsstopsyllable(): void ttsstopword(): void ttsstopphrase(): void ttschangespeechrate(unsigned int(4) speed): void ttschangepitchdynamicrange(unsigned int(4) level): void ttschangepitchheight(unsigned int(4) height): void ttschangegender(unsigned int(1) gender): void ttschangeage(unsigned int(3) age): The member function ttsplay allows a user to start speech synthesis in the forward direction while ttsforward and ttsbackward enable the user to change the starting play position in forward and backward direction, respectively. The ttsstopsyllable, ttsstopword, and ttsstopphrase functions define the interface for users to stop speech synthesis at the specified boundary such as syllable, word, and phrase. The member function ttschangespeechrate is an interface to change the synthesized speech rate. The argument speed can have the numbers from 1 to 16. The member function ttschangepitchdynamicrange is an interface to change the dynamic range of the pitch of synthesized speech. By using the argument of this function, level, a user can change the dynamic range from 1 to 16. Also a user can change the pitch height from 1 to 16 by using the argument height in the member function ttschangepitchheight. The member functions ttschangegender and ttschangeage allow a user to change the gender and the age of the synthetic speech producer by assigning numbers, as defined in 7.3, to their arguments, gender and age, respectively. 9.5 Interface between Speech Synthesizer and Phoneme-to-FAP Converter In the MPEG-4 framework, the speech synthesizer and the face decoder are driven synchronously. The speech synthesizer generates synthetic speech according to the phonemeduration. At the same time, TTS gives phonemesymbol and phonemeduration to Phoneme-to-FAP converter. The face decoder generates relevant facial animation according to the phonemesymbol and the phonemeduration. In this way, the synthesized speech and facial animation have relative synchronization except the absolute composition time. The synchronization of the absolute composition time comes from the same composition time stamp of TTS bitstream and facial animation bitstream. When the lipshapeenable is set, the lipshapeinsentence can be used to generate the phonemeduration. The class ttsfapinterface defines the data structure for the interface between the speech synthesizer and the phoneme-to-fap converter. 12

ÃISO/IEC ISO/IEC CD 14496-3TTS class ttsfapinterface { unsigned int(8) phonemesymbol; unsigned int(12) phonemeduration; unsigned int(8) f0average; unisgned int(1) stress; unsigned int(1) word_begin ; 13

ISO/IEC CD 14496-3TTS ÃISO/IEC Annex A (informative) Applications of MPEG-4 Audio Text-to-Speech Decoder A.1 General This annex part describes application scenarios for the M-TTS decoder. A.2 Application Scenario: MPEG-4 Story Teller on Demand (STOD) In the STOD application, users can select a story from a huge database of story libraries which are stored in hard disks or compact disks. The STOD system reads aloud the story via the M-TTS decoder with the MPEG-4 facial animation tool or with appropriately selected images. The user can stop and resume speaking at any moment he wants through user interfaces of the local machine (for example, mouse or keyboard).the user can also select the gender, age, and the speech rate of the electronic story teller. The synchronization between the M-TTS decoder with the MPEG-4 facial animation tool is realized by using the same composition time of the M-TTS decoder for the MPEG-4 facial animation tool. A.3 Application Scenario : MPEG-4 Audio Text-to-Speech with Moving Picture In this application, synchronized playback of the M-TTS decoder and encoded moving picture is the most important issue. The architecture of the M-TTS decoder can provide several granularities of synchronization. Aligning the composition time of each TTS_Sentence(), coarse granularity of synchronzation and trick mode functionality can be easily achieved. To get finer granularity of synchronization, the information about the lip shape events would be utilized. The finest granularity of syncronization can be achieved by using the prosody information and the videorelated information such as Sentence_Duration, Position_in_Sentence, and Offset. With this synchronization capability, the M-TTS decoder can be used for moving picture dubbing by utilizing the lip shape pattern information. 14