EMMA: Extensible MultiModal Annotation

Size: px

Start display at page:

Download "EMMA: Extensible MultiModal Annotation"

Austen Douglas
6 years ago
Views:

johnston@research.att.com 2013 AT&T Intellectual Property.

1 EMMA: Extensible MultiModal Annotation W3C Workshop Rich Multimodal Application Development Michael Johnston AT&T Labs Research 2013 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Knowledge Ventures.

directions from here <touch> to there <touch> Enables more natural and effective interaction Human-human

2 Multimodal Interactive Systems Support user input and/or system output over multiple modes such as speech, pen, and gesture Graphical interfaces augmented with speech input Multimodal integration directions from here <touch> to there <touch> Enables more natural and effective interaction Human-human communication is multimodal Certain kinds of input/output best suited to particular modes Multimodal error recovery Page 2

3 Authoring Multimodal Systems Many research prototypes showing utility of multimodality for interactive systems Authoring remains complex and specialized task Graphical interface in concert with variety of input and output processing components Ad hoc or proprietary protocols Limit plug and play Complicate rapid prototyping Page 3

4 EMMA: Extensible Multimodal Annotation EMMA standard provides common XML language for representing the interpretation of inputs to spoken and multimodal systems XML markup for capturing and annotating various processing stages of user inputs Container elements Annotation elements and attributes W3C Recommendation February Implementations AT&T, Microsoft, Nuance, Loquendo, OpenStream AT&T Developer Program Speech APIs Current EMMA 1.1 Working Draft Page 4

5 Multimodal Architecture Modality components and Interaction manager are core components of MMI Architecture EMMA used for input between modules Modality component Interaction manager Modality input component Interpretation Integration EMMA Modality component Input history Data Context Session state Dialog manager Application (backend) Modality output component Output generation Presentation manager Modality component API Page 5

6 EMMA Example: flights from Boston to Denver Page 6 <emma:emma version="1.0" xmlns:emma=" xmlns:xsi=" xsi:schemalocation=" xmlns=" <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us emma:grammar-ref="gram1 emma:model-ref="model1">! <emma:interpretation id="int1" emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig> <dest>denver</dest></flt>! <emma:interpretation id="int2 emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig> <dest>denver</dest></flt>! </emma:one-of> <emma:info><session>e50dae19-79b5-44ba-892d</session></emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

EMMA Example: flights from Boston to Denver <emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"! xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"! xsi:schemalocation="http://www.

emma:source="smm:platform=iphone-2.2.1-5h11! emma:signal="smm:file=audio-416120.amr emma:signal-size="4902! <emma:emma> emma:process="smm:type=asr&version=asr_eng2.4! Root element emma:media-type="audio/amr; of all EMMA documents rate=8000"!

7 EMMA Example: flights from Boston to Denver <emma:emma version="1.0" xmlns:emma=" xmlns:xsi=" xsi:schemalocation=" xmlns=" <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr emma:signal-size="4902! <emma:emma> emma:process="smm:type=asr&version=asr_eng2.4! Root element emma:media-type="audio/amr; of all EMMA documents rate=8000"! emma:lang="en-us emma:grammar-ref="gram1 emma:model-ref="model1">! Carries namespace, schema, version <emma:interpretation id="int1" emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig> <dest>denver</dest></flt>! <emma:interpretation id="int2 emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig> <dest>denver</dest></flt>! </emma:one-of> <emma:info><session>e50dae19-79b5-44ba-892d</session></emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>! Page 7

Container element tree <emma:one-of>! <emma:group>! <emma:sequence> Terminating in <emma: interpretation>! Page 8 <emma:emma>! <emma:one-of id="r1"! emma:start="1241035886246" emma:end="1241035889306!

emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">!

8 Container element tree <emma:one-of>! <emma:group>! <emma:sequence> Terminating in <emma: interpretation>! Page 8 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

9 <emma: interpretation>! Contains the application specific markup Annotations specific to that interpretation Page 9 emma:confidence! emma:tokens Semantic representation is not standardized <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotation Scope Annotations on <emma:one-of> are assumed to apply to the contained interpretations Page 10 <emma:emma>! <emma:one-of id="r1"! emma:start="1241035886246" emma:end="1241035889306!

10 Annotation Scope Annotations on <emma:one-of> are assumed to apply to the contained interpretations Page 10 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations Classification of input emma:medium! acoustic! visual! tactile! emma:mode! voice! pen! gui emma:function! dialog! verification! recording! transcription! emma:verbal! true/false!

11 Annotations Classification of input emma:medium! acoustic! visual! tactile! emma:mode! voice! pen! gui emma:function! dialog! verification! recording! transcription! emma:verbal! true/false! Page 11 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations Timestamps emma:start! emma:end! Absolute Milliseconds Page 12 <emma:emma>! <emma:one-of id="r1"! emma:start="1241035886246" emma:end="1241035889306! emma:source="smm:platform=iphone-2.2.1-5h11!

12 Annotations Timestamps emma:start! emma:end! Absolute Milliseconds Page 12 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

13 Annotations emma:source! URI describing the input device emma:signal! References the signal e.g. audio file emma:signal-size! Bytes emma:process! URI describing the process that generated these interpretations Page 13 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations emma:media-type! MIME type of signal Use to indicate code (amr) and sampling rate (8000) emma:lang! Language of the user input cf. xml:lang! Page 14 <emma:emma>! <emma:one-of id="r1"!

14 Annotations emma:media-type! MIME type of signal Use to indicate code (amr) and sampling rate (8000) emma:lang! Language of the user input cf. xml:lang! Page 14 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations emma:grammar-ref! References <emma: grammar> element under <emma:emma>! URI reference to name or location of grammar Page 15 <emma:emma>! <emma:one-of id="r1"!

15 Annotations emma:grammar-ref! References <emma: grammar> element under <emma:emma>! URI reference to name or location of grammar Page 15 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations emma:model-ref! References <emma:model> element under <emma:emma>! URI reference to name or location of data model for application semantic markup Page 16 <emma:emma>!

emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"!

16 Annotations emma:model-ref! References <emma:model> element under <emma:emma>! URI reference to name or location of data model for application semantic markup Page 16 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations emma:confidence! Confidence score assigned to interpretation emma:tokens! Token sequence Page 17 <emma:emma>! <emma:one-of id="r1"! emma:start="1241035886246" emma:end="1241035889306!

17 Annotations emma:confidence! Confidence score assigned to interpretation emma:tokens! Token sequence Page 17 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Annotations <emma:info>! Extensibility Application and vendor specific annotations Page 18 <emma:emma>! <emma:one-of id="r1"! emma:start="1241035886246" emma:end="1241035889306!

18 Annotations <emma:info>! Extensibility Application and vendor specific annotations Page 18 <emma:emma>! <emma:one-of id="r1"! emma:start=" " emma:end=" ! emma:source="smm:platform=iphone h11! emma:signal="smm:file=audio amr! emma:signal-size="4902! emma:process="smm:type=asr&version=asr_eng2.4! emma:media-type="audio/amr; rate=8000"! emma:lang="en-us! emma:grammar-ref="gram1! emma:model-ref="model1">! <emma:interpretation id="int1"! emma:confidence="0.75"! emma:tokens="flights from boston to denver">! <flt><orig>boston</orig><dest>denver</dest></flt>! <emma:interpretation id="int2! emma:confidence="0.68" emma:tokens="flights from austin to denver">! <flt><orig>austin</orig><dest>denver</dest></flt>! </emma:one-of> <emma:info>! <session>e50dae19-79b5-44ba-892d</session>! </emma:info>! <emma:grammar id="gram1" ref="smm:grammar=flights"/>! <emma:model id="model1" ref="smm:file=flights.xsd"/>!

Representing multiple stages of processing of input <emma: derived-from>! Provides a pointer to the EMMA interpretation this interpretation results from! <emma:emma>! <emma:interpretation! id="int2"!

19 Representing multiple stages of processing of input <emma: derived-from>! Provides a pointer to the EMMA interpretation this interpretation results from! <emma:emma>! <emma:interpretation! id="int2"! emma:tokens="comedy movies directed by woody! allen and starring diane keaton"! emma:confidence="0.7 emma:process="smm:type=fusion&version=mmfst1.0">! <query><genre>comedy</genre>! <dir>woody_allen</dir>! <cast>diane_keaton</cast></query>! <emma:derived-from resource="#int1"/>! <emma:derivation>! <emma:interpretation id="int1"! emma:start=" "! emma:end=" "! emma:confidence="0.8" emma:lang="en-us"! emma:process="smm:type=asr&version=asr_eng2.4"! emma:media-type="audio/amr; rate=8000">! <emma:literal>comedy movies directed by woody!!allen and starring diane keaton</emma:literal>! </emma:derivation>! Page 19

<emma:derivation> <emma: derivation>! Container element for holding previous processing stages inline Page 20 <emma:emma>! <emma:interpretation! id="int2"!

20 <emma:derivation> <emma: derivation>! Container element for holding previous processing stages inline Page 20 <emma:emma>! <emma:interpretation! id="int2"! emma:tokens="comedy movies directed by woody! allen and starring diane keaton"! emma:confidence="0.7 emma:process="smm:type=fusion&version=mmfst1.0">! <query><genre>comedy</genre>! <dir>woody_allen</dir>! <cast>diane_keaton</cast></query>! <emma:derived-from resource="#int1"/>! <emma:derivation>! <emma:interpretation id="int1"! emma:start=" "! emma:end=" "! emma:confidence="0.8" emma:lang="en-us"! emma:process="smm:type=asr&version=asr_eng2.4"! emma:media-type="audio/amr; rate=8000">! <emma:literal>comedy movies directed by woody!!allen and starring diane keaton</emma:literal>! </emma:derivation>!

<emma:emma>! <emma:one-of id="one-of1"!!!!emma:lang="en-us" emma:start="1241641821513"!!emma:end="1241641823033"!!emma:media-type="audio/amr; rate=8000"!

<emma:interpretation id="nbest2"!!emma:confidence="0.99"!!emma:tokens="john smith">!!<fn>john</fn><ln>smith</ln>! <emma:interpretation id="nbest3"!!emma:confidence="0.99"!!emma:tokens="joann smith">!

21 <emma:emma>! <emma:one-of id="one-of1"!!!!emma:lang="en-us" emma:start=" "!!emma:end=" "!!emma:media-type="audio/amr; rate=8000"!!emma:process="smm:type=asr&version=watson6">! <emma:interpretation id="nbest1"!!emma:confidence="1.00"!!emma:tokens="jon smith">!!<pn>jon</pn><ln>smith</ln>! <emma:interpretation id="nbest2"!!emma:confidence="0.99"!!emma:tokens="john smith">!!<fn>john</fn><ln>smith</ln>! <emma:interpretation id="nbest3"!!emma:confidence="0.99"!!emma:tokens="joann smith">!!<fn>joann</fn><ln>smith</ln>! <emma:interpretation id="nbest4"!!emma:confidence="0.98"!!emma:tokens="joan smith">!!<fn>joan</fn><ln>smith</ln>! </emma:one-of>! Page 21 <emma: one-of>

22 Multimodal Integration Example Multimodal dynamic map Cohen et al 1998, QuickSet Gustafson et al 2000, AdApt Johnston et al 2002, MATCH Gruenstein et al 2009 Local search application Map locations/search for restaurants/initiate calls Results in list or map view Multimodal commands chinese restaurants near here e.g. finite-state integration Bangalore and Johnston 2009, Computational Linguistics Page 22 <emma:group/lattice/derivation>!

23 Speech input Speech result in EMMA: <emma:emma>! <emma:interpretation id="speech1"! emma:confidence="0.9" emma:verbal="true"! emma:start=" " emma:end=" "! emma:confidence="0.8" emma:lang="en-us"! emma:process="smm:type=asr&version=asr_eng2.4"! emma:media-type="audio/amr; rate=8000">! <emma:literal>french restaurants near here</emma:literal>!! Page 23

<emma:lattice initial="0" final="4">! <emma:arc from="0" to="1">g</emma:arc>! <emma:arc from="1" to="2">point</emma:arc>!

24 Touch input Client collects touch events on map Represents gesture using <emma:lattice>! <emma:interpretation id="touch1! emma:confidence="0.8"! emma:medium="tactile" emma:mode="touch"! emma:start=" " emma:end=" ">! <emma:lattice initial="0" final="4">! <emma:arc from="0" to="1">g</emma:arc>! <emma:arc from="1" to="2">point</emma:arc>! <emma:arc from="2" to="3">coords</emma:arc>! <emma:arc from="3" to="4"> SEM([ , ])</emma:arc>! </emma:lattice>! G point coords SEM ([ , ]) Page 24

Package of multimodal input Page 25 <emma:emma>! <emma:group! emma:medium="acoustic,tactile" emma:mode="voice,touch"! emma:function="dialog">! <emma:interpretation id="speech1"! <emma:group>!

8" emma:lang="en-us" groups! speech and emma:process="smm:type=asr&version=asr_eng2.4" gesture! inputs using emma:media-type="audio/amr; rate=8000"> <emma:group>!

25 Package of multimodal input Page 25 <emma:emma>! <emma:group! emma:medium="acoustic,tactile" emma:mode="voice,touch"! emma:function="dialog">! <emma:interpretation id="speech1"! <emma:group>! emma:confidence="0.9" emma:verbal="true"! emma:start=" " emma:end=" " Based on temporal! emma:medium="acoustic" emma:mode="voice" constraints! client emma:confidence="0.8" emma:lang="en-us" groups! speech and emma:process="smm:type=asr&version=asr_eng2.4" gesture! inputs using emma:media-type="audio/amr; rate=8000"> <emma:group>! <emma:literal>french restaurants near here container </emma:literal>! element <emma:interpretation id="touch1! emma:confidence="0.8"! emma:medium="tactile" emma:mode="touch"! emma:start=" " emma:end=" "> Client posts! <emma:lattice initial="0" final="4">! multimodal input to <emma:arc from="0" to="1">g</emma:arc> multimodal! fusion <emma:arc from="1" to="2">point</emma:arc> server! <emma:arc from="2" to="3">coords</emma:arc>! <emma:arc from="3" to="4"> SEM([ , ])</emma:arc>! </emma:lattice>! <emma:group-info>temporal</emma:group-info>! </emma:group>! <emma:group-info> indicates nature of grouping

Multimodal EMMA Result Multimodal Interpretation Results from finite state multimodal understanding emma:medium, emma:mode have multiple values Timestamp is the union of speech and gesture timestamps

emma:medium="acoustic,tactile" emma:mode="voice,touch"! emma:function="dialog"! emma:process="smm:type=fusion&version=watson6"! emma:start="1241035886246" emma:end="1241035889306"!

26 Multimodal EMMA Result Multimodal Interpretation Results from finite state multimodal understanding emma:medium, emma:mode have multiple values Timestamp is the union of speech and gesture timestamps One <emma:derivedfrom> element for each mode Combining inputs can be contained inline in <emma:derivation>! Page 26 <emma:emma>! <emma:interpretation! emma:medium="acoustic,tactile" emma:mode="voice,touch"! emma:function="dialog"! emma:process="smm:type=fusion&version=watson6"! emma:start=" " emma:end=" "! <query><cuisine>french</cuisine>! <location>[ , ]</location></query>! <emma:derived-from resource="#speech1"/> <emma:derived-from resource="#touch1"/>! <emma:derivation>! <emma:interpretation id="speech1! emma:confidence="0.9"! emma:start=" "! emma:end=" " emma:verbal="true"! emma:confidence="0.8" emma:lang="en-us"! emma:process="smm:type=asr&version=asr_eng2.4"! emma:media-type="audio/amr; rate=8000">! <emma:literal>french restaurants near here </emma:literal>! <emma:interpretation id="touch1"! emma:confidence="0.8"! emma:medium="tactile" emma:mode="touch"! emma:start=" "! emma:end=" ">! <emma:lattice initial="0" final="4">! <emma:arc from="0" to="1">g</emma:arc>! <emma:arc from="1" to="2">point</emma:arc>! <emma:arc from="2" to="3">coords</emma:arc>! <emma:arc from="3" to="4"> SEM([ , ])</emma:arc>! </emma:lattice>! </emma:derivation>!

27 Conclusion The W3C EMMA (1.1) standard provides an XML representation language for containing and annotating inputs to multimodal systems Some key features Standard representations for common metadata/annotations on inputs to interactive systems Representation of uncertainty one-of/confidence/lattice <emma:group> for packages of multimodal input for fusion <emma:derivation> <emma:derived-from> for representation of multiple processing stages Tomorrow EMMA 1.1 (emma:annotation, emma:ref...) Future EMMA uses cases (emma:presentation) Page 27

28 EMMA Resources Specifications The EMMA specification -- Use Cases for EMMA EMMA Other information Building Multimodal Applications with EMMA, Michael Johnston, Proceedings of ICMI-MLMI 2009, Boston, MA, 2009 Improving Dialogs with EMMA Deborah Dahl, SpeechTEK 2009 Extensible Multimodal Annotation for Intelligent Virtual Agents Deborah Dahl, 10th International Conference on Intelligent Virtual Agents, September 20-22, 2010 Introducing EMMA Speech Technology Magazine, March, EMMA aspx Page 28

29 Acknowledgements Debbie Dahl Paolo Baggia Dan Burnett Ingmar Kluche Kazuyuki Ashimura Michael Bodell Dave Raggett Roberto Pieraccini Max Froumentin Massimo Romanelli Gerry McCobb Andrew Wahbe Patricio Bergallo Jerry Carter Wu Chou Yuan Shao Jin Liu Katriina Halonen T. V. Raman Stephen Potter Page 29

Building Multimodal Applications with EMMA

Building Multimodal Applications with EMMA Michael Johnston AT&T Labs Research 180 Park Ave Florham Park, NJ 07932 johnston@research.att.com ABSTRACT Multimodal interfaces combining natural modalities