INTRODUCTION TO VOICEXML FOR DISTRIBUTED WEB-BASED APPLICATIONS

ιατµηµατικό Μεταπτυχιακό Πρόγραµµα Σπουδών : Οικονοµική & ιοίκηση Τηλεπικοινωνιακών ικτύων (Νέες υπηρεσίες και τεχνολογίες δικτύων) INTRODUCTION TO VOICEXML FOR DISTRIBUTED WEB-BASED APPLICATIONS Π.Κ Κίκιραs 1, A.Σταυρόπουλος 2, (1) ΠΜΣ «ιοίκησης & Οικονοµικής των Τηλεπικοινωνιακών ικτύων» e-mail: kikirasp@ieee.org (2) ΠΜΣ «ιοίκησης & Οικονοµικής των Τηλεπικοινωνιακών ικτύων» e-mail: astaur@odt.uoa.gr ABSTRACT The promise of the information revolution has been achieved via the extensive use of internet and telecom recourses (wireless or wired). One can access information anytime, anywhere and with any device. Voice applications are significant part of the pervasive computing vision. As so many people, globally, have access to advanced telephone devices, companies can use voice applications to reach this huge customer base of users who do not have access to computer systems due to time or location. This paper deals with VoiceXML; a worldwide adopted W3C specification for developing voice applications. Introduction Mobile access to the Internet seems to be on everyone's mind these days. The holy grail of anytime, anywhere access to the Internet has created a lot of interest and experimentation in thin-client mobility-supporting technologies such as VoiceXML and WML (wireless markup language). Although adoption of small screen browser access technologies such as WML is lower than anticipated, the continuing quest for mobile Internet access solutions is generating even more interest in voice enabled solutions and more specifically in VoiceXML and voice portals. VoiceXML is an attempt to bring the advantages of Web-based development and content delivery to interactive voice response applications. By aiming in bringing together the world's estimated two billion fixed line and mobile phones with the applications developers, the advances in VoiceXML are adding a significant milestone in the convergence of telecom technologies and the Web. 1 VOICEXML OVERVIEW. The history of Voice Markup Languages is not very old. The development of Voice Markup Languages started in 1995 when AT&T Bell laboratories started a project called PhoneWeb. The aim was to develop a markup language like HTML for defining voice markup for voice-based devices such as telephones and voice browsers. When AT&T Bell Labs split into two companies, both companies continued their work on the development of a Voice Markup Language and both came up with their own versions of Phone Markup Language (PML). Later, in 1998, Motorola came up with a new voice Markup language called VoxML. This language is a subset of XML and enables the user to define voice markup for voice-based applications. Soon IBM also joined the race with the launch of SpeechML, the IBM version of Voice Markup

Language. Other companies such as Vocalis, HP and Sun Microsystems developed their own versions of this language. The VoiceXML Forum is an organization founded by Motorola, IBM, AT&T, and Lucent to promote voice-based development. This forum introduced a new language called VoiceXML based on the legacy of languages already promoted by these four companies. In August 1999, the VoiceXML Forum released the first specification of VoiceXML version 0.9. In March 2000, version 1.0 was released. Then in October 2001, the first working draft of the latest VoiceXML version 2.0 was published. Starting from late 2001 Forum s VoiceXML initiative was adopted by the W3C consortium as an integral part of its own Speech Interface Framework, which have led to the release, at March 16 2004, of VoiceXML Forum s working draft as a W3C recommendation, promoting VoiceXML 2.0 specification as a Web standard by industry and the Web community. 1.1 VoiceXML and HTML Though VoiceXML has adopted many concepts and designs from HTML it differs in several ways. HTML was designed for visual Web pages and lacks the control over the user-application interaction that is needed for a speech-based interface. The main difference between VoiceXML and HTML is in the sequential structure of the VoiceXML s documents. HTML document is a single unit that resides on a web server characterized by a unique uniform resource identifier and whenever is accessed by a client, it simultaneously downloads it self as a whole to the clients browser. In contrast, a VoiceXML document contains a number of dialogue units (menus or forms), properly divided with markup tags, presented sequentially. This difference is due to the visual medium s ability to display a number of items in parallel, while the voice medium is inherently sequential. 1.2 Architecture - Key Concepts of VoiceXML The architecture of VoiceXML is very similar to that of standard web applications. When a user requires some documents from the server, sends a request to the server by using software called a browser. Upon receiving the user request through the browser, the server starts processing the required documents and sends the result to the browser as its response. The browser forwards this response to the user. In VoiceXML applications, just as in web applications, documents are hosted on the web server. In addition to a web server, a VoiceXML architecture houses another server, the voice server, which handles all interactions between the user and the web server. The voice server plays the role of the browser in a voice application, interpreting all spoken inputs received from the user and providing audio prompts as responses to the user. In the case of voice applications, the end user need not have any high-end computer and sophisticated browsing software. He can access the voice application through a simple telephony device connected to copper or unwired telephone network. Figure 1 shows the architecture of a voice application. 2

Figure 1. High level Architecture of a VoiceXML enabled network The standard implementation architecture of a voice application includes the following components: Web server: This is the server hosting a VoiceXML application in his network. It is important to notice that VoiceXML can be delivered from any common Web server. VoiceXML Gateway interpreter: The VoiceXML gateway consists of hardware and software that bridge the PSTN and Internet networks. This gateway consists of a VoiceXML browser and resources for ASR, TTS, and DTMF. These resources may be hardware and/or software. PSTN/Wireless telephone network: Public Switched Telephone Network or Wireless Telephone Network are the telephone services most of people have, and it carries our speech and DTMF interactions, such as prompts played by the VoiceXML Gateway and responses the caller speaks. Client device: The device the caller uses to access a VoiceXML application. 2 VOICEXML KEY CONCEPTS A VoiceXML document (or a set of documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URLs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation. VoiceXML key concepts are: Session: A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context. Document: A VoiceXML document is primarily composed of top-level elements called dialogs. There are two types of dialogs: forms and menus. A document 3

may also have <meta> elements, <var> and <script> elements, <property> elements, <catch> elements, and <link> elements. Dialog: Is a top-level element. The user when interacting with a VoiceXML application is always in one dialog state. There are two types of dialogs: forms and menus. Subdialogs: A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Local data, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications. Menu: A menu presents the user with a choice of options and then transitions to another dialog based on that choice. Grammar: Each dialog has one or more speech and/or DTMF grammars associated with it. A grammar specifies a list of permissible vocabulary for the user to select from in order to interact with the VoiceXML application. Each dialog has one or more speech and/or grammars associated with it. Form: Forms are the key component of VoiceXML documents. A form contains: A set of form items, elements that are visited in the main loop of the form interpretation algorithm. Form items are subdivided into field items, those that define the form s field item variables, and control items, those that help control the gathering of the form s fields. Declarations of non-field item variables. Event handlers. Filled actions, blocks of procedural logic that execute when certain combinations of field items are filled in. Form attributes are: id: The name of the form. Scope: The default scope of the form s grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. A form grammar that has dialog scope is active only in its form. Application: An application is a collection of VoiceXML documents. All the documents in an application share the application root document. 2.1 A Sample VoiceXML application Due to VoiceXML is an extension of XML, it follows the basic rules of XML. In the following sample VoiceXML application we provide a sample informative application for ODT students. In the specified application we are going to exhibit VoiceXML main features and capabilities by implementing the following scenario: An ODT student is trying to access to department s secretariat voice application which is able to provide him with information considering class schedules, and general announcements, the application will be consisted by two vxml files (exact code can be found at appendix A. The architecture of the application is described in figure 2. The application environment that we use for the development and testing of the application is MOTOROLA Wireless IDE with Mobile ADK 2.0 (http://www.motorola.com/msp/products/developer/index.html). 4

Main.vxml (root) Schedule.vxml (root) Figure 2. Application Architecture 3 CONCLUSIONS VoiceXML is an approach for enhanced and more user friendly man machine interfaces. It leverages the gap between web and voice applications by enabling content delivery to the latter. It also creates opportunities for business to the web content providers by enabling their access to the market of the world's estimated two billion fixed line and mobile phones owners. Further developments of VoiceXML will focus on the implementation of suitable features in order to support natural dialogues between machines and humans. 5

References Bruce Lucas, VoiceXML for Web-based Distributed conversational applications Communications of the ACM September 2000/Vol. 43, No 9. Dave Ragget, Getting Started with VoiceXML W3C tutorial, 2001. Vivek Malhotra, Developing VoiceXML applications, 2000. Harsra Srivatsa, Deep into VoiceXML, Part 1 & 2, 2001 Charul Shukla, Avnish Dass & Vikas Gupta, VoiceXML 2.0 Developer s Guide Building Professional Voice-Enabled Applications with JSP, ASP, & ColdFusion, McGraw-Hill/Osborne, 2002 Rick Beasley & Mike Farley, Voice Application Development with VoiceXML, Sams Publishing, August, 2001 6

Appendix A 7

1. Main.vxml Source Code <?xml version="1.0"?> <vxml version="1.0">  <form id="intro"> <block> <prompt>welcome to ODT's Secretariat Voice Application</prompt> <goto next="#choice"/> </block> </form> <menu id="choice" dtmf="true"> <prompt> Say schedule, announcements, or quit. </prompt>  <choice next="http://147.102.15.69 /schedule.vxml">schedule</choice> <choice next="#announcements">anouncements</choice> <choice next="#quit_app">quit</choice> </menu>  <form id="announcements"> <block> <prompt>next Monday's Lesson will be postoponed</prompt> <goto next="#choice"/> </block> </form>  <form id="quit_app"> <block> <prompt>goodbye!</prompt> </block> </form> </vxml> 8

2. Schedule.vxml Source Code <?xml version="1.0"?> <vxml version="1.0"> <form id="chooseday"> <field name="userselection"> <grammar> <![CDATA[ [ [one Monday dtmf -1] {<option> "Monday" >} [two Tuesday dtmf -2] {<option> "Tuesday" >} ] ]]> </grammar>  <prompt> Welcome to the weekly schedule of ODT's classes. You can say Monday or Tuesday or you can push one for Monday and two for Tuesday. Please make your selection </prompt>  <nomatch> What did you say? <reprompt/> </nomatch>  <noinput> Please input something! <reprompt/> </noinput> <filled> <if cond="usersellection == 'monday'"> <prompt> Monday's schedule is as follows </prompt> <goto next = "http://147.102.15.69/main.vxml"/> </if> <if cond="usersellection == 'tuesday'"> <prompt> Monday's schedule is as follows </prompt> <goto next = "http://147.102.15.69/main.vxml"/> </if> </filled> </field> </form> </vxml> 9