Speech Applications How do they work?
What is a VUI? What the user interacts with when using a speech application VUI Elements Prompts or System Messages Prerecorded or Synthesized Grammars Define the possible User responses to the Prompts Dialog Logic Define the System s actions responding to the User s input
Technologies Involved Speech Recognition Recognition of User s input, based on a specific Grammar Speech Synthesis On-demand rendering of System Prompts Dialog Control Dialog Logic implementation
Handling Speech Input User I want to go to Dallas Speech Endpointing Feature Extraction OK, what time do you want to leave? Dialog Management Natural Language Understanding Recognition System Speech Synthesis
Endpointing Detection of the beginning and end of speech Find where the utterance starts Endpointing Wait for a sufficiently long pause to indicate the end
Feature Extraction Endpointed Utterance Transform Utterance into sequence of Feature Vectors Represent measurable characteristics of speech Most typical: energy at different frequencies Sequence of vectors, each one extracted from a 10msec frame Feature Extraction.............................. Feature Vectors
Recognizer Determine the words that make up the utterance Features to Phonemes Acoustic Model Phonemes to Words Pronunciation Dictionary Grammars Feature Vectors.............................. Recognition I want to go to Dallas Word String
Natural Language Processing Word String I want to go to Dallas Assign meaning to words that were spoken Name - Value pairs called slots Defined for each item relevant to the application Natural Language Processing destination: Dallas Meaning
Dialog Management Control the actions of the system Access database Play back information to the user Perform a transaction Play a prompt requesting additional information Meaning destination: Dallas Dialog Management OK, what day do you want to leave? Actions/Prompts
Dialog Manager The heart of a Speech Application Coordinates all subsystems Passes information between them Keeps track of application state Dialog Manager Capture Voice ASR DTMF Replay Audio TTS
Implementation Issues Why do we need a Speech Application? Provide information to the user Make it easier to access a service anywhere, anytime Objectives Reusability of existing infrastructure Reusability of existing data Reusability of existing business logic
Bad Practice Example At first, Barpoint programmed nine variations of its site, but within a few months the company's staff was frantically trying to maintain 90 versions. That effort involved every available employee, pulling many away from their regular duties and causing the company's core business to suffer. The Industry Standard 4/2/2001
The Answer: VoiceXML A special-purpose language for describing interactive voice dialog Simplifies application development Minimizes Internet traffic interaction Separates user interaction code from application logic code Provides portability Provides simplicity Supports rapid prototyping and iterative refinement
Web Application Database Server DB Multimedia files Web Server HTML Scripts Internet HTML Browser PC Application Server Business Logic
Speech Application Database Server DB VoiceXML Scripts Capture Voice Telephone Network Web Server Grammars Voice- XML Browser ASR DTMF Replay Audio Audio files Internet TTS Gateway Application Server Business Logic
Combining Both UIs Application Layer Presentation Layer Data Base App Logic VoiceXML HTML VoiceXML HTML VoiceXML Browser HTML Browser
Key Concepts Callers access commerce, content, and communications services via voice VoiceXML is a language for developing voice-enabled Web sites VoiceXML supports verbal menus and forms While HTML browsers execute on the user s PC, VoiceXML browsers execute on a speech server VUI are very different from GUI