EVALUATING DIALOGUE STRATEGIES IN A SPOKEN DIALOGUE SYSTEM FOR

Size: px

Start display at page:

Download "EVALUATING DIALOGUE STRATEGIES IN A SPOKEN DIALOGUE SYSTEM FOR"

Alexander Turner
5 years ago
Views:

1 EVALUATING DIALOGUE STRATEGIES IN A SPOKEN DIALOGUE SYSTEM FOR Fernando Farfán, Heriberto Cuayáhuitl and Alberto Portilla Intelligent Systems Research Group Department of Engineering and Technology Universidad Autónoma de Tlaxcala Apartado Postal No. 140 Apizaco, Tlaxcala, México {farfan, hcuayahu, aportilla}@ingenieria.uatx.mx ABSTRACT This paper presents an evaluation of directed dialogue (DD) and mixed initiative (MI) strategies in a spoken language system for . We compare the DD strategy, in which the system controls the dialog, to the MI strategy, in which users can flexibly control the dialog. For evaluating both strategies we used the PARADISE framework, which supports comparisons among dialogue strategies. Our experimental results show that the MI strategy performance surpasses the DD strategy in efficiency and user satisfaction, but needs more work for achieving qualitative measures and task success. KEY WORDS Dialogue strategies, spoken dialogue systems, by phone, speech recognition, and word spotting 1 Introduction In Mexico, automated services by telephone are few and most of these services are IVR-DTMF systems based on touch tone interfaces. Currently, these services tend to be replaced by Spoken Language Systems (SLS), which are more sophisticated than IVR-DTMF systems, provide a broader range of input, reduce call hold time, and generate costs savings. A SLS allows a human computer interaction based on human speech, using speech technology such as automatic speech recognition (ASR) and speech synthesis or Text To Speech (TTS). A SLS can interact with the user by employing a dialogue strategy such as Directed Dialogue (DD) or Mixed Initiative (MI). DD guides the user interaction to the achievement of goals, with the system always taking the initiative in the interaction, while in MI either the system or the user can take the initiative in the interaction. For example, consider the two dialogue strategies exemplified in D1 and D2, where S is the system and U is the user. In fact, the dialogues were in Spanish, but we here give only their English translation. Directed Dialogue Dialogue Strategy (D1) S: Welcome to the system by phone! S: Since this is your first call, I will provide you with some instructions. S: Your messages are presented as a sorted list, so you can say: first, next, previous, or last. If you want to erase a message, say delete, otherwise, if you want to hear a message, say read. To end the call, say good bye. S: Now you will hear your list of messages, remember that if you dont know what to say, you can say instructions. Let s begin! S: Message sent by Armando Rocha, subject my new cell phone number. U: (silence) S: Message sent by Alberto Morales, subject Company meeting. U: Read S: Please, hold on while I get the message. S: The message is: S: Hi, I just want to let you know that the meeting will be next Friday at 5:30. Don t be late. S: That was the message. Please say one of the following choices: repeat, reply, delete or continue. U: Good bye S: Thanks for calling the system by phone. Good bye. Mixed Initiative Dialogue Strategy (D2) S: Welcome to the system by phone! S: Since this is your first call, I will provide you with some instructions. S: If you want to hear a message, you can say I would like to hear the messages of Ricardo Pérez, if you want to reply to a message you can say I want to reply to the message of Ricardo Pérez. S: What would you like to do? U: Hi, I would like to hear Alberto Morales s messages. S: Please, hold on while I get the messages. S: There is just one message and the message is: S: Hi, I just want to let you know that the meeting will be next Friday at 5:30. Don t be late. S: That was the message. What would you like to do? U: Good bye S: Thanks for calling the system by phone. Good bye.

2 I N P U T Spontaneous Continuos Natural Language Word Spotting Discrete Words Human operator Natural Dialogue System Natural Language System IVR-Speech Instruction s Wait For Call true Welcome Prompt NIP OK? false Request NIP Digits IVR-DTMF Menu Directed Dialogue Mixed Initiative DIALOGUE COMPLEXITY Free Flowing New and Unread Action to perform Invalid User Thanks For Calling Figure 1. Spoken Language Systems Classification [6]. Body- Part Header Delete Hang Up At first glance, MI may seem a better strategy than DD, but previous works suggest that there are several difficulties with MI strategy [1-3]. First, MI requires more complex grammars, which could result in high automatic speech recognition (ASR) error rates. Second, MI may require users to learn what the system can understand, due to the fact that the system does not prompt with valid vocabulary. Thus, our motivation in this work is to evaluate both dialogue strategies (DD and MI) in order to contribute to the solution of the problems described above which MI confronts today. This paper presents a comparative analysis of the performance of two dialogue strategies (DD and MI) in a spoken language system for accessing by phone. In this work we used the most sophisticated evaluation framework currently available for spoken language systems, PAR- ADISE [4]. Our experimental framework uses a state of the art speech recognizer, which has the ability to handle out-of-vocabulary words [5]. The following section describes the system design. Section 3 describes our experimental design, where we provide details of the evaluation methodology. In section 4, we describe our experimental results. Finally, in section 5 we provide our conclusions and comment on future directions of this work. 2 System Design Spoken language systems can be classified by dialogue complexity and by the type of speech input provided by the user. A SLS classification is shown in Figure 1. In order to evaluate both dialogue strategies, two systems were developed: a Speech-IVR system and a Natural Language System (NLS). The Speech-IVR system uses the DD strategy where discrete words are used as speech input, while MI uses a NLS with word spotting capabilities and continuous natural language as speech input. Our systems provide the following capabilities: Reply Figure 2. High-Level diagram for DD and MI strategies. User Authentication, Tells the number of new and unread s, Headers and body-part consultation, Deleting and replying messages A high level call flow diagram for our systems is illustrated in Figure 2. To help users to interact with the system, the system instructs them in the first call, after user authentication. Our dialogue manager uses a state machine to implement both dialogue strategies. Most of the states include a DialogModule T M provided by the recognizer, which takes care of the conversation. A DialogModule includes: an initial prompt, timeout prompts, retry prompts, confirmation prompts, a help prompt, and a vocabulary or grammar. We enabled a set of global commands to be reached anytime in the call flow: cancel, help, instructions, and call termination. The system implementing the DD strategy only uses vocabularies. The system implementing the MI strategy uses grammars, which include a set of filler models to the left and right of keywords for modeling out-of-vocabulary words. In this way we were able to recognize any phrase with at least one keyword and at most two keywords (action and name). The communication architecture components provide critical support for our systems. Figure 3 shows the system architecture. The following hardware and software components were employed in the system communication architecture: ASR: SpeechWorks Recognizer 6.5 Second Edition for Mexican Spanish (barge-in enabled)

3 Automatic Speech Recognizer Speech Synthesizer MAXIMIZE USER SATISFACTION PSTN Card Telephony Interface VRU INTERNET/ INTRANET MAXIMIZE TASK SUCCESS MINIMIZE COST MEASURES Storage Resources Kappa EFFICIENCY METRICS QUALITATIVE METRICS Device Access Mail Server Database Elapsed time, user and system turns. Retry, barge-in, cancel, etc. Figure 3. System communication architecture. Figure 4. PARADISE s structure of objectives for dialogue performance [4]. Speech Synthesizer: voice Eloquent 5.0 Mexican male Card Telephony Interface: Dialogic D21H Programming Language: C++ Telephone Line Simulator: Skutch AS-26 Mail Server: VisNetic MailServer Operating System: Windows NT 4.0 Telephone: Telmex Regulated, BE-408 Type 1 PC: Pentium 600 MHZ, 256 MB RAM 3 Experimental Design The experimental setting was similar to that described in [2]; we also applied the PARADISE evaluation framework for comparing dialogue strategies. The overall structure of objectives in PARADISE that provides the basis for estimating a performance function is shown in Figure 4, with user satisfaction at the top level. In this framework, user satisfaction is correlated to user task success and such success has a payment in cost measures. The cost measures can be classified as either efficiency measures or qualitative measures. Efficiency measures are correlated to the user and the system interaction as a normal course of events (the user speaks and the system listens and understands perfectly what the user just said, while the time goes by), but not everything works perfectly, the system can fail in automatic speech recognition and produce a set of retries or timeouts (qualitative measures) when the user does not know exactly what to say by example. To know task success, the coefficient Kappa is calculated from a confusion matrix that summarizes how well a user or system achieves the information goals of a particular task for a set of dialogues instantiating a set of scenarios. Kappa is defined by Equation 1, and it is described in [7]. P(A) is the proportion of times that the Attribute-Value Matrix (AVM) for a dialogue agrees with the AVM for the scenario key, and P(E) is the proportion of times we would expect the AVMs for the dialogues and keys to agree by chance. When agreement is perfect (all task information items are successfully exchanged), then Kappa is 1. When agreement is only at chance, then Kappa is 0. κ = (p(a) p(e))/(1 p(e)) (1) where, p(e) = n i=1 (t i/t ) 2 (2) and, p(a) = ( n i=1 M(i, j))/t (3) As shown in Figure 4, performance also includes a function that combines cost measures. The PARADISE framework represents each cost measure as a function c i that must be minimized. These cost measures and the Kappa coefficient are the basis for formulating the equation for system performance. This system performance is calculated by the following equation. performance = (α N(κ)) i=1 w i N(c i ) (4) Where α is a weight on κ, the cost functions are weighted by w i, and N is a Z score normalization function, defined by the following equation. N(x) = (x x)/σ x (5) For evaluating both dialogue strategies, 50 undergraduate students, with an average age of 21 and who receive on average four s per day, tested our system. Their only previous experience in consulting was through the UNIX pine tool or with the support of a graphic user

4 Table 1. Tasks to be performed by each user. Task Description Goals 1 Armando Rocha has Get Armando s prepared a meeting, so he cellular telephone is expecting you to call number. him. He sent an with his cellular telephone number. 2 Fernando Mata Terrazas has Get wedding s day invited you to his wedding. and hour. Delete the message. 3 Alberto Morales sent an Get meeting s day about a meeting next and hour. Friday and he wanted you Confirm your to confirm your attendance. assistance to the meeting. interface such as HotMail or Yahoo Mail Service. Each student was requested to perform three tasks including five goals, giving us a total of 125 goals per strategy. The tasks are listed in Table 1. It is important to mention that our testers had no previous experience with a SLS, so that our experiments were performed with novice users. Before our testers started a call, they were given the following tips for using the systems, recommended by [8]. If you do not know what to say, you can say help anytime you want. If the system didn t understand you correctly and is performing a wrong task, you can say cancel to go back a previous step. If you keep silent, the system will tell you what to say. You can barge-in a system dialog, so you do not have to wait until it finishes a word or phrase. When you finish your tasks, you can say good bye to end your call. The data collection for evaluation consisted in: The recording of dialogues using the.ulaw audio files. The recording of metrics for each user session. A survey that helped us to know the opinion about the system usage. In this way, we determined a set of objective and subjective metrics. These metrics are: Task Success, Bargein, Timeout, Retry, Elapsed Time, Help, Cancel, System Turns, User Turns, and Mean Recognition Score (MRS). To measure task success, users provided us with the information required by each task. We used an Attribute Table 2. Tasks to be performed by each user. Attribute Day Hour Delete Possible Values Any day from Monday to Sunday 5:30, any other hour Yes, No Value Matrix (AVM) to represent the information retrieved from users s. An example of AVM for task 2 is shown in Table 2. In order to compute user satisfaction with subjective metrics, we performed a survey with the following questions: TTS quality: Was the system easy to understand? ASR quality: In this conversation, did the system understand what you said? Task Ease-of-Use: In this conversation, was it easy to find the message you were looking for? Interaction Pace: Was the pace of interaction with the system appropriate in this conversation? User experience: Did you know what you could say at each point of the dialogue? System response: How often was the system sluggish and slow to reply to you in this conversation? Expected behavior: Did the system work the way you expected? Future use: Based on your current experience with the system, would you use it regularly as another medium for accessing ? The user satisfaction survey allowed users to fill in a web page with 8 questions and multiple choice answers. The possible answers to most of the questions were: almost never, rarely, sometimes, often, almost, and always. Some questions had responses like yes, no or maybe. Every response was mapped to an integer from 1 to 5, with 5 representing the highest score. The survey also included a text field where users were encouraged to introduce comments about our systems. All responses were summed up, resulting in a User Satisfaction measure for each dialogue ranging from 8 to 40, where 8 is the worst case for all questions and 40 represents the best case. 4 Experimental Results For computing the system performance by (4), we computed a Multiple Linear Regression (MLR) over all objective metrics, which produced a set of coefficients (weights) describing the relative contribution of each

5 Table 3. Metrics performance for DD and MI strategies. Metrics Directed Mixed Dialogue Initiative Efficiency Measures System Turns User Turns Elapsed Time (secs) Qualitative Measures MRS Timeout Retry Help Cancel Barge-in Task Success (κ) User Satisfaction factor. We used US (user satisfaction) as the predictor factor. The results of the regression showed that Elapsed Time (ET), Barge-In and Retry were the most significant factors (p < 0.028). Task success (κ) was computed using equation 1, and ET, Barge-In and Retry were taken from the log files of our systems. Using these metrics we computed the performance function with a second MLR for obtaining α and w i, resulting in the following performance function. perf ormance = 4.950(κ) (ET ) (Barge in) 0.325(Retry) A summary of metrics collected from our systems applying the PARADISE framework is listed in Table 3. 1 This table shows that MI is more efficient and if fact better than DD in three measures (System Turns, User Turns and Elapsed time). However, according to the qualitative measures, DD is better in Mean Recognition Score (MRS), Timeout, Retry and Barge-In. Task Success in the DD strategy was also better than in the MI strategy. It is important to mention that the MRS for MI had a good performance (0.82) despite the difficulty of the recognition tasks in this strategy. We attribute good speech recognition performance in MI because of the use of a word spotting technique based on the use of explicit garbage models [5]. Table 4 1 shows that users prefer the MI strategy to the DD strategy. In this table we can observe some relevant positive results for the MI strategy: Task ease-of-use, user expertise, expected behavior and future use. We find these results interesting due to the fact that we have 25 different opinions for each dialogue strategy. Thus, we assume that users may find MI easy to use because they can say the action they want in just one utterance. Even when task 1 The numbers in bold font represent better results compared with the other strategy. Table 4. User Satisfaction Survey Results. Criteria Directed Mixed Dialogue Initiative TTS Quality ASR Quality Task ease Interaction pace User expertise System response Expected behavior Future use User Satisfaction success is lower in MI, users feel confident in what to say and expect from the system. Also, it seems that users liked MI more due to the fact that they say would use it in the future as another medium for reading . These results tell us that in a near future MI could be the dialogue strategy most used by spoken dialogue systems. We consider that the key factors are MRS and User Interface (UI) design. So our hypothesis is that having a friendly and easy to use UI and a robust speech recognizer, the MI strategy can surpass the DD strategy. Finally, our experiments were performed with novice users, and we suspect that with expert users, MI could reach the same scores in task success as the DD strategy. Also, timeouts and retries would be reduced dramatically due to the fact that users would know what to say and as a consequence MRS may be more accurate. 5 Conclusion We presented our experiments comparing two dialogue strategies, directed dialogue (DD) and mixed initiative (MI), in the context of a spoken dialogue system for accessing by phone. Our results showed that the MI dialogue strategy is more efficient, which means that the metrics System Turns, User Turns and Elapsed time score better than the DD strategy. Also, our results showed that the DD strategy is better in qualitative metrics such as Task Success, MRS, Timeout and Retry. At this point, we conclude that there are two key components that may improve the performance in dialogue strategies: A robust speech recognizer and a well designed user interface. Achieving these key components, better results can be obtained for efficiency and qualitative metrics in both dialogue strategies. Furthermore, our results showed that user satisfaction is better in the MI strategy. Thus, combining high performance in efficiency and qualitative metrics with user satisfaction makes the MI strategy a potential dialogue strategy to be most used by spoken dialogue systems. For achieving high scores in MRS, it is necessary to deal robustly with out-of-vocabulary speech. Our immediate future direction consists in a detailed

6 study of out-of-vocabulary speech, which according to our results is a key component for dealing with the MI strategy. Other future directions consist in a revision of the UIs considering the addition of interaction demonstrations and careful design of timeouts and retries, which will be performed taking into account our experimental results, user comments and a speech data analysis. Finally, another important issue would be related to user experience. Our results are for novice users, may be for expert users the preference and performance for MI is even more? 6 Aknowledgement This research was partially supported by SpeechWorks International Inc. with equipment and software licenses. We would like to thank Ben Serridge for his writing revision on this paper. References [1] Danieli, M., and Gerbino, E., Metrics for Evaluating Dialogue Strategies in a Spoken Language System, In Proceedings of AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, California, USA, 1995, [2] Walker, M. A., Fromer, J., Fabbrizio, D. G., Mestel C., and Hindle, D., What Can I Say? Evaluating a Spoken Language Interface to , In Proceedings of CHI, California, USA, 1998, [3] Walker, M. A., Kamm, C. A, and Litman, D., Towards Developing General Models of Usability with PAR- ADISE, Natural Language Engineering: Special Issue on Best Practice in Spoken Dialogue Systems, [4] Walker, M. A., Litman, D., Kamm, C. A, and Abella, A., PARADISE: A framework for Evaluating Spoken Dialogue Agents, In Proceedings of ACL/EACL, 1997, [5] Cuayáhuitl, H. and Serridge, Out-Of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems, In Proceedings of the Mexican International Conference on Artificial Intelligence, LNAI 2313, Mérida, Mexico, 2002, [6] Thomas, B. S. and Joy J., Saying what Comes Naturally, Speech Technology Magazine, March-April [7] Carletta, J. C., Assessing the Reliability of Subjective Coding, Computational linguistics, 22(2), 1996, [8] Telephone Speech Standards Committee, Universal Commands for Telephony-Based Spoken Language Systems, CHI Bulletin, Volume 32, Number 2, April 2000,

EVALUATING COMPETING AGENT STRATEGIES FOR A VOICE AGENT

EVALUATING COMPETING AGENT STRATEGIES FOR A VOICE EMAIL AGENT Marilyn Walker, Donald Hindle, Jeanne Fromer, Giuseppe Di Fabbrizio, Craig Mestel AT&T Labs Research 180 Park Ave, Florham Park, NJ 07932,