A comparison of usability methods for testing interactive health technologies: Methodological aspects and empirical evidence

international journal of medical informatics 78 (2009) 340 353 journal homepage: www.intl.elsevierhealth.com/journals/ijmi A comparison of usability methods for testing interactive health technologies: Methodological aspects and empirical evidence Monique W.M. Jaspers Department of Medical Informatics, Jb-114-2, Academic Medical Center- University of Amsterdam, PO Box 22700, Amsterdam, The Netherlands article info abstract Article history: Received 3 July 2007 Received in revised form 19 September 2008 Accepted 15 October 2008 Keywords: Evaluation Usability Medical Systems User interface design Human computer interaction Objective: Usability evaluation is now widely recognized as critical to the success of interactive health care applications. However, the broad range of usability inspection and testing methods available may make it difficult to decide on a usability assessment plan. To guide novices in the human computer interaction field, we provide an overview of the methodological and empirical research available on the three usability inspection and testing methods most often used. Methods: We describe two expert-based and one user-based usability method: (1) the heuristic evaluation, (2) the cognitive walkthrough, and (3) the think aloud. Results: All three usability evaluation methods are applied in laboratory settings. Heuristic evaluation is a relatively efficient usability evaluation method with a high benefit cost ratio, but requires high skills and usability experience of the evaluators to produce reliable results. The cognitive walkthrough is a more structured approach than the heuristic evaluation with a stronger focus on the learnability of a computer application. Major drawbacks of the cognitive walkthrough are the required level of detail of task and user background descriptions for an adequate application of the latest version of the technique. The think aloud is a very direct method to gain deep insight in the problems end users encounter in interaction with a system but data analyses is extensive and requires a high level of expertise both in the cognitive ergonomics and in computer system application domain. Discussion and conclusions: Each of the three usability evaluation methods has shown its usefulness, has its own advantages and disadvantages; no single method has revealed any significant results indicating that it is singularly effective in all circumstances. A combination of different techniques that compliment one another should preferably be used as their collective application will be more powerful than applied in isolation. Innovative mobile and automated solutions to support end-user testing have emerged making combined approaches of laboratory, field and remote usability evaluations of new health care applications more feasible. 2008 Elsevier Ireland Ltd. All rights reserved. Tel.: +31 20 5665178. E-mail address: m.w.jaspers@amc.uva.nl. 1386-5056/$ see front matter 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.ijmedinf.2008.10.002

international journal of medical informatics 78 (2009) 340 353 341 1. Introduction Users adoption and usage of interactive health care applications often have been hampered by their poor design, making these systems difficult to learn and complicated to use. System implementation failures pinpoint to recurrent problems in the design process. It is clear that interactive computer systems designed without reference to health care professionals information processing are likely to dissatisfy their users. Poorly designed systems may even lead to disaster if critical information is not presented in an effective manner. Both quantitative and qualitative studies have highlighted examples of health care applications design flaws that indeed led to errors [1 8]. Consequently, usability assessment is now widely recognized as critical to the success of interactive clinical information systems, and over the years usability evaluations of these kinds of systems have been undertaken. The benefits of usable systems are in increased productivity, reduced errors, reduced need of user training and user support and improved acceptance by their users. A system designed for its usability and tailored to the users preferred way of working will allow them to operate the system effectively rather than struggling with the computer s functions and user interface, enhancing users productivity. Avoiding interface design faults will reduce user error and will reinforce learning, thus reducing the time needed for user training and user support. Improved user acceptance is often an indirect outcome from the design of a usable system [9]. Within health care, these benefits cannot be overestimated and health care organizations are expending increasing resources for performing usability evaluations of their interactive computer applications or for having these evaluations performed by usability laboratories. Usability evaluators now have a variety of empirical and analytic methods at their disposal to guide them in assessing and improving the usability of interactive computer applications, most of which stem from the usability engineering field. The acknowledged need for these kinds of studies and the widespread use of usability evaluation methods has resulted in growing attention within the usability engineering field for a proper implementation of these methods and to main issues as what kind of expertise is needed to produce reliable results, which method to use in a certain context, and to what extent different usability evaluation methods bring forth different results. If our usability studies in the health informatics domain are to bring forth results for making informed decisions with regard to (re)designing interactive computer applications, it is required that usability researchers active in this domain are knowledgeable of the methodological and empirical insights coming from the usability engineering field. In this article, we give an overview of the methodological research available of the three usability inspection and testing methods most often used to evaluate an interactive system s design against user requirements: the heuristic evaluation, the cognitive walkthrough and the think aloud method. These methods are applied to identify usability flaws in an early system s design as part of the system development. That is, we do not include methods that focus on generating design solutions or methods for assessing a system s usability after its release. Our goal is to help newcomers in usability research of health care applications decide on methods to be used in evaluating a (prototype) system s design usability, the kinds of (experts) and participants to recruit, other resources required, the way to analyse and report on the results. We will end with a general discussion of the value and limits of the heuristic evaluation, the cognitive walkthrough and the think aloud method, key issues to consider in conducting usability evaluation studies and new areas of usability research. 2. Expert-based versus user-based usability methods Interactive systems are usually designed through an iterative process of initial design (prototype) evaluation, and redesign. Usability inspections and tests of prototype systems focus on system flaws that need to be repaired before a final design will be released. Different methods can be used to evaluate a first system design on its usability and a variety of expert-based inspection and user-based testing methods have evolved to facilitate this process. Expert-based inspection methods include guideline review, heuristic evaluation, consistency inspection, usability inspection and walkthroughs [9 13]. In general, expert-based methods have the aim of uncovering potential usability problems by having evaluators inspect a user interface with a set of guidelines, heuristics or questions in mind or by performing a step-wise approach, derived from general knowledge about how humans process through tasks. User-based testing methods include user performance measurements, log-file and keystroke analyses, cognitive workload assessments, satisfaction questionnaires, interviews and participatory evaluation [9 13]. With the first four methods, the measures are easy obtained and can be used to infer problem areas in a system s design. These methods however do not give any clear indication why a certain interface aspect poses a problem to a user or how to improve the interface. Participatory evaluation methods require actual end users to employ a user interface as they work through task scenarios and explain what they are doing, by talking or thinking-aloud or afterwards in a retrospective interview. These methods do provide insight into the underlying causes for system usability problems encountered by users and participatory evaluation has therefore led to a high level of confidence in the results produced. To evaluate an interactive system s design against user requirements, the heuristic evaluation and cognitive walkthrough are the two most widely adopted expert-based methods, whereas the most widely applied user-based method is think aloud [12]. In this contribution, we therefore focus on these three usability evaluation methods. Each of these three methods has its strengths and weaknesses, requires certain expertise levels of reviewers, resources, generates specific types of data and requires specific analysis methods. All these aspects may affect the outcomes and the usefulness of usability evaluation efforts. We will hence more specifically go into the assumptions underlying each of these three usability methods, how to apply each method, what is (empirically) known about these

342 international journal of medical informatics 78 (2009) 340 353 usability evaluation methods, and how these usability methods may differ in their results. 3. Expert-based inspection methods 3.1. Heuristic evaluation Among the usability inspection methods, heuristic evaluation is the most common and most popular [12,13]. In a heuristic evaluation, a small set of evaluators inspects a system and evaluates its interface against a list of recognized usability principles the heuristics. Typically, these heuristics are general principles, which refer to common properties of usable systems. Heuristic evaluation is in its most common form based on the following set of usability principles: (1) use simple and natural dialogue, (2) speak the user s language, (3) minimize memory load, (4) be consistent, (5) provide feedback, (6) provide clearly marked exits, (7) provide shortcuts, (8) provide good error messages, (9) prevent errors, and (10) provide help and documentation [14,15]. In performing a heuristic evaluation, each evaluator steps through the interface twice. First, to get a general idea about the general scope of the system and its navigation structure. Second, to focus on the screen lay out and interaction structure in more detail, evaluating their design and implementation against the pre-defined heuristics. Each heuristic evaluation results in a list of usability flaws with reference to the heuristic violated. After the problems are found, preferably each evaluator independently estimates the severity of each problem [13]. Once all evaluations have been conducted, the outcomes of the different evaluators are compared and compiled in a report summarizing the findings. This report describes the usability flaws in the context of the heuristics violated and assists system designers in revising the design in accordance to what is prescribed by the heuristics [16]. Heuristic evaluation is an efficient usability evaluation method [17] with a high benefit cost ratio [18]. Heuristic evaluation is particularly of value in circumstances where time and resources are limited, since skilled experts can yield high quality results in a limited amount of time without the need of involving end users in the evaluation. Heuristic evaluation has however some drawbacks too. First, different sets of heterogeneous heuristics and guidelines have proliferated through the years to assist reviewers in heuristic evaluation of a multitude of user interfaces [19 22]. This proliferation of heuristics and guidelines on its own might present a usability problem for less-experienced evaluators, who must become knowledgeable of and proficient in applying disparate guidelines for a wide variation of contexts. Heuristics and guidelines for one context may not be applicable in a different context or may even contradict a guideline to be used in another context, complicating heuristic evaluation practices even more. Up to now, a simple, unified set of heuristics that can be applied in different contexts, assures reliable results and provides additional design guidance during system development is lacking but new initiatives to develop such a set of unified guidelines are underway [23]. Another drawback of the heuristic evaluation lies in the fact that heuristics or guidelines are often very generally described and so evaluators may interpret them in different ways, producing different outcomes. More particularly, if heuristic evaluations are performed by evaluators of less expertise, this may result in situations in which they consider every heuristic violation as a problem, even if this is not the case. A common heuristic is for example to provide help and documentation [14,15]. Some less-experienced evaluators may yet interpret the lack of a help function and documentation as a usability problem regardless of the simplicity of the computer application. Otherwise, this technique may also result in a less-experienced evaluator missing a usability problem because the specific problem is not directly linked to one of the predefined heuristics [22]. A third problem to be noted is that experts are often renowned for their strong views and preferences. In performing a heuristic evaluation, they may concentrate on certain features while ignoring others. Different evaluators indeed find different problems and expertise greatly influences the outcomes of a heuristic evaluation. Usability experts are almost two times as good as novice evaluators and application domain experts with usability experience ( double experts ) are almost three times as good as novices. Two to three double specialists (with usabilityand system domain expertise) or three to five evaluators with usability expertise suffices to detect 75% of the usability problems in a system, but about 14 novice evaluators are needed to find an equal amount of problems [19]. Since double experts are scarce, three to five skilled usability experts per heuristic evaluation are commonly needed to achieve optimal effectiveness and to defend against threats to the validity of the findings [24 27] (Table 1). In circumstances that usability evaluators are not familiar with the system domain, a work domain expert could assist the evaluator in tackling domain specific problems. These work domain experts in these so-called participatory heuristic evaluations may help usability evaluators in considering how the system contributes in reaching certain goals in that particular area of skills. Another way to help usability inspectors analyse the user interface of an unfamiliar domain is to provide a scenario listing the steps that a typical user would take to accomplish a particular task with the system [28]. Such a heuristic step-wise evaluation draws nearer to a cognitive walkthrough evaluation described in the next section. Several studies have examined the number and severity of usability problems detected with the heuristic evaluation and shown that large numbers of major and minor usability problems can be found [19,22,29,30,31]. One comparison study of particular interest showed that a heuristic evaluation identified many more low- and high-severe problems than the cognitive walkthrough and think aloud [29] and at the lowest cost in terms of person-hours. Though at first sight these results seem to favor the heuristic evaluation, the results were put forward by four highly skilled usability experts with advanced degrees and years of experience in evaluating interfaces. Moreover, none of these individual evaluators found more than one third of the problems, indicating first that interreviewer reliability was low and second that overall the results of a heuristic evaluation indeed improve with the number of evaluators. Another limitation was the large number of specific, one time, low priority or cosmetic problems found by

international journal of medical informatics 78 (2009) 340 353 343 Table 1 Characteristics heuristic evaluation method. When How Input Output Benefits Limits Early system design stage Examples [48 65] Experts evaluate an interface to determine conformance to list of heuristics 3 5 usability experts/2 3 double experts List of predefined heuristics List of heuristics violated Severity rating per usability flaw Quick and cheap Promotes comparability between alternative system designs Unstructured approach Heuristics general defined Hard to decide which guidelines hold in a certain context Results affected by Number of evaluators Evaluators skills the heuristic evaluation compared to the think aloud. Indeed, problems identified in a heuristic evaluation are often not found in user testing; in one study, more than half of the problems revealed by the heuristic evaluation were not discovered in think aloud sessions with end users [30]. Overall these findings suggest that although the heuristic evaluation method seems quite effective in detecting both minor and major usability problems, it is less effective in detecting recurring problems than the think aloud. Moreover, the relative large number of specific, low priority problems found by a heuristic evaluation may make it hard to decide on a system redesign plan when time or budgets are limited. In conclusion, it can be said that the major disadvantages of a heuristic evaluation are the low overlap of usability problems found by different evaluators and the high dependence on skills and experience of the evaluators to improve the results overall. These experts often are not available, limiting the options of performing a heuristic evaluation. 3.2. The cognitive walkthrough The cognitive walkthrough method is a type of usability evaluation technique that focuses on evaluating an (early) system design for learnability by exploration [32,33]. The cognitive walkthrough differs from the heuristic evaluation in that the cognitive walkthrough is highly structured and explicitly guided by the user s tasks. In a cognitive walkthrough, an evaluator, preferably a usability expert evaluates a user interface by analysing the cognitive processes required for accomplishing tasks that users would typically carry out supported by the computer. The cognitive walkthrough helps an evaluator in examining the interplay between a user s intentions and the feedback provided by the system s interface. As a cognitive walkthrough is focused on ease of learning of a computer application by novice users, the evaluator is supposed to explore the interface without any guidance and supposed to simulate a novice user. The cognitive walkthrough relies on a cognitive model of a novice user executing four steps in the context of a task the user is to perform: (1) the user would set a goal to be accomplished, (2) the user would inspect the available actions on the user screen (in the form of menu items, buttons, etc.), (3) the user would select one of the available actions that seems to make progress toward his goal, and (4) the user would perform the action and evaluate the system s feedback for evidence that progress is being made toward his current goal. In executing the cognitive walkthrough, for each action needed to accomplish a certain task, the evaluator tries to answer four questions: (1) will the user try to achieve the correct effect, (2) will the user notice that the correct action is available, (3) will the user associate the correct action with the desired effect, and, if the user performed the right action, (4) will the user notice that progress is being made toward accomplishment of his goal. If there are positive answers to all four questions, the execution of the specified action is found to be without usability flaws, if there is a negative answer to any of the four questions, the specified action is not free of usability problems. A cognitive walkthrough requires a preparation phase in which the user background is defined, and the sample tasks are selected or constructed. By evaluating each step required to perform a task, a cognitive walkthrough can detect potential mismatches between designers and users conceptualizations of a task, potential problems a user would have with interpreting certain verbal labels and menus, and potential problems with the system feedback about the consequences of a specific action. Experiences with the earlier version of the cognitive walkthrough raised several concerns, including the need for a background in cognitive psychology, the tedious nature of the method, and the time required to apply the technique. The most recent version of the cognitive walkthrough addresses these issues [34], describing step-bystep of how to conduct the technique, but requiring detailed descriptions of how users accomplish tasks as input: per task a complete description of the correct action sequences and interface states along these sequences have to be determined before a cognitive walkthrough can be conducted (Table 2). Several studies have examined the number and severity of usability problems found by the cognitive walkthrough method. The cognitive walkthrough appears to be more effec-

344 international journal of medical informatics 78 (2009) 340 353 Table 2 Characteristics cognitive walkthrough method. When How Input Output Benefits Limits Early system design stage Experts simulate new users walking through the interface step-by-step carrying out typical tasks Min. two evaluators List of (potential) usability problems at certain stage in user-interaction cycle Structured approach Tedious Examples [7,66 68] Representative tasks Detailed analysis of potential problems Discourages exploration Complete list of action sequences needed to accomplish tasks User background description Computer knowledge Domain knowledge Results affected by Task description details User background details tive in finding severe than less severe problems [22] and detailed task descriptions rather than shorter ones can significantly increase the number of usability problems related to the feedback on performed action sequences found by a cognitive walkthrough [35]. Another comparison study showed that the cognitive walkthrough only revealed about one third of the usability problems detected by a heuristic evaluation, but again proportionally more severe than less severe problems [29]. This difference in outcomes of the cognitive walkthrough and heuristic evaluation may be explained by the mere fact that the cognitive walkthrough addresses problems by a very structured process whereas the general description of usability principles leave more room to the evaluator in conducting a heuristic evaluation. But though heuristic evaluations do not appear to provide enough structure, cognitive walkthroughs may provide too much. Particularly with its latest version, which for each task requires detailed descriptions of correct action-sequences, evaluators may simply analyse predefined solutions instead of discovering solutions through exploration. This may limit the evaluator s ability to find certain types of problems not directly related to these actionsequences but that would have been detected by a heuristic evaluation. But the high ratio of severe problems found by a cognitive walkthrough favors this method in situations where, because of time or financial constraints, the focus is on detecting the more severe usability problems in a system s design. Differences in user background settings likewise affect the results of a cognitive walkthrough. In a standard cognitive walkthrough, the defined user background is limited to a user s knowledge of a specific interface; other user characteristics are not taken into account. However, when more factors are included in the user background setting, for example cognitive factors, a wider range of usability problems can be found [36]. Especially since the theoretical model underlying the cognitive walkthrough assumes a series of mental process stages, cognitive factors like a user s memory load may influence the execution of these stages. One study showed that inclusion of users cognitive characteristics in the background setting increased the detection scope of the cognitive walkthrough for usability problems related to limits in users memory capacity [37]. As far as we know, no other studies have been performed to analyse the effect of differences in user background settings on the outcomes of a cognitive walkthrough. Despite these methodological concerns, the cognitive walkthrough has been proven to be useful in an early stage of system design. Complete versions of systems are not required and a cognitive walkthrough can even be used in evaluating system designers preliminary ideas. 4. User-based testing method 4.1. The think aloud method The think aloud method formally belongs to the verbal report methods and stems from the field of cognitive psychology. It was specifically developed to gather information on the cognitive behavior of humans performing tasks. The think aloud method is viewed upon as particularly useful in understanding the processes of cognition, because it assesses humans cognitions concurrently with their occurrence. It is therefore a unique source of information on these cognitive processes and a very direct method to gain insight in the way humans solve problems. Overall, the method consists of two stages: (1) collecting think aloud protocols in a systematic way, (2) analysing these protocols to obtain a model of the cognitive processes that take place while a person tackles a problem [38]. These protocols are collected by instructing subjects to solve a problem while thinking aloud ; that is, stating directly what they think. A basic assumption underlying the think aloud is that the only information that can be verbalized are the contents of

international journal of medical informatics 78 (2009) 340 353 345 a subject s working memory; the information that is actively being processed. As the think aloud method does not require a subject to retrieve long-term memory constructs or to retrospectively report on his thought processes, censoring and distortion of these processes is minimized. Constraints on the verbal data are imposed later through strategies for analysing these protocols content, according to the researcher s interest. The think aloud method can be of high value in evaluating a system s design on usability flaws and is therefore frequently used to gather information about a system s usability in testing computer systems with potential end users. During recorded usability sessions, users interact with a (prototype) system or interface according to a predetermined set of scenarios while verbalizing their thoughts. Analyses of these verbal reports provide detailed insight into usability problems actually encountered by end users but also in the causes underlying these problems. However, an often expressed concern with the think aloud method is that information provided by the users is subjective. The identification and selection of a representative sample of (potential) end users is therefore crucial for generating valid usability data with a think aloud test. The subject sample should consist of persons representing those end users who will actually use the system in future. This requires a clearly defined user profile which describes the range of relevant skills of the future system users. Computer expertise, roles of subjects in the workplace, and a person s expertise in the domain of work the computer system will support are useful dimensions in this respect. A questionnaire may be given to test subjects either before or after the session to obtain this information. As the think aloud method provides a rich source of data, a small sample of subjects (approx. 8 subjects) suffices to gain a thorough understanding of task behavior [38] or to identify the main usability problems with a computer system [39]. However, in situations where there are considerable numbers of different types of users there needs to be sufficient number of each type in the think aloud test sessions. A representative sample of tasks of the work domain to be performed by users in the think aloud study is likewise essential. The primary aim of task construction is to provide examples of future system use as an aid to understanding and clarifying user computer interaction behavior and strategies in approaching the computer-supported tasks. So task cases should be realistic and representative of daily life situations and those that end users are expected to perform while using the (future) computer system. Instructions to the subjects about the task at hand should be given before the actual usability study starts. The instruction on the thinking aloud procedure is straightforward. The essence is that the subject performs the task at hand supported by the computer system, and says out loud what comes to mind. A typical instruction would be: I will give you a task. Please keep talking out loud while performing the task. Because thinking aloud in this manner is an unusual task, subjects should be given an opportunity to practice talking aloud while performing an example task. A practice session, with an example task not too different from the target task, should therefore precede the actual experiment to orient participants to the procedure of talking out loud as well as to allow screening for compliance with the task. Virtually no participants are non-responsive and the vast majority of people think aloud in the desired way after instructions are given and thinking aloud has been practiced [38,40]. As soon as the subject is working on the computer-supported task, the evaluator only intervenes when the subject stops talking. Then the instructor should prompt the subject by the following instruction: Keep on talking. The objective is to obtain the maximum from the user while trying to maintain as realistic a real-life environment as possible. Full audio taping and/or video recording of the subject and his concurrent utterances during task performance, and if relevant, video recording of the computer screens is required to capture all the verbal data and user computer interactions in detail. Subjects articulated thoughts are later transcribed for content analysis by coders. Typing out complete verbal protocols is therefore inevitable to be able to analyse the data in detail. Video recordings may be informally reviewed or formally analysed to fully understand the way the subject performed the task or to detect the type and number of user computer interaction problems. Prior to analysing the audio and/or video data, it is usually required to develop a coding scheme to identify step-bystep how the subject tackled the task and/or to identify specific user computer interaction problems in detail. Coding schemes may be developed bottom-up or top-down. In a bottom-up procedure, one would use part of the protocols to generate codes by taking every new occurrence of a cognitive sub-process as a new code. One could for example assign the code guessing to the following verbal statements: Could it be X? or Let s try X.. The remaining protocols would then be analysed by using this coding scheme. Otherwise, categories in the coding scheme may be developed top-down, for example by examination of categories of interactions from the human computer interaction literature. To prevent evaluator bias, it is best to leave the actual coding of the protocols to (minimal 2) independent coders. By separating the collection and analysis of verbal protocols, it is possible to analyse the same protocol several times and to compute intercoder-reliability, for which the Kappa is mostly used [41]. The coded protocols and/or video s can be compiled and summarized in various ways, depending on the goal of the study. If the aim is to evaluate the usability of a (prototype) computer system, the results may summarize any type and number of usability problems revealed and the causes underlying these problems. If the computer system under study is still under development, these insights may then be used to redesign the system. Although application of the think aloud method is rather straightforward, the verbal protocols require substantial analysis and interpretation to gain deep insight in the way subjects perform tasks. Because the think aloud generates a report of unstructured responses, a virtually unlimited number of different coding schemes can be developed or used to analyse the same data set. Therefore, high expertise in both the cognitive ergonomics and computer system application domain is demanded for a thorough analysis and interpretation of the protocols. This also requires that studies are well planned to avoid wasting time and to preclude ambiguous coding schemes (Table 3).

346 international journal of medical informatics 78 (2009) 340 353 Table 3 Characteristics think aloud method. When How Input Output Benefits Limits System design/ implementation stage Examples [7,58 75] End users perform a series of tasks in interaction with the system while verbalizing their thoughts Analysis of these protocols afterwards by experts Human factors and system domain professional 5 8 end users Usability problems at certain stage in user-interaction cycle Clearly defined user-profile(s) User instructions Representative tasks Audio/video (coding scheme) Verbal protocols Rich source of data Time consuming Detailed insight in WHY of usability problems Verbal protocols do not tap all thinking Results affected by Evaluator skills Representativeness user group Task selection The think aloud method has been criticized, particularly with respect to the validity and reliability of the verbal reports it generates since the experimenter seems to disturb the subject s cognitive processes [42,43]. To assure the validity and reliability of the verbal data, a prerequisite of the think aloud method is that usability experts focus on collecting hard data only what the subject attends to and in what order, not a subject s introspection, inference or opinion in evaluating a system s usability. Most evaluators interruptions aiming at elicitation of these types of information during subject task behavior should therefore be considered a source of error, potentially leading to distorted reports [42]. Evaluators instructions requiring a subject to explain his thoughts may redirect a subject s attention to a self-evaluation of his procedures and thus indeed change the thought process itself. Redirecting the user s attention and changing the thought process can be thought of as a kind of experimenter bias, which can indeed greatly influence the reliability and the validity of the data. Interventions can however be broken down into two types: reminders to keep talking and probes used to elicit additional information from subjects. It has been shown that, as long as the experimenter minimizes interventions in the process of verbalizing and merely reminds the subject to keep talking when a subject stops verbalizing his thoughts, the ongoing cognitive processes are not disturbed. Other, more intrusive types of probes to gather even more useful information should not be used. If researchers seek additional information from subjects they should collect it in the form of retrospective reports after task completion to avoid any interruptions of task flow [38]. However, since subjects may not be aware of what they actually were doing while performing a task or interacting with a computer, the usefulness of evaluation measures that rely on retrospective self-reports only is limited. The advantage of thinking aloud as data eliciting method includes the fact that the resulting reports provide a detailed account of the whole process of a subject executing a task concurrently. It is therefore that the concurrent think aloud method that requires subjects to verbalize their thoughts while performing a computer-supported task is more popular than the retrospective think aloud method, asking subjects to report on their thoughts after performing the computer-assisted tasks silently. A study comparing the concurrent think aloud condition with the retrospective think aloud condition indeed found that the latter results in considerably fewer and different verbalizations that the concurrent think aloud condition [46]. Concurrent think aloud provides more complete and detailed descriptions of the subject s cognitions in the computer interaction process than the retrospective think aloud. Moreover, the concurrent think aloud condition seems to generate more usability problems than the retrospective think aloud condition [47]. As with retrospective self-reports, many of the concerns of the retrospective think aloud focus around the problem of forgetting and distortion or fabrication of the subjects cognitions during task performance. A case can nevertheless be made for a retrospective think aloud session after a concurrent think aloud session to give participants more opportunity to report on usability problems that are not directly task-related [46]. As described, the methodology of the concurrent think aloud requires that the usability tester does not intervene with test users thought processes, apart from reminders to keep the subject talking. However, because of the nature of the usability testing environment, with complex interfaces and incomplete or error-prone prototypes, usability evaluators are often forced to intervene in order to advance with the user test for example. But even then, usability experts still need to agree on how to intervene in these circumstances in a way that reduces the threat to the verbal data s reliability and validity. To deal with these kinds of situations, a modification to the think aloud method has been proposed by Boren and Ramey [42]. This modified think aloud method takes into account the social nature of speech but works towards the same goal as the formal think aloud procedure proposed by Ericsson and Simon, that is: eliciting a verbal report that is as undirected, undisturbed and constant as possible. This new approach suggests that acknowledgement tokens can be used

international journal of medical informatics 78 (2009) 340 353 347 by usability practitioners to provide subjects with a more natural but still non-intrusive response expected from an engaged listener (the experimenter) while still promoting participant s speakership. A comparison study of the Ericsson and Simon s approach and Boren and Ramey s approach in a usability test of a website showed that the process of thinking aloud while carrying out tasks was not affected by the type of approach but that users task performance differed. More tasks were completed in the Boren and Ramey s condition and subjects were less lost. Nevertheless, the think aloud approaches did not differ in the number of different usability flaws revealed [46]. From these first results it can be concluded that the modified think aloud approach is a valid alternative in eliciting test users thought processes in a non-intrusive way. The think aloud has proven to identify about one third of the usability problems detected by a heuristic evaluation but proportionally more severe and recurring problems than specific and low priority problems [29], assuring the reliability of its findings. In comparison to a cognitive walkthrough, the think aloud likewise revealed significantly more problems of a severe and recurring nature than the cognitive walkthrough [29,44,45]. These differences in results might be explained by the fact that the analysis of think aloud is based on fairly large bodies of data of about 8 subjects, whereas a heuristic evaluation normally is conducted by no more than 4 reviewers and a cognitive walkthrough by no more than 2. Moreover, the three techniques differ in defining the user background setting. In a heuristic evaluation, users background characteristics normally are not taken into account and as described, the cognitive walkthrough provides no guidance in how to define user backgrounds and how these descriptions should be taken into account in the usability evaluation. In a think aloud, the users themselves provide this input to the technique through a questionnaire and by verbalizing their thoughts while they interact with the system. These questionnaires, the subjects verbalizations and user-interactions with the system all provide a useful context for identifying factual system deficiencies and the reasons why these deficiencies pose difficulties to its users. In conclusion, the think aloud method has proven its worth and reliability in detecting severe and recurring system usability problems and their underlying causes but in comparison to the heuristic evaluation and cognitive walkthrough at a rather high cost. 5. Expert-based and user-based usability tests in the health care domain 5.1. Examples of the heuristic evaluation Heuristic evaluations have been performed to assess the usability of electronic health record systems [48], webbased interfaces for telemedicine applications [49,50], medical library websites and online educational websites [51] infusion pumps [52], pacemaker programmers [53], web-based clinical information systems [54] and management systems for dental practice [55]. It seems feasible to train clinicians to perform heuristic evaluations and they may detect domain specific usability problems residing in health information systems that may not be found by usability experts [56], which indeed proves the value of including double experts as evaluators in a heuristic evaluation. In a study assessing four different management systems for dental practice on its usability by a heuristic evaluation, it was likewise found that the three double experts (both usability and domain experts) did not only reveal violations of the set of heuristics used but also detected usability problems that were related to the domain of system application [55], confirming the role double experts can fulfill in tackling domain specific problems. Modifications to the traditional heuristic evaluation have been proposed so that it can also be applied to medical devices and used to evaluate the patient safety of those devices efficiently and at low costs [57]. Heuristic evaluations have often been combined with think aloud end user testing, for example in usability tests of medical websites [58 62], patient self-care management systems [63,64] and clinical trial management systems [65]. A usability assessment of a website for indexing and cataloguing other medical websites including both a heuristic evaluation and think aloud revealed that the added value of the think aloud sessions was in providing hints about the causes and severity of problems detected by the heuristic evaluation [59]. In two other studies, the heuristic evaluation was used to design a first version of user interface for a patient information website after which patients evaluated its usability by thinking aloud, revealing that the end user testing was still irreplaceable to uncover remaining problems in the design of the website [60,61]. Two studies likewise proved that think aloud sessions after a heuristic evaluation can still reveal usability problems in the design of a computer application, in these cases a patient-tailored system for self-care management of HIV/AIDS [63] and an online cancer information website [64]. In the evaluation of the self-care HIV/AIDS management application, the expert heuristic evaluators seemingly focused more on problems related to information design whereas the patients had more concerns related to system navigation, access and computer operating functions [63]. A somewhat similar result was found in the evaluation of the online cancer information website; the think aloud sessions with end users revealed problems with navigation through the website not revealed by the preceding heuristic evaluations [64]. The think aloud sessions were thus helpful in clarifying problems related to the way users had to interact with these two applications and in revealing mismatches between designers ideas and users actual strategies in approaching the computersupported tasks. 5.2. Examples of the cognitive walkthrough To characterize the way a clinical knowledge system designed to support therapeutic decision making mediated clinicians reasoning, Cohen et al. [66] combined the cognitive walkthrough with end user think aloud sessions. This study revealed that all clinicians were able to use the system effectively but that expert clinicians outperformed the other clinicians in construction of coherent patient representations. These results point out the value of combining the cognitive walkthrough with the think aloud method; the cognitive walkthrough can be used to construct a general cognitive model of how users would approach a task and the think aloud can add to this understanding by analysing differences in the actual

348 international journal of medical informatics 78 (2009) 340 353 cognitive information processing of computer users that vary in their domain expertise. Two other studies on Computerized Physician Order Entry systems (CPOEs) likewise combined the cognitive walkthrough with think aloud end user sessions to analyse end users cognitive task behavior and their interactive strategies; both studies showed that mismatches between clinicians mental model of ordering and a certain CPOE system interaction structure can lead to faulty orders, missed orders and delayed orders [7,67]. To support the choice and acquisition process of all sorts of clinical information systems, Beuscart-Zéphir et al. [68] have proposed a multi-dimensional assessment framework, including the cognitive walkthrough and think aloud user test sessions, which seems to provide a highly reliable basis for deciding on a clinical application. 5.3. Examples of the think aloud method Of all the three methods described in this contribution, the think aloud procedure has been most often used in the health care domain. Besides in combination with the heuristic evaluation [58 65] or with the cognitive walkthrough [7,66 68], the think aloud method has been used individually to assess the usability of CPOEs [69], Internet-based care management tools [70], and mobile devices [71,72]. In investigating how CPOEs could be made more usable for writing admission orders, Johnson et al. [69] used the think aloud procedure first to gain an understanding of physicians mental models of order writing, secondly to analyse how CPOEs conceptual models conflict with these mental models and thirdly to provide recommendations for CPOE (re)design. Two other studies applied the think aloud procedure to detect usability problems with a handheld application; a mobile electronic medical record system [71] and a prescription writing application [72]. Both studies revealed somewhat similar results as [7,63,64,67]: physicians experienced difficulties in interacting with the applications and their actual errors in entering (prescription) data were closely associated with these usability flaws in the application. Qiu and Yu [73] used think aloud sessions not to evaluate a computer application s usability but to assess novice system users training needs in learning to use a nursing information system and identified three types of knowledge deficiencies to be remediated by the computer system training. Beside health care professionals, patients have been involved in usability testing of web-based patient education information [74]. Other studies have applied the think aloud procedure not in testing computer applications on their usability but in analysing the requirements for a user interface of computerized clinical applications (see for example [75]). An experimental comparison study of the original think aloud approach [38] and the adapted approach [42] that focused on usability problems of a Website indeed showed that the number of different navigation problems found did not differ with the these two methods, but that subjects were less lost in the adapted approach [76]. 6. Discussion Evaluation of the usability of interactive health care computer applications is an essential step in human-centered design and several usability inspection and testing methods, both expert-based and user-based, have evolved from human computer interaction research. In comparison to userbased testing, expert-based usability inspection methods are less expensive and can be used to test early system mock ups or prototypes and are therefore more readily applied and integrated into a system development process. However, usability experts apparently seldom have enough knowledge of the users work domain to evaluate, for example, whether the system interaction structure follows users task flow. The validity of the outcomes of expert-based inspections thus not only depends on the usability experience of experts conducting these evaluations, but also on their knowledge of the specific work domain wherein the system is to be implemented. As expert-based usability method, the heuristic evaluation is easy to understand and to apply and produces relatively good results at rather low costs. A heuristic evaluation process is unstructured and there is no specific focus on user characteristics or tasks: evaluators are only guided by the list of heuristics. These two aspects are just less advantageous characteristics of the heuristic evaluation and may explain the broad range of low priority ( cosmetic ) and non-specific or recurring problems found. These large numbers of low priority and non-recurring problems revealed by a heuristic evaluation run the risk of shading the more severe flaws and at times may have to be filtered before the most critical usability problems can be identified and repaired. These are the main reasons for including several skilled usability professionals in heuristic system evaluations. But even such experts may not show large overlap in the usability problems detected and might produce more reliable results when they would also be familiar with the domain of system implementation. In most circumstances however, these types of professionals, so called double experts, are scarce, difficult to recruit or simply not available, which may confine the application of the heuristic evaluation. As alternative expert-based approach, the cognitive walkthrough provides a more structured approach that is guided by a set of questions and a list of user tasks. Early versions of the cognitive walkthrough required a thorough understanding of the cognitive theory underlying the method and therefore a background in cognitive psychology. The most recent version provides evaluators with detailed instructions of how to conduct the technique [34], broadening the scope of a cognitive walkthrough beyond application of it by cognitive psychologists. Although less skilled evaluators are better guided in the current version, it requires detailed descriptions of correct action sequences necessary to accomplish each user task as input. These detailed action sequences at the same time appear to discourage exploration of the system, limiting the evaluator s ability to find problems not directly related to the user tasks being performed by the evaluator. Task description likewise affects the results of a cognitive walkthrough test. A method that less-experienced reviewers could use to define these tasks based on the model underlying the cognitive walkthrough would therefore be a useful addition to the technique. Then, even the recent version of the cognitive walkthrough method does not prescribe in detail how the user background is to be defined. How certain user characteristics