The Reliability of Usability Tests: Black Art or Science? Paul Naddaff. Bentley University

Running head: THE RELIABILITY OF USABILITY TESTS: BLACK ART OR SCIENCE? The Reliability of Usability Tests: Black Art or Science? Paul Naddaff Bentley University

1.1 INTRODUCTION The field of user centered design research can be broadly divided into three main categories; testing, inspection, and inquiry. Each category employs various methods that can be used to evaluate how a given design supports a user in completing tasks, the general usability of a system, or exploring user needs and understandings of that system. In this paper, we will focus primarily on the category of testing and discuss the reliability of this core testing method. 1.2 USABILITY TESTING OVERVIEW Usability testing (or UT) has long been viewed as a core method of user centered design research. In usability testing, a testing team will select representative users to interact with a system/design to accomplish a set list of tasks (or openly explore it) (Dumas & Redish, 1999). While the tasks are being attempted, the test practitioners will examine and record specific experiences (both positive and negative) each user has in completing a given task. This data is later analyzed and feedback is submitted to the development team and other stakeholders. Broadly speaking, usability testing as a data collection method, is not fully reliable on its own. As Wilson (2000) points out, usability testing is one of many methods used in user research. He emphasizes that using multiple methods can provide different perspectives on the same problem(s). The convergence of those methods can lead to true, actionable, and reliable insight. Relying too heavily on any one method, specifically usability testing, is not sound because many aspects of a given system may not be evaluated thoroughly enough, if at all. Problems will go undetected. Based on the literature, there are many reasons why a usability test may not be as reliable as usability practitioners once thought. Designing something that is simple to use is extremely difficult, and there is no formula. Usability testing attempts to perfect the design of something that will be used by imperfect beings- it is no wonder that it is not perfectly reliable. That being said, we will examine specific reasons for the lack of reliability, and potential methods to make usability testing more reliable. The main areas that we will discuss are:

1. The evaluator effect 2. Synchronization between client and testing team 3. Task selection and formulation 4. Who to test? 5. Problem severity 6. Making change happen 2.1 THE EVALUATOR EFFECT The evaluator effect refers to how multiple evaluators will record differing findings when analyzing the same usability test sessions or when evaluating the same system (in the case of expert evaluation). Humans no matter how well trained will make errors and have variability in judgment. There have been many studies scientifically examining this effect (Jacobsen, 1998, Kessner, 2001, Lindgaard, 2007, Molich 1998-2011). The validation of the evaluator effect sent shockwaves through the foundations of the usability testing community because it suggests that the method as a whole is not as repeatable as it was once thought. In the scientific community, in order for a method to be reliable it must generate reproducible results, this is an imperative property of an established methodology (Kessner, 2000). Molich has lead a series of studies targeted specifically at analyzing the different methods used by various professional usability testing labs. The Comparative User Evaluation (or CUE) studies have consistently shown that there are striking differences in approach, reporting, and findings between the labs (Molich, 1998). These differences contained a surprising amount of variability in the results of usability tests. For example, in CUE-4, only 9 of the 340 total different usability issues detected by the 17 participating teams, were reported by more than half of the teams (Molich, 2006). 205 (60%) of the reported issues were uniquely reported by only one team, no two teams reported the same issue. Statistics like this run throughout the CUE studies as well as through similar studies cited earlier.

The most important aspect of the evaluator effect is to know that it exists and to act accordingly. As mentioned earlier, Wilson highlights the importance of using multiple evaluation methods to triangulate or converge upon product requirements or issues. Each method that a user research team uses is going to have its strengths and weaknesses- it is up to the team to advocate for the appropriate resources to run the necessary methods for a given design. 2.2 SYNCHRONIZATION BETWEEN CLIENT AND TEST TEAM While close client-testing team integration is not critical for an evaluation (i.e. in the case of the expert evaluation of a competitor website), specific and focused requests by the client lead to more overlap in the findings of the testing team (Kessner, 2000). Usability testing is not a perfect science, this much is clear- but when dealing with humans, imperfect beings, the more insight a testing team can get before building a test plan, the better. While the client may not always have the deep understanding of their customer that they think they do, the knowledge they do have should not be discounted. The earlier in the design process that this synchronization can happen, the better. 2.3 TASK SELECTION AND FORMULATION One of the most widely discussed, researched, and debated aspects of usability testing has been the question of how many test subjects are necessary to reveal the maximum number of usability problems in a system over a series of usability tests (Virzi, 1992). The topic has been put to rest and resurrected, with highly regarded experts in the field still not in perfect agreement. The specific number test participants required, however important it may be, is only part of the equation that needs to be considered when designing a reliable usability test. If a given research lab had all of the time and resources in the world to test millions of users, the test would still not necessarily be perfect. Task selection and formulation is perhaps the most important (and difficult) aspect of designing a usability test. Evidence highlighting this importance is not hard to find. For example Cockton & Wollrych (2001) developed concurrent tests of the same system to determine if multiple passes would uncover new problems- it did. A task is

merely a different way of looking at a system. Each task will poke and prod a design in a certain place and in a certain way, revealing strengths and weaknesses. Tasks should be designed to simulate a user s real world experience using something. By giving many sets of user tasks to a small number of users vs giving many users the same limited set of user tasks, the testing team will reliably reveal more issues (Lindgaard, 2001). The preceding statement assumes those tasks are well designed. A well designed task will focus on probing a systems usefulness, efficiency, effectiveness, satisfaction, and accessibility (Rubin & Chisnell, 2008). % Problems % Problems vs Task Coverage 50 45 40 35 30 25 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 Number of user tasks % Problems % Problems Found vs Number of Users 50 45 40 35 30 25 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of users 2.4 WHO TO TEST? In an effort to increase the reliability of a usability test, it is important to test various types of potential users. For example, both novice and experienced users will each interact with a system in a different way. Each user will have a different mental model of a given system, collecting that data will help the development team craft the design around those models. Additionally, selecting non-target users, for example elderly consumers with arthritis, when designing kitchen utensils, lead design firm, Smart Design, to design an extremely successful collection of utensils for OXO (designcouncil.org).

2.6 PROBLEM SEVERITY Even a poor usability test will uncover some problems. However, a testing method that only uncovers minor problems (assuming there are severe problems to be discovered) is not of much use (Lindgaard, 2007). A study by Hertzum et. al (2002), used multiple expert evaluators to examine, in part, if the evaluator effect applied as much to severe problems as it did for minor problems. Evaluators in this study were slightly more likely to find severe problems than they were overall problems (24% vs 16% respectively). Another study by Kessner (2001) tells a slightly more favorable story for the reliability of expert evaluators. As can be seen in the chart below, severe problems had more overlap than minor problems. This is an important fact because it illustrates the ability of usability testing to reliably detect severe problems. As mentioned earlier, no one method is going to detect all problems. Another aspect of problem severity is the question of when is a problem severe enough to be a problem. This depends strongly on what stage of development the system is in. For instance, a relatively early iteration of a system is going to be more plagued by problems large and small, than say a near final iteration. It is advised that in early iterations, tasks that are not successfully completed by 70% of the users, it is problematic

(Rubin & Chisnell, 2008). In later iterations of the system, the testing team should aim for 95% success rate. This focus on defining severe problems helps increase reliability by providing a firm framework for testing teams to work within. This issue relates strongly to section 2.6 below. 2.6 MAKING CHANGE HAPPEN In Molich s CUE-4, 340 usability problems were found on the hotel website. 340 different issues that a development team is being asked to address. It is possible that the practitioners in this case, were being too sensitive (Molich & Dumas, 2006). At a certain point, a testing team is going to have to make the decision as to which problems they want to take issue with, and which are not worth fighting for. It is likely that if somebody told you they had 340 unique issues with something you created, that you may not be open to hearing what they have to say. The ultimate goal of usability testing is to make usability changes happen, not just to test and report on them. While usability testing is part art and part science, delivering results of a test in a productive manner definitely requires tact. While making user centric change doesn t happen be means of any one test, it is the raison d être of the field, and thus, the reliability of practitioners to reliably make change happen cannot be overlooked. It is important to build partnerships with internal teams in marketing, engineering, and management early in the design process. It is advisable to apply usability methods to those internal teams in an effort to learn more about their goals, priorities, customer contacts, and customer data; doing so can help the strategic penetration of usability within organizations (Rosenbaum, 2000). 3.1 CONCLUSION This paper has provided numerous facts and pieces of evidence that paint a relatively bleak picture of the reliability usability testing. Some endearingly call it a black art (Lindgaard, 2000), while others continue to revere it as the gold standard of user centered design research. Regardless of how reliable it is or is not, we keep using it. Most advanced fields are at first considered a concoction of magic and science until they are fully understood.

While usability testing is not the silver bullet of the design process, it provides a large amount of information about a design, much of which is useful. In the cause of usability, doing something is almost always better than doing nothing. (Gray and Salzman, 1998). Today, it appears as though many companies do not do any sort of user research before launching a product- usability testing has proven itself to be extremely valuable despite it s drawbacks. By taking some of the above considerations into account, usability testing teams will be more likely to produce reliable and repeatable results in their testing.

REFERENCES Cockton, G & Woolrych, A. Understanding inspection methods: Lessons from an assessment of heuristic evaluation. In People & Computers XV, A. Blandford & J. Vanderdonckt (Eds.), Springer-Verlag (2001), 171-191. Dumas, J.S, & Redish, J.C. (1999)- A practical guide to usability testing (Rev. ed.). Exeter, UK: Intellect. Hertzum, M.& Jacobsen, N.E. The Evaluator Effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 1, Lawrence Erlbaum Associates (2003), 183-204. Jacobsen, N.E., Hertzum, M., and John, B.E., 1998, The evaluator effect in usability studies: Problem detection and severity judgments. Human Factors and Ergonomics Society 42nd Annual Meeting, 5-9 October 1998 (Santa Monica: Human Factors and Ergonomics Society), pp. 1336-1340. Kessner, M., Wood, J., Dillon, R. F. and West, R. L., 2001, On the reliability of usability testing. Conference on Human Factors in Computing Systems: CHI 2001, 31 March 5 April 2001 (extended abstracts) (Seattle: ACM Press), pp. 97 98 Molich, R., Bevan, N., Butler, S., Curson, I., Kindlund, E., Kirakowski, J., and Miller, D., 1998, Comparative evaluation of usability tests. Usability Professionals Association 1998 Conference, 22-26 June 1998 (Washington DC: Usability Professionals Association), pp. 189-200. Molich, R. & Dumas, J.S. Comparative Usability Evaluation (CUE-4). Behaviour & Information Technology, Taylor & Francis (in press). OXO, 2011. Retrieved 10/30/11from http://www.designcouncil.org.uk/case-studies/oxo- Good-Grips/User-research/ Rosenbaum, S., Rohn, J., and Humburg, J., 2000, A Toolkit for Strategic Usability: Results from Workshops, Panels, and Surveys. Conference on Human Factors in Computing Systems: CHI 2000 (New York: ACM Press), pp. 337-344. Rubin, J. & Chisnell, D. (2008). Handbook of usability testing: How to plan, design and conduct effective tests (2 nd ed.). Indianapolis, IN: Wiley Publishing, Inc. Virzi, R.A, (1992). Refining the test phase of usability evaluation: How many subjects are enough? Human Factors, 34, 457-468. Wilson, C. (2006). Triangulation: The Explicit Use of Multiple Methods, Measured, Measures, and Approaches for Determining Core Issues in Product Development. Interactions (Nov-Dec. 2006).