The Reliability of Usability Tests: Black Art or Science? Paul Naddaff. Bentley University

Similar documents
Current Issues in the Determination of Usability Test Sample Size: How Many Users is Enough?

Detection of Web-Site Usability Problems: Empirical Comparison of Two Testing Methods

EVALUATION OF THE USABILITY OF EDUCATIONAL WEB MEDIA: A CASE STUDY OF GROU.PS

Nektarios Kostaras, Mixalis Xenos. Hellenic Open University, School of Sciences & Technology, Patras, Greece

Heuristic Evaluation of Groupware. How to do Heuristic Evaluation of Groupware. Benefits

How to Conduct a Heuristic Evaluation

2/18/2009. Introducing Interactive Systems Design and Evaluation: Usability and Users First. Outlines. What is an interactive system

SFU CMPT week 11

Usability Evaluation as a Component of the OPEN Development Framework

PUTTING THE CUSTOMER FIRST: USER CENTERED DESIGN

Task-Selection Bias: A Case for User-Defined Tasks

CUE-10: Moderation Page 1. Comparative Usability Evaluation 10. Moderation. Observing usability test moderators

Evaluation techniques 1

Evaluation techniques 1

THE USE OF PARTNERED USABILITY TESTING TO HELP TO IDENTIFY GAPS IN ONLINE WORK FLOW

Usability Testing: Too Early? Too Much Talking? Too Many Problems?

Requirements Gathering: User Stories Not Just an Agile Tool

Chapter 18: Usability Testing U.S. Dept. of Health and Human Services (2006)

Reflections on Brian Shackels Usability - Context, Definition, Design and Evaluation

UX Research in the Product Lifecycle

Generating and Using Results

White Paper. Incorporating Usability Experts with Your Software Development Lifecycle: Benefits and ROI Situated Research All Rights Reserved

Usability of interactive systems: Current practices and challenges of its measurement

A Usability Evaluation of Google Calendar

Curtin University School of Design. Internet Usability Design 391. Chapter 1 Introduction to Usability Design. By Joel Day

Foundation Level Syllabus Usability Tester Sample Exam

Usability Evaluation of Cell Phones for Early Adolescent Users

Assistant Professor Computer Science. Introduction to Human-Computer Interaction

User Interface Evaluation

Survey of severity ratings in usability. Lisa Renery Handalian

Heuristic Evaluation Report. The New York Philharmonic Digital Archives archives.nyphil.org

NPTEL Computer Science and Engineering Human-Computer Interaction

SEGUE DISCOVERY PARTICIPATION IN DISCOVERY DISCOVERY DELIVERABLES. Discovery

Expert Reviews (1) Lecture 5-2: Usability Methods II. Usability Inspection Methods. Expert Reviews (2)

A Heuristic Evaluation of Ohiosci.org

Improving the Usability of the University of Rochester River Campus Libraries Web Sites

needs, wants, and limitations

Problem and Solution Overview: An elegant task management solution, that saves busy people time.

Chapter 3: Google Penguin, Panda, & Hummingbird

Evaluation of Commercial Web Engineering Processes

Improve the User Experience on Your Website

Design for usability

ONS Beta website. 7 December 2015

Real Wireframes Get Real Results

Taxonomy Governance Checklist

Usefulness of Nonverbal Cues from Participants in Usability Testing Sessions

Lose It! Weight Loss App Heuristic Evaluation Report

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

On the performance of novice evaluators in usability evaluations

User-Centered Design (UCD) is a multidisciplinary USER-CENTERED DESIGN PRACTICE. The State of

Usability Testing Methodology for the 2017 Economic Census Web Instrument

Visual Appeal vs. Usability: Which One Influences User Perceptions of a Website More?

A Revised Set of Usability Heuristics for the Evaluation of Interactive Systems

Introduction to User Stories. CSCI 5828: Foundations of Software Engineering Lecture 05 09/09/2014

How to Choose the Right UX Methods For Your Project

A short introduction to. designing user-friendly interfaces

Usability Testing. November 14, 2016

The Analysis and Proposed Modifications to ISO/IEC Software Engineering Software Quality Requirements and Evaluation Quality Requirements

PLEASE SCROLL DOWN FOR ARTICLE

ITIL implementation: The role of ITIL software and project quality

Speed and Accuracy using Four Boolean Query Systems

Usability Report. Author: Stephen Varnado Version: 1.0 Date: November 24, 2014

Standard Glossary of Terms used in Software Testing. Version 3.2. Foundation Extension - Usability Terms

Automate Transform Analyze

Usability Report Cover Sheet

Acurian on. The Role of Technology in Patient Recruitment

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Step-by-Step Instructions for Pre-Work

ACL Interpretive Visual Remediation

CS 160: Evaluation. Professor John Canny Spring /15/2006 1

CS 160: Evaluation. Outline. Outline. Iterative Design. Preparing for a User Test. User Test

User-centered design in technical communication

The Changing. Introduction. IT Business Brief. Published By IT Business Media Cofounders Jim Metzler

Choosing the Right Usability Tool (the right technique for the right problem)

Joint Application Design & Function Point Analysis the Perfect Match By Sherry Ferrell & Roger Heller

Course Information

Hyper Mesh Code analyzer

June 2017 intel.com schneider-electric.com

Evaluation and Design Issues of Nordic DC Metadata Creation Tool

THINK THE FDA DOESN T CARE ABOUT USER EXPERIENCE FOR MOBILE MEDICAL APPLICATIONS? THINK AGAIN.

HCI and Design SPRING 2016

Shaping User Experience

Usability Testing CS 4501 / 6501 Software Testing

Step by Step Instructions for Pre Work

ArticlesPlus Launch Survey

Integrating Usability Design and Evaluation: Training Novice Evaluators in Usability Testing

Preliminary Findings. Vacation Packages: A Consumer Tracking and Discovery Study. Exploring Online Travelers. November 2003

Case Study: Successful Adoption of a User-Centered Design Approach During the Development of an Interactive Television Application

Foundation Level Syllabus Usability Tester Sample Exam Answers

Rapid Software Testing Guide to Making Good Bug Reports

Toward an Automated Future

Usability Inspection Report of NCSTRL

The Study on Cost Comparisons of Various Card Sorting Methods

The Bizarre Truth! Automating the Automation. Complicated & Confusing taxonomy of Model Based Testing approach A CONFORMIQ WHITEPAPER

Improving user interfaces through a methodological heuristics evaluation framework and retrospective think aloud with eye tracking

User Experience Report: Heuristic Evaluation

Evaluation in Information Visualization. An Introduction to Information Visualization Techniques for Exploring Large Database. Jing Yang Fall 2005

USER-CENTERED DESIGN KRANACK / DESIGN 4

Up and Running Software The Development Process

GOMS Lorin Hochstein October 2002

Transcription:

Running head: THE RELIABILITY OF USABILITY TESTS: BLACK ART OR SCIENCE? The Reliability of Usability Tests: Black Art or Science? Paul Naddaff Bentley University

1.1 INTRODUCTION The field of user centered design research can be broadly divided into three main categories; testing, inspection, and inquiry. Each category employs various methods that can be used to evaluate how a given design supports a user in completing tasks, the general usability of a system, or exploring user needs and understandings of that system. In this paper, we will focus primarily on the category of testing and discuss the reliability of this core testing method. 1.2 USABILITY TESTING OVERVIEW Usability testing (or UT) has long been viewed as a core method of user centered design research. In usability testing, a testing team will select representative users to interact with a system/design to accomplish a set list of tasks (or openly explore it) (Dumas & Redish, 1999). While the tasks are being attempted, the test practitioners will examine and record specific experiences (both positive and negative) each user has in completing a given task. This data is later analyzed and feedback is submitted to the development team and other stakeholders. Broadly speaking, usability testing as a data collection method, is not fully reliable on its own. As Wilson (2000) points out, usability testing is one of many methods used in user research. He emphasizes that using multiple methods can provide different perspectives on the same problem(s). The convergence of those methods can lead to true, actionable, and reliable insight. Relying too heavily on any one method, specifically usability testing, is not sound because many aspects of a given system may not be evaluated thoroughly enough, if at all. Problems will go undetected. Based on the literature, there are many reasons why a usability test may not be as reliable as usability practitioners once thought. Designing something that is simple to use is extremely difficult, and there is no formula. Usability testing attempts to perfect the design of something that will be used by imperfect beings- it is no wonder that it is not perfectly reliable. That being said, we will examine specific reasons for the lack of reliability, and potential methods to make usability testing more reliable. The main areas that we will discuss are:

1. The evaluator effect 2. Synchronization between client and testing team 3. Task selection and formulation 4. Who to test? 5. Problem severity 6. Making change happen 2.1 THE EVALUATOR EFFECT The evaluator effect refers to how multiple evaluators will record differing findings when analyzing the same usability test sessions or when evaluating the same system (in the case of expert evaluation). Humans no matter how well trained will make errors and have variability in judgment. There have been many studies scientifically examining this effect (Jacobsen, 1998, Kessner, 2001, Lindgaard, 2007, Molich 1998-2011). The validation of the evaluator effect sent shockwaves through the foundations of the usability testing community because it suggests that the method as a whole is not as repeatable as it was once thought. In the scientific community, in order for a method to be reliable it must generate reproducible results, this is an imperative property of an established methodology (Kessner, 2000). Molich has lead a series of studies targeted specifically at analyzing the different methods used by various professional usability testing labs. The Comparative User Evaluation (or CUE) studies have consistently shown that there are striking differences in approach, reporting, and findings between the labs (Molich, 1998). These differences contained a surprising amount of variability in the results of usability tests. For example, in CUE-4, only 9 of the 340 total different usability issues detected by the 17 participating teams, were reported by more than half of the teams (Molich, 2006). 205 (60%) of the reported issues were uniquely reported by only one team, no two teams reported the same issue. Statistics like this run throughout the CUE studies as well as through similar studies cited earlier.

The most important aspect of the evaluator effect is to know that it exists and to act accordingly. As mentioned earlier, Wilson highlights the importance of using multiple evaluation methods to triangulate or converge upon product requirements or issues. Each method that a user research team uses is going to have its strengths and weaknesses- it is up to the team to advocate for the appropriate resources to run the necessary methods for a given design. 2.2 SYNCHRONIZATION BETWEEN CLIENT AND TEST TEAM While close client-testing team integration is not critical for an evaluation (i.e. in the case of the expert evaluation of a competitor website), specific and focused requests by the client lead to more overlap in the findings of the testing team (Kessner, 2000). Usability testing is not a perfect science, this much is clear- but when dealing with humans, imperfect beings, the more insight a testing team can get before building a test plan, the better. While the client may not always have the deep understanding of their customer that they think they do, the knowledge they do have should not be discounted. The earlier in the design process that this synchronization can happen, the better. 2.3 TASK SELECTION AND FORMULATION One of the most widely discussed, researched, and debated aspects of usability testing has been the question of how many test subjects are necessary to reveal the maximum number of usability problems in a system over a series of usability tests (Virzi, 1992). The topic has been put to rest and resurrected, with highly regarded experts in the field still not in perfect agreement. The specific number test participants required, however important it may be, is only part of the equation that needs to be considered when designing a reliable usability test. If a given research lab had all of the time and resources in the world to test millions of users, the test would still not necessarily be perfect. Task selection and formulation is perhaps the most important (and difficult) aspect of designing a usability test. Evidence highlighting this importance is not hard to find. For example Cockton & Wollrych (2001) developed concurrent tests of the same system to determine if multiple passes would uncover new problems- it did. A task is

merely a different way of looking at a system. Each task will poke and prod a design in a certain place and in a certain way, revealing strengths and weaknesses. Tasks should be designed to simulate a user s real world experience using something. By giving many sets of user tasks to a small number of users vs giving many users the same limited set of user tasks, the testing team will reliably reveal more issues (Lindgaard, 2001). The preceding statement assumes those tasks are well designed. A well designed task will focus on probing a systems usefulness, efficiency, effectiveness, satisfaction, and accessibility (Rubin & Chisnell, 2008). % Problems % Problems vs Task Coverage 50 45 40 35 30 25 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 Number of user tasks % Problems % Problems Found vs Number of Users 50 45 40 35 30 25 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of users 2.4 WHO TO TEST? In an effort to increase the reliability of a usability test, it is important to test various types of potential users. For example, both novice and experienced users will each interact with a system in a different way. Each user will have a different mental model of a given system, collecting that data will help the development team craft the design around those models. Additionally, selecting non-target users, for example elderly consumers with arthritis, when designing kitchen utensils, lead design firm, Smart Design, to design an extremely successful collection of utensils for OXO (designcouncil.org).

2.6 PROBLEM SEVERITY Even a poor usability test will uncover some problems. However, a testing method that only uncovers minor problems (assuming there are severe problems to be discovered) is not of much use (Lindgaard, 2007). A study by Hertzum et. al (2002), used multiple expert evaluators to examine, in part, if the evaluator effect applied as much to severe problems as it did for minor problems. Evaluators in this study were slightly more likely to find severe problems than they were overall problems (24% vs 16% respectively). Another study by Kessner (2001) tells a slightly more favorable story for the reliability of expert evaluators. As can be seen in the chart below, severe problems had more overlap than minor problems. This is an important fact because it illustrates the ability of usability testing to reliably detect severe problems. As mentioned earlier, no one method is going to detect all problems. Another aspect of problem severity is the question of when is a problem severe enough to be a problem. This depends strongly on what stage of development the system is in. For instance, a relatively early iteration of a system is going to be more plagued by problems large and small, than say a near final iteration. It is advised that in early iterations, tasks that are not successfully completed by 70% of the users, it is problematic

(Rubin & Chisnell, 2008). In later iterations of the system, the testing team should aim for 95% success rate. This focus on defining severe problems helps increase reliability by providing a firm framework for testing teams to work within. This issue relates strongly to section 2.6 below. 2.6 MAKING CHANGE HAPPEN In Molich s CUE-4, 340 usability problems were found on the hotel website. 340 different issues that a development team is being asked to address. It is possible that the practitioners in this case, were being too sensitive (Molich & Dumas, 2006). At a certain point, a testing team is going to have to make the decision as to which problems they want to take issue with, and which are not worth fighting for. It is likely that if somebody told you they had 340 unique issues with something you created, that you may not be open to hearing what they have to say. The ultimate goal of usability testing is to make usability changes happen, not just to test and report on them. While usability testing is part art and part science, delivering results of a test in a productive manner definitely requires tact. While making user centric change doesn t happen be means of any one test, it is the raison d être of the field, and thus, the reliability of practitioners to reliably make change happen cannot be overlooked. It is important to build partnerships with internal teams in marketing, engineering, and management early in the design process. It is advisable to apply usability methods to those internal teams in an effort to learn more about their goals, priorities, customer contacts, and customer data; doing so can help the strategic penetration of usability within organizations (Rosenbaum, 2000). 3.1 CONCLUSION This paper has provided numerous facts and pieces of evidence that paint a relatively bleak picture of the reliability usability testing. Some endearingly call it a black art (Lindgaard, 2000), while others continue to revere it as the gold standard of user centered design research. Regardless of how reliable it is or is not, we keep using it. Most advanced fields are at first considered a concoction of magic and science until they are fully understood.

While usability testing is not the silver bullet of the design process, it provides a large amount of information about a design, much of which is useful. In the cause of usability, doing something is almost always better than doing nothing. (Gray and Salzman, 1998). Today, it appears as though many companies do not do any sort of user research before launching a product- usability testing has proven itself to be extremely valuable despite it s drawbacks. By taking some of the above considerations into account, usability testing teams will be more likely to produce reliable and repeatable results in their testing.

REFERENCES Cockton, G & Woolrych, A. Understanding inspection methods: Lessons from an assessment of heuristic evaluation. In People & Computers XV, A. Blandford & J. Vanderdonckt (Eds.), Springer-Verlag (2001), 171-191. Dumas, J.S, & Redish, J.C. (1999)- A practical guide to usability testing (Rev. ed.). Exeter, UK: Intellect. Hertzum, M.& Jacobsen, N.E. The Evaluator Effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 1, Lawrence Erlbaum Associates (2003), 183-204. Jacobsen, N.E., Hertzum, M., and John, B.E., 1998, The evaluator effect in usability studies: Problem detection and severity judgments. Human Factors and Ergonomics Society 42nd Annual Meeting, 5-9 October 1998 (Santa Monica: Human Factors and Ergonomics Society), pp. 1336-1340. Kessner, M., Wood, J., Dillon, R. F. and West, R. L., 2001, On the reliability of usability testing. Conference on Human Factors in Computing Systems: CHI 2001, 31 March 5 April 2001 (extended abstracts) (Seattle: ACM Press), pp. 97 98 Molich, R., Bevan, N., Butler, S., Curson, I., Kindlund, E., Kirakowski, J., and Miller, D., 1998, Comparative evaluation of usability tests. Usability Professionals Association 1998 Conference, 22-26 June 1998 (Washington DC: Usability Professionals Association), pp. 189-200. Molich, R. & Dumas, J.S. Comparative Usability Evaluation (CUE-4). Behaviour & Information Technology, Taylor & Francis (in press). OXO, 2011. Retrieved 10/30/11from http://www.designcouncil.org.uk/case-studies/oxo- Good-Grips/User-research/ Rosenbaum, S., Rohn, J., and Humburg, J., 2000, A Toolkit for Strategic Usability: Results from Workshops, Panels, and Surveys. Conference on Human Factors in Computing Systems: CHI 2000 (New York: ACM Press), pp. 337-344. Rubin, J. & Chisnell, D. (2008). Handbook of usability testing: How to plan, design and conduct effective tests (2 nd ed.). Indianapolis, IN: Wiley Publishing, Inc. Virzi, R.A, (1992). Refining the test phase of usability evaluation: How many subjects are enough? Human Factors, 34, 457-468. Wilson, C. (2006). Triangulation: The Explicit Use of Multiple Methods, Measured, Measures, and Approaches for Determining Core Issues in Product Development. Interactions (Nov-Dec. 2006).