CogTool-Explorer: A Model of Goal-Directed User Exploration that Considers Information Layout

CogTool-Explorer: A Model of Goal-Directed User Exploration that Considers Information Layout Leong-Hwee Teo DSO National Laboratories 20 Science Park Drive Singapore 118230 leonghwee.teo@alumni.cmu.edu Bonnie E. John IBM T. J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 bejohn@us.ibm.com Marilyn Hughes Blackmon Institute of Cognitive Science, University of Colorado, Boulder, CO 80309-0344 blackmon@colorado.edu ABSTRACT CogTool-Explorer 1.2 (CTE1.2) predicts novice exploration behavior and how it varies with different user-interface (UI) layouts. CTE1.2 improves upon previous models of information foraging by adding a model of hierarchical visual search to guide foraging behavior. Built within CogTool so it is easy to represent UI layouts, run the model, and present results, CTE1.2 s vision is to assess many design ideas at the storyboard stage before implementation and without the cost of running human participants. This paper evaluates CTE1.2 predictions against observed human behavior on 108 tasks (36 tasks 3 distinct website layouts). CTE1.2 s predictions accounted for 63-82% of the variance in the percentage of participants succeeding on each task, the number of clicks to success, and the percentage of participants succeeding without error. We demonstrate how these predictions can be used to identify areas of the UI in need of redesign. Author Keywords ACT-R; CogTool; Information Foraging; human performance modeling ACM Classification Keywords H5.m. Information interfaces and presentation (e.g., HCI): Miscellaneous. General Terms Human Factors. INTRODUCTION Iterative design and testing is a fundamental process in user interface (UI) design. UI designers may generate dozens of ideas, but a typical project only has the resources to empirically test a handful with appropriate users. A vision for human performance modeling has been to provide a method for running psychologically valid tests on many design ideas and obtaining a) quantitative measures of usability comparable to empirical testing with humans, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 12, May 5 10, 2012, Austin, Texas, USA. Copyright 2012 ACM 978-1-4503-1015-4/12/05...$10.00. (b) an understanding of why the quantitative results came out as they did. Modeling has successfully realized this vision for predictions of efficiency of the UI for skilled users (e.g., [2, 18]). For predictions of novice exploration behavior, Information Foraging Theory [21] and the Linked Model of Comprehension-Based Action Planning and Instruction Taking (LICAI, [17]), based on Kintsch's construction-integration theory [14], provide promising underlying psychological theories, and several tools for website design have grown out of that work, notably Bloodhound [7] and Automatic Cognitive Walkthrough for the Web (AutoCWW, [4]). These models of novice behavior take into account the information scent of links (i.e., the semantic relatedness between the link s label and the user s goal, hereafter, infoscent), but do not consider the 2-D layout of the information on the page. Layout is an important factor, however, in determining a user s success [24]. Budiu and Pirolli [5] take 2-D layout into account, but in the context of a model created for a specific UI (a Degree-of-Interest tree). Likewise, the grouping of links on a page can also influence a user s behavior; AutoCWW [4] takes grouping into account when analyzing infoscent, but does not consider the 2-D layout of the groups when making its predictions. We present CogTool-Explorer 1.2 (CTE1.2), a computational embodiment of information foraging implemented in ACT-R [1] within the CogTool prototyping and analysis tool [13] that takes both 2-D layout and grouping into account when making predictions of novice exploration behavior. The next section will give an overview of CTE1.2, followed by implementation details, an evaluation of CTE1.2 on three different layouts of the same information, and suggestions for future work. OVERVIEW OF COGTOOL-EXPLORER 1.2 (CTE1.2) CTE1.2 is a research extension of CogTool that connects a computational model of eye-movements, visual perception, cognition and motor actions (i.e., a simulated user) with a UI storyboard. The original CogTool [13] had only a simulation of a skilled user (i.e., a Keystroke-Level Model [6]). The first version of CogTool-Explorer [24] added a model of novice behavior that considered 2-D layout to CogTool so a designer can make predictions of novice exploration on the same tasks and UIs that he or she 2479

analyzes for skilled execution time. After iterating on CogTool-Explorer many times to add consideration of grouping, improve its mechanisms and set parameters [25, 26], CTE1.2 makes predictions of novice exploration behavior on new text-based website layouts, explaining 63-82% of the variance on measures of interest to UI designers. Figure 1 shows an example run of CTE1.2 s simulated user on an encyclopedia look-up task from an experiment in [3]. In that experiment, the participant was presented with a webpage (shown in the background of Figure 1) with the instructions Find encyclopedia article about at the top and a paragraph of text just below it that constituted the participant s search goal. Below the goal, participants were presented with 93 links organized in 9 groups. Each link was an encyclopedia topic; selecting a link transitioned to its lower-level webpage which presented an alphabetical list of article titles. Participants could check that they had succeeded in the task by finding the target article title from the exploration goal in the A-Z list of article titles. If the target is not in the list, the participant had selected an incorrect link, and he or she would go back to the top-level webpage and continue exploration. For each exploration goal, there is only one correct link that leads to a lowerlevel webpage that contains the target article title. The participants were given 130 seconds to complete each task. Both Kitajima, Blackmon, and Polson [15, 16], Miller and Remington [20], and Hornof [11] argued that in such a layout with groups, users would first evaluate the groups in the webpage, focus attention on a group and then evaluate the links in that group. If the user decides to go back from a group, he or she will reevaluate the groups in the webpage, focus attention on another group and then evaluate the links in the new group. We implemented this group-based hierarchical exploration process in CTE1.2 to make it consider group relationships during exploration. We will present the implementation details in the next section, but refer to the numbers in Figure 1 to convey a sense of what the underlying model is doing. In the left half of Figure 1, CTE1.2 s simulated eye starts at the upper left of the screen where the goal is presented (1), and then moves to the nearest group (2). It determines the infoscent of the group and uses it to decide between two actions, continuing to look at other groups or looking at the links inside the best group seen so far. The model is stochastic in its judgment of nearest and of infoscent, making each run different, simulating the variability in human performance. This run of CTE1.2 looks at seven of the nine groups before deciding (at 3) to focus on the best group so far. It moves its eye to that group (4). The right half of Figure 1 shows the continuation of the run. Having decided to focus on a single group, CTE1.2 looks at the first link it sees (5), determines its infoscent, and uses this to decide between three actions, continuing to look at other links within this group, clicking on the best link seen so far, or abandoning this group and popping back up to explore other groups. In this example run, CTE1.2 looks at links until it decides to click on the best link so far (6). This link does not lead to the correct encyclopedia page (not shown), so CTE1.2 continues to look in this group until it decides to abandon it and go back to looking at other groups (7). CTE1.2 continues this cycle of perceiving groups (8) or links and deciding what to do next, until it either finds the correct encyclopedia page or runs out of time. IMPLEMENTATION DETAILS As shown in Figure 2, CTE1.2 is built inside CogTool [13], which provides the UI designer with a graphical UI (GUI) for representing UI designs and tasks, running the model to Figure 1. Sample run of CogTool-Explorer 1.2 (CTE1.2) on a layout of 93 links in 9 groups on a 3 3 grid. 2480

Figure 2. Structure of CogTool-Explorer 1.2 (CTE1.2) and the inputs from the AutoCWW Project at the University of Colorado used to evaluate it in this paper. generate predictions, and displaying its results. CTE1.2 is comprised of a device model and a user model, both implemented in the ACT-R cognitive architecture [1]. The Device Model CTE1.2 uses CogTool s UI storyboarding tool to allow an interactive system designer to create a storyboard of a GUI either by hand or automatically from existing HTML. The storyboard contains frames (e.g., a web page) with interactive widgets like links and buttons, and transitions between those frames that represent actions on widgets the user can perform, (e.g., clicking on a link). Transitions include a representation of the system response time after the users action (which is zero in the experiment and models in this paper). The widgets have three attributes that are used by the user model, their x-y position in the frame, their size, and the textual labels that are displayed to the user. These objects and their attributes are represented in an ACT-R device model with which CTE1.2 can interact. The UI designer can group widgets much like groups are made in drawing or presentation applications, i.e., select all the widgets (or previously-created groups) you want to group and invoke the Group command. Groups can have a textual label, just as widgets can. The CTE1.2 device model includes data structures to represent hierarchical grouping relationships in a frame, so that the user model can see and consider groups during exploration. In more detail, as an ACT-R model runs it extracts information from the device model to create visual-location objects representing what is visible on the screen. These objects have attributes for x-y coordinates and other basic visual features such as size and color (as yet unused in CTE1.2). CTE1.2 also includes a member-of-group attribute in each visual-location object. When the UI designer groups a set of widgets, the member-of-group slot of each visuallocation object that belongs to that group will have a value that references the group s visual-location object. Since a group s visual-location object also has a member-of-group slot, nested groups can be represented in this scheme. The member-of-group slot of a visual-location object that does not belong to any group has a slot value equal to nil. We can interpret the nil value in the member-of-group slot as membership in an implicit top-level group on the webpage. The User Model The user model has three main components implemented in ACT-R: eyes, hands and cognition. In addition to these components, the user model is comprised of (1) a representation of the exploration goal, (2) a representation of a user s semantic knowledge, and (3) a serial evaluation process adapted from the SNIF-ACT 2.0 model [9] and expanded to consider 2-D layout and grouping. 2481

Exploration Goal The entire process is driven by giving the user model an exploration goal, a paragraph of text that describes the target webpage in the website. This goal is given directly to the user model by typing the paragraph into a Task textbox in the CogTool s GUI. This text is encoded in a ACT-R chunk representing the exploration goal. The model then uses its eyes to look around the device model in search of a way to get to its goal. ACT-R Eyes: Visual Search, with Knowledge of Grouping As previously outlined, the user model searches hierarchically, first looking at the groups, and only after deciding to focus on one group, looking through the links in that group. This hierarchical search process was derived from prior work by Halverson and Hornof [10,11], but differs from the prior work in that infoscent determines whether CTE1.2 focuses on or leaves a group instead of exact word matching. To implement this hierarchical search, the exploration goal chunk contains a group-infocus slot whose value is initially set to nil (recall that nil represents an implicit top-level group on the webpage). The user model constrains its visual search to visual-location objects with member-of-group slots that match the groupin-focus slot. When the model decides to select a group to focus on, it pushes the current value of the group-in-focus slot into the exploration history and updates the group-infocus slot to reference the new group. When the model decides to select a link and transition to a new page, it pushes the current value of the group-in-focus slot into the exploration history and updates the group-in-focus slot to nil. Finally, when the model decides to go back, it updates the group-in-focus slot to the most recent entry in the exploration history, and then deletes that most recent entry from the exploration history. With this representation and mechanism in place, CTE1.2 s visual search is equipped to navigate through any hierarchically grouped layout of widgets on any number of web pages: pages with a one flat level of links, pages like the one in Figure 1 with a regular grid of groups and links, and pages with any arbitrarily arranged, nested and even overlapping groups. ACT-R s eyes serially look at widgets and groups of widgets (we will refer to both individual widgets and groups as elements) in the device model guided by a visual search process, adapted from the Minimal Model of Visual Search [10]). Implemented in the ACT-R vision module, augmented with the EMMA model of visual preparation, execution and encoding [23], moving the eyes and extracting information from the UI elements ranges in duration from 50 to 250 msec. The process starts in the upper-left corner of the frame and proceeds to look at the closest element with respect to the model s current point of visual attention. To make progress, the visual search marks this element as having been attended. The model will not look at an attended element again on this visit to its containing frame or group, but might do so in a subsequent visit because the elements revert to being unattended when the eyes leave a group or frame. Since distance between elements determines which elements are looked at and in which order (moderated by a noise function, so each run of CTE1.2 model may be different), the influence of the 2-D layout of a UI emerges from CTE1.2 s performance. When the user model looks at an element, it extracts the text label from the device model and uses the representation of its semantic knowledge to decide what to do next. If the user model chooses to click on a widget that results in a transition to a different frame, or chooses to go back to the previous frame (e.g. clicks on the back button in a web browser), the user model s visual field will be updated with the elements of the next frame. If the user model chooses to focus on a group or go back from a group, the user model will now continue exploration among the member elements of the new group. Before discussing the decision process in more detail, we will present the representation of semantic knowledge. Representation of Semantic Knowledge CTE1.2 creates a representation of the semantic knowledge of the user from the text of the exploration goal, the labels of the elements in the device model, and a large Englishlanguage semantic space housed at the AutoCWW [4] website 1 at the University of Colorado (the first-yearcollege level TASA corpus from Touchstone Applied Science Associates, Inc.) 2. The AutoCWW tools uses Latent Semantic Analysis (LSA) [19] to calculate the semantic relatedness of the goal to the text of elements, based on the cosine value between two vectors, one representing the text in the goal and the other representing the text in the element. CTE1.2 creates a dictionary of these cosines, or infoscent scores, by calling out to a particular composite tool 3 on the AutoCWW website that first (a) simulates human comprehension processes by elaborating each link text with semantically similar, familiar words in the TASA college-level corpus, and then (b) uses one-tomany comparison to compute the cosine between the goal text and each of the elaborated link texts. This dictionary is built after the device model and exploration goal have been defined but before the user model is run and is stored in a look-up table for access by the user model during model runs. When the model retrieves infoscent from this dictionary, noise is added to emulate a human s variable judgment about the similarity of concepts. 1 http://autocww.colorado.edu/homepage.html 2 CTE1.2 can connect to other sources of information scent score but a discussion of alternative sources is beyond the scope of this paper. 3 The particular tool queried over the Internet by CTE1.2 was http://autocww.colorado.edu/elaborate.html 2482

The Serial Evaluation Decision Process Every time after the user model evaluates the infoscent of an element, the model will decide to either (1) continue to look at and evaluate another element, (2) click on the best widget or focus on the best group seen so far, or (3) go back to the previous group or frame. The model selects the action with the highest utility as computed by three utility update functions. The first two utility update functions [Eq. 1 and Eq. 2] are from SNIF-ACT 2.0 [9] and remain unchanged in CTE1.2. The third utility update function is a major contribution of CTE1.2 over SNIF-ACT 2.0 and will be described in detail after reviewing the first two functions. U LookAt is the model s estimate of how beneficial it will be to continue to look at more elements. A high value of U LookAt means that the elements looked at so far have been highly related to the goal, so this is a good information patch and looking more may find something even better. Mathematically, U' LookAt + IS Current U LookAt = N' + 1 where U' LookAt is the previous computed utility IS Current is the infoscent of the currently attended element N' is the number of elements already assessed in the current group before the currently attended element [Eq. 1] U Choose is the model s estimate of how beneficial it would be to stop looking in this group and jump to a new group it hasn t yet explored. One way to interpret U Choose is that it reminds the model how attractive the best element seen so far in the group compares to the all the elements that the model has seen in the group so far as captured in U LookAt. Thus, if U LookAt > U Choose, the model keeps looking at unattended elements, but when U Choose > U LookAt, the model stops looking and chooses to click on (if a link) or focus on (if a group) the best element seen so far. The k > 1 parameter in U Choose biases the model to prefer U LookAt when the model starts exploring in a new group. Mathematically, U' Choose + IS Best U Choose = N' + k + 1 where U' Choose is the previous computed utility IS Best is the highest infoscent attended in the current group N' is the number of elements already assessed in the current group before the currently attended element k is a scaling parameter [Eq. 2] Both Eq. 1 and Eq. 2 were derived from a Bayesian analysis by Fu and Pirolli in SNIF-ACT 2.0 and thus have strong theoretical support. These two utility update functions also have empirical support from several modeling studies [9, 24] where those models used these update functions and had good fits to participant data on link selections. However, those models did not emphasize or use go-back behavior, and we found that our initial CogTool-Explorer model that used the original U GoBack update equation from SNIF-ACT 2.0 did not match participants go-back behavior [25]. Therefore, CTE1.2 uses Eq. 3, which was developed in [26]. U GoBack = MIS(elements assessed in previous group excluding the selected element from the previous group) MIS(elements assessed in current group including the selected element from the previous group and incorrect elements in the current group) GoBackCost where MIS is Mean Infoscent GoBackCost is a parameter that represents the fixed cost incurred from interacting with the UI to go back and incorrect elements are assigned zero infoscent [Eq. 3] The first term in Eq.3 represents how attractive the model finds the links on the previous page that have not yet been explored; if this term is large, the model is likely to go back. The second term represents how confident the model is that the current page is on the correct path to its goal; if this term is large, the model is likely to keep exploring the current page. ACT-R Hands: Clicking on Links and Buttons When the user model decides to click on a link or the back button, it does so using ACT-R s standard motor module as used throughout CogTool. The time taken to move the mouse is determined by the Welford formulation [28] of Fitts s Law [8] as implemented in ACT-R, with a=0, b=100 ms, and a minimum time of 100ms. The time to click the mouse button is 150 msec, which emerges from ACT-R s standard motor constants. This click is sent to the device model, which feeds a new frame to ACT-R s visual module if the action changes what is visible on the device s display. ACT-R Cognition: Time to Access Infoscent The time the model takes to perform its operations is important because if the model runs faster than humans, it will be able to explore more links in the allotted time than people can. If it performs more slowly than humans, it will run out of time before exploring as many links. Neither situation will predict human performance well, making the duration of operations a critical aspect of CTE1.2. The duration of CTE1.2 s motor operators presented in the previous paragraph are familiar to HCI researchers. However, the time to access infoscent is a relatively new concept. When the model decides to look at an element, a 2483

sequence of three ACT-R production rules are responsible for (1) looking at the element, (2) encoding the text, and (3) assessing the infoscent of the encoded text. The duration for looking at an element is determined by EMMA [23] and the duration for encoding the text is the default 50ms for a single production in ACT-R; both are well established in the literature. The third production to assess the infoscent of an element is unique to the SNIF-ACT 2.0 and CTE models. This production takes the encoded text of the element, and the text from the exploration goal, and approximates the cognitive operation of assessing semantic similarity by invoking a LISP function to retrieve the infoscent of the element from the model s look-up table that represents its semantic knowledge. This LISP function is an example of using a computationally efficient implementation of a complex cognitive process like assessing infoscent, in place of using more native ACT-R mechanisms such as spreading activation between declarative memory chunks. This is a common expediency among cognitive modelers who wish to match patterns of human behavior but are not trying to match the time course of behavior as well. Since the time course of behavior in this model affects its success (i.e., failure occurs because the model, and the humans, ran over the 130 second time limit), we have to care more about time. This meant that the default 50ms for the single production that invokes the LISP function would be shorter than the duration of a more native ACT-R implementation that requires multiple production rules with latencies from declarative memory retrievals. The above reasoning motivated us to perform iterative tests of setting longer durations for the third production on a separate data set [25] than the evaluation in the next section. When set to 275ms, the average duration between link selections by the model matched the 7.4s observed in participants behavior. This duration will be used on another page layout and data set in the next section. EVALUATION OF COGTOOL-EXPLORER 1.2 (CTE1.2) We compared CTE1.2 s performance on 36 tasks on three different layouts of the same information, to demonstrate its ability to predict the effects of layout and grouping. The first layout (multi-page, Figure 3) was used to set the duration of the infoscent assessment production in CTE1.1, so results for the multi-page layout should be considered explanations of the data rather than predictions. However, both results on the second and third layouts (half-flattened, Figure 4, and multi-group, Figure 1) are true predictions, using only the task and UI descriptions, not looking at any human performance data. The tasks were selected from three experiments previously reported by Blackmon et al. [3, 4] and Toldy [27] to provide a set of encyclopedia look-up tasks of varying difficulty that had been tested on all three UI layouts. All three experiments from which the data were drawn used the same procedure and the same tasks. The participants were presented with a webpage with a target paragraph at the top and links below (Figures 1, 3 and 4). They were asked to click links until they either successfully found the correct webpage or were shown a webpage announcing that time had run out. The tasks alternated between hard and easy tasks, counterbalancing to prevent order effects. For example, an easy task (i.e., well supported by the UI design) was to look up Fern ; 100% of the participants found its link ( Plants ) in the Life Science category. In contrast, a hard task was to look up Lifesaving, with only 25% of the participants finding its link ( Organizations ), in the Social Science category. Thirty-six to 60 undergraduate participants completed each task, earning course credit or $15. Participants had 130 seconds to complete each task. Logging software recorded each link clicked, the group heading under which the click was made, and the time elapsed since the previous click. From this log, a total of 4979 completed tasks, we could extract many metrics of interest, for example, the percentage of participants who succeeded in each task, the number of clicks each participant made in each successful trial, and the percentage of participants who succeeded without error on each task. The Layouts The multi-page layout (Figure 3) starts with a top-level page comprised of the goal statement and a list of nine category links below it. When a link is clicked on this start page, a 2 nd -level page appears with a list of links that are more specific aspects of the category link clicked on the top-level. When a link is clicked on the 2 nd -level page, a 3 rd - level page appears with an alphabetical list of terms. If the correct path is followed, the goal term appears on the 3 rd - level page. The participants use the browser s back button to go back from a lower-level page to a higher-level page. The half-flattened layout (Figure 4) works like an accordion widget. 4 It starts with the same top-level page as the multipage layout, but when a category link is clicked, the links below it move down and its more specific links (those that would have been on a 2 nd -level page in the multi-page layout) appear indented, just below the category link. Clicking on one of these links leads to the same 3 rd -level pages as in the multi-page layout. Clicking on another category link at this point collapses the currently expanded category link and expands the one just clicked. The multi-group layout (background of Figure 1) puts all 93 links on a single page organized in 9 groups on a 3 3 grid. Each group has a heading that contains the same words as the category links in the multi-page and half-flattened layouts. Clicking on any link in this layout brings up the same 3 rd -level pages as in the multi-page layout. 4 For example, http://www.welie.com/patterns/showpattern.php?patternid=accordion 2484

Figure 3. Multi-page layout. model s power to predict which tasks need no improvement and therefore no further design effort. To obtain stable values for the above metrics, we ran many sets of model runs until convergence, where each set comprises the same number of model runs as participant trials for each of the 36 tasks. We first ran two sets and checked whether %Success for all 36 tasks on the first set were within 1% of the %Success on the combination of both sets (it wasn t). We then ran a third set and compared the %Success for all 36 tasks for the combined runs of the first two sets to the combined runs of all three sets. We continued to run sets of models until %Success for all 36 tasks in the new combined sets are within 1% of the combined previous sets. All tasks converged within 16 sets of model runs (over 20,000 model runs). Results Table 1 shows the results of running CTE1.2 to convergence on the three layouts and comparing them to both human data and CTE1.1 s predictions on the same tasks. Figure 4. Half-flattened layout. Clicking on subordinate links brings up the 3 rd -level pages shown in Figure 3. Metrics and Modeling Process We compared the model runs by CTE1.2 to participant data on five task performance measures [26]. Due to space limitations, this paper reports only the following three metrics, which are both indicative of model goodness-of-fit and important to UI designers. 1. Correlation between model and participants on the percent of trials succeeding on each task (R 2 %Success). Percent success is common in user testing to inform UI designers about how successful their users will be with their design, so a high correlation between model and data will allow modeling to provide similar information. 2. Correlation between model and participants on the number of clicks on links to accomplish each task (R 2 ClicksToSuccess). This metric eliminates unsuccessful trials because some participants click two or three links and then do nothing until time runs out, whereas others continued to click (as did the model), so success trials may be a better test of how well the model fits motivated users. 3. Correlation between model and participants on the percent of trials succeeding without error on each trial (R 2 %ErrorFreeSuccess). This measure indicates the CTE1.2 is identical to CTE1.1 on the multi-page layout (because there are no groups in that layout, so the models perform the same), is statistically indistinguishable from CTE1.1 on the half-flattened layout, but improves substantially on the multi-group layout. As we speculated in [26], the half-flattened layout has at most 24 links visible at one time which may not be sufficiently difficult for participants to adopt a hierarchical visual search strategy. The benefit of hierarchical visual search is revealed in the multi-group layout, however, with its 9 categories and 93 links visible on its page. CTE1.2 accounts for 63-82% of the variance in novice behavior without using human data to set any parameters. In contrast, prior models like DOI-ACT [5] and SNIF-ACT 2.0 [9] accounted for 56% (ClicksToSuccess) and 94% (%Success) of the variances in their human data, respectively, but fit model parameters to the same human data used to evaluate the models. CTE1.2 is the first process model of information foraging that has been applied to Table 1. Correlations of CTE1.1 s and CTE1.2 s predictions with human data. 2485

different layouts without peeking at any human data to set parameters. This avoids over-fitting of the model to data and increases the model's ability to generalize. Correlation alone does not tell the entire story, however. For example, Figure 5 shows that CTE1.2 underpredicts human performance on the hardest tasks, having %Success for many tasks at less than 20% whereas the people never performed that badly. The next section demonstrates how, despite predictions that are not perfect, CTE1.2 can identify those tasks most in need of UI design attention and those where the UI already supports the user. ignore the easy tasks, and move on to considering the moderate tasks (categorized as neither easy nor hard) if there was time before the site had to be released. For example, if the product team decided that an easy task was one where 95% of the people could succeed in 2 minutes (approximately the time limit in the experiments) and that a hard task was one that 75% or fewer people could succeed in that time, then CTE1.2 could correctly identify 87% of the easy tasks and 93% of the hard tasks using the same criteria for the model. However, it would miss one hard task and would identify four moderate tasks as hard, so design effort would not be expended exactly as it would were user testing data available. If the product team had a less stringent definition of easy (i.e., 90% success) and hard (50% failure) then CTE1.2 would not miss any of the really Figure 5. CTE1.2's predicted %Success versus observed %Success. If CTE1.2 perfectly matched participant data, all data points will lie on the green diagonal line. The red line is the best fitting line for the data points. DISCUSSION Just as user testing can identify which tasks are not well supported by the current UI design, CTE1.2 s predictions can be used by UI designers for the same purpose. For example, the leftmost two columns of Table 2 display each task and the %Success attained by the human participants using the multi-group layout. The three rightmost columns show how successful CTE1.2 would be at identifying easy tasks (that require no additional design effort) and hard tasks (that should received additional attention and redesign) under different definitions of easy and hard. We present several definitions of easy and hard because the criteria for defining these categories are usually dependant on business considerations. For example, an e-business project may have very stringent criteria because their customers may flee to a competitor as soon as they lose their way in the site, whereas a site providing information about disease treatment and prognosis may be able to depend on the persistence of motivated users and have a less stringent definition. Whatever the criteria, the project team could concentrate their design effort on the hard tasks, Table 2. Identification of easy and hard tasks under several definitions of easy and hard. Shading indicates hits, misses and false alarms as summarized in the last four rows. 2486

hard tasks, but its false alarm rate would be higher (35% of the tasks it identifies as hard would not actually be hard, and one would actually be easy), so design effort would be expended needlessly. However, there is no necessity to use the same criteria for both the model and the human data. Knowing that CTE1.2 underpredicts human performance as shown in Figure 5, the criteria for CTE1.2 could be set to 50% in order to identify tasks that would be successful for 75% or fewer users. Likewise, CTE1.2 s criteria for an easy task could be set at 90%, resulting in the rightmost column in Table 2. This results in 93% of both easy and hard tasks being correctly identified, only 1 miss (7%) and 3 false alarms (14%). Another interesting point is that CTE1.2 s predictions of which group heading users will click in first are even more promising than its predictions of our main metrics. Blackmon has shown in laboratory studies [3] and Wolfson and Bailey have shown in practice [29], that a user s first click is highly predictive of eventual success. We analyzed the correspondence between CTE1.2 s predictions of which group heading contained the link first clicked in each task in the multi-group layout against the observed first-clicks of the participants. CTE1.2 s predictions accounted for 71% of the variance with no bias for under or over-predicting. Thus, CTE1.2 could provide targeted guidance for heading label choices in UI designs. CONCLUSION AND FUTURE WORK We have developed CogTool-Explorer (CTE1.2) 5, a model of goal-directed user exploration that considers both layout position and grouping of on-screen options, and test results show that the model accounted for 63-82% of the variance of human performance on three measures of interest to HCI. The model s parameters were set using a multi-page layout of information and attained this level of fit to participant data on a half-flattened (accordion style) layout and a multigroup layout, suggesting that the model is not over fitted to a particular layout. We showed how CTE1.2 might be used to identify tasks where the UI needs to be redesigned to support human exploration and tasks where no more design effort is required, attaining over a 90% hit rate for easy and hard tasks, missing less than 10% of the hard tasks, with less than 20% false alarms. Our eventual goal is for CTE1.2 to work for a wide range of UIs, so that it can be used as a predictive modeling tool for design. We must further test and likely refine the model on many other UIs before we can rely on its predictions in general, but our results so far are encouraging. Several avenues of future work may increase its accuracy as a predictive model and its usefulness as a tool for design. 5 CTE1.2 is part of the CogTool open source project and can be downloaded, as can all of CogTool, from http://cogtool.hcii.cs.cmu.edu/. For example, CTE1.2 uses only infoscent to evaluate links; it is likely that humans use logical reasoning mechanisms, especially when information foraging fails, as suggested by CTE1.2 s under-prediction of success on hard tasks. Using the categorical relationships of words as well as a statistical model of semantic similarity, as in [5], especially when UIs are arranged in groups, may be a path to improvement. In addition, AutoCWW [4] uses familiarity of words as well as infoscent to make its predictions; including familiarity in CTE1.2 s decision process may also increase its predictive power. CTE1.2 currently assumes that all information is equally visible, ignoring contrast, color, size, etc, thus, it should be considered as a test of only the textual labels, grouping and positioning at this point. Adding a model of saliency (e.g. [12] and its successors) would fit well within the ACT-R framework and could allow future versions of CTE to be applied to more realistic UIs. Further, CTE1.2 does not model the psychological processes of how visual groups are formed and recognized. Rather, group relationships are provided as input to the model by the human modeler (as does AutoCWW [4] and DOI-ACT [5]). Future work can explore the use of other computational models of visual grouping, for example [22], as input to CTE. This has the potential for both increasing the accuracy of CTE s predictions and decreasing the work for a UI designer using CTE. In sum, CTE1.2 began with SNIF-ACT 2.0 [9], embodied it with the eyes and hands of ACT-R [1], guided its visual search with the Minimal Model of Visual Search [10] and knowledge of grouping [26], improved its Go-Back utility update function [25, 26], and was aligned with the time course of human behavior [26]. Built within CogTool [13] so it is easy to represent UI layouts, run the model, and present results, CTE1.2 contributes to human performance modeling in HCI and to our set of research and design tools. ACKNOWLEDGMENTS We thank the amazing CogTool team. This research was supported by in part by funds from IBM, NASA, Boeing, NEC, PARC, DSO, and ONR, N00014-03-1-0086. The views and conclusions in this paper are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of IBM, NASA, Boeing, NEC, PARC, DSO, ONR, or the U.S. Government. REFERENCES 1. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., and Qin, Y. (2004). An integrated theory of the mind. Psychological Review 111, 4, 1036-1060. 2. Bellamy, R., John, B. E., Kogan, S. (2011). Deploying CogTool: Integrating quantitative usability assessment into real-world software development. Proceeding of the 33rd International Conference on Software Engineering (ICSE '11). ACM, New York, NY, USA, 691-700. 2487

3. Blackmon, M. H. (2012). Information scent determines attention allocation and link selection among multiple information patches on a webpage. Behaviour Information & Technology, 31(1), 3-15. 4. Blackmon, M. H., Kitajima, M., and Polson, P. G. (2005). Tool for accurately predicting website navigation problems, non-problems, problem severity, and effectiveness of repairs. In Proc. CHI 2005, ACM Press, 31-40. 5. Budiu, R., and Pirolli, P. L. (2007). Modeling navigation in degree of interest trees. In Proc. of the 29th Annual Conference of the Cognitive Science Society, Cognitive Science Society. 6. Card, S. K., Moran, T.P., and Newell, A. (1980). The keystroke level model for user performance time with interactive systems. Communications of the ACM 23, 7, 396-410. 7. Chi, E. H., Rosien, A., Supattanasiri, G., Williams, A., Royer, C., Chow, C., Robles, E., Dalal, B., Chen, J., and Cousins, S. (2003). The bloodhound project: automating discovery of web usability issues using the InfoScent simulator. In Proc. CHI 2003, ACM Press (2003), 505-512. 8. Fitts, P. M. (1954). The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, xlvii, 381-391. 9. Fu, W.-T., and Pirolli, P. (2007) SNIF-ACT: A cognitive model of user navigation on the World Wide Web. Human-Computer Interaction, 22, 355-412. 10. Halverson, T., and Hornof, A. J. (2007) A minimal model for predicting visual search in human-computer interaction. In Proc. CHI 2007, ACM Press, 431-434. 11. Hornoff, A. J. (2004). Cognitive strategies for the visual search of hierarchical computer displays. Human Computer Interaction, 19, 183 223. 12. Itti, L., Koch, C., and Niebur, E. (1998) A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254-1259. 13. John, B. E., Prevas, K., Salvucci, D. D., and Koedinger, K. (2004) Predictive human performance modeling made easy. In Proc. CHI 2004, ACM Press, 455-462. 14. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95, 163-182. 15. Kitajima, M., Blackmon, M.H., and Polson, P.G. (2000). A comprehension-based model of Web navigation and its application to Web usability analysis. In S. McDonald, Y. Waern and G. Cockton (Eds.), People and Computers XIV - Usability or Else! (Proceedings of HCI 2000, 357-373). Springer-Verlag. 16. Kitajima, M., Blackmon, M. H., and Polson, P. G. (2005). Cognitive Architecture for Website Design and Usability Evaluation: Comprehension and Information Scent in Performing by Exploration. HCI International 2005. 17. Kitajima, M., and Polson, P. (1997). A comprehensionbased model of exploration. Human-Computer Interaction, 12, 4, 345-389. 18. Knight, A., Pyrzak, G., and Green, C. (2007). When two methods are better than one: combining user study with cognitive modeling. Ext. Abstracts CHI 2007, ACM Press (2007), 1783-1788. 19. Landauer,T. K., McNamara, D. S., Dennis, S., and Kintsch W. (Eds). Handbook of Latent Semantic Analysis, Mahwah NJ: Lawrence Erlbaum Associates, 2007. 20. Miller, C. S., and Remington, R. W. (2004) Modeling Information Navigation: Implications for Information Architecture. Human-Computer Interaction, 19, 225-271. 21. Pirolli, P. and Card, S.K. (1999). Information foraging. Psychological Review, 106, 643-675. 22. Rosenholtz, R., Twarog, N. R., Schinkel-Bielefeld, N., and Wattenberg, M. (2009) An intuitive model of perceptual grouping for HCI design. In Proc. CHI 2009, ACM Press, 1331-1340. 23. Salvucci, D. D. (2001) An integrated model of eye movements and visual encoding. Cognitive Systems Research 1, 4, 201-220. 24. Teo L., and John B. E. (2008) Towards a Tool for Predicting Goal-directed Exploratory Behavior. In Proc. of the HFES 52 nd Annual Meeting, 950-954. 25. Teo, L., and John, B. E. (2011) The Evolution of a Goal- Directed Exploration Model: Effects of Information Scent and GoBack Utility on Successful Exploration. Topics in Cognitive Science, 3, 154-165. 26. Teo, L. (2011) Modeling Goal-Directed User Exploration in Human-Computer Interaction. Unpublished Doctoral Dissertation, Carnegie Mellon University. 27. Toldy, M. E. (2009) The impact of working memory limitations and distributed cognition on solving search problems on complex informational websites. Unpublished Doctoral Dissertation, University of Colorado Boulder, Department of Psychology. 28. Welford, A. T. (1960). The measurement of sensorymotor performance: Survey and reappraisal of twelve years' progress. Ergonomics, 3, 189-230. 29. Wolfson, C.A., Bailey, R.W., Nall, J. and Koyani, S. (2008), Contextual card sorting (or FirstClick testing): A new methodology for validating information architectures, Proceedings of the UPA. 2488