Enhancing Web Search Result Access with Automatic Categorization

Size: px

Start display at page:

Download "Enhancing Web Search Result Access with Automatic Categorization"

Darren Phillips
5 years ago
Views:

1 Mika Käki Enhancing Web Search Result Access with Automatic Categorization ACADEMIC DISSERTATION To be presented with the permission of the Faculty of Information Sciences of the University of Tampere, for public discussion in the Pinni auditorium B1097 on December 2 nd, 2005, at 13:00 o clock. Department of Computer Sciences University of Tampere Dissertations in Interactive Technology, Number 2 Tampere 2005

2 ACADEMIC DISSERTATION IN INTERACTIVE TECHNOLOGY Supervisor: Opponent: Reviewers: Professor Kari-Jouko Räihä, PhD, Department of Computer Sciences, University of Tampere, Finland Dr. Polle T. Zellweger Bellavue, Washington, United States of America Dr. Steve Jones Department of Computer Science, University of Waikato, New Zealand Professor Samuel Kaski, PhD, Laboratory of Computer and Information Science, Helsinki University of Technology, Finland Dissertations in Interactive Technology, Number 2 Department of Computer Sciences FIN University of Tampere FINLAND ISBN ISSN Tampereen yliopistopaino Oy Tampere 2005

3 Abstract Information in the Web is typically found with the help of a Web search engine. For instance, Google has been reported to index over eight billion Web pages and to process over 200 million queries a day. Information is available, but users express their information need with very few query words, typically with one or two. The task of finding relevant information from a set of 8 billion documents with a cue of just two words is a tremendous challenge. Search engines perform incredibly well with sophisticated result ranking methods, but there are cases when the result ranking is not appropriate. For example, undirected informational searches where a broad understanding about a topic is sought or queries with ambiguous terms are such cases. Our approach is to enhance users result access process with automatically formed filtering categories. Categories provide an understandable overview of the results and make accessing of relevant results easy. The concept is implemented in a search user interface called Findex. Two different categorization methods have been developed promoting simplicity to make the functionality understandable to the users. We evaluated our approach in controlled experiments, in a longitudinal study, and with a theoretical test. In the experiments we tested the usefulness of the proposed user interface and the categorization schemes with 20 and 36 participants. The results showed that finding relevant results is about 30 40% faster with the proposed user interface compared to the de facto standard, the ranked results user interface. The attitudes favor the new user interface. In an experiment with 27 participants we found that it is better to show only a small number of categories (around 10 15) instead of maximizing the result coverage by displaying more categories. The results of the experiments were complemented with a longitudinal (two months) study in real use situation with 16 participants. The results indicated that the categorization user interface becomes a part of the users search habits and is beneficial. However, the benefit is not as clear as the experiments indicated. In a real situation, the categories are needed and used in about every fourth search. The usage patterns indicate that categories help when result ranking does not bring relevant results to the top of the result list. iii

4 Acknowledgements The most important support for my work has become from my research group. My closest colleague (literally, sitting in the same room), Anne Aula, had a major role in this work producing ideas and contributing to the design of the system. Without her, the system would not be what it is today. She also had a major role in carrying out pilot tests when we explored the experimental settings for the studies as well as in everyday decisions concerning the research. Working together made the whole process a lot of fun I cannot image a better way of making a PhD. In addition to Anne, other members of the research group have been of great importance in this work. Tomi Heimonen, Harri Siirtola and Natalie Jhaveri have all commented, discussed and helped me with problems, writing papers and evaluating ideas. Their role in the final product is considerable and I am grateful for the collaboration. Johanna Höysniemi is also worth mentioning although not an official member of the group. Dr. Scott MacKenzie s course on Methods, Models, and Measures was revolutionary for my research. The sound methodology presented and description of the research process sharpened my understanding of the matter to an extent that made the implementation of the studies clear and strict. In fact, these insights shaped the form of the whole research. Kari-Jouko Räihä has been the enabling factor in the whole process. Experience from short term projects gives me a perspective according to which this work would not have been possible without proper funding from the graduate school (UCIT). It enabled the persistent and systematic work towards the goal. Kari-Jouko has been a founding member of the graduate school thus essentially making this all possible. Stina Boedeker has provided much inspiration and confidence for my work. Poika Isokoski s attitude and dedication to the research has provided me with a model that definitely had an effect that was needed for accomplishing the thesis. I would also like to thank Mari, who has pushed me to try harder on our discussions about the research and ways of doing it. She had also a key role in starting my final writing process. Tampere, 1 st of November, 2005 Mika Käki iv

5 Contents 1 INTRODUCTION Objective Context Previous Work Method Results Structure of the Thesis ACCESSING WEB INFORMATION Web Searching Search Process Information Foraging Theory Web Search Engine User Interfaces ENHANCING SEARCH RESULT ACCESS Overview Keyword-in-context index Visualizing Search Results Query Refinements Categorizing Web Search User Interfaces ENHANCING SEARCH RESULT ACCESS WITH CATEGORIZATION Cluster Hypothesis Categorization Techniques Central Search Result Categorization Systems Related Clustering Systems The Findex System METHODOLOGY Constructive Approach Measuring the Use of Search Interfaces Contributed Measures Experimental Design Tasks STUDIES Overview of the Studies Study I: Experiment of Statistical Categories Study II: Search User Interface Evaluation Measures Study III: The Effect of the Number of Categories Study IV: Longitudinal Study of Findex Study V: Experiment with Context Categories Study VI: Evaluation of the Categorization Algorithms Division of Labor CONCLUSIONS REFERENCES...69 APPENDIX v

6 List of publications This thesis consists of a summary and the following original publications, reproduced here by permission of the publishers. I Mika Käki and Anne Aula (2005). Findex: improving search result use through automatic filtering categories. Interacting with Computers. Elsevier, Volume 17, Issue 2, pages II Mika Käki (2004). Proportional search interface usability measures. In Proceedings of NordiCHI 2004 (Tampere, Finland), October ACM Press, pages III Mika Käki (2005). Optimizing the number of search result categories. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages IV Mika Käki (2005). Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages V Mika Käki (forthcoming). fkwic: frequency based keyword-incontext index for filtering web search results. Accepted for publication in Journal of the American Society for Information Science and Technology. Wiley. 135 VI Mika Käki (2005). Findex: properties of two web search result categorizing algorithms. Accepted for publication in Proceedings of the IADIS International Conference on World Wide Web/Internet (Lisbon, Portugal), October IADIS Press, pages vi

7 1 Introduction 1.1 OBJECTIVE The motivation for this study is to enhance the end users opportunities to find meaningful results from the Web search engine results. We intend to achieve this by automatically categorizing the search results and by presenting an overview of the results to the user. Two alternative systems have been implemented and discovered to enhance the users ability to access search results. Web searching is a ubiquitous, basic way of finding information from the ever expanding World Wide Web (WWW). Jakob Nielsen (2004) has stated that 88% of the navigation sessions are initiated by the use of a search engine. Google ( is currently the most popular search engine indexing over 8 billion Web documents and handling over 200 million queries a day (Google Timeline, 2005). The user interface of Google, and of other popular search engines, still resembles the solutions first introduced in the beginning of the 1990s when the Web was young and much smaller (e.g., first release of Lycos in 1994 indexed about documents (Mauldin, 1997)). Although the size of the Web and Web search engine databases is rapidly increasing, the skills of the users have not altered all that much. Extensive studies of the Web searchers behavior with search engines show that the topics of the searches have changed with the evolution of the Web, but the query formulation skills of the users have not (Spink et al., 2002; Jansen & Spink, 2006). In particular, users routinely submit short queries containing just a few words (on average about 2.5 words). When we combine these two facts, the motivation for our study becomes evident. Although the search engines use sophisticated techniques for 1

8 ranking the search results, it is virtually impossible to return the most relevant document to the users out of 8 billion if the cue consists only of two words. This is especially true in situations where the query words are ambiguous or when the user wants to learn different sides of a certain topic. Inspired by this, we ask the following research question: can we enhance the users search performance by new user interface solutions? If so, how can they be achieved and how important are such advances? In accordance with our research question, we designed user interface prototypes based on the idea of categorizing search results. The solutions were tested mostly in human-computer interaction (HCI) driven studies. 1.2 CONTEXT In addition to the obvious context of Web searching and current Web search engine user interfaces, this research is connected to various research fields, discussed briefly below. The primary field of research relevant to the current thesis is humancomputer interaction. Human-computer interaction emphasizes the end user s role in the success of a system. Methods from HCI research can be applied in many domains, but software user interfaces were the original focus. In HCI, a solution is considered successful if we can observe measurable improvements in the end user s performance with the system in a particular task. This study is strongly HCI driven and aims to contribute to the knowledge in HCI. The second important field is information retrieval (IR) studies or, more broadly, information studies. The roots of IR are in the early days of storing textual data in computer systems. Textual data, as opposed to structured data in databases, poses particular challenges in retrieving the information from the storage. Exact matching is not desirable in the same sense as with structured data. It would be frustrating trying to find a book if one had to type in the title in the exactly correct form. The example illustrates the different user needs that are associated with IR systems. The main results from IR studies are in the core of every (Web) search engine today. As the availability and importance of electronically stored information has increased, the information retrieval community has focused on a wider context of using information. For example, information seeking (IS) considers the wider context of information use including the retrieval of the information from databases. This study resembles many IR and IS studies and can provide a contribution for these communities. The third partially related field is data mining or knowledge discovery. This is a field that studies ways of automatically extracting useful 2

9 information from large databases. In our view, the size of the database is related to the user s task and resources. If the utilization of information is difficult or too time consuming for the user in a given task, we consider the database large. This implies that we can utilize automatic knowledge discovery or summarization methods to help the user in understanding the information. Data mining and knowledge discovery are fields where techniques similar to our automatic categorization are developed and studied. However, we do not aim to contribute to data mining technologies in our study. Fourthly, natural language processing (NLP) is a field of research that we use in our work. NLP can be regarded as a set of computing techniques that aim, in extreme, to achieve human-like language understanding with computer software. Such techniques include, for example, word stemming and part-of-speech analysis. When we compute categories for textual data, it is common to utilize some techniques used or developed in the NLP field. 1.3 PREVIOUS WORK An early pioneer of automatic overviews for accessing information was the SuperBook prototype (Remde et al., 1987). It was among the first systems where the (meta)data contained in the document was used to automatically create a meaningful overview of it. Although the text was nicely structured (it had headings and sub-headings clearly marked), the idea of automatically producing an overview of the text was important, especially since it was found to be beneficial in a subsequent user study (Egan et al., 1989). Scatter/Gather (Cutting et al., 1992; Cutting et al., 1993), developed in Xerox PARC, took the idea a step further. The tool enabled browsing a large document collection without explicit search functionality. The solution was based on an organization created by automatic clustering. One of the contributions was a new clustering technique that made it feasible to handle substantial numbers of documents, which had not been possible previously. Later on, Scatter/Gather was used for accessing search results with the same clustering idea (Hearst et al., 1995; Hearst & Pedersen, 1996). Another pioneering system was presented by Allen, Obry, and Littman (1993), who focused on introducing structure to the result documents of a search. The user interface was built around an interactive dendrogram (a type of tree structure) built by a hierarchical clustering algorithm. The user could select branches from the tree and see the corresponding article titles (search results) in the user interface. 3

10 One of the most influential systems in organizing Web search results is Grouper (Zamir & Etzioni, 1999). Grouper uses a clustering algorithm specifically built for organizing Web search results. The system was successful and it is extensively cited in the literature. Our approach is close to Grouper, but the categorization algorithm and the user interface are different. In addition, we evaluated our solution in laboratory settings. This gives us more information about the use of such systems. Whereas the above-mentioned systems use clustering techniques where similar documents are brought together to form groups, there are also systems that use classification methods. DynaCat (Pratt & Fagan, 2000) applied Medical Subject Headings (MeSH) classification in retrieving medical documents. Chen and Dumais (2000) used a similar technique in the classification of Web search results. Their evaluation method is partly used in our studies. 1.4 METHOD This work is based on constructive and empirical research. We have constructed a software program that implements a user interface with automatic categorization facilities. Two different categorization techniques are available through the same user interface. The evaluation part is based on 1) laboratory experiments and 2) a longitudinal study. We conducted three controlled experiments with participants for testing the effects of the proposed user interface on the user performance. The controlled environment enabled us to measure accurately the users interaction with the system. A longitudinal study was used to compensate the limitations of the controlled experiments. The usefulness of the system in real situation cannot be fully understood by only relying on the laboratory tests. Thus, information on real use was collected in a longitudinal study with 16 participants over a period of two months. Mathematical measures were also employed to characterize the properties of the categorization algorithms. Specifically, the last study followed the example of many information retrieval studies on systems similar to ours. The system was empirically tested, but without human participants. 1.5 RESULTS The research produced five main results that address different sides of our research questions and confirm findings in previous research: 4

11 1. Automatic categorization can be used to enhance user performance in search tasks. We describe two methods for categorizing Web search results and a filtering user interface concept for enhancing the users task of evaluating the search results. 2. User performance is significantly improved by our techniques. The benefit is about 30 40% faster speed in finding relevant answers while the number of irrelevant answers is reduced. In addition to these, we found evidence that users prefer the suggested search user interfaces. These results are based on observations in laboratory studies conducted with both categorization systems. 3. New user interface techniques can have an impact on users ways of searching. In the longitudinal study, users adopted the new categorizing technology as a part of their search habits and they reported benefiting from it. They also reported having changed the way they formulate search queries. Log files showed that the use of categories stayed at a constant level being used in roughly every fourth search. 4. The two presented categorization algorithms work acceptably and complement each other. In a theoretical test without user participation one method produced higher coverage and overlap results whereas the other produced higher quality category names. The computational performance of the algorithms was on an acceptable level although they are not optimized for top performance. 5. The cluster hypothesis is confirmed. Jardine and van Rijsbergen (1971) stated in their so-called cluster hypothesis that relevant documents for a query tend to be similar. According to our results, this seems to hold in the context of clustering Web search results. 1.6 STRUCTURE OF THE THESIS This thesis consists of a summary and six original articles published in international conferences and journals. The summary will first introduce the reader to the phenomenon of finding information in the Web. This is followed by discussion and a review of the related work on helping users find relevant information in the Web environment. As we have seen various methods for enhancing the access to Web search results, we will focus on methods that are based on result categorization. Our approach is based on categorization and thus this chapter contains the closest references. At the end of the chapter, we introduce our own Findex tool. Whereas the beginning of the thesis mostly concerns the related work, the latter part of the summary (from the methodology chapter on) describes our approach. The methodology chapter discusses our research approach 5

12 and the options we had in selecting the evaluation methods. In the Studies chapter we connect each of the separate publications to the context of this thesis and explain their meaning for it. The thesis ends with conclusions drawn from the results. But before going to the conclusions, let us first begin with the basic information about searching information from the Web. 6

13 2 Accessing Web Information 2.1 WEB SEARCHING The World Wide Web has become a widespread and extensive source of information. The actual number of documents available on the Web is impossible to count due to the distributed nature of the Web, but Google reported indexing over 8 billion pages by the end of 2004 (Google Timeline, 2005). Clearly the amount of information is well beyond the comprehension of any information user. Because the number of Web documents and Web sites is so extensive, the use of Web search engines is one of the basic activities while using the World Wide Web. Almost 90% of Web navigation sessions start with the aid of a search engine (Nielsen, 2004). Web search engines have put information retrieval systems into the every day lives of millions of people. However, Web searching differs from the use of conventional information retrieval systems. The most prevalent difference is the user population. According to a study by Jansen and Pooch (2000), the users of traditional information retrieval systems enter fairly sophisticated and long search queries. Such systems are typically used by professionals (like librarians) who are formally educated and who understand Boolean logic. In contrast, with OPACs (online public access catalogs) and Web search engines, the use of advanced query features or comprehensive terminology is rare. Users make short queries, typically consisting only of one or two words. Both of these latter systems (OPACs and Web search engines) are used by laypersons with varying backgrounds. The users are not experts in information retrieval and may not even know much about the topic they are exploring. Formal categorizations or classification terms are not known to such users and thus, the selection of the query words may be 7

14 suboptimal (especially with OPACs). With the Web search engines, the user population is even more varied than with OPAC systems. Practically every Web user is a search engine user and the variety in the skills, background knowledge, and interests is enormous. Spink, Jansen, and their colleagues (Jansen et al., 1998; Jansen et al., 2000; Spink et al., 2001; Spink et al., 2002; Jansen & Spink, 2006) have conducted comprehensive studies about the use of Web search engines by analyzing the query logs of the Lycos Web search engine. The data collections are impressive, covering over one million search sessions in multiple samples over the years. The data covers the search behavior of about searchers. The main results of the log analysis are clear: Web users formulate short queries consisting only of a couple of terms, typically one or two. Advanced operators, Boolean operators in particular, are seldom used (on average in less than 10% of the queries). Among the operators, phrase search is the most popular. The topics that are being searched have changed over time, but the query formulation skills and habits remain the same. This is notable, as at the same time the amount of available information has exploded. The user s next task after formulating and submitting a query is to evaluate and exploit the results. Evaluation refers here to the process of scanning the result listing and deciding on the relevance of the individual results. If a result seems relevant for the user, it is opened for closer inspection. The Lycos search engine logs show that users evaluate results contained in the first two result list pages, meaning that users evaluate results. From these result list pages only a few actual result documents are opened. One query typically ends in opening one or two result documents for further consideration. These observations have a major impact on the requirements of the search engine functionality and the user interfaces. This means that the ranking of the results is a crucial feature. A query may result in hundreds of thousands or even millions of result documents while the user is willing to evaluate only the first ten or twenty of them. In these circumstances, it seems evident that the ranking method will fail from time to time. We can also see the meaning of the observations differently. It means that the user interfaces of the search engines face a serious challenge in delivering the relevant results to the users. 2.2 SEARCH PROCESS The every day experience of Web searching may seem confusing and somewhat chaotic. The actions the user takes may vary from time to time 8

15 and depend on the situation. However, we need a model of the search process at an appropriate level in order to be able to understand the meaning of different actions and artifacts involved in the process. Because Web searching is a special case of general information searching (retrieval), the obvious source for such a model could be information retrieval (IR) studies. At the core of traditional IR studies is the process of matching a query against the documents. For example, according to Robertson (1977), the Swets model identifies two steps in the matching process. In the first step, the value of a matching function is computed between the query and each document in the data set. The second step selects the documents with the highest values from the first step. This model emphasizes the role of the information retrieval system in the search process. The matching problem is difficult and thus an understandable and important focus area. However, being so focused on the matching process, the user is almost completely eliminated from this view. Our research question involves the behavior of the user and thus, this model is not appropriate for our purposes. During the 1980s more and more studies started to shift towards the users and the user interaction with the information retrieval systems (Saracevic et al., 1988). Because the user of the system is focused on, the searching model also changed. A general model (Saracevic et al., 1988) of information seeking and retrieving identifies seven phases in the search process. The phases are characterized by the following events: 1) user has a problem to be solved, 2) user seeks to resolve the problem by formulating a question, 3) presearch interaction with a searcher, 4) search formulation, 5) search activity and interaction, 6) delivery of the results, and 7) evaluation of the results. This model brings the user into the process and emphasizes the actions taken before the actual use of the information retrieval system. Vakkari (1999), on the other hand, has stressed that information seeking happens in a context where information is needed for a certain purpose (a specific task). The information seeking approach has also been applied to Web search studies. Choo, Detlor, and Turnbull (2000) developed a new model for Web information seeking that describes different types of search behaviors in terms of scanning modes and search moves. This work does not provide us with more detailed information about the focus of this study: the nature of interaction with the search engine. Sutcliffe and Ennis (1998) proposed a cognitive model of users information searching behavior. According to the model, the four activities in the information searching process are: 1) problem identification, 2) need articulation, 3) query formulation, and 4) results evaluation. The model includes iteration so that unsuccessful queries can 9

16 lead to a new problem identification. This is an important point in the Web environment as about half of the query sessions have been shown to contain more than one query (Spink et al., 2002). However, the emphasis on cognitive processes in the model reduces its utility in our case where the focus is in the system-user interaction. Marcia Bates (1989) has proposed a model for online search interfaces that contains and actually emphasizes the iterative or progressive nature of the search process. Her notions of browsing and berrypicking reflect the fact that the information need is rarely constant even over one search session. The query results gathered may affect users search strategies, query formulations and even the information need. Her model describes information search as an evolving process where the information need is continually sifting and it is fulfilled not by one query but by a set of queries produced in this process. Shneiderman, Byrd, and Croft (1997; 1998) have proposed a search user interface framework that is based on a four-phrase model of the search process. This model emphasizes the interaction between the user and the system and is thus consistent with our research question. This is why we have used this model in our study. According to the model, the search process consists of the following phases: 1. Query formulation: the initial phase where the user formulates the information need in terms of a query. The phase also includes the decisions about the information source and the fields of the documents to search. 2. Action: starting the actual system-performed search operation. This may happen implicitly or explicitly depending on the search system. For instance, Microsoft Windows help indices start the search implicitly as the user types in text while a typical Web search engine requires users to press enter or click a button in order to start the search. 3. Result evaluation: when the search is performed, the results are presented to the user and the user needs to evaluate them in order to find the relevant documents. Typically result evaluation is facilitated in current search engines by presenting the results as a ranked list with short document summaries. 4. Query refinement: search is typically an iterative process where the results of one query affect the next queries. In this respect, it is important to make it possible for the users to easily edit and modify the current query. Our contribution is targeted at facilitating the user performance in the result evaluation phase while the other phases are more or less excluded 10

17 from the studies. However, all the phases are certainly interconnected and our research in large (within our research group) has contributed to the other phases as well. For example, we have designed aids for facilitating query refining and studied the users query formulation skills. These studies are nevertheless beyond the scope of this work. 2.3 INFORMATION FORAGING THEORY The above search process models evolved from a strictly system oriented model towards models where the human operator had a bigger role. We can still take a step and look into models of human behavior in the search process. What can be said about the searcher? Which factors and motivations guide the actions in an information searching task? Information foraging theory (Pirolli & Card, 1995, 1999) answers these questions by analyzing the human activities associated with the search process. As the name suggests, the information searching process is compared to food foraging and the analogies are used widely in the theory. Information foraging theory consists of three models. The information patch model describes how the information is scattered around the environment (physical or virtual) and how information seekers allocate time and effort in order to find relevant information. The information scent model is concerned with the process of identifying valuable information from cues that are available in the environment or information space. Lastly, the information diet model addresses the selection of the actual information items. For the current work, the most interesting part of information foraging theory is the patch model. The central idea is that the information seeking process starts by first locating a patch of information, an area in the (physical or virtual) space that has high information concentration. Next, the information is gathered within this patch as long as it appears to be more efficient than locating a new patch. Note that the theory implies modification of the strategy employed in order to maximize the rate of gaining valuable information. In other words, the information searching process is a constant calculation of cost and benefit. Certainly, this calculation may happen without conscious effort, but it affects the behavior. Let us consider a simple Web search session as an example. The query formulation can be seen as an initial effort to find the first information patch. As we come across the patch (get the query results), we start to evaluate it. We look for relevant pieces of information and use all available cues to find it. For example, keywords in bold face and the context in 11

18 which they appear in the result summary can provide the user with a scent of relevant information. If the patch, in our case the list of results, appears to be too sparse in relation to our information need, we start to look for a new information patch. In our example this means formulating a new query. The decision on when this happens is affected by the personal characteristics of the searcher. Some searchers find it easy to look for new patches (formulate queries) and some do not. The available tools also affect the decision. If the search space can provide the searcher with easy and efficient tools for finding a new patch or for finding new relevant results within the current one, it affects the decision on when to switch to a new patch. In terms of information foraging, our research aims at providing the users with new means of finding new information patches. The patch of information in our context gets a somewhat different meaning than in the previous example. Because the result set is divided into easily accessible clusters, one such cluster can be seen as an information patch. In effect, the result categorization approach provides a user with multiple easily accessible information patches with one query. This is likely to reduce the cost of changing patches and thus the users become more demanding in the evaluation of one patch. Because the patch switching is easier, the searchers can concentrate on patches whose information density is high. The authors of the information foraging theory have proposed similar categorization based methods for enhancing (Web) information access. In the beginning, this was based on automatic clustering in the Scatter/Gather (Pirolli & Card, 1995) system and later on automatically extracted structures from Web link graphs and users navigation actions (Pirolli, Pitkow, et al., 1996; Chi et al., 2001; Chen, LaPaugh & Singh, 2002). 2.4 WEB SEARCH ENGINE USER INTERFACES So far, we have seen that Web searching is a frequent and important part of modern information management. The search process models agree that there is an information need that is transformed into a query string. The information retrieval system compares the query to the documents and presents the best matches to the user. This is the high level principle of how the present search engines work. Let us now take a look at the state of the art of commercial Web search user interfaces and the tools they provide to the users in order to overcome the challenges of Web searching. We include in our review three major Web search engine user interface types: 1) ranked result list (e.g., Google Search Engine), 2) directory service enhanced result list (e.g., Yahoo! Search Engine), and 3) query refinement suggestions (e.g., Teoma 12

19 Search Engine). The list could be accompanied with search result visualizations and automatically categorizing search engines. We consider these user interface solutions as emerging and thus they are covered in the next chapter. Ranked List User Interface Perhaps the most ubiquitous and most popular type of search engine user interface is a ranked list user interface. It is simple and easy to understand, which probably explains its popularity. Google search engine is a well known example of such a user interface. Figure 1. Google search engine user interface. Google (Figure 1) is fairly conservative in its user interface design. It relies mostly on its unique PageRank mechanism to bring the most relevant results to the top of the result list and works remarkably well. However, it cannot deliver alternative results in queries where search terms have multiple meanings. Another useful feature in Google is the spell checking of the queries that brings up suggestions for alternative ways of spelling the query words. This reduces errors in the query formulation, but does not solve the problem of ambiguous queries. Another exception to the simple basic user interface is the sponsored links (on the right in Figure 1), which bring up advertisements related to the user s query. 13

20 Directory Enhanced User Interface Yahoo! is a good example of user interfaces utilizing human moderated directory (Figure 2). The Yahoo! directory consists of an hierarchical categorization of terms and topics and of Web sites and pages assigned to those categories. Because the directory was created by humans, the quality of the categorization is good. One problem with such categories is that the categorization may be unfamiliar to the user and thus, browsing may be difficult at first. The search functionality helps to get started and the categories associated with the relevant results allow the users to find other relevant documents as well. Figure 2. Yahoo search engine user interface. Yahoo! categories are utilized in the user interface to give further contextual information about the results. First, a few relevant categories are listed, through which the user can start browsing their contents. Second, each individual result item is accompanied by a category link that describes the category to which the result belongs. User Interfaces with Query Refinement Aids Query refinement aids aim to help users to express their information need in a more precise form. Typically, the initial query must be entered without assistance, but the subsequent queries can be affected by the refinement aids. Teoma is a good example of a user interface with query refinement aids (Figure 3). Teoma puts the query refinement suggestions into a significant role in the user interface. 14

Figure 3. Teoma search engine user interface. Teoma clusters Web pages according to so-called communities, meaning Web pages that are about or are closely related to the same subject (http://sp.teoma.

21 Figure 3. Teoma search engine user interface. Teoma clusters Web pages according to so-called communities, meaning Web pages that are about or are closely related to the same subject ( The user interface presents these communities under Refine functionality that allows users to rephrase their query with a particular topic covered by a community. For the user, the selection of such a refinement appears as a new search with more focused results. In addition to the refinements, communities are used to look for experts in the area. These results are presented to the user in a separate area in the user interface, giving it a central role in the user experience. Yahoo! is another example of this kind of functionality. In regular searches, Yahoo! has a so-called Also try feature. This feature presents a set of queries entered by other searchers that are similar to the current query. Typically this feature can help users to narrow down the search queries and become aware of the other meanings of the query words (in ambiguous queries). When the user selects one of the proposed query formulations, the current query is replaced with the new one and the new query is executed. To conclude the present state of Web information access, we can see that Web searching is a frequent but challenging activity for a huge user population and thus, well worth studying. Multiple user interface solutions have been proposed for the application domain, most of which are decidedly simple. The solutions work relatively well, given the vast variation in the user population. The sound, simple and well tested 15

22 solutions of current user interfaces are something that we wish to utilize in our solution as well. Users are accustomed to the simple ranked result listings and thus, we must consider the benefits of them thoroughly in new solutions. The theoretical work on searching sets a framework for our new solutions. The search process models imply a separate result evaluation phase that will be in the focus of the current research. The model has also influenced the test setups used in our studies. In addition, the information foraging theory provides us with factors affecting the actions of the searchers. These views have inspired us to look for ways of helping the users to find new and meaningful information patches. 16

23 3 Enhancing Search Result Access 3.1 OVERVIEW There is a large number of user interface related techniques that can be used to enhance users access to search results. Initial query formulation is difficult, but refining the queries and evaluating query results have both received a lot of attention. The actual techniques to enhance users performance vary widely, ranging from simple text layouting to complex visualizations. The work for enhancing search result access has started already in the 1950s. Back then, the output devices were rather limited (printers and simple character-based displays), which set strict limitations to the solutions available. With the development of the display technologies, the original problems have changed, but the fundamental questions of how the search results should be presented and what kind of tools the users need in order to access them remains. In the following, we will take a look at early work in keyword-in-context (KWIC) indices, newer work on result visualization techniques, query refinement suggestions and current categorizing Web search user interfaces. All of these techniques are related to our solution and have inspired our work. A more thorough survey of the techniques and different approaches can be found in the Marti Hearst s (1999) chapter User Interfaces and Visualization in the book Modern Information Retrieval. 17

24 3.2 KEYWORD-IN-CONTEXT INDEX The keyword-in-context (KWIC) index is a type of concordance (word index) designed for presenting the results of a search query. Keyword-incontext index dates back to 1959, when Hans Peter Luhn published an article about it. This discussion is based on Salton s description of the technique (Salton, 1989, pp ). The idea in KWIC is to provide users with a meaningful and easy-to-scan query result listing in a limited environment such as those available in the late 1950s. The system is character based and it displays one result per one line of text, typically representing the result by its title (such as a book title in a library system). The keyword (query term) used in the query is central in displaying the results. The titles are printed in the KWIC index so that the instances of the keyword are aligned (see Figure 4). The scanning of the list of hits is assumed to be fast and easy as the user can easily see the context in which the keyword appears in each of the result items. graphic scheme based on abstract and index cards tic information using abstract and index publications abstract archive of alcohol lireratu publishing modern abstract bulletins company pharmaceutical abstract bulletin a punched card abstract file on solid state and tra the abstract of the technical report relation of an abstract to its original from journal article to abstract Figure 4. An example of keyword-in-context (KWIC) index (picture after Salton (1989)). The typical way of presenting results in modern Web search engines can be seen as one type of KWIC. The Web search results are typically represented with a short summary text that contains the title of the document and a so-called query-biased text summary. This kind of summary is built so that short parts of the text (snippets) containing the query keyword are selected from the document (see Figure 5). The approach has been shown to be advantageous for the searchers (Tombros & Sanderson, 1998; White et al., 2001). Advantages of query biased summaries in information retrieval Advantages of query biased summaries in information retrieval Advantages of query biased summaries in information retrieval Anastasios Tombros Mark Sanderson... portal.acm.org/citation.cfm?id=290947&coll=portal&dl=acm - More from this site Figure 5. An example of a query biased search result summary from Yahoo! search engine. The query was query biased. 18

25 Typically the keywords are highlighted (e.g., with bold type face) in these query-biased result listings. The purpose and the effect of bolding is comparable to the term alignment in KWIC. Bolded keywords direct the visual scanning process so that keywords are easily found and thus, the user is able to determine the context in which they appear in the results. 3.3 VISUALIZING SEARCH RESULTS One possible way of improving the user interfaces of the search engines is to visualize the results. Visualization of the extensive amount of information sounds appealing. Visualizations are assumed to have the power of delivering a clear insight of a problem. However, the actual status of search result visualization is different. Unfortunately, visualizing unstructured textual information is all but a clear case. We can find a considerable set of result visualization techniques in the literature, but none of them have met the great expectations people have of them. We can distinguish two major approaches to visualizing search result (Zamir, 1998): 1. visualizations based on properties of individual documents, and 2. visualizations of the inter-document relationships. When the document properties are visualized, the options are to utilize the query terms and visualize their distribution or to use known attributes of the documents such as publication date, author, and type of document (e.g., book, article, magazine). Envision (Fox et al., 1993; Heath et al., 1995; Nowell et al., 1996) is a user interface for a library system that employs the latter approach. In Envision, the results for a query are displayed by icons in a matrix (Figure 6). The user can control the attributes visualized by different visual variables, for example, the year of the publication can determine the position of the icon on the X-axis and the icon can represent the type of the publication. Note that Envision relies heavily on structured data that is available in library systems. A similar approach is much harder to implement in the Web environment. The GRIDL prototype employs a similar approach, but adds so-called hieraxes, categorical and hierarchical axes to the visualization (Shneiderman et al., 1999). 19

Figure 6. Envision user interface for search results. Perhaps the most widely known example of visualizations based on query term distribution is TileBars by Marti Hearst (1995).

The rectangle is broken into a number of bins whose darkness represents the density of particular query facet occurrences within the corresponding section in the document.

26 Figure 6. Envision user interface for search results. Perhaps the most widely known example of visualizations based on query term distribution is TileBars by Marti Hearst (1995). TileBars represent the document as a rectangle whose length is proportional to the length of the document (Figure 7). The rectangle is broken into a number of bins whose darkness represents the density of particular query facet occurrences within the corresponding section in the document. The rectangle is divided into rows that stand for each of the query facets. Veerasamy and Belkin (1996) proposed a system where documents are represented by vertical columns and query terms are located in rows. In the intersections, there are bars whose length visualizes the weight of that term in the corresponding document. In a user study, the system was found to be beneficial, but no convincing evidence for its utility and for the visualization approach in general was found. InfoCrystal (Spoerri 1994a, 1994b) is a way of visualizing the query term distribution between documents rather than within them (Figure 8). The idea in InfoCrystal is to take the query terms and display the number of Figure 7. A screenshot of TileBar User interface. 20

27 A C B Figure 8. Original form of InfoCrystal. The icons have the following meanings: solid circles = query terms, hollow circles = instances of terms alone, rectangles = instances of two term intersections, hollow triangle = instance of three term intersections (picture after Spoerri 1994a). matches that each Boolean (e.g., and ) combination of them corresponds to. The matches are represented with icons whose form and position indicates what kind of combination is under consideration. For example, a rectangle contains two ends and it thus indicates the number of matches that the two facing concepts have given the Boolean operator. InfoCrystal aims to provide an understanding of what parts of the query are potentially too restrictive or not discriminatory enough. However, the resulting visualizations seem to get rather complicated and hard to understand. The visualization idea of InfoCrystal was applied to meta searching in the MetaCrystal prototype (Spoerri 2004a, 2004b). MetaCrystal visualizes the overlap of the search results from multiple search engines. There the different icons in the crystal visualization represent results from different search engines. The user is able to focus on documents found by certain search engines. Visualizing the whole search result collection and the relationships between individual items has been another popular approach. Kartoo (Kartoo Search Engine) is a publicly available tool that makes Web searches and visualizes the results as a concept map (Figure 9). The idea of seeing the whole result set at one sight is appealing, but in practice, the map is difficult to interpret. Cat-a-Cone (Hearst & Karadi, 1997) is a research prototype that visualizes the search results in a three-dimensional cone tree. The structure for the tree comes from a predefined classification system (MeSH) that is also used to classify the result documents. The tree shows all the topics in 21

Figure 9. Kartoo search engine user interface displaying the results for a query jaguar. which the results belong and aims to provide an easy access to them.

28 Figure 9. Kartoo search engine user interface displaying the results for a query jaguar. which the results belong and aims to provide an easy access to them. The essential feature in the Cat-a-Cone system is its ability to display multiple selected categories in the category hierarchy simultaneously. The threedimensional representation of the category tree makes this possible. Self-Organizing Maps (SOM) is a general technique for mapping any multidimensional data into a two-dimensional space. Self-Organizing Maps are implemented using artificial neural networks and the map generation phase can be seen as a teaching phase of the network. The resulting map places closely related data items (such as Web documents or search results) next to each other. Lin, Soergel, and Marchionini (1991) were among the first to apply SOMs in information retrieval and for handling document sets. They proposed a user interface where a document collection was presented as a map with the most dominant concepts visible and with borders between the major regions. Later, the applicability of SOMs was demonstrated for large databases with the WEBSOM (Figure 10) system (Kohonen, 1997; Kaski et al., 1998). WEBSOM was intended to enable interaction with and comprehending a large document collection (such as static Web pages or Usenet news articles). The approach is expected to reduce the problem in selecting appropriate query terms as the query formulation step is essentially eliminated from the search process. Zamir (1998) suggested that the approach could be applied to search results, in addition to static document collections. 22

Figure 10. A screenshot of WEBSOM demonstration. This brief overview of the (Web) search result visualization techniques is not intended to be comprehensive.

29 Figure 10. A screenshot of WEBSOM demonstration. This brief overview of the (Web) search result visualization techniques is not intended to be comprehensive. Result visualization is a field of active research activities. 3.4 QUERY REFINEMENTS The goal of result visualization is to help users to understand the retrieved result set. The approach may not help in a case where the result set does not contain any relevant documents, although such an understanding is an important step. Aids for refining the query formulation can be helpful in such a situation. In the information retrieval community, most work relating to query refinements has been done in automatic query expansion. Automatic query expansion refers to a process where the search system automatically adds synonyms or other closely related terms to the query in order to improve the recall or the precision of the query. Our approach emphasizes the active role of the user and thus the automatic query refinement suggestions are of greater interest here. Vélez, Weiss, Sheldon, and Gifford (1997) proposed a fast query refinement algorithm. The system, called RMAP, uses a pre-computed corpus in order to speed up the process of computing the refinement suggestions. This is important in Internet searches, where the number of searches is vast. In the evaluation of the system, it was found to be an 23

30 attractive approach especially when processing time is critical. However, the evaluation did not include end users, and thus, it is not know whether the approach increases users search effectiveness. Belkin and his colleagues (Belkin et al., 1999) followed a different path in the evaluation by presenting two-term suggestion systems to 36 volunteer searchers. The first system (RF, relevance feedback) was based on explicit feedback from the users. The second system (LCA, local context analysis) produced term suggestions automatically. The comparison produced the first statistically significant difference measured in the TREC interactive track as they discovered that the LCA system was considered to demand less user effort while using the system. Both of these first two systems can be considered to be traditional information retrieval systems that may have a somewhat limited user population. However, a similar automatic term suggestion approach has been employed in the Web environment as well. Bruza, McArthur, and Dennis (Dennis et al., 1998; Bruza et al., 2000) have proposed a query refinement system called Hyper-index (Bruza & Dennis, 1997) and a Web search user interface called Hyper-index Browser (HiB). Hyper-index Browser requires the users to always refine their queries before they are presented with the search results. In practice, the users are presented with a list of possible query refinements after query submission rather than a list of results. When one of the refinement suggestions is selected, the query is actually executed and the results are presented to the users. The Hyper-index Browser was evaluated in user studies where it was compared first to Excite and then to Yahoo! and Google. The studies produced evidence that the approach is beneficial in ambiguous queries. It was also noted that with HiB, the users spent the least amount of time (relatively) in evaluating the actual result documents. This indicates that the time spent in refining the query (making a selection from the lists of suggestions) can be gained back in the result evaluation phase. Query term suggestions are also used in commercial search engines and Peter Anick has studied their actual use. The research contains a system prototype, Paraphrase Search Assistant (Anick & Tipirneni, 1999), and a log based study of the use of the AltaVista query term suggestion system called Prisma. This fairly extensive (over 15,000 search sessions) study concluded that the query suggestions were as effective as manually reformulated queries when they were used. However, the vast majority of the query reformulations were made manually (Anick, 2003). 24

31 3.5 CATEGORIZING WEB SEARCH USER INTERFACES Another group of search engines utilizes clustering methods. These search engines include Vivísimo (Figure 11, Vivísimo Search Engine), WiseNut (Figure 12, WiseNut Search Engine), and iboogie (Figure 13, iboogie Search Engine). All of these engines take a set of results and compute categories for them online. The resulting categorization is hierarchical in all these search engines. The categories are presented in the user interface as a list (either at the top of the page or to the left of the results). The exact details of the categorization algorithms have not been published, but their functionality is similar to the users. The user selects one of the categories and the user interface displays the documents that belong to that category. The major difference from the previous query refinement aids is the fact that selecting a category does not execute a new search, but alters the way the result listing is displayed. From the research perspective, there are a few problems with these commercial systems. First, the actual categorization algorithm is not public, and thus, the understanding cannot be shared in the research community. Second, there are no published evaluations of the usefulness of the categorizations. Addressing this shortcoming is one of our main contributions. Figure 11. Vivísimo search engine user interface with the categories for query jaguar. 25

32 Figure 12. WiseNut search engine user interface with the categories on the top for query jaguar. Figure 13. iboogie search engine user interface with the categories for query jaguar. 26

33 All of these commercial systems employ a hierarchical categorization, although WiseNut uses it only in certain categories. We acknowledge the benefits of the hierarchical categorization schemes and appreciate them in certain situations. However, we assume that hierarchical categorizations could be too elaborate in everyday searching, where the search topics and thus, the concept hierarchy could be hard to understand or require too much attention. Search result evaluation is a tedious task and we believe that users may be impatient in going through the results. In that case, it could be better not to present them with a comprehensive hierarchy but rather, with a more compact overview. Based on these assumptions, we have selected a flat categorization approach for our prototype. 27

34 28

35 4 Enhancing Search Result Access with Categorization In the previous chapter, we presented multiple ways to improve the search result access, including ways to enhance result summaries, result visualizations, query reformulation aids, and result categorization. In this work, we have chosen the categorization approach. Next, we will discuss the relevant theories and the previous research in the field. We will first discuss the cluster hypothesis, which is a theory describing the rationale behind the categorization approach. It will be followed by a brief historical review of how the technology has evolved, what the most prevalent categorization technologies are, and how they work. Next, we will go through various research prototypes that have employed categorization techniques in accessing search results. Finally, our approach and our research prototype, Findex, are presented. 4.1 CLUSTER HYPOTHESIS In 1971 Jardine and van Rijsbergen published the so-called cluster hypothesis in the context of information retrieval. The cluster hypothesis states that relevant documents for a query tend to be similar to each other. This is an important discovery for our approach, as it is the underlying motivation for clustering the results. In the 1970s the clustering approach was not typically employed for accessing the query results, but rather, in the search process itself. If the document database can be divided into similar clusters, the searching efficiency can be improved and thus more elaborate methods could be used in the process. In practice, this means that the search process would 29

36 consist of two phases. The first phase would locate the relevant clusters and the second would retrieve the actual documents. The second phase needs to consider only the documents in the clusters identified and thus the number of documents is smaller than in the whole database. This makes the use of computationally intensive retrieval (or matching) algorithms feasible. Voorhees (1985) proposed a new method for testing the cluster hypothesis on a given document collection and noted that cluster based searches performed better than sequential searches in smaller collections. This is a good reason for clustering the search result lists, which are typically clearly smaller than the document collections. This is supported by the results from a study by Hearst and Pedersen (1996) where the hypothesis was tested with the Scatter/Gather system. They added a new assumption that clustering should honor the different retrieval contexts and thus the clustering should be done in context of a query. Both of the conditions (small collection and query-dependent clustering) can be met in result set clustering. When focusing on the performance of the end user in evaluating the search results, the basic idea of the cluster hypothesis still holds. When similar search results are clustered together in the user interface, the workload of the user is reduced. Ideally, the user needs to first locate the relevant clusters and then evaluate the documents within those clusters. Again, the number of documents to be evaluated is reduced as the number of documents in the relevant cluster is smaller than in the whole result set. 4.2 CATEGORIZATION TECHNIQUES The literature describes a vast number of technologies used in making meaningful categories of textual documents. Before we present an overview of the techniques, we define the terminology used in the discussion. The terminology used in the literature is not always consistent. Terminology The terminology used in discussing document or text categorization varies considerably from source to source. Many terms, such as categorization, clustering, grouping, and classifying, can all be used to refer to the same high level concept of organizing a set of textual documents into a number of smaller groups. In this work, the term categorization refers to both the process of making such an organization and the outcome of the process. There are multiple automatic techniques for computing categories. The techniques can be divided into two groups, and we use the following terms for them: a) clustering techniques bring similar documents together, and 30

37 b) classification techniques assign documents to predefined classes. Multiple implementation techniques have been proposed for both approaches. In addition to clustering and classification, keyword extraction is a technology that can be used in document categorization. Keyword extraction is a process where descriptive words or phrases are found in a textual document. When a categorization has been built with a given technique, the resulting categorization is said to contain categories. The terms class and cluster refer to categories in the context of classification and clustering techniques, respectively. Group can also be used to refer to categories. Document Representation and Similarity Measures Many categorization techniques, especially clustering techniques, are based on similarity measures between the documents. Document similarity can be measured in a number of ways, but the most common ones include document representation in weighted vector format, socalled vector space model (Salton, 1989). In the vector space model, each document is represented by a vector of words where the importance of each word is represented by a number, e.g., its frequency. As some words like a, an, the, or and are frequent, such words are often removed from the vector. This can be accomplished by using socalled stopword lists that enumerate words to be discarded. A more sophisticated method is to use more elaborate weight measures than simple word frequency. TFIDF (Term Frequency Inverse Document Frequency) is a statistical technique widely used in achieving this (e.g., Salton, 1989, pp. 280). There are multiple formulas for computing the measure, but the basic principle is the same. The measure compares the frequency of a term in the current document (tf) to the number of documents containing the term in the whole collection (df). Because the document collection frequency is used as an inverted multiplier, the terms that are common receive a small weight indifferent of their frequency in the current document. In contrast, words that appear frequently in the current document, but are rare in the collection, receive a higher score and are considered descriptive. As the documents are represented by weighted vectors, their similarity (or the distance between them) can be computed using methods from vector algebra. Again, there are multiple ways of applying them, but the socalled cosine measure is perhaps the most widely used. The cosine measure is defined as the dot product of two vectors divided by the product of the length of the vectors (d 1, d 2 ): cosine(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2. 31

38 The length of the document vectors is typically normalized to be 1 (unit vector) and thus, the cosine measure is simply the dot product of the vectors. Clustering Techniques An overwhelming selection of clustering methods can be found from the literature. Not all of them are relevant for clustering textual documents, but even those that are, are too numerous to present here. Berkhin (2002) has presented a comprehensive survey of the clustering methods. We will concentrate here on the basic clustering techniques that are most closely related to the problem of categorizing search results. Clustering techniques can be divided into those that produce a hierarchical clustering (hierarchical methods) and those that produce an un-nested partitioning of the documents (partitioning methods). There are several techniques for implementing both. The following partial classification (after Berkhin, 2002) illustrates the relationships of the methods we will describe. Our description of the methods follows the order of the classification. Partial classification of clustering methods Hierarchical methods Agglomerative clustering Divisive clustering Partitioning methods K-means Agglomerative clustering is an iterative bottom-up process where the two most similar clusters are merged together at each step. The process starts with individual data points as clusters and in the end the whole collection is contained in one root cluster. The resulting data structure is a tree called a dendrogram. As the building process suggests, each node in the dendrogram branches into two until the leaf nodes are reached. The quality of the clusters is dependent on the similarity measures between the documents (and clusters) and several variations of the method described above have been employed. Divisive clustering is the other way of building hierarchical clusters. It uses a top-down process where the document collection is first considered as one cluster that is then divided into smaller sub-clusters until they contain only individual data points. The factors that affect the resulting clusters include algorithms to decide which cluster to split and the criterion for assigning the documents to the new clusters. In the process, document similarity measures are needed, for example, in determining which cluster has the most variation in it. Agglomerative algorithms are more common of the two. 32

39 There are also a number of clustering techniques that produce a flat categorization. We will describe a well-known and most widely used technique in document clustering. In the K-means algorithm, the basic solution is to first select target number K cluster centroids (central points in the document space) around which the documents are then clustered. The selection of a fixed number of centroids is also the source for the name of the technique. In principle, each document is assigned to the closest cluster (represented by a centroid). The first interesting issue in the algorithm is the selection of the centroids. Multiple approaches have been tried, including random selection. The second issue is the selection of the documents to be associated with a given centroid. This reduces back to calculating distances between vectors, because both the centroids and the documents are typically represented as vectors. As stated, documents are usually associated with the closest centroid. As the initial clustering is achieved with the previous procedure, the result can be optimized in an iterative process. This can be done by recalculating the centroid based on the documents contained in the cluster and then reassigning the documents again to the new centroids. Another optimization option is to find an optional centroid and to analyze the effects it would have on the clustering. Note that the K-means algorithm can be used to build a hierarchical clustering by applying the algorithm recursively to the clusters once computed. All the techniques discussed above may produce good quality clusters if all the parameters and factors are adjusted successfully. However, naming the clusters is a major problem. As the clusters are generated on the fly from a set of documents, the automatic description of the clusters has proved to be extremely difficult (Popescul and Ungar, 2000). Typical solutions employ representative words (frequently occurring or strongly weighted). However, the resulting word lists are typically hard to understand. This naming problem is a major shortcoming of these classical clustering methods, which is one of the motivations for our approach. Classification Techniques The main difference between clustering and classification methods is the source of the resulting structure. Where clustering creates the structure in the process, classification methods rely on predefined, typically manmade topic structures, which typically are hierarchies (such as MeSH for medial documents or Yahoo! directory for Web content). The classification process is based on a set of target classes that are characterized by a set of features. The classified items are also characterized by similar features and the classification algorithm must make a decision on which class a data point belong to. We can find a 33

40 number of techniques used for this purpose from the literature ranging from simple nearest neighbor and multivariate regression models to various Bayesian models and neural networks. The bottom line is that the algorithm must place the data item in one of the available classes. Machine learning techniques are often applied in classification. Learning algorithms are used to optimize the classification process by teaching the system with a correctly classified training set. Such a set could be, for example, from a Web directory service (such as Yahoo!). Because the correct classification of each data point is known in the training set, the parameters affecting the classification can be adjusted in the process. One typical feature of the classification methods is that they do not directly support hierarchical classifications although the target classification scheme is often hierarchical. If nothing is done, the structure is flattened because of the classification technology. There are, however, systems where the complete hierarchy of the target classification scheme is employed. One can build a hierarchy of classifiers so that in the first level, a coarse decision needs to be made (e.g., distinguish computer articles from automobile articles). The next classifier depends on the decision made by the previous one as the process proceeds from one level to the next. Such an approach is used, for example, by Dumais and Chen (2000) and by Koller and Sahami (1997). For the end user, the most notable consequence of using a classification technique is the quality of the category names. Because the documents are classified to an existing taxonomy, the class names are also predefined and can be carefully selected to optimally convey the intended meaning. Thus the naming problem associated with the clustering methods is avoided. However, the classification scheme may be too rough for the given data set resulting in a categorization where all data items are placed in one or two classes. Such a categorization does not reduce the number of evaluated documents enough to realize the promise of the cluster hypothesis. Keyword Extraction Keyword extraction is an area of research that is closely related to clustering. It is especially important for our categorization method that is based on word and phrase frequencies. In contrast to our solution, keyword extraction typically aims at automatically extracting keyphrases for describing the contents of a document. Such keywords or keyphrases are often required by academic publications and their automatic extraction would be useful in many ways. Another popular application area of keyword extraction is in the query refinement suggestion systems. There the extracted keywords are used to give the user more options in reformulating the query. 34

41 According to Jones and Paynter (2002), Turney (2000) was the first to apply learning methods to the keyword extraction task. Barker and Cornacchia (2000) developed a system that utilized the extractor component developed by Turney, but added a new way of selecting the final keywords from the extracted candidates. The selection was based on noun phrases and their frequency. The extraction process scanned the text word by word and looked for sequences of nouns and adjectives ending with a noun. For identifying the part-of-speech, an online dictionary was used rather than a part-of-speech tagger. The digital library project in New Zealand has produced a keyword extraction algorithm called KEA (Jones & Paynter, 2002). In addition to simply extracting descriptive words for documents, it was also applied to facilitate search result access in a Web-based library system. Another example of its application is a document clustering system with a special stress on naming the clusters. In both tasks, the automatically extracted keywords were shown to be effective (Jones & Mahoui, 2000). Thus, KEA is closely related to our research. KEA is based on a supervised learning approach. The system is trained with a set of documents whose accurate keywords are known (e.g., author provided). Each document is transformed into text and stemmed candidate phrases with a length of one to four words are formed from the text. For each candidate, two measures are computed: 1) document distance, which describes how far the first instance of the phrase is in the document, and 2) the TFIDF measure of its frequency. Based on these measures, a Naïve Bayes classifier is constructed. When the classifier is ready, keyword candidates and their measures are computed in the same way and the classifier is used to select the most promising candidates. KEA is used in multiple user interfaces in different roles. In the simplest case, it can be used in library search to describe the retrieved documents with automatically extracted keywords if author-provided keywords are not available. In KeyPhind (Gutwin et al., 1999) keywords, or keyphrases, are used for refining a query. When the user enters a query, keyphrases that contain the query term(s) are listed. Upon selection of a keyphrase, the related keyphrases and the documents containing the keyphrase are displayed in the user interface. Phrasier (Jones, 1999) utilizes the automatic keyphrases in a complete browsing, querying, and reading environment for a digital library. Keyphrases are the basis for automatically creating hyperlinks between the documents that share the keyphrase. The user interface also displays the related documents. Kniles (Jones & Paynter, 1999) is a simpler version of Phrasier for the Web environment having basically the same features. 35

42 In addition to the digital library environment, KEA has also been tested in the Web environment. Jones, Jones and Deo (2004) presented a system for PDA devices that used KEA produced keyphrases as search result surrogates on a small screen. The solution was compared to displaying document titles, but no performance differences were observed in the study. 4.3 CENTRAL SEARCH RESULT CATEGORIZATION SYSTEMS Now that we know the basics of clustering textual documents, we can direct our attention to actual systems where the techniques are utilized. We will first take a look at the search result categorizing systems that have had a notable impact on the research in the field. Later, we will make a more extensive survey of the related systems. Systems where categorization was an explicit part of the end user s experience started to emerge at the beginning of the 1990s. Scatter/Gather (Cutting et al., 1992) was one of the first systems where automatic clustering was tightly integrated to the user interface. In the late 1990s, Grouper (Zamir & Etzioni, 1999) was introduced with a clear focus on categorizing search results. Around the same time, classification based systems were also introduced. These include the SWISH prototype by Chen and Dumais (2000) and the DynaCat system by Pratt and Fagan (2000). These four systems have been discussed widely in the HCI community and are closely related to our work. In addition, they exemplify the two major approaches: clustering and classification. The prototypes use two types of data sources: digital library (with structured data and complete documents) and Web searches (data limited to summary texts only). This separation is important because the data type affects the techniques used. Figure 14 summarizes the technical approaches and the data sources of the four central systems. Digital Library Web Search Results Clustering Scatter/Gather Grouper Classification DynaCat SWISH Figure 14. Techniques and data sources of the central prototypes. Scatter/Gather The Scatter/Gather user interface is based on an interactively and iteratively built document structure. In the beginning, the whole database 36

43 would be divided into a number of clusters (scatter). These clusters were presented to the user by showing a list of representative words and a short list of sample document titles contained in the cluster. The user then selects one or more of these clusters for focusing on the interesting topics. The selected clusters from a new base set (gather), which is then divided (scatter) into clusters again. The user controls the clustering process by selecting the clusters of interest and forms a tailored hierarchical categorization of the document set. The original idea of Scatter/Gather was to function as a browsing tool for large document collections, but quite soon the idea was employed in accessing search results. From the user s perspective, this does not change the situation much. The difference is that the initial document collection is formed by a search query (the result set), but from there on, the interaction with the system is the same. However, the performance issue is reduced as the initial document collection is considerably smaller than in the original case. In addition to being among the first systems to actually try and demonstrate search result clustering in practice, research on Scatter/Gather also presented one of the first user studies on such systems (Pirolli, Schank, et al., 1996). As with many experimental technologies, the initial results from the user studies were not a major success. In fact, Scatter/Gather appeared to be both slower and less accurate when compared to the standard information retrieval system based on similarity search. Despite the slightly disappointing results in simple document retrieval, Scatter/Gather was seen to effectively communicate the topical structure of the document collection. A follow-up study focused on the usefulness of the clustering approach (Hearst & Pedersen, 1996). The overall performance of the system would not be as important as the success of using the categories for a given task. This approach produced results. The study concluded that the users found and selected the most relevant clusters. This is crucial for the cluster hypothesis and the results provided confirmation that the clustering approach may be beneficial in search result access. Grouper After Scatter/Gather, the idea of categorizing search results seemed to fade and it was not a popular research topic, but the area was rediscovered in the late 1990s. At that time, the Web had grown to vast proportions and finding information in it became harder. Web searching was an important motivation for the beginning of the next wave of search result categorization systems. Zamir and Etzioni (1998) were the first to demonstrate the feasibility of this approach in the Web environment. They compared multiple 37

44 clustering techniques (Zamir et al., 1997) and finally presented their own clustering method, Suffix Tree Clustering (STC). STC is based on shared words and phrases in the documents and the technique was especially designed for Web searching. The authors call the method clustering, but it is also close to term extraction. For example, the algorithm does not use the classical document similarity measures that are distinctive for clustering algorithms. Zamir and Etzioni (1999) developed a Web search engine user interface based on the idea. The clustering search engine user interface was called Grouper. Grouper presents the search results in five categories. Each category shows representative words and a few sample document titles much as the Scatter/Gather system did. The interaction, however, is simpler compared to Scatter/Gather. In Grouper, the user simply selects one of the categories and the system will display the documents, whereas in Scatter/Gather cluster selection presents the user with new clusters. However, Grouper forces the user to make one category selection, because initially only the clusters are displayed. Thus, clusters are emphasized and the design introduces an additional interaction step to the search process, namely the selection of the category. This may not be necessary, because the top results in the search engine rank order could satisfy the user s need. The design may be good in evaluating the use of the categories, but it may decrease the user performance in real situations. Grouper has not been formally tested in an experiment, but it was evaluated by a log study. The results showed that the users followed more documents in a session and that the time needed to access multiple documents was shorter than when using a conventional user interface. These are positive results and indicate that result clustering is worth exploring further. SWISH The SWISH prototype employs another approach, as Dumais and Chen (2000) implemented a hierarchical classifier based on Support Vector Machine (SVM). The classifier was taught with LookSmart Web Directory documents that are organized into a hierarchy of categories by professional human editors. After the teaching phase, the classifier will assign new documents to the best matching categories. The original user interface of SWISH organized the list of documents by category title names using them as headings. The user could collapse or expand the categories and each document was presented by a one line title underneath the category. The short document summary was available as a hover text on demand. The document title was a link to the actual document and the user interface contained separate buttons for opening 38

45 subcategories and displaying more documents within a category (Chen & Dumais, 2000). SWISH was evaluated with 18 users comparing it to the typical rank order list user interface (Chen & Dumais, 2000). The test setup was one of the sources of inspiration for our own studies as the authors used predefined queries. The results of the study concluded that the category approach is faster and that there are fewer give-up situations compared to the ranked list user interface. In addition, the users showed positive attitudes towards the proposed system. In a later study (Dumais et al., 2001), seven user interface designs were compared. The conditions included three ranked list layouts and four automatic category based layouts. The user interfaces varied in showing the result summaries and category names. The results indicate that the category based user interfaces are faster than the list based and the best performance was achieved in the condition where the document titles were displayed in the context of the categories. This means that a proper context is needed in order to understand the meaning of a category. DynaCat DynaCat is a search system in the medical domain intended for patients searching for information about various medical issues (Pratt & Fagan, 2000). Like SWISH, DynaCat uses the classification approach, but it utilizes multiple models in the process. It models the user s query according to predefined query types and uses a large domain specific terminology model (Medical Subject Headings, MeSH) to classify the retrieved documents. Thus, the category selection is influenced by both the user defined query and the retrieved documents. The user interface of DynaCat resembles our solution. It lists selectable categories on the left side of the user interface. In contrast to ours, the categorization is hierarchical. DynaCat was evaluated in a user study with 15 participants where it was compared to the ranked results user interface. The results of the study show that the participants found more answers in the given time and that they were more satisfied with the results when using DynaCat. In summary, we can see similar results in the user studies of these four prototypes (Scatter/Gather, Grouper, SWISH, and DynaCat). All except Scatter/Gather demonstrated faster and more enjoyable user performance in search tasks. The test setups in the studies were similar: the proposed categorization system was compared to a ranked list of results, the de facto standard. These results are in line with the cluster hypothesis in the context of search result access. It means that the result categorization is able to bring together relevant documents and help the user in finding the needed information. 39

46 4.4 RELATED CLUSTERING SYSTEMS In addition to the previously discussed research prototypes, there is a large number of systems that are closely related to the current topic. Table 1 lists a selection of such systems including the previously presented most influential systems. The list is not comprehensive, but it gives us an overview of the systems, evolvement in the research, the technologies used, and the user interface solutions employed. The Technology column reveals the main technique that is used in the corresponding prototype to create the categorization. The most common techniques include variants of clustering and classification as well as term extraction methods. In addition to these, this sample contains a few systems that employ Web link structure analysis. The Type of the categorization or organization refers to the structure that is produced in the organization process and displayed to the user. We assume that the resulting structure may have an important role in the understandability and usefulness of the system for the end users. The structures are either hierarchical (H) or flat (F). The type of user interface ( UI ) is of great interest to us, because it has such a central position in the end user experience. We speculate that the utility of the most brilliant categorization system may be damaged by a suboptimal user interface design. Categorization of user interfaces is not a simple task, but we try. Table 1 summarizes our categorization principles. We distinguish three types of user interfaces and two target environments (the Web and graphical user interfaces (GUI)). The actual combination of the user interface type and the target environment are represented with the listed combinations of letters. Finally the Data source column tells what kind of data source is used in the prototype. The most important ones include search results from a search engine, complete (full text) search result documents (search docs), and rich data from a digital library (DL). Table 1. Legend of the user interface types used in Table 2. UI Type Description Web GUI Overview+Detail UI Browsing UI Visualizing UI Displays simultaneously an overview of the data and details of the selected item. A structure is used to navigate in the data collection. The whole structure and/or the data items are not simultaneously visible to the user. A visual representation (as opposed to textual) is used to select interesting data items. Multiple techniques A combination of two or more of the above. M-W M-G O-W B-W V-W O-G B-G V-G 40

47 Table 2. Research prototypes using categorization in accessing search results. No System name Reference Technology Type UI Data source Systems discussed above 1. Scatter/Gather Cutting et al clustering H OB-G DL 2. Grouper Zamir and Etzioni 1999 clustering F B-W search results 3. SWISH Chen and Dumais 2000 classification H B-W search results 4. DynaCat Pratt and Fagan 2000 classification H O-G DL (MEDLINE) Closely related systems 5. Adaptive Search Roussinov and Chen 2001 clustering F B-W search results 6. AMIT Wittenburg and Sigman 1997 link structure H V-G web walker 7. Carrot Weiss and Stefanowski 2003 clustering F B-W search results 8. Cat-a-Cone Hearst and Karadi 1997 classification H V-G DL (MEDLINE) 9. (CGRU) Chekuri et al classification F B-W? search docs 10. Cha-Cha Chen et al link structure H O-W intranet search 11. CI / Meta Spider Chau et al extraction F M-G search results 12. Dart Cho and Myaeng 2000 clustering F V-W search results 13. DisCover Kummamuru et al clustering H O-W search results 14. HighLight Wu et al extraction H O-W search results 15. HuddleSearch Osdin et al clustering H OB-W search results 16. Info Navigator Carey et al clustering + extraction 17. Interactive Dendrogram F/H V-G search docs Allen et al clustering H V-G DL 18. isearch Chen and Chue 2005 clustering + link structure H O-W search docs 19. J-Walker Cui and Zaïaine 2001 classification H O-W search results 20. (KS) Kules and Shneiderman 2005 classification + clustering H O-W search results 21. (LC) Leouski and Croft 1996 clustering H B-G search docs 22. PHIND Edgar et al extraction H B-W DL 23. Retriever Jiang et al clustering F B-W search results 24. SONIA Sahami et al clustering + classification + extraction F N/A DL / search results 25. WebACE Boley et al clustering H O-W browse history 26. WebCutter Maarek et al link structure H V-G Guru / Lotus Domino 27. WebRat Granizer et al clustering F V-W search results 28. (ZHCMM) Zeng et al extraction F O-W search results 41

48 Cluster Classify Extract Hybrid Other Hierarchic Flat Other Figure 15. Distribution of the categorization techniques in Table Figure 16. Distribution of the categorization types in Table 2. To summarize Table 2, we can see that clustering (Figure 15) is the most popular technique in this sample (we assume that this gives a good picture of the overall situation). In addition, the structure is typically presented in a hierarchical structure (Figure 16). 4.5 THE FINDEX SYSTEM To address our research question on how to enhance the search result access, we have implemented two categorization algorithms for Web search results and designed a filtering user interface for the task. The main idea is to present an overview of the results with automatically computed categories so that different topics contained in the results become visible and easily accessible. Result access is enhanced by the filtering user interface that allows users to select items in the category overview and see the results belonging to the selected category. Categorization Methods We have designed and implemented two categorization algorithms. The first, which we call the statistical method, aimed at simplicity while the second is a redesign aiming at better descriptiveness of the category names. The second design was inspired by the experiences gained from the first one and is called (keyword) context categories or fkwic for short. Both categorization systems are based on the word and phrase frequencies found in the search results. In principle, the most frequent words and phases are used as the categories. The category computation is based on the textual data available in search engine result listings, i.e. result titles and summaries (snippets). The number of results used in the computation can be adjusted, but we have found about results to be a good compromise between thoroughness, simplicity, and computational efficiency. The categorization process starts with a computation of so-called category candidates. In statistical method, the candidates include all individual words and up to one sentence long multi-word phrases found in the result text. Each candidate is associated with a frequency figure, which describes 42

49 the number of results the candidate is found in, not the actual word or phrase count. Separately listed stopwords (e.g., articles, pronouns, and the like) are excluded in the candidate extraction process so that they do not appear as candidates or inside candidate phrases. In the candidate computation, the word order of the phases is meaningful and only phases with same word order are treated as equal. The context categorization computes the candidates slightly differently. Context categories are required to contain at least one query term. Thus all the candidates are phrases (at least two words long). The requirement to contain one query term in the candidates reduces the number of valid candidates significantly. Other than this requirement, the candidate computation is similar to the statistical method. In the early versions of the algorithms we employed a word stemmer for discarding the word endings that cause unwanted variation in words. In computation, simple inflections such as singular and plural forms of a word make them different (e.g., car and cars). However, the stemmer used (by Martin Porter) caused confusion for the end users in some cases and thus, we started to use a simpler non-exact string matching algorithm (for details see Paper VI). The effect of the algorithm is similar to stemming algorithms. After the candidate extraction the actual categories are selected. This phase is important in contributing to the quality of the categories. The process includes merging the candidates that are considered to be the same and removing the candidates that are sub-phrases of one another. The selection process is slightly different in these two categorization methods and the details can be found in Paper IV. The main point in the selection process is to select highly descriptive (understandable for humans) categories while ensuring appropriate coverage of the results. In the end, n most frequent candidates are selected to be displayed to the user. Our study (Paper III) indicates that this number should be between 10 and 20. In our experiments, we used 15 categories. The final categories are carefully selected words or phrases from the search results. These categories contain all the results where the word or phrase occurs. Due to merging of the candidates, the words in the phrase categories may appear in different order in the results or words may not be strictly sequential but may have stopwords in between. Other than such exceptions, the mapping between the category name and the associated results is straightforward. In fact, the categories can be seen as ready-made free text search queries for the result set. User Interface The user interface design follows the popular overview and details model (Card et al., 1999, pp ) and is divided into two panels. The left 43

Figure 17. Findex standalone user interface with built-in All results category selected. contains the list of categories (overview) and the right shows the actual results.

50 Figure 17. Findex standalone user interface with built-in All results category selected. contains the list of categories (overview) and the right shows the actual results. The user interface has been implemented both as a standalone graphical user interface application (Figure 17) and as a Web service (Figure 18). In both cases, the basic structure and functionality of the user interface are the same. The graphical application was used extensively in our experiments (it allows comprehensive logging) and the Web user interface was targeted for our longitudinal study to make the service easily accessible. The selected user interface model was derived from our design approach where the aim is to provide new ways of accessing search results. This means that the new features are added to the current user interfaces so that the users can take advantage of their existing knowledge with them. This allows users also to ignore the new features when desired. To enable this, our interface has a built-in All results category. The user interface functions so that this category is automatically selected after each search. When the All results category is selected, the conventional list of ranked results is displayed. This makes the user interface appear like any other Web search engine. When a category is selected, the result listing is filtered to show only those results belonging to the category. Our categorization schemas are straightforward: a result belongs to a category if it contains the name of the category in its result summary text. This fact is illustrated to the user by highlighting the corresponding text in the result listing (Figure 18). 44

Figure 18. Findex web user interface. The larger image shows statistical categories, the smaller context categories. Highlighting shows the relationship between a result and the selected category.

51 Figure 18. Findex web user interface. The larger image shows statistical categories, the smaller context categories. Highlighting shows the relationship between a result and the selected category. In the latest Web user interface, the two categorization methods are visible and controllable by the user (this was not the case in our longitudinal study). On top of the category box on the left (Figure 18), there are two tabs labeled Categories and Contexts for statistical and context categories respectively. By selecting a tab, the user can control the type of categories displayed in the overview. Differences from Related Systems Because enhancing search result access by categorization has been under extensive research, the obvious question arises: what is the contribution of the present research? The differences and thus the contributions of this study are threefold. One aspect is the actual algorithms used to categorize the results, another is the combination of the algorithms and the user interface, and the third is the evaluation approach. The following summarizes our contribution in relation to the other systems: 1. The two categorizing algorithms we present are novel and designed especially for Web search engine results consisting of short text summaries. The algorithms are based on a similar term (phrase) extraction technique used in the STC algorithm by Zamir and Etzioni 45

52 (1998) and no document similarity measures are used. In contrast to Zamir and Etzioni, we do not merge clusters based on the documents they contain, but based on the similarity of the extracted phrases. This appears to produce understandable results. 2. The filtering user interface in combination with the type of categorizing algorithms we use is new. The Grouper user interface forced the users first to select a category and the results were displayed only on the next page. Our user interface treats categories as an added convenience that is provided in addition to the results. This allows the users take advantage of result ordering when it works, but gives them additional means of exploring the results when needed. DynaCat was similar in this respect, but the categorization method and data source were different. 3. The selection of the evaluation methods is unique, giving new insight about how search result categorization is used. Our approach combined experiments and longitudinal studies with the same system. In related research theoretical or mathematical evaluations are common, but our methods involve end users closely in the evaluation process. In summary, our algorithms are unique, but not radically different from previous work. The user interface idea has also been presented by others, but the combination of them and the thorough evaluation with end users constitutes the contribution of this research. 46

53 5 Methodology 5.1 CONSTRUCTIVE APPROACH Studying human-computer interaction often involves the construction of a software artifact that implements an interesting design idea. The artifact demonstrates the potential of the idea and makes its evaluation possible. The evaluation enables us to gather valuable information about the solution. Building better ways of accessing Web search results is an activity where such a constructive research is valuable. It is impossible to evaluate the importance and the functionality of the design ideas without a working prototype. For example, it is easy to imagine a system with a perfect categorization system for intuitive representation of the information. However, it is difficult to build such a system, which is why we do not have them. A constructive approach makes the elimination of infeasible ideas clear and concrete. The implementation of this study contained multiple stages of constructive work. The construction and the experiences from the evaluations taught us valuable lessons that are incorporated into the process in subsequent implementation phases. In our methodology, the major constructive phases were followed by an evaluation to enable such feedback. 5.2 MEASURING THE USE OF SEARCH INTERFACES The selection of evaluation techniques for search interfaces is not a straightforward matter as one can follow at least the examples of HCI and IR research. The choice of methods depends on the research question. We 47

54 will now discuss the properties of the methods found in those fields and justify our selection. The core measures in HCI are stated in the ISO (1998) standard. They are effectiveness, efficiency, and subjective satisfaction. Effectiveness measures the completeness and thoroughness of task completion. In the case of information search it means, for example, the number of found (relevant) documents and their coverage in relation to the given task. Efficiency, on the other hand, describes the value of the results achieved in relation to resources used (such as time or money). In information search tasks, this typically means the time used for accomplishing the task or the number of result documents opened for evaluation. Subjective satisfaction is usually evaluated with questionnaires eliciting users opinions about the system. The HCI measures are well suited in our situation where we are interested in the users performance, but they are so general that they cannot be measured directly. There is a lot of room for interpretation as to what the measures actually mean. The evaluator must decide what the individual measures (effectiveness, efficiency and subjective satisfaction) mean in the application domain being studied. The approach for evaluating search systems in the information retrieval community is different. The most fundamental measures are recall and precision. Recall describes the thoroughness of a search. It is presented with a number that states the proportion of the relevant documents retrieved to all the relevant documents in the collection. Precision, on the other hand, denotes the number of relevant documents within the result set. The greater the precision, the fewer irrelevant documents there are to distract the user in the result evaluation. Both these measures are stated as percentages. The above-mentioned measures of recall and precision do not depend on the user interaction with the system. The measures are calculated based solely on the result set the search system returns. This is appropriate when the properties of the retrieval engine are studied, but if we are interested in how the user can evaluate the result listing, we need a different approach. Another issue with the recall measure is that the measure is hard to calculate in the Web environment. For computing recall for a query, the total number of relevant documents in the collection (the Web) should be known. This is feasible only in limited collections such as those provided in TREC (Text Retrieval Conference). Veerasamy and his colleagues (Veerasamy & Belkin, 1996; Veerasamy & Heikes, 1997) used slightly modified measures in a study on a graphical 48

55 user interface for accessing the search results. The measures are related to recall and precision, but they are based on the document selections made by the users. The measures are called interactive recall and interactive precision. Interactive recall indicates the percentage of the relevant documents in the result set that were selected by the user. Interactive precision indicates the proportion of relevant documents within the user selected documents. We adopted these measures in our experiments. In the studies, we refer to them simply as recall and precision as the meaning of the measures is obvious in the context. In addition, the word interactive seems inappropriate in the context of HCI studies. Interactivity is such a central concept in HCI that using it to describe a measure seems confusing. 5.3 CONTRIBUTED MEASURES In addition to these well established measures, we developed a few measures of our own for the studies. In the first study, it became apparent that measures typically used in HCI studies, like time and success, may not be enough in studying search user interfaces. To alleviate the problem, we developed three new measures for the evaluation of interaction with and usability of search user interfaces. The measures are specially targeted at studying the result evaluation phase of the search process and they are presented in Paper II. The first suggested measure is search speed, that is measured in answers per minute. The measure is analogous with physical speed like kilometers per hour. The second measure is closely related and adds a quality dimension to the measure. Qualified search speed states how fast the user can find results of certain relevance, for example, how many relevant documents the user is able to gather in a minute. One important property of these measures is that they are proportional making the comparison of the results slightly easier. The third new measure is immediate accuracy that captures the success in typical Web search tasks. Web searchers select commonly only one or two results for each query (Spink et al., 2002). In such a situation, the limiting resource is not time, but rather the number of result selections. It matters how many result selections (clicks) the user needs in order to find the first relevant document for the information need. This is exactly what the immediate accuracy measures. It states the percentage of cases where at least one relevant document is found by the n th document selection. These three new measures were utilized in appropriate places throughout the individual studies of the thesis. They address the problem noted earlier about the lack of concreteness in HCI measures. 49

56 5.4 EXPERIMENTAL DESIGN The measures for evaluating the success of a design are of great importance, but the experimental design of the evaluation is not selfevident either. In our case, the independent variable is clear: the user interface. Although we had two categorization algorithms available, each experiment provided the participants with only one. Thus the user interface and the categorization algorithm were treated as one experimental variable. In each experiment the independent variable had two values: suggested user interface and the baseline user interface. As the baseline user interface, we used a Google interface imitation that displayed Google results in the original order, ten results per result page. In addition to the independent variable, the actual experiment situation and its constraints play a major role. To obtain reliable results we tested multiple experimental settings. First, we aimed to maximize the external validity by emphasizing the naturalness of the situation. We simply provided the participants with search tasks and let them do the searches as they wished. We controlled the tasks, user interfaces, and topical knowledge (using students from a particular class and tasks related to the topics of the class), but not the search behavior. We treated task completion times as dependent measures and logged the selected results. Such a test setup did not yield any interesting information about the phenomena that we were interested in (accessing the search results). By looking at the collected data, we saw that the participants did not utilize the new user interface features (categories) and that most of the time was spent evaluating the actual documents accessed through the search result list. This was undesirable for our purposes and compromised the validity of the results by introducing a lot of noise into the data. In the second step, we added more control to the setup. We addressed the problems by not allowing the participants to open the result documents and requiring them to use the categories in the category condition. The latter was achieved simply by disabling the automatic selection of the All results built-in category. Normally, the selection of this category makes the category system appear almost like the normal ranked result list, because all the results are immediately shown to the user. However, the exciting situation of being in an experiment (although participants were explicitly told that the user interface is the target of the study, not the participants) was likely to cause the participants of our first test to follow the familiar way of accomplishing the task. That is, using the ranked results. Exciting or stressful situation may impair the human performance and cause so-called tunnel vision, referring to the narrowing of the useful field of view (UFOV) (Matthews et al., 2000, pp ; Ware, 2004, p. 147). 50

57 With these refinements we conducted another pilot study. Again we saw that the level of control was still insufficient. The collected data contained such a wide variation that it was not possible to measure the effects created by different user interfaces in the result evaluation phase of the search process. By looking at the data we concluded that the variation was caused by the differing query formulation skills of the participants. The third approach was adopted from the literature where the queries for each task were predefined by the experimenters (Chen & Dumais, 2000). In information search tasks, this is a fairly radical solution, but the focus of our research allowed this. Because the focus was on understanding the effects of the user interface on the result evaluation phase, controlling the query formulation phase did not invalidate the measurements. This setup allowed us to measure the effects of the variation in the user interface properly. The actual tasks and the associated predefined queries can be seen in Appendix 1. We considered this issue from the point of view of internal and external validity. Increasing the control in the test situation increases the internal validity of the setup at the expense of external validity. Because we increased control only in the phases of the process not included in the interesting phenomena, we concluded that the external validity was not compromised too much. 5.5 TASKS The early pilot tests were based on fact finding tasks where it was enough to find one document that contained the answer to the question. In the course of pilot testing it became evident that such a task type may be a possible source of misleading results. It is rare that result categorization helps users in fact finding tasks. As the clustering hypothesis suggests, categories bring similar documents together and thus provides the users with a more comprehensive set of results on the desired topic. Because this is the main area of contribution of our proposed solution, the type of tasks should reflect this fact. To alleviate the issue, we reformulated search tasks by requiring users to collect as many documents as possible for a given task. This aims to mimic a certain type of searches that users engage in regularly in the Web. The Web search types have been classified by Broder (2002) and Rose and Levinson (2004). According to their taxonomies, multiple results are often helpful in informational (in particular undirected informational) searches. Rose and Levinson give an example of an undirected informational query: color blindness. Such a query aims to cover a broad topic (a phenomenon) and multiple result documents can help users in achieving the understanding of it. 51

58 In normal settings, Web searchers are not simply searching for as many documents for a task as they can. There is a fine balance between the thoroughness and the time spent in searching the documents. There are many factors affecting this balance, which we cannot properly control. To simulate the balance, we asked the participants in a pilot study to carry out the task as fast as possible and with the thoroughness they felt appropriate. The inclusion of subjective judgment turned out to be a mistake. Participants favored thoroughness excessively over time. In practice, they could evaluate all 150 results that we use for categorization searching for the relevant answers. This is clearly not normal behavior, as other studies report that users typically consider fewer than 30 results per search in about 80% of the searches (Jansen et al., 2000; Hoelscher, 1998). We concluded that the somewhat artificial experiment situation affects the participants performance and encourages them to carry out the tasks with exceptional thoroughness. To compensate for this we chose to impose a time limit for the tasks to simulate normal behavior. A one-minute limit was seen and pilot tested to be an appropriate limit. It allows moderate thoroughness while still being short for the participants to be overly accurate in the task. Our one-minute time limit is supported by the figures reported by Aula and Nordhausen (forthcoming). Their figures indicate that Web searchers use about 1.5 minutes for evaluating the result listing of a query. Completing a search task took 5.5 minutes in their study and it consisted of multiple queries and evaluation of the actual result pages (57% of the time). Our time limit is shorter, but the experimental setting focusing only on the result evaluation compensates for this. In a normal situation (such as that in Aula s and Nordhausen s study), user s attention must shift from evaluating the result listing to evaluating the result documents and back, but in our tests, this did not happen. This reduces the required time for evaluating the results in our experiments. After the fact, we can see that our participants saw on average 3.6 result pages (with ten result per page) while using the reference user interface. This is consistent with 30 evaluated results reported earlier (Jansen et al., 2000; Hoelscher, 1998). 52

59 6 Studies 6.1 OVERVIEW OF THE STUDIES Figure 19 illustrates the research process and displays the temporal relationships between the phases. The starting point for the studies is the search framework that enables us to formulate queries, execute them, and to categorize the results. Implementing such a framework was the first major constructive part of the research and produced the Findex search user interface. The development of the first categorization algorithm was a part of this work. The first experimental study was designed to evaluate the effectiveness and usefulness of the statistical categorization approach. Because the measuring practices for evaluating the search user interfaces were somewhat limited, we developed new measures. The results of the first experiment were used in testing the new measures along with the results found in the literature. As the first experiment indicated the utility of our categorization approach, we looked deeper into the system. In the next phase, we studied the effect of the number of categories presented to the user. We learned that relatively few categories yield a better performance. The next step was to address the question of the external validity of the studies. Initial studies were carried out in a laboratory and little was known about the use of the system in real settings. This was addressed by a longitudinal study where 16 participants were allowed to use the system for an extended period of time. Before the study, a new Web based interface for Findex was implemented. 53

60 The experiences from the work with the first categorization scheme gave us valuable insights. We noted that good quality categories (category names) tend to contain a query word in them. This observation was then implemented in a working prototype with inspiration from keyword-in-context (KWIC) indices. After this construction phase, the new solution was integrated into the Findex user interface and evaluated in an experiment. The final study was concerned with the properties of the two categorization algorithms. It described the details of the algorithms for future development and presented various result of their performance. 6.2 STUDY I: EXPERIMENT OF STATISTICAL CATEGORIES Reference Mika Käki and Anne Aula (2005). Findex: improving search result use through automatic filtering categories. Interacting with Computers. Elsevier, Volume 17, Issue 2, pages (Paper I, page 85) Construction of Findex framework Evaluation of statistical categories (I) Evaluation measures (II) Number of categories (III) Construction of Findex Web UI Longitudinal study (IV) Construction of context categories Evaluation of context categories (V) Evaluation of the algorithms (VI) Figure 19. The main phases of the research process. Ovals denote construction and rectangles evaluation. Objective The aim of the first study was to directly contribute to the main issue of the thesis: enhancing search result access. The testing phase was preceded by design and implementation of the first categorization scheme that aimed to enhance the users in accessing the search results. The first categorization scheme (statistical algorithm) was initially based simply on word frequencies. The simple approach was strongly motivated by observations from previous clustering systems that appeared incomprehensible to the end users who are unaware of the underlying technology. The first approach of using single words was found to be too restrictive. Although single words may be descriptive, the inclusion of multi-worded category names (phrases) seemed appropriate. Because the logic of selection is the same for words and phrases, the addition did not complicate the system much. 54

61 One important design decision was the selection of the user interface model. A two-piece user interface that shows an overview on the left and contents of the selected item on the right is a widely used solution. Overview and details type user interfaces have been popular in the research literature and they have been shown to be beneficial for the users. In addition, this type of user interface allowed the users to take advantage of the ranked result listing when it is profitable. Our aim was to provide extra tools for interacting with the search results and thus this solution was a good match with the objectives. The question for the first experiment was if the automatic categories are beneficial for the end user or not. It played an important role for the entire research project. The result was an important indication that our approach worked and that it was worth exploring further. The answer was sought via an experiment where the new category solution was compared to the ranked list (de facto standard) approach. We recruited 20 participants for the controlled study that was carried out in a laboratory environment. Results and Discussion The results indicate the success of the selected user interface and categorization scheme. The participants were able to locate the relevant results up to 40% faster with the new user interface. In addition, participants were 21% more accurate (in terms of relevant results) and they showed positive attitudes towards the proposed system. The results were positive for our research. They showed that we were on the right track in the pursuit of making searching easier. Thus the first and most important conclusion was to carry on studying the techniques. In addition, the performance benefit was fairly high in our experimental setting. The results can also be seen to be in line with the cluster hypothesis and the assumptions about the profitability of an overview + detail type of user interface. The study also raised a number of questions, such as the number of categories to present to the user and the ability to generalize the results in other situations. These issues were addressed in the subsequent studies. 6.3 STUDY II: SEARCH USER INTERFACE EVALUATION MEASURES Reference Mika Käki (2005). Proportional search interface usability measures. In Proceedings of NordiCHI 2004 (Tampere, Finland), October ACM Press, pages (Paper II, page 107) 55

62 Objective In the course of conducting and analyzing the results of the first study, we discovered a lack of descriptive measures for our needs. In some well defined areas in HCI there are commonly established measures for evaluating the success of user interface solutions. For example, in text entry studies measures like keystrokes per character (KSPC) or error rates are routinely used (MacKenzie, 2002; Soukoreff & MacKenzie, 2003). The same cannot be said about evaluating search user interfaces and the aim of the second study was to provide new, useful measures. The data from the first study were available for experimenting with the new measures. From the literature review it became apparent that presenting raw numbers on the amount of time spent and the number of results gathered were popular measures in search user interface studies. Such measures capture important properties of the measured systems, but the results are hard to interpret and compare. Based on the literature review and the experiences from our first study, we set two goals for the new measurements. First, the new measures must make comparisons and understanding of the results easier. Second, they must capture the special characteristics of Web searching. In particular, it is common that Web searchers stop the search process when one or two good enough answers are found. None of the previously used measures capture the success in such a situation. Results and Discussion The results of the study include three new measures for evaluating search user interfaces, which were evaluated by applying them to the results of previous studies (our own and those found in the literature). The first two are designed to make the results easier to understand and to compare. Search speed and qualified speed measures are proportional measures for describing how fast the search user interface is. Search speed is a simpler version that describes a raw measure without considering the quality of the results while qualified speed employs accuracy information. Both of these measures are stated in answers per minute (APM). Accuracy information in qualified speed adds an extra modifier to the measure by giving, when it states, e.g., the number of relevant answers per minute. The third measure addresses special characteristics of Web search behavior. Immediate accuracy is the proportion of the cases where at least one relevant answer is found by n th result selection. This aims to capture the success in typical rather impatient Web search behavior. The evaluation of the measures was based on applying them to the data of the first study as well as to the data reported in one of the Scatter/Gather studies. In the comparison, we showed that these measures can separate systems and that they are easier to compare than the old ones. The 56

63 conclusion was that the new measures are useful additions to the toolbox of the search user interface evaluator and we employed these measures in the later studies where appropriate. There are, however, a few problematic issues in the results of this study. The evaluation method for the new measures is not clear because there are no widely accepted ways of demonstrating their utility. Conducting a conventional experiment is problematic, because the measures largely constitute the experiment; the result of an experiment is expressed by them. Thus, the evaluation of the measures must be largely grounded on the intuition about their descriptiveness. However, this does not mean that the experiment is futile in evaluating new measures. It plays an important role in forming an impression of the descriptiveness and utility of the measure. Second, the applicability of the measures can be limited. Our own need implies an emphasis on the result evaluation phase and the measures reflect the fact. For example, it is not clear if they can be used in evaluating the utility of query reformulation aids. However, we think that the applicability is not seriously compromised. For example, a system with novel query refinement aids can be evaluated with speed measures with a different test setup where the user has an opportunity to make multiple queries. If the query refinement aids work, the effect should be measurable in the users ability to find meaningful results, for example in qualified speed. 6.4 STUDY III: THE EFFECT OF THE NUMBER OF CATEGORIES Reference Mika Käki (2005). Optimizing the number of search result categories. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages (Paper III, page 117) Objective The success of the first evaluation of the Findex search user interface encouraged us to look deeper into the phenomena of using categories as the result list overview. Because the overall objective is to enhance the user s performance, an obvious question is how the number of categories presented to the user affects it. In other words, what is the optimal number of categories? In the first experiment, the number of categories was somewhat randomly chosen (fifteen), based on our intuitive conception of the performance of the system. We first tried to find the answer from previous menu selection studies, but they were not quite on target. The automatically computed categories are 57

64 more of a moving target and thus the users evaluation process of them may be considerably different from the search of menu items. Notably, meaningful ordering and grouping are not possible with automatically formed categories, and this changing nature complicates the situation. Thus, we decided to investigate the issue in a new study. The experiment compared three conditions: 10, 20, and 40 categories while the other parts of the user interface were constant. The controlled study was carried out with 27 participants in a laboratory. The test setup was similar to that of the first experiment. The setup was seen to be robust and appropriate also for the current problem. Results and Discussion The results of the experiment showed that fewer categories are better, but the measured differences between the conditions were relatively small. The subjective opinions were clearly against many categories and the participants found 40 categories to be clearly too many. Although 20 categories also received negative subjective feedback, the objective measures could reveal only small or no differences in the level of performance in comparison to 10 categories. In the end, our original estimate of 15 categories turned out to be fairly good. As the main result indicates that fewer categories results in better performance the question about fewer than 10 categories readily arises. Unfortunately, the condition with fewer than 10 categories had to be excluded from the study because of practical reasons. Increasing the number of conditions increases the need for participants and thus the need for time. We simply did not have all this available. However, we do know from the first study that zero categories results in poor performance. In addition, with fewer than 10 categories, it is probable that the categories would not support the user s task. One of the practical conclusions from this study was that the users may need a way to control the number of categories presented in the list. A number around 10 or 15 seems appropriate for the default setting, but it does make sense to let users control the number of categories to a certain extent. Indeed, such functionality is implemented, for example, in the Vivisímo search engine, where the user can get more categories on demand. 6.5 STUDY IV: LONGITUDINAL STUDY OF FINDEX Reference Mika Käki (2005). Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human 58

65 Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages (Paper IV, page 123) Objective In the third phase of the studies, we turned our attention to the issue of internal and external validity. As the previous experimental settings were strictly controlled, some questions were unanswered. First, already from the first experiment we already knew that categories are not beneficial in all situations, but the frequency of such cases is unknown. This is the case, because we controlled the query formulations and thus the distribution of the tasks. Thus, the query formulations in the experiments do not necessarily comply with real use. Second, the users actual use habits are not known from laboratory experiments. The experimental setting forced the participants to use the categories at least once for each task, but in a real situation there is no such constraint. To address these issues, we conducted a longitudinal study. We implemented a Web-based version of our search user interface and recruited 16 participants from Finnish universities. Universities were used as the recruitment source because we wanted to involve users who are fairly active Web searchers and university personnel were assumed to need information frequently in their work. The study was carried out during the summer and we collected usage information on two months of use. To compensate for the vacations, the system was available for the participants for three months. All interaction with the system was logged and the behavior of the participants was not restricted in any way. In fact, they were encouraged to use the system any way they saw appropriate. Results and Discussion The results of the study showed that the utility of the categories is a more complex matter than the first experiment suggested. In the controlled setup, the participants were required to use the categories at least once in a task, but in the real situation the categories were used on average in every fourth query. Although this may seem little, we find it encouraging. The categories were used regularly over a long period of time, indicating their consistent ability to help users in certain situations. By examining the log files we concluded that the categories are most likely used in situations where the result ranking does not support the user s task. The time required to select the first result is about twice as long as the access time when categories are not used. This means that the users have time to first read a screen full of results and evaluate them. If this does not produce results, they scan the short category list, select a category and evaluate few results in the category before opening a result page. 59

66 Although the user behavior could be seen as a disappointment regarding the categories, it can also be seen in a positive light. This scenario means that the users can utilize their old search habits when working with the new user interface. Users can exploit the success of the rank ordering, when possible, while categories help them in problem situations. As categories are used regularly, we believe that there is a need for categories and they are employed as part of search habits. Thus is seems that categories are beneficial in real settings. We consider the results of the study to be fairly strong despite the shortcomings of the test setup. The longitudinal studies are often loosely controlled, which was also the case here. However, in our case we can combine the results with the results from the experiments with the same system. We know from the experiment that the use of categories enhances performance. It means, among other things, that users tend to select meaningful categories. From the longitudinal study we know that categories are used regularly. Because the use of categories does not diminish over time, the category selections are likely also to be beneficial in real settings. One interesting question that we were not able to address in the given time frame concerns the usage patterns. In particular, it would be interesting to know in what kind of situations the categories are used. For example, one might assume that they are used in the query refinement phase, when the user is experiencing difficulties in formulating the query. Our data could provide insight into this question, and this is obviously an interesting topic for future studies. 6.6 STUDY V: EXPERIMENT WITH CONTEXT CATEGORIES Reference Mika Käki (forthcoming). fkwic: frequency based keyword-in-context index for filtering web search results. Accepted for publication in Journal of the American Society for Information Science and Technology. Wiley. (Paper V, page 135) Objective In the course of using and testing the first version of Findex we noted that the most meaningful categories tended to contain query terms in their names. This gave rise to an association with keyword-in-context (KWIC) indices and led to the idea of displaying the most frequent keyword contexts as an index to the results. The implementation of this fkwic indexing system proved to be notably different from the initial categorization scheme. The development was an iterative process where design and implementation were followed by an 60

67 informal evaluation. The requirement of having query terms in the category names posed new challenges as the number of category candidates was reduced and the algorithm for removing and merging similar candidates was changed considerably compared to the initial categorization algorithm. Because the categorization algorithm changed so much, it was not clear if the new approach would be beneficial for the user. Thus the objective of this study was to ascertain if this new categorization algorithm enhances the user performance. We conducted a controlled experiment with 36 participants in a usability laboratory. The new system was compared to the ranked list user interface (baseline solution). The setup was largely the same as in the first experiment, because the research question is basically the same and the setup was seen to be sound. Results and Discussion The results confirmed the utility of this new approach. The results showed a 29% increase in the speed of finding relevant results and the proportion of relevant results among the selected results increased by 19%. In addition to these objective measures, we obtained evidence about positive attitudes towards the proposed user interface. These facts support the hypothesis that the proposed system enhances the users performance in accessing the search results. Due to slight changes in the test setup and the demographics of the participants, the results are not exactly comparable with the first experiment. However, it is fairly safe to say that the performance of the systems is at about the same level. Based on this study, we cannot say which of these systems is better. Even if we could, the difference would probably be fairly small. Although the comparison of the systems would be interesting, we abandoned this approach considering it too radical for these prototype systems. The results of such a study could be too largely influenced by small design or implementation flaws and thus lead to false conclusions. Small differences in the system performance could be exaggerated in a comparison setup. Instead, we judged that the most important point was to establish a relation between the new and the currently dominant systems. 6.7 STUDY VI: EVALUATION OF THE CATEGORIZATION ALGORITHMS Reference Mika Käki (2005). Findex: properties of two web search result categorizing algorithms. Accepted for publication in Proceedings of the IADIS 61

68 International Conference on World Wide Web/Internet (Lisbon, Portugal), October IADIS Press, pages (Paper VI, page 153) Objective The first five papers had a strong human-computer interaction bias in their research methodology and approach. Because the topic of the research is situated at the intersection of HCI and IR, we adopted a more IR-oriented method for this study. In our previous publications, the fine details of the categorization algorithms were left slightly fuzzy and the computational performance was largely uncovered. In addition, the relationship between the two categorization systems was unclear. Both systems are beneficial for the users, but intuitive experience indicated that both algorithms may have situations where they perform better than the other. To address these needs, we performed a study on the algorithms. The study involved mathematical measures such as coverage, overlap, recall, and precision of the category algorithms. A heuristic evaluation was included to identify the criteria for the situations where the algorithms work best. In addition, the algorithm descriptions were published to ensure that the acquired information can be utilized in future research. Results and Discussion The results of the study revealed benefits and downsides in both algorithms. Both were seen to deliver acceptable computational performance, given that the current implementations are not highly optimized. The first categorization algorithm performed better with respect to ensuring the coverage and overlap of the categories. Context categories (fkwic), in contrast, were strong on measures involving the quality dimension, but were not able to cover as large part of the results. This supports our hypothesis that there are differences between the methods. The heuristic assessment revealed situations where the categories were successful and unsuccessful for both algorithms. Typically a situation that is hard for one algorithm is not as difficult for the other. Thus, it could be possible to compensate the flaws in one algorithm by a reasonable selection of the used categorization method. Such work is left for the future. 6.8 DIVISION OF LABOR One of the publications mentions a co-author. The first paper was done in collaboration with Anne Aula, whose contribution for the whole system is important. The central ideas behind the categorization approach were developed together with her. In addition, Ms. Aula had central role in the 62

69 design of the experimental setting and in conducting the pilot studies in which the settings were tried out. The experiment reported in the paper was carried out by myself. The paper was mostly written by me and Ms. Aula had an important role in commenting it. Although the other papers do not mention co-authors, it does not mean that they were made in isolation. Colleagues have contributed countless ideas and comments for each of the papers. However, all the experiments, software artifacts, and original text for the papers were produced by the present author. 63

70 64

71 7 Conclusions We have presented the Findex Web search user interface concept consisting of user interface functionality and two novel result categorization schemes. The categorization approach for accessing the search results was evaluated in four user studies and in one theoretical study. In addition, we presented three new measures to be used in the evaluation of search user interfaces with user studies. The contribution of the work is two-fold: 1) a search user interface concept (user interface functionality and the categorization schemes) and 2) new information for the scientific community about the usefulness of categorizing search user interfaces. The latter is the main contribution of the work. The following lists the conclusions from each of the studies: 1. The first study (Paper I) evaluated the basic categorization approach and the statistical categorization algorithm in particular. The study shows that the approach is beneficial as users were 40% faster in finding relevant results and the relevance of initial selections is higher. We can conclude that the approach increases users performance in certain conditions (such as those used in the experiment). However, the experimental setup leaves us in the dark as to how useful the system would be in a normal use situation. Given the search queries that were formulated for the participants and search tasks not initiated by the searchers, the generalization of the results cannot be entirely taken for granted. This issue was addressed in the longitudinal study (Paper IV). 2. The second experiment (Paper III) studied the effect of the number of categories on the user performance. The main conclusion is that fewer 65

72 categories result in a slightly better user performance. However, the performance penalty with more categories is not great leaving some room for new designs. In practice, categories appears to be an acceptable number. Note that the categorization algorithm has an affect on the quality and the coverage of the categories and thus on the user performance. We assume that our results can be used as guidelines, but new categorization algorithms may require new studies. 3. The third study (Paper IV) addressed the issue of using the categorization system in a normal situation. This longitudinal study shows that categories become a part of users search habits and that they are used in roughly in every fourth query. We conclude that users can see the benefit of the categories in normal situations and that they can take advantage of them. Unfortunately, we were not able to (due to time limits) analyze the use situations and use patterns related to the category use. This would be an interesting analysis and is left for future studies. 4. The experiment on context categories (Paper V) compared our second categorization algorithm to the de facto standard ranked results list user interface. Results show that the context categories increase users speed of finding relevant results and their accuracy in selecting meaningful results. We conclude that this alternative categorization algorithm is a viable solution and enhances the users performance. Being a laboratory experiment the study faces the challenges of external validity. However, the use of context categories is much like the use of statistical categories that were seen to perform well in the longitudinal study. We assume that this is also the case with context categories. 5. Theoretical evaluation of the categorization algorithms (Paper VI) studied the properties of them. Based on the evaluation, the computational performance of the algorithms is acceptable. The quality of the categories depends on the underlying result set and both algorithms have strengths and weaknesses. We conclude that the algorithms are a good starting point and provide benefits as they are. However, there is room for improvement and the algorithms should not be considered to be finalized products. 6. Three new search user interface performance measures were proposed in Paper II. The measures were used in multiple experiments during the studies. The measures reveal interesting details and differences in the user interfaces. We conclude that the measures are applicable in studying search user interfaces. However, some assumptions the measures make may limit their applicability. For example, the application of the immediate accuracy assumes multiple result selections, which is not always achievable in a test setup. 66

73 In summary, result categorization enhances users performance. The degree of advantage depends on the query, the user s information need, and the results returned by the underlying search engine. Categories are not needed when the result ranking supports the user s information need. If the top of the result list does not provide relevant results, the users cope with the situation using the categories. Although we saw that our categorization algorithms perform acceptably according to multiple measures in the computational evaluation, large scale applications are not simple. If we consider a commercial search engine such as Google that processes hundreds of millions of searches a day, the performance requirements are enormous. Assuming 200 million queries a day and an extra load of 200 milliseconds per query for the categorization, we are facing over a year of computation each day. The cost of implementing such a system is obviously high. Although this simple calculation suggests problems in scaling the system, we do not have all the information to draw firm conclusions. Our studies did not contain in-depth performance examinations in terms of processor and memory resources. It is possible that these problems can be solved or reduced easily. The large scale Web searches, however, are only one application domain. The techniques can surely be applied in other environments such as intranet searches or other search facilities that utilize the processing power of the local computer. We expect the solution to be easily applicable in such cases. Since the work for this study commenced, new methods have been published that aim to increase the quality of the categories. The results of Zeng and colleagues (Zeng et al, 2004) are especially promising. Their technology is based on features assigned to the candidate categories and learning methods in selecting appropriate weights for the features. Although this complicates the selection process of the categories, it does not reveal the complexity to the end users because categories are still simply words or phrases appearing in the results. This kind of approach is desirable from our premises, where the comprehensibility for the end user is vital. Improving the quality of the cluster names is the most important area of future work for our system. The complete removal of stop words from the final category names may not be the optimal solution, although it is efficient in certain situations. Another issue concerns uninformative words that are not stopwords, such as information or world. In some contexts they can be meaningful, but not generally. Perhaps feature based measures (such as TFIDF) on the word significance could solve some of these problems, as Zeng and colleagues have demonstrated. 67

74 68

75 8 References Allen, Obry, & Littman (1993): An Interface for Navigating Clustered Document Sets Returned by Queries. In Proceedings of the Conference on Organizational Computing Systems, COOCS 93 (Milpitas, USA). ACM Press, pp Anick, P. (2003). Using Terminological Feedback for Web Search Refinement - A Log-based Study. In Proceedings of the Annual International ACM/SIGIR 03 Conference (Toronto, Canada). ACM Press, pp Anick, P. & Tipirneni (1999). The Paraphrase Search Assistant: Terminological Feedback for Iterative Infromation Seeking. In Proceedings of the Annual International ACM/SIGIR 99 Conference (Berkeley, USA). ACM Press, pp Aula, A. & Nordhausen, K. (forthcoming). Modeling Successful Performance in Web Search. To appear in Journal of the American Society for Information Science and Technology (JASIST). Wiley. Barker, K. & Cornacchia, N. (2000). Using Noun Phrase Heads to Extract Document Keyphrases. In Proceedings of the Thirteenth Canadian Conference on Artificial Intelligence, LNAI 1822 (Montreal, Canada). Pp Bates, M. (1989). The Design of Browsing and Berrypicking Techniques for the Online Search Interface. Online Review. Vol. 13, No. 5, pp Belkin, N., Cool, C., Head, J., Jeng, J., Kelly, D., Lin, S., Lobash, L., Park, S., Savage-Knepshield, P., & Sikora, C. (1999). Relevance Feedback versus 69

76 Local Context Analysis as Term Suggestion Devices: Rutgers TREC-8 Interactive Track Experience. TREC-8, Proceedings of the Eighth Text Retrieval Conference (Washington, D.C., USA). Berkhin, P. (2002). Survery of Clustering Data Mining Techniques. Accrue Sofware, San Jose, California. Available at: Boley, D., Gini, M., Hastings, K., Mobasher, B., & Moore, J. (1998). A Client-Side Web Agent for Document Categorization. Journal of Internet Research. Vol. 8, No. 5. Broder, A. (2002). A Taxonomy of Web Search. SIGIR Forum. ACM Press, Vol. 36, No. 2., pp Bruza, P. & Dennis, S. (1997). Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine. In Proceedings of RIAO 97 (Montreal, Canada). Pp Bruza, P., McArthur, R., & Dennis, S. (2000). Interactive Internet Search: Keyword, Directory and Query Reformulation Mechanisms Compared. In Proceedings of Annual International ACM/SIGIR 2000 Conference (Athens, Greece). ACM Press, pp Card, S., Mackinlay, J., & Shneiderman, B. (1999). Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann Publishers, San Francisco. Carey, M., Heesch, D., & Rüger, S. (2003). Info Navigator: A visualization tool for document searching and browsing. In Proceedings of the International Conference on Distributed Multimedia Systems (DMS, Florida, Sept 2003). Pp Carey, M., Kriwaczek, F., & Rüger, S. (2000). A Visualization Interface for Document Searching and Browsing. In Proceedings of NPIVM 2000 (Washington, D.C., USA). ACM Press. Chau, M., Zeng, D., & Chen, H. (2001). Personalized Spiders for Web Search and Analysis. In Proceedings of JCDL 01 (Roanoke, USA). ACM Press. Chekuri, C., Goldwasser, M., Raghavan, P., & Upfal, E. (1997). Web Search Using Automatic Classification. In Proceedings of the 6 th International World Wide Web Conference, WWW6 (Santa Clara, USA). Chen, H. & Dumais, S. (2000). Bringing Order to the Web: Automatically Categorizing Search Results. In Proceedings of the ACM SIGCHI 70

77 Conference on Human Factors in Computing Systems, CHI 2000 (The Hague, Netherlands). ACM Press, pp Chen, L. & Chue, W. (2005). Using Web Structure and Summarization Techniques for Web Content Mining. Information Processing and Management. Elsevier, Vol. 41, No. 5, pp Chen, M., Hearst, M., Hong, J., & Lin, J. (1999). Cha-Cha: A System for Organizing Intranet Search Results. In Proceedings of the 2 nd USENIX Symbosium on Internet Technologies and SYSTEMS (USITS). Chen, M., LaPaugh, A., & Singh J. (2002). Predicting Category Accesses from a User in a Structured Information Space. In Proceedings of the Annual International ACM/SIGIR 02 Conference (Tampere, Finland). ACM Press, pp Chi, E., Pirolli, P., Chen, K., & Pitkow, J. (2001). Using Information Scent to Model User Information Needs and Actions on the Web. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 01 (Seattle, USA). ACM Press, pp Cho, E. & Myaeng, S. (2000). Visualization of Retrieval Results Using DART. In Proceedings of the International Conference RIAO (Paris, France). Choo, C., Detlor, B., & Trunbull, D. (2000). Information Seeking on the Web - An Integrated Model of Browsing and Searching. First Monday ( University Library at the University of Illinois at Chicago, Vol. 5, No. 2. Cui, H. & Zaïaine, O. (2001). Hierarchical Structural Approach to Improving the Browsability of Web Search Engine Results. In Proceedings of the 12 th International Workshop on Database and Expert Systems Applications (DEXA 01). IEEE Computer Society. Cutting, D., Karger, D., & Pedersen, J (1993). Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections. In Proceedings of the Annual International ACM/SIGIR 93 Conference (Pittsburgh, USA). ACM Press, pp Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the Annual International ACM/SIGIR 92 Conference (Copenhagen, Denmark). ACM Press, pp Dennis, S., McArthur, R., & Bruza, P. (1998). Searching the World Wide Web Made Easy? The Cognitive Load Imposed by Query Refinement Mechanisms. In Proceedings of the Third Australian Document Computing 71

78 Symposium (ADCS 98). Department of Computer Science, University of Sydney, TR-518, pp Dumais, S. & Chen, H. (2000). Hierarchical Classification of Web Content. In Proceedings of the Annual International ACM/SIGIR 2000 Conference (Athens, Greece). ACM Press, pp Dumais, S. Cutrell, E., & Chen, H. (2001). Optimizing Search by Showing Results in Context. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 01 (Seattle, USA). ACM Press, pp Edgar, K., Nichols, D., Paynter, G., Thomson, K., & Witten, I. (2003). A user evaluation of hierarchical phrase browsing. In Proceedings of the European Conference on Digital Libraries ECDL 2003 (Trondheim, Norway). Egan, D., Remde, J., Gomez, L., Landauer, T., Eberhardt, J., & Lochbaum, C. (1989). Formative Design-Evaluation of SuperBook. AMC Transactions on Information Systems. ACM Press, Vol. 7, No. 1, Fox, E., Hix, D., Nowell, L., Brueni, D., Wake, W., Heath, L., & Rao, D. (1993). Users, User Interfaces, and Objects: Envision, a Digital Library. Journal of the American Society for Information Science (JASIS). Wiley, Vol. 44, No. 3, pp Google Search Engine. Google Timeline (2005) Granizer, M., Kienreich, W., Sabol, V., & Dösinger, G. (2003). WebRat: Supporting Agile Knowledge Retrieval through Dynamic, Incremental Clustering and Automatic Labelling of Web Search Result Sets. In Proceedings of the Twelfth IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE 03). IEEE Computer Society. Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C., & Frank, E. (1999). Improving browsing in digital libraries with keyphrase indexes. Journal of Decision Support Systems. Elsevier, Vol. 27, No 1 2, pp Hearst, M. (1995). TileBars: Visualization of Term Distribution Information in Full Text Information Access. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 95 (Denver, USA). ACM Press, pp

79 Hearst, M. (1999). User Interfaces and Visualization. Chapter in Baeza- Yates, R. and Ribeiro-Neto, B. (eds.). Modern Information Retrieval. Addison Wesley, Edinburgh Gate, England. Hearst, M. & Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In Proceedings of the Annual International ACM/SIGIR 97 Conference (Philadelphia, USA). ACM Press, pp Hearst, M., Karger, D., & Pedersen, J. (1995). Scatter/Gather as a Tool for the Navigation of Retrieval Results. In Proceedings of the American Association for Artificial Intelligence (AAAI) Conference, Fall Symposium AI Applications in Knowledge Navigation & Retrieval. Cambridge, MA, pp Hearst, M. & Pedersen, J. (1996). Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of the Annual International ACM/SIGIR 96 Conference (Zurich, Switzerland). ACM press, pp Heath, L., Hix, D., Nowell, L., Wake, W., Averboch, G., Labow, E., Guyer, S., Brueni, D., France, R., Dalai, K., & Fox, E. (1995). Envision: A User- Centered Database of Computer Science Literature. Communications of the ACM. ACM Press, Vol. 38, No. 4, pp Hoelscher, C. (1998). How Internet Experts Search for Information on the Web. In Proceedings of the World Conference of the World Wide Web, Internet, and Intranet (Orlando, USA). iboogie Search Engine. ISO/IEC (1998). Ergonomic requirements for office work with visual display terminals (VDT)s - Part 11 Guidance on usability. ISO/IEC : 1998 (E). Jansen, B. & Pooch, U. (2000). A Review of Web Searching Studies and a Framework for Future Research. Journal of the American Society of Information Science and Technology (JASIST). Wiley, Vol. 52, No. 3, pp Jansen, B. & Spink, A. (2006). How Are We Searching the World Wide Web? A Comparison of Nine Search Engine Transaction Logs. Information Processing & Management. Elsevier, Vol. 42, No. 1, pp Jansen, B., Spink, A., & Saracevic, T. (1998). Searchers, the subjects they search, and sufficiency: A study of a large sample of Excite searches. 73

80 In Proceedings of WebNet-98, World Conference on the WWW, Internet and Intranet (Orlando, USA). Jansen, B., Spink, A., & Saracevic, T (2000). Real Life, Real Users, and Real Needs. A Study and Analysis of User Queries on the Web. Information Processing and Management. Vol. 36, No. 2, pp Jardine, N. & van Rijsbergen, C. (1971). The Use of Hierarchic Clustering in Information Retrieval. Information Storage and Retrieval. Pergamon Press, Vol. 7, No. 5, pp Jiang, Z., Joshi, A., Krishnapuram, R., & Yi, L. (2000). Retriever: Improving Web Search Engine Results Using Clustering. University of Maryland, Technical Report, October Jones, S. (1999). Design and Evaluation of Phrasier, an Interactive System for Linking Documents using Keyphrases. In Proceedings of Human- Computer Interaction INTERACT 99 (Edinburgh, UK). IOS Press, pp Jones, S. & Mahoui, M. (2000). Hierarchical Document Clustering Using Automatically Extracted Keyphrases. In Proceedings of the Third Asian Conference on Digital Libraries (Seoul, Korea). Pp Jones, S. & Paynter, G. (1999). Topic-based browsing within a digital library using keyphrases. In Proceedings of the fourth ACM Conference on Digital Libraries 99 (Berkeley, USA). ACM Press, pp Jones, S. & Paynter, G. (2002). Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications. Journal of the American Society for Information Science and Technology (JASIST). Wiley, Vol. 53, No. 8, pp Jones, S., Jones, M., & Deo, S. (2004). Using Keyphrases as Search Result Surrogates on Small Screen Devices. Personal Ubiquitous Computing. Springer, Vol. 8, No. 1, pp Kartoo Search Engine. Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM Selforganizing Maps of Document Collections. Neurocomputing. Elsevier, Vol. 21, pp Kohonen, T. (1997). Exploration of Very Large Databases by Selforganizing Maps. In Proceedings of the IEEE International Conference on Neural Networks. Vol. 1, pp

81 Koller, D. & Sahami, M. (1997). Hierarchically Classifying Documents Using Very Few Words. In Proceedings of the 14 th International Conference on Machine Learning, ICML (Nashville, USA). Pp Kules, B. & Shneiderman, B. (2005). Categorized Graphical Overviews for Web Search Results: An Exploratory Study Using U.S. Government Agencies as a Meaningful and Stable Structure. In Proceedings of the Third Annual Workshop on HCI Research in MIS. Technical report HCIL , CS-TR-4715, UMIACS-TR , ISR-TR Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., & Krishnapuram, R. (2004). A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In Proceedings of the Thirteenth International World Wide Web Conference (New York, USA). Pp Leouski, A. & Croft, B. (1996). An Evaluation of Techniques for Clustering Search Results. Department of Computer Science, University of Massachusetts, Amherst, Technical Report IR-76. Lin, X., Soergel, D., & Marchionini, G. (1991). A Self-organizing Semantic Map for Information Retrieval. In Proceedings of the Annual International ACM/SIGIR 91 Conference. ACM Press, pp Maarek, Y., Jacovi, M, Shtalhaim, M, Ur, S., Zernik, D., & Shaul, I. (1997). WebCutter: A System for Dynamic and Tailorable Site Mapping. In Proceedings of the 6 th International World Wide Web Conference, WWW6 (Santa Clara, USA). Pp MacKenzie, S. (2002). KSPC (Keystrokes Per Character) as a Characteristic of Text Entry Techniques. In Proceedings of the Fourth International Symposium on Human-Computer Interaction with Mobile Devices (Heidelberg, Germany). Springer-Verlag, pp Matthews, G., Davies, R., Westerman, S., & Stammers, R. (2000). Human Performance: Cognition, Stress and Individual Differences. Psychology Press, Hove, UK. Mauldin, M. (1997). Lycos: Design Choices in an Internet Search Service. IEEE Expert. Vol. 12, No. 1, pp MeSH Nielsen, J. (2004). When Search Engines Become Answer Engines. At Nowell, L., France, R., Hix, D., Heath, L., & Fox, E. (1996). Visualizing Search Results: Some Alternatives to Query-Document Similarity. In 75

82 Proceedings of the Annual International ACM/SIGIR 96 Conference (Zurich, Switzerland). ACM Press, pp Osdin, R., Ounis, I., & White, R. (2002). Using Hierarchical Clustering and Summarization Approaches for Web Retrieval: Glasgow at the TREC 2002 Interactive Track. In The Eleventh Text Retrieval Conference (TREC 2002). Pp Pirolli, P. & Card, S. (1995). Information Foraging in Information Access Environments. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 95 (Denver, USA). ACM press, pp Pirolli, P. & Card, S.K. (1999). Information Foraging. Psychological Review. APA, Vol. 106, No. 4, pp Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a Sow s Ear: Extracting Usable Structures from the Web. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 96 (Vancouver, Canada). ACM press, pp Pirolli, P., Schank, P., Hearst, M., & Diehl, C. (1996). Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 96 (Vancouver, Canada). ACM press, pp Popescul, A. & Ungar, L. (2000). Automatic Labeling of Document Clusters. Unpublished manuscript, available at: Pratt, W. & Fagan, L. (2000). The Usefulness of Dynamically Categorizing Search Results. Journal of the American Medical Informatics Association. Elsevier, Vol. 7, No. 6, pp Remde, J. Gomez, L., & Landauer, T. (1987). SuperBook: An automatic tool for information exploration - hypertext? In Proceedings of the ACM Conference on Hypertext and Hypermedia, Hypertext 87 (Chapel Hill, USA). ACM Press, pp Robertson, S. (1977). Theories and Models in Information Retrieval. Journal of Documentation. Emerald, Vol. 33, No. 2, pp Rose, D. & Levinson, D. (2004). Understanding User Goals in Web Search. In Proceedings of the Thirteenth International World Wide Web Conference, WWW2004 (New York, USA). ACM Press, pp

83 Roussinov, D. & Chen, H. (2001). Information Navigation on the Web by Clustering and Summarizing Query Results. Information Processing & Management. Elsevier, Vol. 37, pp Sahami, M., Yusufali, S., & Baldonado, M. (1998). SONIA: A Service for Organizing Networked Information Autonomously. In Proceedings ACM Conference on Digital Libraries 98 (Pittsburgh, USA). ACM Press, pp Salton, G. (1989) Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Publishing Company, Inc. Reading, Massachusetts. Saracevic, T., Kantor, P., Chamis, A., & Trivison, D. (1988). A Study of Information Seeking and Retrieving. I. Backgound and Methodology. Journal of the American Society for Information Science. Wiley, Vol. 39, No. 3, pp Shneiderman, B., Byrd, D., & Croft, B. (1997). Clarifying Search A User- Interface Framework for Text Searches. D-Lib Magazine, January Shneiderman, B., Byrd, D., & Croft, B. (1998). Sorting Out Search A User-Interface Framework for Text Searches. Communication of the ACM. ACM Press, Vol. 41, No. 4, pp Shneiderman, B., Feldman, D. & Rose, A. (1999). Visualizing Digital Library Search Results with Categorical and Hierarchical Axes. CS-TR- 3993, UMIACS-TR ftp://ftp.cs.umd.edu/pub/hcil/reports- Abstracts- Bibliography/99-03html/99-03.html. Soukoreff, W. & MacKenzie, S. (2003). Metrics for Text Entry Research: An Evaluation of MSD and KSPC, and a New Unified Error Metric. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2003 (Lauderdale, USA). ACM Press, pp Spink, A., Jansen, B., Wolfram, D., & Saracevic, T. (2002) From E-Sex to E- Commerce: Web Search Changes. IEEE Computer. IEEE Computer Society, Vol. 55, No. 3, pp Spink, A., Wolfram, D., Jansen, B., & Saracevic, T. (2001) Searching the Web: The Public and Their Queries. Journal of the American Society for Information Science and Technology (JASIST). Wiley, Vol. 52, No. 3, pp Spoerri, A. (1994a). InfoCrystal: A Visual Tool for Information Retrieval & Management. In Proceedings of the ACM SIGCHI Conference on Human 77

84 Factors in Computing Systems, CHI 94 (Boston, USA). ACM Press, pp Spoerri, A. (1994b). InfoCrystal: A Visual Tool for Information Retrieval and Management. In Proceedings of Information Knowledge and Management (CIKM'93). ACM Press, pp Spoerri, A. (2004a). Visual Search Editor for Composing Meta Searches. In Proceedings of ASIST 2004 (Providence, USA). Spoerri, A. (2004b). MetaCrystal: Visualizing the Degree of Overlap between Different Search Engines. In Proceedings of the Thirteenth International World Wide Web Conference, WWW2004 (New York, USA). ACM Press. Sutcliffe, A, & Ennis, M. (1998). Towards a Cognitive Theory of Information Retrieval. Interacting with Computers. Elsevier, Vol. 10, No. 3, pp Teoma Search Engine. Tombros, A. & Sanderson, M. (1998). Advantages of Query Biased Summaries in Information Retrieval. In Proceedings of the Annual International ACM/SIGIR 98 Conference (Melbourne, Australia). ACM Press, pp Turney, P. (2000). Learning Algorithms for Keyphrase Extraction. Information Retrieval. Kluwer Academic Publishers, Vol. 2, No. 4, pp Vakkari, P. (1999). Task Complexity, Problem Structure and Information Actions: Integrating Studies on Information Seeking and Retrieval. Information Processing and Management. Elsevier, Vol. 36, No. 6, pp Veerasamy, A. & Belkin, N. (1996). Evaluation of a Tool for Visualization of Information Retrieval Results. In Proceedings of the Annual International ACM/SIGIR 96 Conference (Zurich, Switzerland). ACM Press, pp Veerasamy, A. & Heikes, R. (1997). Effectiveness of a Graphical Display of Retrieval Results. In Proceedings of the Annual International ACM/SIGIR 97 Conference (Philadelphia, USA). ACM Press, pp Vélez, B., Weiss, R., Sheldon, M., & Gifford, D. (1997). Fast and Effective Query Refinement. In Proceedings of the Annual International ACM/SIGIR 97 Conference (Philadelphia, USA). ACM Press, pp

85 Vivísimo Search Engine. Voorhees, E. (1985). Cluster Hypothesis Revisited. In Proceedings of the Annual International ACM/SIGIR 85 Conference (Montreal, Canada). ACM Press, pp Ware, C. (2004). Information Visualization Perception for Design (second edition). Morgan Kaufmann Publishers, San Francisco. Weiss, D. & Stefanowski, J. (2003). Web Search Results Clustering in Polish: Experimental Evaluation of Carrot. In Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM 03 Conference (Zakopane, Poland). Vol. 578 (XIV), pp White, R., Jose, J., & Ruthven, I. (2001). Query-Biased Web Page Summarization: A Task-Oriented Evaluation. In Proceedings of the Annual International ACM/SIGIR 2001 Conference (New Orleans, USA). ACM Press. WiseNut Search Engine. Wittenburg, K. & Sigman, E. (1997). Integration of Browsing, Searching, and Filtering in an Applet for Web Information Access. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 97 (Atlanta, USA). ACM Press. Wu, Y., Shankar, L., & Chen, X. (2003). Finding More Useful Information Faster from Web Search Results. In Proceedings of Information Knowledge and Management, CIKM'03 (New Orleans, USA). ACM Press, pp Yahoo! Search Engine. Zamir, O. (1998). Visualization of Search Results in Document Retrieval Systems - General Examination. University of Washington, SIGTRS Bulletin, Vol. 7, Num. 2 (6.2001). Zamir, O. & Etzioni, O. (1998). Web Document Clustering: A Feasibility Demonstration. In Proceedings of the Annual International ACM/SIGIR 98 Conference (Melbourne, Australia). ACM Press, pp Zamir, O. & Etzioni, O. (1999). Grouper: A Dynamic Clustering Interface to Web Search Results. In Proceedings of the International WWW Conference WWW 8 (Toronto, Canada). Elsevier Science, pp Zamir, O., Etzioni, O., Madani, O., & Karp, R. (1997). Fast and Intuitive Clustering of Web Documents. In Proceedings of the ACM SIGKDD 79

86 International Conference on Knowledge Discovery and Data Mining (Newport Beach, USA). ACM Press, pp Zeng, H., He, Q., Chen, Z., Ma, W., & Ma, J. (2004). Learning to Cluster Web Search Results. In Proceedings of the Annual International ACM/SIGIR 04 Conference (Sheffield, UK). ACM Press, pp

87 Appendix 1 The tasks and queries used in the studies. All the tasks were presented to the participants in Finnish. For this table, the tasks were translated in English. Queries are reported as sent to the search engine with translations in parenthesis. Task Query Experiment of Statistical Categories (Paper I) Find information about the space shuttle challenger accident Find picture about the volcano Pinatubo Find information about the terrorist attack on World Trade Center Find information sources about growing tulips Find pages that deal generally with the city of Oulu Find information about the things that should be considered when buying a used car from Finland Find information about the Finnish national opera (kansallisooppera) Find information about sinking of Titanic ship Find pages conserned with the Kobe earth quake Find information about the terrorist attack to Pentagon Find information about growing crocus Find pages that concern the university of Oulu in general Find pictures of the Jupiter planet Find reviews of the sound track of the film Pahat Pojat Find sources from where you could get a free address Find as many Finlandia-prize winners as you can (avoid collecting the same author many times) Experiment on the effect of the number of categories (Paper III) Find information about the space shuttle challenger accident Find picture about the volcano Pinatubo Find information about the terrorist attack on World Trade Center Find recipes of American Apple Pie Find opportunities to get a summer job as a sales person Find pictures of the planet Mars Find information about the things that should be considered when buying a used car from Finland Find information about sinking of Titanic ship Find pages conserned with the Kobe earth quake Find information about the terrorist attack to Pentagon Find recipes of minestrone soup Find information about the new Miss Finland (2004) challenger pinatubo world trade center tulppaani (tulip) oulu käytetty auto (used car) kansallisooppera (national opera) titanic kobe pentagon krookus (crocus) oulu jupiter pahat pojat sähköposti ( ) finlandia palkinto (Finlandia-prize) challenger pinatubo world trade center apple pie kesätyö (summer job) mars käytetty auto (used car) titanic kobe pentagon minestrone miss suomi (Miss Finland) 81

88 Task Find pictures of the planet Jupiter Find sources from where you could get a free address Find information about the flight accident happened over Lockerbie in Scotland Find pages concerning volcano eruption happened in the mid 1990s in Iceland Find reasons for climate warming Find recipes for making tiramisu Find pages concerning the composing of a will Find pictures of Moon Find instructions for wool washing Query jupiter sähköposti ( ) lockerbie Iceland eruption climate warming tiramisu testamentti (will) moon wool washing Experiment on the context categories (Paper V) Find information about the space shuttle challenger accident Find picture about the volcano Pinatubo Find ideas (instructions, recipes) about what can be done from chocolate Find information about what the world health organization (WHO) is doing to cure river blindness Find pages that deal generally with the city of Oulu Find pictures of the planet Venus Find pages where you get information about preventing influenza Let s imagine that you want to buy a mobile phone with a camera. Find pages where you find the prices of such producs. You think you have seen a barnacle goose. Find pages with which you can confirm your observation (a picture, identification information). Find information about sinking of Titanic ship Find information about the hurricanes appeared this autumn (2004) in United States and in Caribbean Find ideas about what else can be done from tea leaves other than normal tea Find information about the actions the world health organization (WHO) takes against tuberculosis Find pages that concern the university of Oulu in general Find information about colored contact lenses Let s imagine that you want to buy a DVD-player. Find price information of various products You think you have seen a goldeneye. Find pages with which you can confirm your observation (pictures, identification information) challenger Pinatubo chocolate river blindness oulu venus influenza kamerapuhelin (camera phone) valkoposkihanhi (barnacle goose) titanic hurricane tea tuberculosis +who oulu contact lenses dvd-soitin (dvd-player) telkkä (goldeneye) 82

89 Paper I Mika Käki and Anne Aula (2005). Findex: improving search result use through automatic filtering categories. Interacting with Computers. Elsevier, Volume 17, Issue 2, pages Elsevier B.V., Reprinted with permission. 83

90 84

91 Interacting with Computers 17 (2005) Findex: improving search result use through automatic filtering categories Mika Käki*, Anne Aula 1 Department of Computer Sciences, University of Tampere, Kehruukoulunkatu 1, FIN Tampere, Finland Received 24 June 2004; revised 6 August 2004; accepted 10 January 2005 Abstract Long result lists from web search engines can be tedious to use. We designed a text categorization algorithm and a filtering user interface to address the problem. The Findex system provides an overview of the results by presenting a list of the most frequent words and phrases as result categories next to the actual results. Selecting a category (word or phrase) filters the result list to show only the results containing it. An experiment with 20 participants was conducted to compare the category design to the de facto standard solution (Google-type ranked list interface). Results show that the users were 25% faster and 21% more accurate with our system. In particular, participants speed of finding relevant results was 40% higher with the proposed system. Subjective ratings revealed significantly more positive attitudes towards the new system. Results indicate that the proposed design is feasible and beneficial. q 2005 Elsevier B.V. All rights reserved. Keywords: Web search; Search user interface; Categorization; Clustering; Information access 1. Introduction Web search engines are one of the most popular means of finding information from the World Wide Web. The huge amount of documents requires the users to describe their information need very precisely in order to avoid too long result lists. Indeed, formulating the information need accurately in the search query is known to be hard for typical web * Corresponding author. Tel.: C ; fax: C addresses: mika.kaki@cs.uta.fi (M. Käki), anne.aula@cs.uta.fi (A. Aula). 1 Tel. C ; fax: C /$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi: /j.intcom

92 188 M. Käki, A. Aula / Interacting with Computers 17 (2005) users. Studies show that people enter very few search terms, typically one or two (Jansen et al., 1998). Such queries result in huge result sets which are hard to understand and slow to browse through. Very long result lists are clearly a major usability problem and a challenge for the search engine user interfaces. To solve this problem, information scientists look for better retrieval algorithms to get better results in the first place and human computer interaction practitioners work for improved user interfaces for result handling. We follow the latter path and state the first question of the study: how to present the search results so that users are able to find the needed information efficiently? We propose a new filtering user interface based on automatic result categories for accessing the results. The system is called Findex. The user interface presents an overview of the results to help the users identify and access the interesting results quickly. The overview is constructed by computing the most frequent words and phrases in the results and presenting them to the user as a list of categories next to the result list. The user can then select an interesting category from the list and the user interface filters the result list to show the corresponding search results. In order to test the feasibility and the usefulness of the design, we implemented it and conducted an experiment with 20 participants. The experiment compared our solution to a currently widely accepted search user interface model (ranked list). The experiment aimed to answer the second question of this study: can users understand the user interface and is it beneficial in accessing the search results? The results show that the new user interface was faster and more accurate compared to the conventional one and users expressed positive attitudes towards it. In the following, we will take a brief look of related work in the field. After that, the algorithm and the user interface of the proposed system are thoroughly described followed by a description of the experiment and its results. Finally, findings are summed up in the conclusions. 2. Related work Three areas of research are relevant for this study. Firstly, research in categorization of textual documents (web or otherwise) set the background for our categorization mechanism. Secondly, work in understanding and improving web search user interfaces provides us with information on search interface usability. Thirdly, studies on usability evaluation of web search engines give examples of how to study this phenomenon. We concentrate on papers that cover several of these areas as they are the closest references. Document categorization has long traditions in the information retrieval (IR) community. The incompleteness and impreciseness of simple lexical text matching methods have been identified and document categorization is regarded as one option to overcome these difficulties. We can identify two commonly used categorization techniques: document classification means putting documents into predefined categories whereas document clustering refers to dividing a set of documents into groups based on their similarity (Maarek et al., 2000). Various implementation techniques have been proposed for both. 86

93 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 1. Categorizing web search user interface: Grouper (picture after Zamir and Etzioni, 1998). Dumais et al. (1988) were among the first in the HCI community to suggest the use of clustering techniques (in their case based on Latent Semantic Indexing, LSI) for improving the access to textual information. Later, Scatter/Gather (Cutting et al., 1992) was one of the first real systems where the clustering approach was usability tested. The first results were not promising as the clustering interface seemed both slower and less precise when searching for relevant articles on a given topic (Pirolli et al., 1996). However, in a followup study, Scatter/Gather was found to have potential because users were able to identify and use the most relevant categories in information gathering tasks (Hearst and Pedersen, 1996). Later on, Zamir and Etzioni (1998) demonstrated the technical feasibility of clustering techniques in web environment using their own algorithm. They have also proposed a search engine user interface Grouper (Zamir and Etzioni, 1999) (Fig. 1) based on the clustering algorithm. The interface presents the titles of sample result pages grouped in clusters and resembles the user interface of Scatter/Gather. In addition, Grouper lets the user refine the query by selecting keywords from a category. Unfortunately, controlled usability tests on Grouper have not been published, but evaluation has been based on log studies and measuring the properties of the algorithm. The DART (Cho and Myaeng, 2000) system provides another type of user interface for a clustering system. DART uses the same clustering algorithm as Grouper. The contribution of DART is a dartboard-like user interface (Fig. 2) which visualizes the result set in relation to the clusters. The system has been usability tested, but the results are hard to interpret. For example, the number of participants was not reported. 87

190 M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 Fig. 2. User interface of DART (picture after Cho and Myaeng, 2000). One problem with clustering is the naming of the clusters.

94 190 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 2. User interface of DART (picture after Cho and Myaeng, 2000). One problem with clustering is the naming of the clusters. Typically, the most frequent or most distinctive word(s) found in the documents of the cluster are used as the name. This, however, may lead to rather long, uninformative, and incomprehensible names. A solution to this problem is to use document classification instead. When document categories are predefined, the names can also be predefined to correctly express the intended meaning. Chekuri et al. (1997) used this classification approach in a document classifier for web searches. They envisioned that the user could choose one or multiple categories along with the search terms when submitting a query. Unfortunately neither the user interface nor a usability evaluation are reported for this solution. A similar solution has been in the focus of recent SWISH prototype (Chen and Dumais, 2000; Dumais and Chen, 2001; Dumais et al., 2001). SWISH has a classifier for web search engine results and a special user interface for it (Fig. 3). The proposed solutions have also been thoroughly evaluated (both algorithmically and for usability). The user studies conclude that categories are indeed faster and more efficient for a particular type of tasks. The studies compared multiple user interface solutions and found that categories are most effective when presented with some sample results. The examples seem to help the users to understand the meaning of a category. In addition to text-based clustering and classification, other categorization methods have also been identified. For instance, Cha Cha (Chen et al., 1999) and AMIT (Wittenburg and Sigman, 1997) use hypertext link structure as the basis for the categorization of the documents. The usability of Cha Cha has been under investigation and answers to 88

M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 191 Fig. 3. Result classifying SWISH web search interface (picture after Dumais et al., 2001).

95 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 3. Result classifying SWISH web search interface (picture after Dumais et al., 2001). a questionnaire showed positive attitudes towards the system. However, without more objective measures, the results are rather speculative. Another example of a different categorization scheme is DynaCat (Pratt and Fagan, 2000), which uses domain knowledge (man-made taxonomies) in its classification process. DynaCat has an overview-based user interface (Fig. 4) and a user test showed positive results about its usefulness. The digital library project in New Zealand adopted yet another way of providing summaries of textual material. The project produced multiple examples of user interfaces that are based on key phrase extraction. The technique has been used to find related documents (Jones and Paynter, 1999) as well as to categorize search results on small devices (Jones et al., 2004). A commercial product called Vivísimo ( uses categorization, but neither a description of the categorization algorithm nor usability test results have been published. The user interface of Vivísimo resembles closely that of ours as it initially displays 10 categories besides the actual results. The biggest difference is that Vivísimo utilizes hierarchical categorization scheme whereas ours is based on a simpler list. In the Vivísimo user interface, the initial top-level categories can be further explored by looking at sub categories of them. The actual categorization scheme of Vivísimo is unknown, but seems to utilize frequently occurring words, words that occur frequently together in the same result, and frequently occurring word stings (phrases). The hierarchy is achieved assumingly by applying the same categorization scheme recursively to the top-level categories. 89

192 M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 Fig. 4. DynaCat system that uses domain knowledge for result categorization (picture after Pratt and Fagan, 2000).

96 192 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 4. DynaCat system that uses domain knowledge for result categorization (picture after Pratt and Fagan, 2000). In summary, the problems in the past studies and systems that are relevant here are: Lack of thorough user experiments. We have only limited knowledge on the usability of result categorization systems. Utilization of too complex categorization techniques that are arguably not understood by the users. Our work aims to solve these problems. 3. System description Technically our solution is something in between the Grouper and the user interface developed by Chen and Dumais (2000). It does not use predefined categories like Chen and Dumais system, but it does not use classical clustering techniques, either. Instead we simply seek for the most frequent words or phrases among the results and use them as the categories. The categories are shown in a separate list beside the results. Selecting a category displays the corresponding results in the result list, that is, filters the result set. The actual searches are done through Google Web API ( We shall first describe the categorization algorithm. 90

97 3.1. Result categorization M. Käki, A. Aula / Interacting with Computers 17 (2005) One of the restrictions in the web environment is that the whole document text body is not available for the categorization process. As others have demonstrated (Dumais et al., 2001; Zamir and Etzioni, 1998), clustering and classification methods can be used to categorize web search results based solely on the short text summaries (snippets) returned by the search engine. However, we believe that the naming problem associated with clustering, the limitations of classification, and the complexity of both could be avoided with a different approach. In order to make the system understandable and users to feel in control, a simpler solution is desirable. Our categorization is solely based on word frequencies in the result listing, i.e. in titles and short text summaries (snippets). We basically select the n most frequent words and use them as the categories. Such a category contains then all the results where the word appears. It is commonly known that simply selecting the most frequent words does not work, because articles and other very frequent words (like and ) do not carry much meaning on their own. We use a stop word list to exclude such words from the category list. The second problem in simple lexical word matching is that simple inflections of the words make them different (e.g. car and cars would be two different words). In order to reduce this problem, we use a word stemmer (Snowball stemmer by Martin Porter, snowball.tartarus.org/). The stemmer removes the word endings so that the simple inflections of a word map to the same word stem. Both the previous techniques are language-sensitive. Our software has been built in a way that enables us to easily add more languages as desired. For a new language, we need: (1) a stemmer, (2) a word frequency list of the language (corpus), and (3) a stop word list. The corpus is used for language detection and it could also be used to approximate the stop word list automatically, but a human-made list is preferred for better accuracy. For the testing purposes we have implemented the needed functions for English and Finnish. Language detection of the results is automatic and is based on word frequencies in the corpora. With this simple logic, we get a list of words capturing the major topics of the results fairly well. However, some words acquire considerably more meaning when presented in context. For example, the word states does not convey as much meaning as the phrase united states. To present the user with more meaningful categories we search for the most frequent phrases in the results as well. A phrase is defined to be any string of words inside a sentence (between periods). Categories are computed right after the search engine has returned the results. Each word in the results, except the stop words, is stemmed and stored with information on the result item that contained the word. For the phrases the procedure is similar. As each sentence is broken into phrases, each word comprising a phrase is stemmed and the resulting stemmed phrase is stored with information about which result contains it. As the category candidates are processed, unique words and phrases and so called sub phrases are removed. Sub phrases are part of a longer phrase (super phrase). For example, if we have a super phrase united states, there will be corresponding sub phrases united and states among the candidates. All sub phrases, which are part of a super phrase in 91

98 194 M. Käki, A. Aula / Interacting with Computers 17 (2005) Table 1 Categories calculated for a few popular queries Query Challenger Sars Jaguar Categories Space shuttle challenger Health Club Challenger disaster Information Jaguar cars Mission Global Information Challenger learning center China Jaguar panthera onca January Outbreak Atari jaguar Crew Public Mac jaguar Nasa World Reviews Tragedy Cdc Performance Challenger accident Latest Wildlife Information Sars virus Virtual Science Asia Largest cat Reagan April First History Diseases Apple Description Sars epidemic Powerful Dodge challenger Government Homepage the same result, are removed. Candidates are sorted according to the frequency, and n (currently 15) first candidates are selected as the categories. We are currently preparing another paper describing the algorithm in more detail. Some examples of the resulting categories for a few queries can be seen in Table Properties of the categorization technique The calculation of categories is computationally intensive if the number of candidate phrases is large. The dominant factor is the number of results to be categorized. By experimentation we found that 150 first results seem to capture the most frequent categories. Increasing the number of results beyond that makes practically no difference in the categories. The current implementation needs about 2 seconds to form categories for these 150 search results making it a feasible solution. Part of the calculation could be made in parallel while waiting for the search results. As the categories are computed from the first 150 results returned by a search engine, the underlying ranking method has a considerable effect on the outcome. It determines (1) which results are categorized (150 first ones), (2) what is the order of results within the calculated categories (rank order), and (3) which categories are selected in a tie situation (category with higher ranks in the original result list). There are a few properties of the frequency-based categorization technique that require further attention. First, the selected categories are not exclusive meaning that one result can sometimes be accessed through multiple categories. We consider such an overlap in the categories to be a desirable feature, because the meaning of the information depends on the context and the overlapping categorization may help users to realize some of those meanings. For example, let us consider the challenger query in Table 1. The following two results seem to discuss roughly the same topic (space shuttle Challenger disaster). 92

99 M. Käki, A. Aula / Interacting with Computers 17 (2005) Online Ethics Center: Roger Boisjoly and the Challenger Disaster The space shuttle Challenger disaster recounted by Roger Boisjoly who attempted to get the mission cancelled.. Roger Boisjoly and the Challenger Disaster.. The Space Shuttle Challenger Disaster a NASA Tragedy The Space Shuttle Challenger Disaster, a NASA Tragedy. When the space shuttle. Related Resources to Space Shuttle Challenger Disaster.. The categorization algorithm places both of these results into two categories: challenger disaster and space shuttle challenger. For users looking for information about the accident the former is more relevant while the latter one will attract those generally interested in the space shuttle. The listed results are relevant for both. The second feature of the technique is that not all results are guaranteed to belong to any category. This could be very undesirable as some relevant results could not be accessed at all. We provide a special built-in category for viewing all the results in the normal rank order list to overcome this problem. According to our experience, this solution works fine. The third property of the technique is that it may produce categories that are out of context or hard to understand for the user. Because the technique is based solely on statistical analysis, it does not consider the words as concepts having a well-defined meaning. As a result, the usefulness of the categories is not guaranteed and it can vary between different result sets. We believe that users will understand this and that they are able to discard the possible low-quality categories User interface The current prototype is implemented in Java as a standalone application. This approach enabled us to easily experiment with the user interface mechanics and made implementation of the experiment easy and robust. However, we have also implemented the same functionality as a standard web service to be used through a standard web browser. This implementation makes the solution attractive to a wider user population and is seen to be feasible. The user interface follows the basic idea used in most graphical clients and in Windows Explorer. These programs display a set of collections on the left of the window while the right side shows the contents of the selected collection (like files in a folder). The same holds true here as well: the left side of the user interface shows the list of categories (words and phrases) and the right side displays the corresponding results (Fig. 5). This familiar design was assumed to be easy to understand and adopt. The category list and the result display are tightly coupled in the user interface so that changing the category selection immediately changes the contents of the result view. In the result view, the selected keyword is highlighted in pale yellow to make the connection between a selected keyword and a result evident. 93

196 M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 Fig. 5. The proposed category user interface. The category list is on the left and the corresponding results are shown on the right.

100 196 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 5. The proposed category user interface. The category list is on the left and the corresponding results are shown on the right. If no category is selected, the result view is empty. However, this is not likely to happen as the special built-in All results category is automatically selected after the query has been completed. As the name of the built-in category suggests, selecting that category the user will see all the results retrieved for the query (by default, 150 first results). It is also possible to select many categories. In this case, the results are required to belong to all the selected categories (intersection of the categories). The order of the results is determined by the search engine. When all the results are shown, the result listing is the same as that returned by the search engine. When a category is selected, the relative order of the results is determined by the search engine, although the results are not likely to be sequential in the original list. The results have an ordinal that shows the position of the result in the rank order of the search engine. Other parts of the user interface are fairly obvious. On the top, there is a field for entering the query and Search and Cancel buttons for controlling the search engine. The status bar on the bottom shows additional information like the number of documents found in total and the state of the system (on/off-line). 94

M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 197 4.

101 M. Käki, A. Aula / Interacting with Computers 17 (2005) Experiment We conducted an experiment to evaluate the categorization algorithm and the user interface described in the previous sections. The experiment tested if the new solution differs from a widely accepted solution, in this case the search engine user interface displaying results in a ranked list, in terms of speed, accuracy, and subjective satisfaction Participants There were 20 volunteer participants (8 male, 12 female) in the study. The participants average age was 35 years varying from 19 to 57 years. The participants were recruited from the local university. They were students and personnel from seven organizational units. The participants had relatively long histories of general computer use (on average 11.5 years) as well as web use (on average 6 years). Almost all the participants can be regarded as experienced computer users Apparatus Two user interfaces were used to access the search results: 1. Category interface (category UI, Fig. 6, right window) was the Findex user interface described above with two modifications: (1) multiple selections of keywords was not Fig. 6. Desktop set-up in the test. Task window on the left and search window on the right. The screen size was 1200!1024 pixels and the search window size was 800!900 pixels. 95

198 M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 Fig. 7. Reference user interface showing 10 results on a page and radio buttons to navigate between the pages.

102 198 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 7. Reference user interface showing 10 results on a page and radio buttons to navigate between the pages. Note the checkboxes for selecting results. allowed, and (2) automatic selection of All results category was disabled. The latter modification means that the result list was initially empty in each task. Both modifications were made after pilot tests to make the experiment set-up more robust and focused. Category computation produced 15 categories and it was based on the first 150 results. 2. Reference interface (reference UI, Fig. 7) was a Google web search engine imitation showing results in a ranked list on separate pages, 10 results per page. In the bottom of the window, there were controls to browse the pages in order (Previous and Next buttons) or in random order (radio buttons). The participant had access to 15 pages (150 first results). Both user interfaces showed the results in the same visual format (Fig. 8). The format resembles closely the format of Google, omitting size, category, cached pages, and similar pages features found in Google. 96

103 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 8. Visual format of the individual result elements in the experiment. The reason why we did not use the publicly available Google interface was the controllability of the experiment variables. By using Google, we would have been faced with possible network problems and changing content of the Google database. The instrumentation of different systems could also have caused errors in timings, for instance. The experiment procedure was automated. During the experiment the computer desktop contained two windows: one displayed the test tasks in textual format (task window) while the other was the user interface studied (search window) as shown in Fig. 6. The size and location of both windows were predetermined and fixed Design The experiment had search user interface as the only independent variable with two values: category UI and reference UI. The values of the independent variable were varied within the subjects and thus the analysis was done using repeated measures tools. As dependent variables we measured: (1) time to accomplish a task in seconds, (2) number of results selected for a task, (3) relevance of each selected result on a threestep scale (relevant, related, not relevant), and (4) subjective attitudes towards the systems Procedure The experiments were carried out in a usability laboratory where participants were invited one at a time. Before the experiment the whole procedure was explained to the participants and any questions regarding the set-up were answered. One experiment lasted about 45 min and contained 18 information seeking tasks and three claim rating tasks (one for each condition and one comparing the two). The experiment was divided into two parts, each consisting of nine information seeking tasks and a claim answering task. One part was carried out with the category interface and other using the reference interface. The order of the parts and the tasks were counterbalanced between the participants. Before each part the participants were explained how the user interface worked and they were allowed to try it by themselves. The search tasks were based on predefined queries. We adopted this approach from an experiment by Dumais et al. (2001) where the participants did not formulate the queries themselves either. This approach makes perfect sense, because it removes the vast variability caused by different search capabilities of the participants and lets us measure the performance in the result evaluation phase, which we aim to improve. The tasks were selected to cover multiple interests (e.g. astronomy, cooking, movies, cars, gardening, etc.). The queries were balanced in terms of (1) number of obviously relevant categories, 97

104 200 M. Käki, A. Aula / Interacting with Computers 17 (2005) and (2) the position of them in the category list. In addition, a few queries did not have any obviously relevant categories. For each task, the participants were instructed to first read the task description, then push the Start button on the task window and promptly proceed to accomplish the task using the search window. While the participant read the task, the test apparatus fetched the results of the corresponding predefined query, but the user interface was hidden. When the participant pushed the Start button, the search window was enabled and the task execution could start immediately. The actual queries were executed before the experiment and saved locally for fast and equal access. Upon task completion, the participants were instructed to push the Done button on the task window. The time between Start and Done button presses was measured as the total time for the task. The participants were told about this timing scheme. After each task, the participant used the Next button in the task window to see the next one. Between the tasks, the participants were in control of the situation being able to take a short break if desired. The actual task of the participant was to collect as many relevant results for the information seeking task as possible as fast as you can. The task has two competing goals (speed and accuracy) to simulate realistic settings. In a real situation, users balance between these two goals intuitively based on the various factors (time, importance of the task, available resources, etc.), but in test situation such limiting factors have to be created artificially. In the pilot tests we observed that even if the task had these two competing goals, users tented to favor thoroughness using extended amount of time for a task. To enforce more realistic (faster) behavior, the time for each task was limited to 1 min. The participants were encouraged to utilize their own personal habits in selecting the results. When the 1 min time limit passed, the search window was automatically disabled and the clock was automatically stopped. The participants were able to proceed to the next task before this time limit if they thought that they had found enough results. Collecting the results was implemented by adding a check box beside each result item (see Figs. 6 and 7). Participants were instructed to check the corresponding check box when a result seemed relevant. If a mistake happened, it was possible to clear the mark by clicking the checkbox again. After the two sets of tasks and ratings, comparison ratings and demographic information were elicited by online forms in the task window. After this the experiment was over. 5. Results 5.1. Speed measures As the total time reserved for each task was limited to 1 min, the plain task times do not tell the whole story about the speed. In both conditions (category and reference user interface), it was very common that the participant used all the time reserved for a task. Thus, the mean task times were very close to 1 min, being 56.6 s (sdz5.5) for the category interface and 58.3 s (sdz3.5) for the reference interface. Due to the experiment set-up, it is understandable that we did not observe a statistically significant difference in task time 98

M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 201 Fig. 9.

105 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 9. There is a statistically significant difference in speed of use between the compared user interfaces in favor of the category UI. Note also the greater proportion of relevant results found with it. between the two conditions. Repeated measures analysis of variance (ANOVA) gave F(1,19)Z3.65, n.s. The number of results that the participants collected revealed an interesting finding as Fig. 9 shows. On average the participants were able to find 5.5 results per minute (sdz2.0) using the category interface contrasted with 4.1 per minute (sdz1.1) found with the reference interface. This difference in speed is statistically significant (F(1,19)Z12.13, p!.01). However, the raw speed of result acquisition may not be the best measure to evaluate the efficiency of a search engine user interface. It may be that the selected results do not really answer the user s initial information need. In order to estimate the usefulness of the obtained results, we need to consider their accuracy Accuracy measures For measuring the quality of the results the participants collected in the experiment, we judged the relevancy of the selected results for each task. Each selected result was assigned one of three relevance values: relevant, related, and not relevant. Ranking was based only on the snippets as were the participants selections, because we did not want to include the snippet document relationship as a factor in the study. A result was ranked relevant if the snippet clearly indicated that the corresponding page referred had desired content. In practice, for a result to be relevant we required the existence of multiple concepts of the task in the snippet. In contrast, a related snippet was required to refer to the same overall topic, but not to other aspects of the task description. Finally, the not relevant snippets differed from the overall topic of the task. For example, in a task to find pictures of the planet Jupiter, a snippet referring to images of Jupiter was rated as relevant, a snippet referring generally to the planet Jupiter was rated as related, and finally a snippet referring to the Jupiter Research Company was rated as not relevant. Relevance was measured in four ways. Firstly, recall states which portion of all relevant results were found. In this study the recall measure was calculated from all the relevant results found by all the participants, not from all the relevant results in the result 99

106 202 M. Käki, A. Aula / Interacting with Computers 17 (2005) sets, which is the conventional way. Secondly, precision tells the proportion of relevant results among selected results. The third and fourth relevance measures concern the accuracy in relation to speed. Recall and precision measures show a difference between the user interfaces. When using the category interface 62% of the results (sdz13), on average, were relevant whereas the precision with the reference user interface was 49% (sdz15). The difference is statistically significant: F(1,19)Z14.49, p!.01. The recall measure revealed a similar difference: with the category user interface the participants found on average 33% (sdz4) and with the reference user interface 19% (sdz7) of the relevant results for each task (F(1,19)Z29.88, p!.01). The breakdown of the speed measures according to relevance shows also significant differences (see Fig. 9). Using the category user interface, the participants were able to find 3.5 relevant results per minute on average (sdz1.5) whereas the use of the reference user interface yielded 2.0 relevant results per minute (sdz0.8; F(1,19)Z8.20, p!.01). The speed of acquiring related result (on average 1.4) is the same for both systems, but the category user interface reduces the number of not relevant results (0.4 vs. 0.7 not relevant results per minute, sdz0.3 for both cases; F(1,19)Z11.24, p!.01) Immediate success measures In the real world, web users are not typically looking for as many results as possible, but in many cases they are looking for the first good enough answer. To measure this kind of behavior, we analyzed the success of the first few selections. Fig. 10 shows the cumulative Fig. 10. Cumulative proportion of cases where at least one relevant result has been found with the nth selection. The participants found the first relevant result sooner with the category UI than with the reference UI. 100

M. Käki, A. Aula / Interacting with Computers 17 (2005) 187 206 203 Fig. 11. The precision (proportion of relevant results) of the first selections were higher when using the category user interface.

107 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 11. The precision (proportion of relevant results) of the first selections were higher when using the category user interface. percentage of the cases where users have found at least one relevant result with the nth selection. This measure is called immediate accuracy (Käki, 2004). The most interesting difference is produced already with the first selection, where in 56% of the cases users find a relevant result with the category UI whereas for the reference UI the figure is 40%. The difference is statistically significant, F(1,19)Z12.5, p!.01. Note that the difference stays virtually the same after the first selection. The same effect can be seen in the precision of the first selections in Fig. 11. With the category UI 59% of the first selections and 70% of the second selections are relevant. The corresponding numbers for the reference UI are 42 and 46%. In both cases, the difference is significant as ANOVA gives: F(1,19)Z13.1, p!.01 and F(1,19)Z25.6, p!.001, respectively. The comparison of the corresponding times, however, does not show notable differences. Average time used by the participants to find the first relevant result was about 21 s with both user interfaces. The second relevant result was found in 27 and 35 s (sdz6 and 8) while finding the third relevant result took 30 and 36 s for the category and reference user interfaces, respectively. The difference in acquiring the second relevant result is significant (F(1,19)Z17.70, p!.01). It is notable, however, that when using the reference UI there were more cases where not a single relevant result was found for a task. Thus the average times for finding the first relevant result are not completely comparable Subjective measures The alternative user interfaces were also evaluated using subjective measures. To achieve this, the participants were presented a set of claims (e.g. It was easy to find the results and The user interface was confusing ) after using each user interface. 101

108 204 M. Käki, A. Aula / Interacting with Computers 17 (2005) Fig. 12. Distribution of the subjective ratings shows more positive attitudes towards the category UI than towards the reference UI. The participants responded to the claims on a six point scale from agree (0) to disagree (5). In the end of the experiment, there was another set of claims where participants had to compare the two user interfaces against each other (e.g. Functionality was easier to understand and Task execution was harder ). The responses were also here collected along a six point scale but the range was from category interface (0) to reference interface (5). Fig. 12 shows the results of the claim answers. In the picture the scales have been normalized to have positive answers on the left and negative ones on the right. As Fig. 12 suggests, there is a difference in the subjective ratings of the systems. The analysis of the responses shows a statistically significant difference in the attitudes toward the systems. The median score (medianz1) for the category interface indicates more positive attitudes towards it than towards the reference interface (medianz2) and Wilcoxon matched-pairs signed-ranks test gives ZZK2.51, p!.02 (see Fig. 12 for variability). Similarly, when comparing the two systems together we see a statistically significant bias for the category interface. On a six-point scale where 0 stands for the reference UI and 5 for the category UI, the median score was 3.5. One-sample Wilcoxon singed-ranks test gives VZ188, p! Discussion and conclusions In the beginning, we had two questions in mind. The first one was finding ways to present search results. The answer was Findex, a system with automatically formed categories that provide an overview of the results in association with a filtering user interface. The second question was whether the proposed solution would work in practice or not. A range of measures collected in the experiment gives us a good reason to believe that Findex does, indeed, perform better than a conventional ranked list 10-results-perpage user interface. The belief is supported by four measurements collected in the experiment. First, with the category user interface it is possible to browse through more results than with the conventional user interface because the searching speed is higher (5.5 vs. 4.1 results per minute). This is important, since many times the web search results are rather unreliable and it is thus desirable to be able to access alternative results quickly. Second, and more importantly, the proposed interface not only gives the users more options but gives them more relevant options. Results showed that the increase in the number of results was due to increased number of relevant results. The measured speed of 102

109 M. Käki, A. Aula / Interacting with Computers 17 (2005) finding relevant results was about 40% higher with the proposed system compared to the reference solution. Third, for users in the real situations, the immediate success of the search is also very important. The results show that when using our system, the users found the first relevant result earlier (with fewer selections) than with the reference user interface. The result is very important although we did not measure the time difference in finding the first relevant result. The selection with which the first relevant result is found is crucial, since people tend to visit very few result pages in the web environment (Spink et al., 2001), typically one or two. According to the results, the proposed system performs better in this kind of search tactic. Fourth, the results showed that users had positive attitudes towards the system. Although subjective ratings are rather soft measures, they do grasp a very critical issue in a user interface. Even the most sophisticated and efficient user interface is useless if people do not like it. Based on the questionnaire and informal discussions with the participants we have good reasons to believe that the proposed interface could find its audience. Although the results are very promising, we must bear in mind that improvements come with a certain price. In order to compute the categories, a large number of results must be fetched from a search engine. This takes time in addition to the actual computation of the categories. However, the system is built in a way that it can display first 10 results immediately and then compute the categories in the background while the user can evaluate the first results. This reduces the perceived cost for the operations, but delays can still have an effect on subjective ratings. 7. Future work The basic functionality of the category user interface is promising, and we are planning to continue to investigate it. First, the number of categories presented to the user is an interesting question and it was left completely untouched in the present study. Even the number of categories (15) was somewhat arbitrarily chosen. We plan to study what is the effect of the number of categories on the users performance. Second, we are executing a longitudinal study of the usability of the category user interface. In the study we aim to find out if the categories are beneficial in the long run when their use is truly voluntary and happens in natural settings. In addition, it is interesting to see if the category user interface alters the users behavior in some ways, for example, could it stimulate the users to make better reformulation of the queries? For the study, an implementation of the user interface has been done for the web environment. Acknowledgements This work was supported by the Graduate School in User-Centered Information Technology (UCIT). We would like to thank Poika Isokoski, Johanna Höysniemi and Kari-Jouko Räihä for comments and discussions that contributed to this study, Tomi Heimonen, Natalie Jhaveri, Harri Siirtola. 103

110 206 M. Käki, A. Aula / Interacting with Computers 17 (2005) References Chekuri, C., Goldwasser, M., Raghavan, P., Upfal, E., Web search using automatic classification, Proceedings of the Sixth International World Wide Web Conference WWW6, Santa Clara, USA. Chen, H., Dumais, S., Bringing order to the web: automatically categorizing search results, Proceedings of CHI 2000, The Hague, Neatherlands. ACM Press, New York, pp Chen, M., Hearst, M., Hong, J., Lin, J., Cha Cha: a system for organizing Intranet search results, Proceedings of the Second USENIX Symposium on Internet Technologies and SYSTEMS (USITS). Cho, E., Myaeng, S., Visualization of retrieval results using DART, Proceedings of RIAO 2000, Paris, France. Cutting, D., Karger, D., Pedersen, J., Tukey, J., Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of SIGIR 1992, Copenhagen, Denmark. ACM Press, New York, pp Dumais, S., Chen, H., Hierarchical classification of web content, Proceedings of SIGIR 2000, Athens, Greece. ACM Press, New York, pp Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R., Using latent semantic analysis to improve access to textual information, Proceedings of CHI 88, Washington DC, USA. ACM Press, New York pp Dumais, S., Cutrell, E., Chen, H., Optimizing search by showing results in context, Proceedings of CHI 2001, Seattle, USA. ACM Press, New York, pp Hearst, M., Pedersen, J., Reexamining the cluster hypothesis: Scatter/Gather on retrieval results, Proceedings of ACM SIGIR 96, Zürich, Switzerland. ACM Press, New York. Jansen, B., Spink, A., Bateman, J., Saracevic, T., Searchers, the subjects they search, and sufficiency: a study of a large sample of excite searchers, 1998 World Conference on the WWW and Internet, Orlando, USA Jones, S., Paynter, G., Topic-based browsing within a digital library using keyphrases, Proceedings of the ACM Conference on Digital Libraries, Berkeley, USA. ACM Press, New York, pp Jones, S., Jones, M., Deo, S., Using keyphrases as search result surrogates on small screen devices. Personal and Ubiquitous Computing 8 (1), Käki, M., Proportional Search Interface usability measures. Proceedings of Nordi CHI 2004, Tampere, Finland. ACM Press, New York, pp Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D. (2000). Ephemeral document clustering for web applications. IBM Research Report RJ , April Pirolli, P., Schank, P., Hearst, M., Diehl, C., Scatter/Gather browsing communicates the topic structure of a very large text collection, Proceedings of CHI 96, Vancouver, Canada. ACM Press, New York, pp Pratt, W., Fagan, L., The usefulness of dynamically categorizing search results. Journal of the American Medical Informatics Association 7 (6), Spink, A., Wolfram, D., Jansen, M., Saracevic, T., Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology 52 (3), Wittenburg, K., Sigman, E., Integration of browsing, searching, and filtering in an applet for web information access. CHI 97 Electronic Publications: Late Breaking/Short Talks, Atlanta, USA. Zamir, O., Etzioni, O., Web document clustering: a feasibility demonstration, Proceedings of the 19th International SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98). ACM Press, New York, pp Zamir, O., Etzioni, O., Grouper: a dynamic clustering interface to web search results, Proceedings of the Eighth International World Wide Web Conference WWW8, Toronto, Canada. Elsevier, Amsterdam. 104

111 Paper II Mika Käki (2005). Proportional search interface usability measures. In Proceedings of NordiCHI 2004 (Tampere, Finland), October ACM Press, pages ACM, Reprinted with permission. 105

112 106

113 Proportional Search Interface Usability Measures Mika Käki Department of Computer Sciences FIN University of Tampere, Finland ABSTRACT Speed, accuracy, and subjective satisfaction are the most common measures for evaluating the usability of search user interfaces. However, these measures do not facilitate comparisons optimally and they leave some important aspects of search user interfaces uncovered. We propose new, proportional measures to supplement the current ones. Search speed is a normalized measure for the speed of a search user interface expressed in answers per minute. Qualified search speed reveals the trade-off between speed and accuracy while immediate search accuracy addresses the need to measure success in typical web search behavior where only the first few results are interesting. The proposed measures are evaluated by applying them to raw data from two studies and comparing them to earlier measures. The evaluations indicate that they have desirable features. Author Keywords Search user interface, usability evaluation, usability measure, speed, accuracy. ACM Classification Keywords H5.2. Information interfaces and presentation (e.g., HCI): User Interfaces: Evaluation/methodology. H3.3. Information Storage and Retrieval: Information Search and Retrieval: Search Process. INTRODUCTION In order to study the usability of search user interfaces we need proper measures. In the literature, speed, accuracy and subjective satisfaction measures are common and reveal interesting details. They have, however, a few shortcomings that call for additional measures. First, comparing results even within one experiment let alone between different experiments is hard because the measures are not typically normalized in the research reports but multiple raw numbers (like answers found and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NordiCHI '04, October 23-27, 2004 Tampere, Finland Copyright 2004 ACM /04/10... $5.00 time used) are reported. Of course, unbiased comparison between studies will always be difficult as the test setup has a big effect on the results, but the problem is compounded by the presentation of multiple task dependent measures. A good measure would be as simple as possible, yet it must not discard relevant information. Second, the current measures do not reveal the sources of speed differences. In particular, the relation between speed and accuracy may be hard to understand since the current measures for those dimensions are completely separate. For example, it is essential to know if the increase in speed is due to careless behavior or better success. Third, in the web environment, a typical goal for a search is to find just a few good enough answers to a question. This is demonstrated by studies that show that about half of the users only view one or two result pages per query [11]. Current search user interface usability measures do not capture the success of such a behavior very well. In order to address these problems, we present three new proportional, normalized usability measures. The new measures are designed for the result evaluation phase of the search process [10] where real users are involved. Search speed is a normalized speed measure expressed in answers per minute. It makes within study comparisons simple and between studies bit more feasible. Qualified search speed is a combination of speed and accuracy measures that reveals the tradeoff between speed and accuracy. It shows the source of speed differences in terms of accuracy and is also measured in answers per minute. Immediate search accuracy is a measure that captures the success of result evaluation when only the first few hits are interesting. These new measures are evaluated by applying them to data from real experiments and comparing them to conventional measures. RELATED WORK In usability evaluations, the measurements are typically based on the three major components of usability: effectiveness, efficiency, and satisfaction [3, 4]. International ISO standard [4] defines effectiveness as the accuracy and completeness with which the users achieve specified goals and efficiency as resources expended in relation to the accuracy and completeness with which users achieve goals. According

114 to the standard, efficiency measure divides the effectiveness (achieved results) by the resources used (e.g. time, human effort, or cost). In this work, we will leave satisfaction measures out of the discussion and concentrate on objective quantitative measures. Usability measurements are strongly domain dependent. In the search user interface domain effectiveness is typically measured in terms of accuracy (which is recognized as an example measure in the ISO standard as well). Time (speed of use) is typically used as the critical resource when calculating the efficiency. In the following we will discuss measuring practices in typical studies evaluating search user interfaces. Note that although almost every study in the information retrieval community deals with searching, they tend to focus on system performance [8] and thus only a few studies are mentioned here. Speed Measures The basic approach for measuring the speed is simply to measure the time required for performing a task, but the actual implementation differs from study to study. In early evaluations of the Scatter/Gather system by Pirolli et al. [6], times were recorded simply on a task basis. In the results they reported how many minutes it took, on average, to complete a task. In the study by Dumais et al. [2], roughly the same method was used, except that the times were divided into categories according to the difficulty of the task. Sebrechts et al. [9] used a different categorization method where task execution times were divided into categories according to the subject s computer experience. Time measurements can also be recorded in a somewhat reversed manner as Pratt and Fagan [7] did. They reported how many results users found in four minutes. This is close to measuring speed (achievement / time), but this normalization to four minutes is arbitrary and does not facilitate comparisons optimally. In a study by Dennis et al. [1], the time to bookmark a result page was measured and only one page was bookmarked per task. This setup makes the comparison fairly easy since the reported time tells how much time it takes to find a result with the given user interface. However, this desirable feature was caused by the setup where only one result was chosen, and other types of tasks were not considered. Accuracy Measures Accuracy measures are based on the notion of relevance which is typically determined by independent judges in relation to a task. In information retrieval studies, accuracy is typically a combination of two measures: recall and precision. Recall describes the amount of relevant results found in a search in a relation to all the relevant results in the collection. As a perfect query in terms of recall could return all the entries in the collection, it is counterbalanced with the precision measure. Precision describes how clean the result set is by describing the density of relevant results in it. Precision, like recall, is expressed by a percentage number which states the proportion of relevant targets in the result set. Recall and precision measures are designed for measuring the success of a query. In contrast, when the success of the result evaluation process is studied, the users need to complete the process by selecting the interesting results. Measures are then based on analyzing the true success of the selections. Recall and precision measures are used here too, but the calculation is different. In these cases recall describes the amount of relevant results selected in relation to the amount of them in the result set. Precision, on the other hand, describes the density of relevant results among the selected results. Veerasamy and Heikes [13] used such measures (called interactive recall and interactive precision) in their study of a graphical display of retrieval studies. They asked participants to judge the relevance of the results in order to get the users idea of the document relevance. Pirolli et al. [6] used only the precision measure in their test of the Scatter/Gather system. The selection of the results was implemented by a save functionality. Dennis et al. [1] used an approach where they reported the average relevance of the results found with a given user interface. Relevant results were indicated by bookmarking them. Further variations of the measures where user interaction is taken into account in accuracy evaluation were proposed and used by Veerasamy and Belkin [12]. Information Foraging Theory Stuart Card, Peter Pirolli and colleagues have made extensive research on information foraging theory [5] in Xerox Parc and the results are relevant here as well. In its conventional form information foraging theory states that the rate of gain of valuable information (R) can be calculated using the formula: G R = (1) T B + T W In the formula, G is the amount of gained information, T B is the total time spent between information patches and T W is the total time spent within an information patch [5]. An information patch is understood to mean a collection of information such as a document collection, a search result collection or even a single document that can be seen to be a collection of information that requires some actions for digesting the information. In the information foraging process, the forager navigates first between patches and then finds actual meaningful information within a patch. The process is then started over by seeking a new patch. If we discard the separation of two different types of activities (between and within patches) for simplicity, equation 1 states the information gain rate in terms of time unit. This matches with common practices in the field and is the basis for our proposed measurements as well

The gap is well justified as the definition would unnecessarily reduce the scope of the theory.

115 Figure 1. Compared user interfaces in our experiment. Category user interface on the left, reference user interface on the right. The gap that is left in the information foraging theory in a relation to making concrete measurements, is the definition of information gain. The gap is well justified as the definition would unnecessarily reduce the scope of the theory. On the other hand, when we deal with concrete problems, we can be more specific and thus obtain preciseness. This is our approach here: we apply the basic relationships stated in the information foraging theory and provide meaningful ways of measuring the gain. All this is done in the context of evaluating search user interfaces in the search result evaluation phase. We will get back to this topic in the discussions of the new measures to see their relationship to the information foraging theory in more detail. EXPERIMENT We will evaluate the proposed measures using data from an experiment of ours. This experiment was conducted to evaluate a new search user interface idea by comparing it to the de facto standard solution. Our proposed user interface used automatically calculated categories for facilitating the result access (Figure 1, left). As the categories we used the most common words and phrases found within the result titles and text summaries (snippets). Stop word list and a simple stemmer were used for improving the quality of the categories (e.g. discarding very common words such as and or is ). As the category word (or phrase) selection was based solely on the word frequencies, the categories were neither exclusive nor exhaustive. There was a special built-in category for accessing all the results as one long list. The hypothesis behind the category user interface was such that it would allow users to identify and locate interesting results easier and faster than the conventional solution. The calculated categories were presented to the user as a list beside the actual result list. When a category was selected from the list, the result listing was filtered to display only those result items that contained the selected word or phrase. There were a total of 150 results that the user could access and from which the categories were computed. Participants There were 20 volunteer participants (8 male, 12 female) in the experiment. Their average age was 35 years varying from 19 to 57 years and they were recruited from the local university. Almost all of the participants can be regarded as experienced computer users, but none of them was an information technology professional. Apparatus There were two user interfaces to access the search results: 1. The category interface (category UI, Figure 1, left) presented the users with a list of 15 automatically generated categories on the left side of the user interface. When the user selected a category, the corresponding results were shown on the right side of the user interface much like in popular clients. 2. The reference interface (reference UI, Figure 1, right) was a Google web search engine imitation showing results in separate pages, ten results per page. The order of the results was defined by the search engine (Google). In the bottom of the window, there were controls to browse the pages in order (Previous and

116 Next buttons) or in random order (a radio button for each page). There were 15 pages so that the participants could access a total of 150 results. Design and Procedure The experiment had search user interface as the only independent variable with two values: category UI and reference UI. The values of the independent variable were varied within the subjects and thus the analysis was done using repeated measures tools. As dependent variables we measured: 1) time to accomplish a task in seconds, 2) number of results selected for a task, 3) relevance of selected result in a three step scale (relevant, related, not relevant), and 4) subjective attitudes towards the systems. The experiments were carried out in a usability laboratory. One experiment lasted approximately 45 minutes and contained 18 (9+9) information seeking tasks in two blocks: one carried out with the category interface and the other using the reference interface. The order of the blocks and the tasks were counterbalanced between the participants. For each task, there was a ready-made query and users did not (re)formulate the queries themselves. This kind of restriction in the setup was necessary to properly focus on measuring the success in the result evaluation phase of the search. The actual task of the participant was to collect as many relevant results for the information seeking task as possible as fast as you can. The participants collected results by using check boxes that were available beside each result item(seefigure1). In the test situation there were two windows in the computer desktop. The task window displayed information seeking tasks for the participants who were instructed to first read the task description, then push the Start button in the task window and promptly proceed to accomplish the task in the search window. Upon task completion (participant s own decision or time-out), the participants were instructed to push the Done button in the task window. The time between Start and Done button presses was measured as the total time for the task. This timing scheme was explained to the participants. Time for each task was limited to one minute. Accuracy measures are based on ratings done by the experimenter (one person). The rating judgments were made based solely on the task description and the very same result title and summary texts that the participants saw in the experiment. Actual result pages were not used because it would have added an extra variable into the design (result summary vs. page relation), which we did not wish. All the tasks had at least two defining concepts like in Find pictures of planet Mars. For relevant results, all of the concepts was required to be present in some form (different wording was of course allowed). Related results were those where only the most dominant concept was present (e.g. planet Mars). Rest of the results was considered to be not relevant. RESULTS For comparing the proposed measures we present here the results of our experiment using the conventional measures: time, number of results, and precision. The time measure did not reveal very interesting results, because the test setup limited the total time for one task to one minute. Thus the mean times for conditions were close to each other: 56.6 seconds (sd = 5.5) for the category UI and 58.3 seconds (sd = 3.5) for the reference UI. The difference is not statistically significant as repeated measures analysis of variance (ANOVA) gives F(1,19) = 3.65, ns. In contrast, number of results revealed a difference. When using the category UI the participants were able to find on average 5.1 (sd = 2.1) results per task whereas using the reference UI yielded on average 3.9 (sd = 1.2) selections. The difference is significant since ANOVA gives F(1,19) = 9.24, p <.01. Precision measure gave also a statistically significant difference. When using the category UI on average 65% (sd = 13) of the participants selections were relevant in a relation to the task. The corresponding number for the reference UI was 49% (sd = 15). ANOVA gave F(1,19) = 14.49, p <.01. The results are compatible with other studies done with similar categorizing search user interfaces. For example, Pratt and Fagan [7] have also reported similar results in favor of categorizing user interface. When categories work, they enhance the result evaluation process by reducing the number of items that need to be evaluated. Users find interesting looking categories and evaluate only the results within those categories. Concentration of relevant documents in the interesting categories is higher than in the whole result set. SEARCH SPEED Definition In order to make the comparison of speed measures easier, we suggest a proportional measure. When the search time and number of results are combined into one measure, just like in measuring physical speed by kilometers or miles per hour, we get a search user interface search speed measure expressed in answers per minute (APM). It is calculated by dividing the number of answers found by the time it took to find them: answers found search speed = (2) minutes searched In relation to the ISO standard this is an efficiency measure whereas the plain number of answers is an (simple) effectiveness measure. In terms of information foraging theory, we replace the G term in equation 1 with number of results found and the time is normalized to minutes. This concretizes the rate (R) inequation1tobe answers per minute. The structure of equations 1 and 2 is essentially the same

117 Whenever two (or more) measures are reduced into one, there is a risk of loosing relevant information. This is the case here as well. The proposed measure does not make the distinction between a situation where one answer is found in 10 seconds and a situation where four answers are found in 40 seconds. In both cases the speed is 6 answers per minute and the details of the situation are lost in the measurement. However, we feel that speed measure is nevertheless correct also in this case. The situation can be compared to driving 50 km/h for 10 or 40 minutes. The traveled distance is different, but the speed is the same. This means that proposed speed measure does not apply in every situation and attention must be paid in measurement selection. The problem of reducing two measures into one has also been thoroughly discussed by Shumin Zhai [14] in the context of input devices. He points out that reduction of two Fitts law variables (a and b) in calculating throughput of an input device leads to a measure that is dependent of the task. The same problem does not apply here as our situation is not related to Fitts law. However, our measure is dependent on the task, but it is not dependent of the used time or the number of results collected like previous measures. Evaluation In order to evaluate the suggested measure it was applied to the results of Scatter/Gather evaluation by Pirolli et al. [6]. In their experiment the task was to find relevant documents for a given topic. The table below summarizes the results (SS = similarity search, SG = scatter/gather): Measurement SS SG Original Time used in minutes Number of answers Search speed Answers per minute The first two rows show the actual numbers reported in the paper while the third row shows the same results in answers per minute. It is arguably easier to understand the relationship between the two user interfaces from the normalized search speed measure. It communicates that the SS condition was roughly four times faster than the SG condition. The relation is hard to see from the original results. In addition, measurements can be easily related to one s own experiences with similar user interfaces because of the normalization. In the second table below, the search speed measure is applied to the data from our own experiment. Here the difference between raw numbers and normalized measure is not as large as in the previous example because the time used for the tasks is roughly the same in both cases due to the test setup. Nevertheless, the suggested measure makes the comparison easier. Note also that the fairly large difference with the speeds in the experiment by Pirolli et al. is presumably due to experiment set-up (tasks, conditions, equipment, etc.). Measurement Category UI Reference UI Raw numbers Time used in minutes Number of answers Search speed Answers per minute When an analysis of variance is calculated on the answers per minute measure, we see a bit stronger result compared to the conventional measures where just the number of results revealed significant difference. Here ANOVA gives F(1,19) = 11.3, p <.01. Slight increase in the F statistic is due to the combination of two measures that both have a difference in the same direction. In summary, search speed measures the same phenomena as the previously used measures (it is calculated from the same numbers) and it can make distinctions between the measured objects. QUALIFIED SEARCH SPEED Definition Previously used recall and precision measures do not directly tell where possible speed differences come from or what the relation between speed and accuracy is. The suggested qualified search speed measure refines the search speed measure with categories of relevance to address this shortcoming. To keep the measure understandable and robust, we use only two or three categories of relevance. Like the previous measure, the qualified search speed is also measured in answers per minute, with a distinction that the speed is calculated separately for each relevance category according to the equation 3. There RCi stands for relevance category i (typical categories are e.g. relevant and irrelevant). answers found RCi qualified search speed RCi = (3) minutes searched Note that the sum over all relevance categories equals to the normal search speed. When qualified search speed is described in information foraging terminology, we can see that the gain is now defined more precisely than with search speed. While search speed takes into account only the number of results, qualified search speed adds the quality of the results into the equation. In essence, this gives us a more accurate estimate of the gain of information, and thus a more accurate rate of information gain. Note that this shows also in the rate magnitude: rate is now stated in (e.g.) number of relevant results per minute. Evaluation When the qualified search speed measure is applied to the data of our experiment and compared to the simple measure of precision, a few observations can be made. First, the proposed measure preserves the statistically significant

118 Figure 2. Qualified search speed measure compared to precision measure of data gathered in our own study. difference that was observed with the conventional precision measure. ANOVA for the speed of acquiring relevant results gives F(1,19) = 32.4, p <.01. Second, both measures (Figure 2) convey roughly the same information about the precision of the user interfaces including: 1) with the category UI more than half of the selected results were relevant whereas with the reference UI about half of the results were relevant, and 2) using the category UI participants were more successful in terms of precision. However, with the suggested qualified search speed measure, the amplitude of difference in precision is not obvious and thus the new measure cannot replace the old one. Third, in addition to what can be seen in the precision chart, the qualified search speed chart (Figure 2) reveals some interesting data. It shows that the improvement in speed is due to the fact that participants have been able to select more relevant results while the proportion of not relevant results decreased a bit. The same information could surely be acquired by combining conventional speed and precision measures, but when the information is visible in one figure it is arguably easier to find such a relationship. Note also that although the new measure is mainly concerned about the accuracy of use, it informs the reader simultaneously about the speed of use as well. Figure 3 makes a comparison between the new measure and the original precision measure using the data collected in the Scatter/Gather experiment [6]. Here it is worthwhile to note that even though precision measures are close to those in the previous example, the qualified search speed measure reveals large differences between the conditions. Qualified search speed seems to reveal the tradeoff between accuracy and speed convincingly in this case. We can also notice that both conditions here are much slower than those in Figure 2 as the qualified search speed is normalized just like the simpler search speed. It is notable that qualified search speed does not measure the same phenomena as precision and thus they are not replaceable. We can image a situation where high qualified speed is associated with low precision and vice versa. In Figure 3. Qualified search speed measure compared to precision measure in the Scatter/Gather study [4]. reality this could happen when users try to be very precise in one condition and very fast in another. On the other hand, we saw that qualified evaluation speed can make clear distinctions between user interfaces, which is a compulsory quality for a useful measure. IMMEDIATE ACCURACY Definition The last suggested measure captures the success of typical web search behavior. In such a task, the user wants to find a piece of information that would be good enough for an information need and overall speed and accuracy are not as important as quick success. The measure is called immediate accuracy and it is expressed as a success rate. The success rate states the proportion of cases where at least one relevant result is found by the n th selection. For applying the measure, the order of each result selection must be stored and the relevance of them must be judged against the task. The selections for each task and participant are then gone through in the order they were made and the frequency of first relevant result finding is calculated for each selection (first, second, and so on). When this figure is divided by the total number of observations (number of participants * number of tasks) we get the percentage of first relevant result found per each selection. Equation 4 shows the calculation more formally, there n stands for n th selection. number of first relevant results immediate accuracy n n = (4) total number of observations When the figures calculated with equation 4 are plotted into acumulativelinechart(figure4)wecanseewhenatleast one relevant result is found on average. For example, (in Figure 4) after the second selection in 79 % of the cases at least one relevant result is found when using the category user interface. Notice also that the lines do not reach the 100 % value. This means that in some of the cases the users were not able to find any relevant results. When looking back to information foraging theory, this measure takes us to a different approach compared to the previous ones. This measure abandons time as the limiting

119 Immediate Accuracy success rate 100 % 80 % 60 % 40 % 20 % 0% Category UI Reference UI selection Figure 4. Immediate accuracy of category UI and reference UI. The measure shows the proportion of the cases where a relevant result have been found at n th selection. resource against which the gain is compared and replaces it by selection ordinal (remember that ISO standard leaves the choice of resource up to the domain). As this new resource is discrete in nature, the expression of the measure as a single figure (rate) becomes hard and thus, for example, a cumulative chart is preferred for easily interpretable figures. From another perspective of information foraging theory, we can say that immediate accuracy is a measure for estimating the beginning of the within patch gain slope. Note,thatitisonlyanestimationofthebeginning of the slope as all subsequent relevant selections are discarded in this measure. In this view, we define an information patch to be a search result set. Evaluation The evaluation is based only on our own data because the measure requires information that is typically not reported in the publications. Figure 4 shows that the user orientates faster while using the category UI as the first selection produces already a relevant result in 56 % of the cases. In contrast, the reference UI produces a relevant result in 40 % of the first selections. By the second selection, the difference is bit greater since in 79 % of the cases the users have found at least one relevant result with the category UI, while the corresponding number for the reference UI is 62 %. In the analysis of cumulative data, the most interesting points are those where the difference between compared measurements changes. Change points are most important because cumulative figures will preserve the difference if not further changes happen. In our case the difference is made at the first selection and remains virtually the same afterwards. This difference is statistically significant as ANOVA gives F(1,19) = 12.5, p <.01 and it is preserved throughout the selections (F(1,19) 10.4, p <.01 for all subsequent selections). Findings of Spink et al. [11] stated that users only select one or two results per query. Immediate accuracy allows us to see the success of the studied user interface in such a case. We can focus on a given selection and quickly see the success rate at that point. Note that this kind of information is not available using the conventional accuracy measures and straightforward speed measures. Immediate Success Speed Another fairly simple and obvious way for measuring immediate success would be to record the time to the first relevant result. We did try this measure as well, but found a problem. In our experiment, the average time to find the first relevant result was practically the same in both cases (20 and 21 seconds for category and reference UI respectively) and there was no statistically significant difference. This could, of course, be the true situation, but the amount of relevant results suggested the opposite. The problem comes from the fact that the first relevant result is not always found. With the category UI users were not able to find a single relevant result for a task in 10% of the cases whereas the same number for reference UI was 21%. We felt that this is a big difference and that it should be visible in the measurement as well. However, we were not able to come up with a reasonable solution for normalizing the time measurement in this respect and thus the measurement is not promoted as such. In addition, the results of Spink et al. [11] suggest that the time to first relevant result is not very important for the search process. Since searchers tend to open only one or two results, the time does not seem to be the limiting factor, but the number of result selections is. This supports also the choice of immediate accuracy over the time to the first relevant result. DISCUSSION Our goal was to provide search user interface designers, researchers, and evaluators with additional measures that would complement the current ones. The first problem with them is that result comparison is hard, even within one experiment. Proportional measures makes within study comparisons easy and in addition they let readers relate their previous experience better to the presented results. We proposed normalized search speed measure that is expressed in answers per minute. As the measure combines two figures (number of answers and time searched) into one proportional number, it makes the comparisons within an experiment easy and between experiments bit more feasible. The second shortcoming of the current measures is the fact that it is difficult to see the tradeoff between speed and accuracy. To address this problem, we proposed the qualified search speed measure that divides the search

120 speed measure into relevance categories. The measure allows readers to see what the source of speed in terms of accuracy is. In the evaluation we showed that conventional measures may only tell the half of the story. For instance, in the case of the Scatter/Gather experiment the precision measure showed only moderate difference between the systems whereas qualified speed revealed a vast difference in the gain of relevant results. Combining speed and accuracy measures is particularly effective in such a case as it eliminates the need to mentally combine the two measures (speed and accuracy). The third weakness of the current measures is their inability to capture users success in typical web search behavior where the first good enough result is looked for. We proposed the immediate accuracy measure to solve this flaw. Immediate accuracy shows the proportion of the cases where the users are able to find at least one relevant result per n th result selection. It allows readers to see how well and how fast the users can orient themselves to the task with the given user interface. As the measurements are made based on finding the first relevant result, the reader can compare how well different solutions support users goal of finding the first relevant answer (and presumably few others as well) to the search task. The proposed measures are not intended to replace the old measures, but rather to complement them. They lessen the mental burden posed to the reader as important information of different type (e.g. speed, accuracy) is combined into one proportional measure. In summary, the proposed measures capture important characteristics of search user interface usability and communicate them effectively. The issue of making comparisons between experiments is not completely solved by these new measures. We feel that the problem is not in the properties of the new measures but in the nature of the phenomena to be measured. In the context of search user interfaces the test settings have a huge effect on the results that cannot be solved simply with new measures. One solution for the problem could be test setup standardization. In the TREC interactive track such an effort have been taken, but it seems that the wide variety of research questions connected to searching cannot be addressed with a single standard test setup. ACKNOWLEDGMENTS This work was supported by the Graduate School in User Centered Information Technology (UCIT). I would like to thank Scott MacKenzie, Kari-Jouko Räihä, Poika Isokoski, and Natalie Jhaveri for invaluable comments and encouragement. REFERENCES 1. Dennis, S., Bruza, P., McArthur, R. Web Searching: A Process-Oriented Experimental Study of Three Interactive Search Paradigms. Journal of the American Society for Information Science and Technology, Vol. 53, No. 2, 2002, Dumais, S., Cutrell, E., Chen, H. Optimizing Search by Showing Results in Context. Proceedings of ACM CHI 01 (Seattle, USA), ACM Press, 2001, Frøkjær, E., Hertzum, M., Hornbæk, K. Measuring Usability: Are Effectiveness, Efficiency, and Satisfaction Really Correlated? Proceedings of ACM CHI 2000 (The Hague, Netherlands), ACM Press, 2000, ISO : Ergonomic requirements for office work with visual display terminals (VDTs) - Part 11: Guidance on usability, International Organization for Standardization, March Pirolli, P. and Card, S. Information Foraging. Psychological Review, 1999, Vol. 106, No. 4, Pirolli, P., Schank, P., Hearst, M., Diehl, C. Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection. Proceedings of ACM CHI 96 (Vancouver, Canada), ACM Press, 1996, Pratt, W., Fagan, L. The Usefulness of Dynamically Categorizing Search Results. Journal of the American Medical Informatics Association, Vol. 7, No. 6, Nov/Dec 2000, Saracevic, T. Evaluation of Evaluation in Information Retrieval. Proceedings of ACM SIGIR 95 (Seattle, USA), ACM Press, 1995, Sebrechts, M., Vasilakis, J., Miller, M., Cugini, J., Laskowski, S. Visualization of Search Results: A Comparative Evaluation of Text, 2D, and 3D Interfaces. Proceedings of ACM SIGIR 99 (Berkeley, USA), ACM Press, Shneiderman,B.,Byrd,D.,Croft,B.ClarifyingSearch: A User-Interface Framework for Text Searches. D-Lib Magazine, January Spink, A., Wolfram, D., Jansen, M., and Saracevic, T.: Searching the Web: The Public and Their Queries. Journal of the American Society for Information Science and Technology, 2001, Vol. 52, No. 6, Veerasamy, A., Belkin, N. Evaluation of a Tool for Visualization of Information Retrieval Results. Proceedings of ACM SIGIR 96 (Zurich, Switzerland), ACM Press, 1996, Veerasamy, A., Heikes, R. Effectiveness of a Graphical Display of Retrieval Results. Proceedings of ACM SIGIR 97 (Philadelphia, USA), ACM Press, 1997, Zhai, S. On the validity of Throughput as a Characteristic of Computer Input. IBM Research Report, RJ 10253, IBM Research Division. August

121 Paper III Mika Käki (2005). Optimizing the number of search result categories. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages ACM, Reprinted with permission. 115

122 116

123 CHI 2005 Late Breaking Results: Posters April 2-7 Portland, Oregon, USA

124 CHI 2005 Late Breaking Results: Posters April 2-7 Portland, Oregon, USA

125 CHI 2005 Late Breaking Results: Posters April 2-7 Portland, Oregon, USA

126 CHI 2005 Late Breaking Results: Posters April 2-7 Portland, Oregon, USA

127 Paper IV Mika Käki (2005). Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005 (Portland, USA), April ACM Press, pages ACM, Reprinted with permission. 121

128 122

129 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

130 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

131 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

132 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

133 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

134 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

135 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

136 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

137 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

138 CHI 2005 PAPERS: Document Interaction April 2 7 Portland, Oregon, USA

139 Paper V Mika Käki (forthcoming). fkwic: frequency based keyword-incontext index for filtering web search results. Accepted for publication in Journal of the American Society for Information Science and Technology. Wiley. Wiley, This is a minor revision differing in layout. Reprinted with permission. 133

140 134

141 f KWIC: Frequency Based Keyword-in-Context Index for Filtering Web Search Results Mika Käki Department of Computer Sciences FIN University of Tampere, Finland Tel Fax ABSTRACT Enormous Web search engine databases combined with short search queries result in large result sets that are often difficult to access. Result ranking works fairly well, but users need help when it fails. For these situations, we propose a filtering interface that is inspired by keyword-in-context (KWIC) indices. The user interface lists the most frequent keyword contexts (f KWIC). When a context is selected, the corresponding results are displayed in the result list allowing users to concentrate on the specific context at a time. We compared the keyword context index user interface to the rank order result listing in an experiment with 36 participants. The results show that the proposed user interface was 29% faster in finding relevant results and the precision of the selected results was 19% higher. In addition, participants showed positive attitudes towards the system. Keywords Web search, search user interface, keyword-in-context index, result filtering, information access. 1 INTRODUCTION Web search engines are one of the most important ways of finding information from the World Wide Web (Web) as Jacob Nielsen recently stated that in 88% of the cases, users start Web navigation by using a search engine (Nielsen, 2004). Log studies have shown that the topics that users are searching for have changed over time, but the search skills and habits have remained largely the same using only a few words in a query (Spink, Jansen, Wolfram, & Saracevic, 2002). In addition, Boolean operators are rarely used (Jansen & Pooch, 2001). When these kinds of queries are combined with enormous databases hosted by the Web search engines (Google indexes over 8 billion Web pages, Google Search Engine), it is presumable that search queries will fail from time to time. Certainly the problem is not newly discovered. The problems are studied in areas of information retrieval (Jansen, 2005), result ranking (Brin & Page, 1998) and result visualization (Kartoo Search Engine). Query refining aids (Anick, 2003) try to reduce the problem in the query formulation whereas result categorization aims to ease the understanding and accessing of the result list. Studies (Zamir & Etzioni, 1998; Zamir & Etzioni, 1999; Vivisimo Search Engine, Dumais & Chen, 2000; Chen & Dumais, 2000; Käki & Aula, 2005) have demonstrated the usefulness of the result categorization approach. Our current work is based on these ideas. We propose frequency based keyword context index (f KWIC), a new solution inspired by the keyword-in-context (KWIC) indices. f KWIC extracts the most frequent keyword contexts with 135

142 a new algorithm. The contexts are employed in a filtering user interface that provides an easy access to interesting results. The user interface is built around conventional ranked result listing and the keyword context index is provided as an additional tool for coping in situations where result ranking does not support user s information need. The solution is expected to be most useful when the initial query formulation is ambiguous or when the user is engaged in an undirected informational search task where a broad understanding of a topic is required. In both cases the intended meaning of the query is not entirely obvious and thus the ordering of the results may not support the user s need. With the proposed index, the user can access meaningful results easily even if they are located down in the rank order. To find out whether keyword context index is beneficial in actual use, we conducted an experiment with 36 participants. The experiment compared the f KWIC user interface to the ranked result list solution (de facto standard) with 18 (9 + 9) search tasks. The results show that the new user interface is advantageous, especially when multiple results are needed. Users were 29% faster and 19% more accurate (in terms of precision) with the new system compared to the control condition (ranked list). In addition, the first relevant result is found with fewer selections. This is important in typical Web searches where the first relevant result may be the only one the user retrieves. The paper presents first a review of relevant literature in the field. Next, the algorithm for building the keyword context index is explained, and the user interface design decisions are discussed. This is followed by the explanation of the experiment and the presentation of its results. We end with conclusions. 2 RELATED WORK Two areas of research are most relevant for this work. First, the work done in developing keyword-in-context indices has been a major inspiration for our design. Second, the way in which the new user interface solution is applied, is largely based on the work done in Web search result categorization. 2.1 Keyword-in-Context (KWIC) Indices Keyword-in-context (KWIC) index is a form of concordance (word index) where each occurrence of the keyword is displayed together with surrounding words in a list of strings. An example index could list a set of book titles as a response to a library database query as shown in Figure 1 for query abstract. The keyword used in the query is the basis for the index, while the remaining words in the title form the context for it (Figure 1). In these indices, the keyword is typically printed aligned in the middle to make visual scanning of the list easier. Hans Peter Luhn published a paper on KWICs in graphic scheme based on abstract and index cards tic information using abstract and index publications abstract archive of alcohol lireratu publishing modern abstract bulletins company pharmaceutical abstract bulletin a punched card abstract file on solid state and tra the abstract of the technical report relation of an abstract to its original from journal article to abstract Figure 1. An example of a Keyword-in-Context (KWIC) index where the keyword is abstract (after Salton (1989)). In a recent application of the idea, similar word listings were used to display key phrases in a Web site user interface (Paynter, Witten, Cunningham, & Buchanan, 2000). Their user interface shows the search results in a list where the keyword is first accompanied by one 136

143 context word (preceding or subsequent). When user selects one of these contexts, the second list displays three words for each hit: two selected keywords and one additional context word. This interaction forms a hierarchical index where users browse to the desired content by sequential and narrowing context selections. In another prototype called Keyphind, the results of a digital library query were represented by extracted keyphrases and the selection of them allowed users to find the interesting results (Gutwin, Paynter, Witten, Nevill-Manning, & Frank, 1999). The common form of Web search engine result summaries (snippets) can be seen as a KWIC index. The summaries consist typically of phrases selected from the source text so that the query keyword is shown in a context (Figure 2). The keywords are not aligned spatially, but they are highlighted with a bold face type. This design saves considerable amount of screen space and delivers rich context information. Statistical Abstract of the US Description of contents of Statistical Abstract of the United States and links to its supplements and additional featues. Census Bureau k - Cached - Similar pages Figure 2. An example of modern counterpart of KWIC index from Google search engine (Google Search Engine), keyword is abstract. 2.2 Web Search Result Categorization So called cluster hypothesis (Jardine & Rijsbergen, 1971) states that relevant documents for a query tend to be similar with each other. If such clusters are available, searcher s task is first to locate the relevant cluster(s) and then evaluate only the documents within that cluster(s). This reduces the amount of documents that need to be evaluated and makes the search process more effective. The great promise has inspired a substantial amount of research in the area and the interesting question is if, and how, such a clustering can be made automatically. In this paper, the concept of clustering is used to describe a technique of automatically bringing similar documents together. Classification, on the other hand, refers to a technique of putting documents into predefined categories. Both of them can be used to categorize the documents (or search results) in order to realize the promise of the clustering hypothesis. Scatter/Gather was one of the first systems where the cluster hypothesis was tested with end users and it showed that the clustering approach is technically feasible (Cutting, Karger, Pedersen, & Tukey, 1992). A later user study showed more evidence for the cluster hypothesis as the users were able to select the most relevant clusters during the search process (Hearst & Pedersen, 1996). It meant that the users found the most valuable information sources in that user interface and that the clustering could enhance the user performance. Recently, a lot of research effort is put to categorizing Web search results. Zamir and Etzioni (1998) demonstrated the benefits of a special type of clustering technique in the Web environment. They also implemented a Web search engine user interface, Grouper, based on the idea (Zamir & Etzioni, 1999). Our Findex system is closely related, but aims to be simpler and more transparent for the users. It is shown to be beneficial in a laboratory and in a longitudinal study (Käki & Aula, 2005; Käki, 2005). Zhen et al. (2004) have improved the quality of the categories by utilizing learning techniques in their clustering algorithm. In contrast to all the previous prototypes that were based on clustering, Swish and Dynacat prototypes use the classification approach. This approach solves the problem of naming the categories sometimes associated with the clustering techniques. Swish prototype works in the 137

144 Web environment by using a pre-taught classification scheme (Dumais, Cutrell, & Chen, 2001). Dynacat, in contrast, uses a predefined classification that is based on the rich metadata attached to the medical documents (Pratt & Fagan, 2000). In addition to the categorization techniques, it is possible to focus on the structure of the resulting categories. There are two main types: hierarchical (like folders in the file system) and flat (a list of categories). Many of the above-mentioned systems employ a flat category list, but Scatter/Gather and Dynacat present hierarchies to the users. Swish uses hierarchical classification system, but result categories form a flat list. In addition to these, Ferragina and Gulli (2005) and Kummamuru (2004) have proposed systems that present an explicit hierarchy of categories in a form of a folder structure like in modern file systems. A few commercial Web search engines have been introduced that use categorization techniques. These include Vivísimo (Vivísimo Search Engine), iboogie (iboogie Search Engine), and WiseNut (WiseNut Search Engine). These are closely related to our system, but the actual categorization method or studies about their usefulness have not been published. 3 SYSTEM DESCRIPTION 3.1 Building Frequency Based Keyword Context Index We took the result pages returned by the search engine as the starting point for our design. This means that the full text of the result documents is not available, but only short summaries (snippets) of them. Although it can be seen as a restriction, it also makes it possible to employ the system on any search engine that produces this kind of results. In addition, the snippets are show to be a feasible source of categorization by Zarmir and Etzioni (1998). Although the original layout of the KWIC index is simple and easy to understand, it has some weaknesses. First, the original form considers only document titles whereas Web searches are based on full text search. This means that the matching document titles may not contain any of the query keywords and thus, displaying the keyword contexts in the titles is impossible. Second, short text summaries that display the keyword contexts in the document body improve users speed and accuracy in search tasks (Tombros & Sanderson, 1998), but such summaries are not available in original KWIC indices. Third issue concerns the screen real estate. KWIC indices were designed at the time when displays were character based. Modern graphical environments are built from a different perspective and thus the original layout of the KWIC indices may not be optimal for the Web search results. To address these issues and to utilize the idea in a modern environment, we altered the original form of KWIC index by emphasizing the index (table of contents) side of the idea. To make the index compact enough, we present the users only with the most frequently occurring keyword contexts. This solution was expected to 1) save valuable screen real estate, 2) let users take advantage of the text summaries, and 3) let users access interesting results more easily. All these points are vital in making the result access more efficient. The computation of the most frequent keyword contexts (f KWIC) starts by listing all phrases within result sentences that contain a keyword. If a snippet does not contain a full sentence (from period to period), then the partial sentence is used. A phrase, on the other hand, is a string of words within a sentence. The length of these phrases is two or more words, maximum being the length of the sentence containing the keyword (context sentence). As the phrases are created, stop words (we use publicly available stop word lists) are excluded simply by stripping them from the phrases. For each phrase, information about the associated 138

145 result is stored. When the list of candidate phrases is complete, instances which are associated with two results or less are removed. The next phase reduces repetition and redundancy in the keyword contexts and aims to ensure proper coverage. First, similar candidate phrases are merged together. Similar phrases are those that are composed of same words (possibly in different order). Queries composed of proper names often produce phrases where the names appear in various orders, which make them redundant and distracting. In addition to ignoring the order of the words, we ignore simple inflections of the words. For this, we use a non-exact string matching algorithm. This algorithm allows strings to differ three characters in length and requires only the 80% (of the shorter string) of the first characters to match. For example, words house, houses, and housing are considered to be the same (four first characters are the same, the length differs two characters at the most). We acknowledge the danger of oversimplification of this approach, but it is found to work satisfactorily in Finnish, English, French and German. Second, sub and super-phrases are processed. Sub-phrase is a phrase that appears inside a longer phrase, such as dog breed is a sub-phrase of dog breed information (two possible keyword contexts for keyword dog ). The sub-phrases are removed if they are associated only with the same results (overlapping) as the super-phrase. If the number of nonoverlapping results associated with the sub-phrase is less than half of the number of results associated with the super-phrase, the sub-phrase is removed from the candidate list. Thus, the sub-phrase must contribute enough to the coverage in order to get selected. For example, if dog breed information appears in six results, dog breed must appear at least in nine results to be kept in the candidate list. Otherwise, the overlap of the keyword contexts is considered to be too great. Finally, the candidate phrases are sorted according to the number of results that are associated only with the given phrase. This is done to maximize the coverage of the phrases in relation to the results. The top n (by default 15) phrases of this final list are selected and shown to the user as the keyword context index. The items in the final keyword context index can be overlapping, meaning that one result can be associated with more than one keyword context. This occurs simply because one result may contain many keyword contexts. This may be confusing if f KWIC is thought strictly as an index that covers the whole contents of the indexed result set. On the other hand, this functionality makes the information accessible in various contexts and thus makes the recognition of the relevant information easier. This technique does not guarantee that all results can be found from the index. In this respect, we favored understandability over the coverage in our design. This shortcoming is solved in the user interface with a functionality that allows users to see the whole result list. The performance of the algorithm for Web environment is of great importance. Our algorithm performs acceptably when the server load is relatively low. It takes about 250 milliseconds (sd = 182, n = 74 queries) to compute the keyword context index for 150 results. In our experience, 150 results is a reasonable tradeoff between speed and coverage. The final keyword contexts tend to remain roughly the same even if the number of results is increased. 3.2 User Interface The idea of adding new features on top of the existing search services is followed in the user interface. The current ranked list presentation of the search results works well in many cases. Thus, we wanted to let people utilize their existing knowledge and experience with it. As a result, the conventional result list is the core of our user interface and it is enhanced by the 139

146 automatically computed keyword context index on the side (Figure 3). The user interface resembles closely that of Vivísimo except that we use a flat categorization where Vivísimo uses a hierarchical one. Keyword context index is used to filter the result listing. When the user selects one keyword context, the result listing is updated to show only those items that contain the selected context. The relationship between the context and the result item is explained by highlighting the context instances in the result texts. In addition to the automatically computed keyword contexts, there is a special built-in item in the index for accessing all the retrieved results. This All results item is automatically selected after each search so that initially the user will always see both the keyword context index and all of the results. This makes the default interaction with the system close to the traditional search engine, but new advanced features are readily available. Figure 3. Screenshot of the keyword context (f KWIC) user interface, list of the most frequent keyword contexts on the left, result listing on the right. The visual format of the keyword context index was altered from the original KWIC design. Although the ability to easily scan the index along the aligned keywords is central in the idea, it would have required too much screen space in our layout. Thus, we simply list the most common keyword contexts left aligned. The order of the results is always determined by the underlying search engine. When one of the keyword contexts is selected, the relative order of the results follows the original order of the search engine, although the results are typically not sequential in the original listing. One important issue in search user interfaces is the perceived speed of the system. If the system is slow, users may start to hesitate in making queries. The result categorization could decrease the responsiveness, because it has to download many results before the context computation can be completed. To address this issue, we built the user interface so that the user sees the first ten results as soon as the underlying search engine returns them. Result retrieval continues in the background and the keyword context index is displayed 140

147 automatically when the computation finishes. This does not enhance the actual speed of the system, but the user can use the waiting time effectively making the system appear faster. 4 EXPERIMENT To evaluate the performance of the f KWIC algorithm and the user interface, we conducted an experiment. The experiment was designed to compare the new solution to the reference solution. As the reference user interface we used the ranked search result list that is popular in commercial Web search engines. 4.1 Participants We had 36 participants in the study (28 male, 8 female). The participants were recruited from an introductory level human-computer interaction class where the students were required to participate in a study. The average age of the participants was 24.5 years (sd = 4.2) ranging from 20 to 35 years. All participants can be regarded as experienced computer users. They reported to have used computers, on average, for 11.5 years (sd = 4.1) and Web for 6.9 years (sd = 2.2). They said to engage in computer, Web, and Web search engine use daily (mode). 4.2 Design The user interface (UI) was the only independent variable in the experiment. It had two values: 1) keyword context UI and 2) reference UI. The value of the independent variable was manipulated within each participant. This causes the measurements to be related and in consequence, the statistical analyzes were done using repeated measures analysis tools. The order of the presentation of the user interfaces was counterbalanced between the participants. Because search tasks are contaminated upon execution (one task can not be carried out twice by the same participant), there were two task sets. Because we did not want the task sets to have an impact on the results, we treated them as a variable whose use was counterbalanced along with the user interface variable. As dependent variables we measured search speed, accuracy, and subjective satisfaction. Speed measures were based on result selection times and accuracy on judgments about the result items made by the experimenter. Subjective opinions were elicited with questionnaires. 4.3 Apparatus and Materials The experiment setup was partly automated. We used our own special purpose software application that presented the tasks and collected an event log of the user interactions. During the experiment there were two windows on the computer screen (Figure 4, A). Task window was located on the left side and it presented the tasks and questionnaires to the participants. It also controlled the search window (on the right) that displayed the tested user interface with the search results. 141

A Figure 4. Computer desktop during the test (gray window is the task window). Picture A shows keyword context UI condition and picture B reference UI condition.

Such actions included result selections, keyword context selections, result page selections, etc. Time and accuracy measurements are based on this log data.

148 A Figure 4. Computer desktop during the test (gray window is the task window). Picture A shows keyword context UI condition and picture B reference UI condition. The experiment software collected an event log. The event log contained time stamped data about the actions the participants carried out. Such actions included result selections, keyword context selections, result page selections, etc. Time and accuracy measurements are based on this log data. We constructed two sets of nine search tasks based on our experiences from previous studies. Each task required participants to collect links to relevant results pages. The task descriptions contained at least two central concepts, like Find pictures (1) of Mount Pinatubo (2). The two task sets were designed to be roughly equally demanding and contain similar topics. Getting two task sets to be exactly equally demanding is almost impossible. As noted earlier, the need of doing this was eliminated by counterbalancing. After the fact, we can see that task set 1 was easier in both conditions, which is a shortcoming in the setup. However, because the bias of the task sets was in the same direction for both conditions and the task sets were used equally often in them, the task set bias has no effect on our final results. For each task there was a predefined query and the participants were not allowed to make or reformulate the queries themselves (see examples in Table 1). This restriction was necessary to eliminate the variance caused by the different search skills of the participants. This increases the internal validity of the experiment by making the measurements more robust. In addition, the exclusion of the search formulation phase from the test does not compromise the external validity of the test too much, because our contribution (and the main interest in the study) is in the result evaluation phase. The same approach for the same reasons was employed previously by Dumais et al. (2001) and Käki and Aula (2004). B Task Find information about the space shuttle Challenger accident Find pictures of Mount Pinatubo Find information about what World Health Organization (WHO) does to cure river blindness Query challenger pinatubo river blindness +who Table 1. Example tasks and associated predefined queries used in the experiment. The search results were accessed off-line during the experiment. We executed each predefined query prior to the experiment and saved 150 first results in the local hard disk with a saving 142

A World Wide Web-based HCI-library Designed for Interaction Studies

A World Wide Web-based HCI-library Designed for Interaction Studies Ketil Perstrup, Erik Frøkjær, Maria Konstantinovitz, Thorbjørn Konstantinovitz, Flemming S. Sørensen, Jytte Varming Department of Computing,