Using Text Elements by Context to Display Search Results in Information Retrieval Systems Model and Research results

Using Text Elements by Context to Display Search Results in Information Retrieval Systems Model and Research results Offer Drori SHAAM Information Systems The Hebrew University of Jerusalem offerd@ {shaam.gov.il, cs.huji.ac.il} Tel. +972-2-5688439 Fax +972-2-5688681 Abstract Information retrieval systems display search results by various methods. This paper focuses on a model for displaying a list of search results by means of textual elements that utilize a new information unit that replaces the currently used information unit. The paper includes a short description of several studies that support the model. 1. Introduction Because of the growth in the number and scope of global databases, a special approach is required to locating information, from the perspective of the user interface. The Internet, as it exists today, is an outstanding example of a broad-base, unfocused database. Most Internet search engines display their information as a serially ordered list of results (with a partial attempt at ranking the results). In most cases, this list includes the document title, URL and, at times, the first few lines of the document. The information, as currently displayed to the user, is incomplete and insufficiently focused on the search query. This requires the user to actually read all the documents in the list with being able to discriminate. With today s search engines most of the search transactions yield a list of hundreds and even thousands of documents, while studies show that the average user only looks at the first 1-2 results (Kirsch 1998). Finding a solution to this paradox presents a serious challenge to researchers in the field. This paper will suggest a way to locate the relevant document without having to read the listed documents. 2. A model for displaying textual search results To deal with the challenge presented above, this section will define a hierarchical structure containing three levels for displaying search results (see Figure No. 1). Search results can be displayed from textual databases by relying on two basic principles; visualization of the results, and the use of textual components to design the list of results. This article focuses solely on the use of textual components to display search results, where the textual component consists of two categories: internal document information and external document information. 2.1 Results based on internal document information In this category, a number of techniques are used, most of which include information components related to the search topic. Following is a description of the various methods. Significant sentences Significant sentences can be descriptive sentences based on defined paragraphs in the document, for example: Abstract, Introduction, Conclusion. Alternatively, sentences relevant to the search query can be 1

used, which include the terms that were the reason for the document being chosen (Luhn 196), (Drori 1998). Significant words Significant words in the document are intrinsic descriptives, such as keywords or frequently repeated words. The document s author determines keywords, or they can be produced automatically. Frequently repeated words that are computer generated (including Stop List operation) can yield results that are similar but less exact (Baldonado & Winograd 1997). Information from HTML tags The language tags can provide us with information about the document. For example, paragraph or subtitle headings can be located by using the <H> tags, and can even be used to generate a table of contents. <META> tags contain information about the document as recorded by the document author, such as: abstract, keywords, and others. A certain amount of noise must be taken into account with these tags because of commercial rating considerations. The following studies utilized tables of contents: Egan et al 1989, Chimera & Shneiderman 1994, and Hertzum & Frokjaer 1996. Additional information Additional information can be generated from within the actual document; for example, when a document includes citations from other documents, the titles of the cited documents can be used, assuming that they have a subject in common (Pitkow & Pirolli 1997). Displaying Search Results Techniques Displaying Search Results Textual Techniques Graphic Techniques Internal Document Information External Document Information Significant Sentences Significant Words Information from HTML Tags Additional Information Document Classification Cited Documents Information from the Data Base Figure No. 1 - Model for displaying search results 2.2 Results based on external document information This category utilizes a number of methods that include information components based on the document s subject field and not contained within the actual document. A description of these methods follows. Document classification This method displays the category with which the document is associated. Search engines that manually define document categories (such as Yahoo) can be used for this purpose. It is also possible to create categories with the aid of computerized algorithms, and the subject association of the document can be established by clustering all the search results (Allen et al 1993), (Zamir & Etzioni 1999). 2

Citing documents This refers to a situation in which one document cites another, where both have a subject in common. The citing documents can be located directly via the Internet, or by using a subject-oriented database such as the Science Citation Index. When the citing documents are located, either their titles or, alternatively, their cited paragraphs can be used (Amento et al 1999). Information from the database The database in which the document is located can provide an indication of the document s subject in a number of ways. Subject oriented databases usually specify the database subject field. An attempt can be made to determine the database subject field from the titles of additional documents contained in the database. 3. Research into displaying search results The research objective was to locate those information components in the search result display that are most relevant to the user, in order to make the task of locating information both more efficient and more effective. The research questions were: 1. What are the most important information details for display in the search results? 2. When comparing the various methods of displaying search result information, which method is preferable in terms of accomplishing the user s task? The research agenda included the performance of various search tasks by a number of user groups. The tasks were carried out using various interfaces. For research purposes, a database was created that included response documents (in English) for defined search queries. In all the studies 3-4 different interfaces were used. The effectiveness of the search, and user satisfaction, were checked using two dimensions: Objective data: response time and accuracy. Subjective data: convenience, sense of confidence, satisfaction, and the relevancy of information components. The participants in the research came from several groups and included students from the School of Business Administration (MBA), technical support personnel from the computer field, and information specialists and librarians from the information field. Statistical analyses included the Anova Test for examining the significance of the difference in the methods, the P Test to determine the significance of the results and, naturally, standard statistical analyses of averages, standard deviations, etc. In the initial study, 128 participants worked with 3 different interfaces. The subject examined was the contribution made by displaying lines of information from the document in addition to the title. The interfaces were: T - titles only; TFL - titles + first lines from the document head (refers to internal document information/significant sentences/descriptive sentences of the model in figure 1); TLC - titles + lines by search context (refers to internal document information/significant sentences/ sentences relevant to the search query of the model). 3

Research 1 - The difference between the methods 5 8 6 4 2 T TFL TLC Time (sec.) 4 3 2 Difficult tasks Simple tasks Comfort Confidence Relevancy T TFL TLC methods A significant difference was found between the methods (P<.1). The TLC method (displaying lines by search context) was preferable in all aspects of the subjective dimension (search convenience, feeling of confidence during use, and relevancy of information). For the objective dimension of search time, the TLC method had an advantage (31%) in the case of complicated tasks, while the other methods had an advantage when handling simple or moderately complex tasks. Research 1 Snap shot TLC interface ( Titles in blue, Lines in context in black, Search terms in blue) 4

In the second study, 51 participants worked with 3 different interfaces. The subject checked was the contribution made by displaying keywords for the information items that were displayed in the first study. Keywords refers to internal document information/significant words/intrinsic descriptive of the model in figure 1. The interfaces were: TK - titles + keywords; TFLK - titles + first lines from the document head + keywords; TLCK - titles + lines by search context + keywords. Research 2 Snap shot TLCK interface ( Titles in blue, Key words in green, Lines in context in black, Search terms in red) A significant difference was found between the methods (P<.1). The TLCK method (lines by search context + keywords) is preferable for the subjective dimension. The TLCK method possesses an advantage (33%) over the other methods, in the case of search times for moderate and difficult tasks. 5

Research 2 - The differences between the methods 8 6 4 2 Comfort Confidence Relevancy TK TFLK TLCK Time (sec.) 14 12 8 6 4 2 TK TFLK TLCK Methods Difficult tasks Simple tasks A significant difference was found between the methods (P<.1). The TLCK method (lines by search context + keywords) is preferable for the subjective dimension. The TLCK method possesses an advantage (33%) over the other methods, in the case of search times for moderate and difficult tasks. In the third study, 75 participants worked with 4 interfaces. The subject checked was the contribution made by displaying the document category in addition to lines from the document (category refers to external document information/documents classification of the model in figure 1). The interfaces were: TFL titles + first lines; TFLC titles + first lines + categories; TLC titles + lines by search context; TLCC titles + lines by search context + categories. Research 3 - The differences between the methods 8 6 4 2 Comfort Confidence Relevancy TFL TFLC TLC TLCC Time (sec.) 2 15 5 TFL TFLC TLC TLCC Methods Difficult tasks Simple tasks A significant difference was found between the methods (P<.1). The TLCC method possesses an advantage in the subjective dimension and also in search times (67%) across all task difficulty levels. In this study, we also examined which search results display parameters are important to the user. The findings showed that confidence in the answer is the most important parameter (78%), followed by search time (73%) and then the ability to find the answer without reading all the documents (54%). User convenience was found to be a less important parameter for the search process (44%). 6

Research 3 Snap shot TLCC interface ( Titles in blue, Categories in red, Lines in context in black) In the forth study, 61 participants worked with 4 interfaces. The subject checked was the contribution made by displaying the document category, address, common words and the organization that published the paper in addition to lines from the document (category refers to external document information/documents classification of the model in figure 1. Document address refers to external document information/information from the data base. Common words refers to internal document information/significant words and the organization that published the paper refers to external document information/information from the data base). The interfaces were: TLCC titles + lines by search context + categories; TLCA titles + lines by search context +internet address; TLCCW titles + lines by search context + common words; TLCO titles + lines by search context + organization name. 7

Research 4 - The differences between the methods % (-high; -low) 8 6 4 2 Comfort Confidence Relevancy TLCC TLCA TLCCW TLCO Time (min.) 7 6 5 4 3 2 1 TLCC TLCA TLCCW TLCO Methods Difficult tasks Simple tasks A significant difference was found between some of the methods (P<.1). The TLCCW method possesses an advantage in the subjective dimension and also in search times across all task difficulty levels. In this study, we also examined which search results display parameters are important to the user. The findings showed that the ability to find the answer without reading all the documents is the most important parameter (87%), followed by confidence (77%) and than search time (67%). User convenience was found to be a less important parameter for the search process (51%). 8

Research 4 Snap shot TLCCW interface ( Titles in blue, Common words in red, Lines in context in black, Search terms in bold) 4. Conclusion and findings The objective of the studies was to examine some of the components of the search results display model. The studies that were performed enable the definition of a new information unit that can replace the unit currently used. We found that, in addition to the title, the alternative information unit must include lines by search context, keywords, and an indication of the document category. Document category can be accomplished by common words. Authors of article and database administrators can benefit by including the suggested information components in each document using standardized means (such as XML). An interesting finding was revealed in the feedback on the parameters that the users considered important when using data retrieval systems. The feeling of confidence when using a system is perceived as having a higher priority than speed and locating the answer without having to read the list of documents. On the other hand, users assigned ease of use a low priority. A study planned by us will include an in-depth evaluation of additional information components of the model. Acknowledgements I wish to thank Eliezer Lozinski for his excellent suggestions in the course of the research. My thanks also to Nir Alon and Aliza Weisberg for their assistance in gathering the data in Research 3 and to Liora Halevi and Yifat Betser-Nahum for their assistance in gathering the data in Research 4. 9

References 1. Allen, R., Obry, P., Littman, M., An interface for navigating clustered document sets returned by queries, Proceedings of the ACM Conference on Organizational Computing Systems (COOCS93), 1993, 166-171. 2. Amento, B., Hill, W., Terveen, L., Hix, D., Ju, P., An empirical Evaluation of User interfaces for Topic Management of Web Sites, Proceedings of CHI'99, ACM Press, Pittsburg PA, May 1999, 552-559. 3. Baldonado, M., Winograd, T., SenseMaker: an information-exploration interface supporting the contextual evolution of a user's interests, Proceedings of CHI97, 1997, 11-18. 4. Chimera, R., Shneiderman, B., An Exploratory Evaluation of Three Interfaces for Browsing Large Hirarchical Tables of Contents, ACM Transaction in Information Systems, New York: ACM, 12 (4), October 1994, 383-46. 5. Drori, O., The User Interface in Text Retrieval Systems, SIGCHI bulletin, New York: ACM, July 1998, 3 (3), 26-29. 6. Egan, D., Remde, J., Landauer, T., Lochbaum, C., Gomez, L., Behavioral Evaluation and Analysis of a Hypertext Browser, Proceedings of CHI '89, New York: ACM, 1989, 25-21. 7. Hertzum, M., Frokjaer, E., Browsing and Querying in Online Documentation: A Study of User Interface and the Interaction Process, ACM Transaction on Computer-Human Interection, New York: ACM, 3 (2), June 1996, 136-161. 8. Kirsh, S., Infoseek's experiences searching the Internet, SIGIR Forum, New York: ACM, Vol. 32, 1998, Num. 2, 3. 9. Luhn, H., Keyword in Context Index for Technical Literature, American Documentation, XI (4), 196, 288-295. 1. Pitkow, J., Pirolli, P., Life, Death, and Lawfulness on the Electronic frontier, CHI '97 Proceedings, New York: ACM, April 1996, 213-22. 11. Zamir, O., Etzioni, O., Grouper: A Dynamic Clustering Interface to Web Search Results, WWW8 Proceedings, Toronto: WWW, 1999. 1