second_language research_teaching sla vivian_cook language_department idl

Similar documents
Automated Online News Classification with Personalization

Performance Measures for Multi-Graded Relevance

An adaptable search system for collection of partially structured documents

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

Advances in Natural and Applied Sciences. Information Retrieval Using Collaborative Filtering and Item Based Recommendation

Interface. Dispatcher. Meta Searcher. Index DataBase. Parser & Indexer. Ranker

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Content-Based Recommendation for Web Personalization

Domain Specific Search Engine for Students

A Time-based Recommender System using Implicit Feedback

Ranking Web Documents with Dynamic Evaluation by Expert Groups

Transparent Access to Legacy Data in Java. Olivier Gruber. IBM Almaden Research Center. San Jose, CA Abstract

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests.

Content Bookmarking and Recommendation

Semantic Clickstream Mining

Software Component Relationships. Stephen H. Edwards. Department of Computer Science. Virginia Polytechnic Institute and State University

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population.

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

A Study on Metadata Extraction, Retrieval and 3D Visualization Technologies for Multimedia Data and Its Application to e-learning

on the WorldWideWeb Abstract. The pages and hyperlinks of the World Wide Web may be

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Finding Relevant Documents using Top Ranking Sentences: An Evaluation of Two Alternative Schemes

Tag-Based Contextual Collaborative Filtering

PWS Using Learned User Profiles by Greedy DP and IL

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

A Tagging Approach to Ontology Mapping

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

Resemblance to query Q. The document space

Project Report. An Introduction to Collaborative Filtering

Adaptive Search at Essex

Self-Organizing Maps of Web Link Information

A World Wide Web-based HCI-library Designed for Interaction Studies

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

This literature review provides an overview of the various topics related to using implicit

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

An Information Theoretic Approach to Ontology-based Interest Matching

Tag-Based Contextual Collaborative Filtering

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval CSCI

CMPSCI 646, Information Retrieval (Fall 2003)

Object-oriented Compiler Construction

An Improved Usage-Based Ranking

A New Context Based Indexing in Search Engines Using Binary Search Tree

Verification of Multiple Agent Knowledge-based Systems

Improving Suffix Tree Clustering Algorithm for Web Documents

Making Sense Out of the Web

Info Agent USER. External Retrieval Agent. Internal Services Agent. Interface Agent

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

A Survey on Various Techniques of Recommendation System in Web Mining

TSS: A Hybrid Web Searches

Combining Ontology Mapping Methods Using Bayesian Networks

Privacy Protection in Personalized Web Search with User Profile

A Constrained Spreading Activation Approach to Collaborative Filtering

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

The Design and Implementation of an Intelligent Online Recommender System

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

From Passages into Elements in XML Retrieval

Annotation for the Semantic Web During Website Development

Hybrid Recommender Systems for Electronic Commerce

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Developing InfoSleuth Agents Using Rosette: An Actor Based Language

NASA Ames Research Center. user groups. Information preferences of specic queries are

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

Using Data Mining to Determine User-Specific Movie Ratings

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

Search Engine Architecture. Hongning Wang

Semantically Rich Recommendations in Social Networks for Sharing, Exchanging and Ranking Semantic Context

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

Automated Cognitive Walkthrough for the Web (AutoCWW)

A Spreading Activation Framework for Ontology-enhanced Adaptive Information Access within Organisations

INFORMATION RETRIEVAL USING MARKOV MODEL MEDIATORS IN MULTIMEDIA DATABASE SYSTEMS. Mei-Ling Shyu, Shu-Ching Chen, and R. L.

Hermion - Exploiting the Dynamics of Software

Design Process Ontology Approach Proposal

Framework for suggesting POPULAR ITEMS to users by Analyzing Randomized Algorithms

The influence of caching on web usage mining

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

CPS221 Lecture: Threads

SOME TYPES AND USES OF DATA MODELS

INFSCI 2480 Adaptive Information Systems Adaptive [Web] Search. Peter Brusilovsky.

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Object classes. recall (%)

A Short Introduction to CATMA

STAR Lab Technical Report

Dynamic Visualization of Hubs and Authorities during Web Search

Compressed Collections for Simulated Crawling

A User Preference Based Search Engine

GlOSS: Text-Source Discovery over the Internet

Time-Surfer: Time-Based Graphical Access to Document Content

Social Networks 2015 Lecture 10: The structure of the web and link analysis

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Transcription:

Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli udog@essex.ac.uk Abstract. The explosive growth of information on the World Wide Web demands eective intelligent search and ltering methods. Consequently, techniques have been developed that extract conceptual information from documents to build domain models automatically. The model we build is a taxonomy of conceptual terms that is used in a search assistant to help the user navigate to the right set of required documents. We monitor the dialogue steps performed by users to get feedback about the quality of choices proposed by the system and to adjust the model without manual intervention. Thus, we employ implicit relevance feedback to improve the domain model. Unlike in traditional relevance feedback and collaborative ltering tasks we do not need explicitly expressed user opinions. Moreover, we aim at improving the domain model as a whole rather than trying to build individual user proles. 1 Introduction In recent years there has been an explosive growth of the sheer volume of information available on the World Wide Web. This information is free and fairly unstructured. Search engines employing standard information retrieval techniques can help to get to some particular piece of information quickly. However, a common phenomenon is that users nd it dicult to express their actual information need as a query. Smaller domains like local Web sites face the same problems. For example, a query frequently found in the log les of our sample domain, the University of Essex Web site, is \languages". Someone submitting this request might have a clear idea about what sort of documents should be retrieved by the search engine, e.g. information about the Modern Languages Unit (which is the best match Google 1 could nd in our domain). But there are far more than 1,000 documents which contain the query term despite the fact that the domain consists of less than 30,000 indexable pages. Other top ranked documents retrieved by Google contain information about natural, controlled, and Pidgin languages. In addition to that, there is a large number of documents related to various types of computer languages like java. 1 http://www.google.com

One way to help the user getting to the best matching documents is to apply some automatically acquired representation of the actual data sources (a \domain model"), something that is feasible for limited domains. We build such a domain model by exploiting markup found in the documents. The result is a set of hierarchies of related terms. These relations are used to initiate simple dialogue steps by displaying candidate terms for query renement alongside the most highly ranked documents retrieved for a user query. The user's choice to pick a query renement term proposed by the dialogue system or to select some option considered relevant can be interpreted as implicit relevance feedback. We suggest to learn from a user in order to help the next user with a similar request as in collaborative ltering. But, unlike in classical collaborative ltering we do not distinguish a number of user groups. We basically have one large group of users, those who submit queries to the search engine of the particular site. Thus, we aim at improving the domain model of that site rather than user proles. 2 Related Work Relevance feedback is a method used to enhance information retrieval results [8, 2]. A user initially submits a query, and the system returns a small number of documents. The user then indicates which of the returned documents are relevant to the query. However, judging the relevance of documents may become time consuming and users would prefer another solution. By observing the users' actions rather than expecting explicit user feedback on results we introduce the idea of implicit relevance feedback. Actions the user performs, in our case dialogue steps, are judged to be relevant, everything else is judged as irrelevant. Our solution can be seen as a particular application of collaborative ltering. Collaborative ltering is based on identifying the opinions and preferences of similar users in order to predict the preferences and to recommend items to others. These techniques are used in a variety of recommender systems ranging from recommending news (e.g. GroupLens [7]) to recommending movies (e.g. Video Recommender [4]). The Community Search Assistant as described in [3] is a software agent which can be used to augment any kind of search engine. The agent works in parallel with the search engine itself and builds a graph of related queries which can be included in addition to the engine's results. The user can then traverse the graph of related queries in an ordered way. Determining relatedness of documents depends on the documents returned by the various queries and not on the terms used for the queries themselves. Furthermore, the use of the search assistant agent enables a form of collaborative search by allowing the users to draw on the knowledge base of queries submitted by others. Internet search engines have also started incorporating simple collaborative ltering techniques in order to improve search. Such eorts include the popularity engine built by DirectHit 2 which operates using a simple voting mechanism. The popularity engine works by simply tracking the queries input by users and the 2 http://www.directhit.com

links that the users follow. Users vote by following a link and therefore the result of a search in such a search engine will return the most popular results for that query. 3 Improving the Domain Model The search system we apply relies on a sophisticated indexing process that extracts a taxonomy of related concepts from the raw documents. The indexing process distinguishes whether an index term extracted from a document is conceptual information by evaluating the number and nature of various markup environments it is found in. Co-occurrence of dierent conceptual index terms in the same document denes a notion of related concepts. This was explained in detail in [6]. This taxonomy is mainly used in a query renement task, i.e. if the user query returns a large number of matching documents. In that case the dialogue component determines a set of conceptual terms related to the query. Those terms are selected based on their ability to describe only a subset of documents dened by the original user query. The user is asked to choose one. To use the introductory example, a query for \languages" would trigger the dialogue system to oer the following conceptual terms as possible constraints: second language, language department, idl, linguistic, spanish, java etc. The strategy applied to determine good discriminating terms (like java in the above example) is to check all concepts related to any of the input terms. This is computationally fairly cheap since there are far fewer concepts than keywords, and much of the calculation can be performed oine [5]. Then the three important factors to select a term as a good discriminator or not are: (1) the number of related concepts, (2) the frequency of each of those concepts, and (3) the weights of each of the related concept relations. The frequency of a concept is initially determined by the number of documents for which it was selected as a conceptual index term as opposed to just a normal keyword index. In addition to that, for every concept in the taxonomy the weights associated with each of its identied related concepts are equal and sum up to 1. These weights change, if: (1) a concept is oered and selected by the user (increase), or (2) a concept is oered and not selected (decrease). This will only change weights of relations already in place. The result is that the good parts of the taxonomy will gain importance, the rest will be less and less relevant. But that does not allow the creation of new links overlooked in the automatic construction of the model. We are currently experimenting with that. For example a user decides not to choose any of the oered terms, but inputs \query languages". This will implicitly introduce a new pair of related concepts which may become more important over time. Since we keep track of the dialogue history, we only increase weights associated with the links between any new input and the most recent input. That ensures that we do not run into computational explosion. The document ranking function we implemented is basically using the vector space model. In addition to that, dierent weights are given to index terms found

in particular markup contexts (e.g. keywords in titles are more important than in free text). This is not new. Search engines like Google use similar ranking functions [1]. However, our function goes beyond that in a number of ways. First of all, conceptual terms which were extracted during indexing are of higher weight than other terms. Moreover, every term has a weight which increases with the relative frequency of this term in the pool of all queries submitted to the search system so far. Finally, every concept term has a weight increasing with the frequency of this term being selected in a dialogue step within the collection of all options oered by the system so far. None of the weights in the ranking function has a particularly strong impact on the overall weight of a document. Finally, a word about the heterogenous nature of our methods which allow explicit relevance feedback. If documents are displayed they come with a box next to them, where a user can judge a document to be relevant or not. Since we keep track of the dialogue history we can again adjust the weights accordingly. This is not implemented yet, but ts into the framework since it is just another parameter in the equation. 4 Example language second_language research_teaching sla vivian_cook language_department... idl language_processing linguistic_university mphil corba computer_network odgm Fig. 1. Partial concept tree for example query \languages" In the example we reduce words to their base forms but apply no stemming. We use the introductory query (\languages"). For the calculation of index terms related to the query term we apply some fairly strict thresholds, i.e. frequent terms are not considered by the system. This is the reason why the compound term english language does not seem to be related to language. Our experience shows that better discriminating terms can be found by applying stricter thresholds. Figure 1 displays part of the originally constructed hierarchy for the con-

ceptual term language. Only the three most relevant related concept terms are presented on the top two levels. It must be interpreted as follows: the system determined the most important concepts that would constrain the original query in order to get to a smaller set of relevant documents. If the user decides to choose second language, then the new query to be evaluated against the database would contain languages as well as second language as query terms. Again a large number of matching documents exists for this new query and one option would be to select a new term oered by the search system, e.g. sla (which stands for `second language acquisition'). Alternatively, the user could ignore the proposed options completely and enter some input like \english" to continue. The order in which the terms are presented to the user represent their relative importance in respect to the current query applied to the domain model. Following a trial period the example structure in Figure 1 has changed signicantly. Apparently, users querying our system for \languages" were mainly interested in the linguistic sense of the query term. The relation between language and idl (`interface denition language') has disappeared from the list of most relevant related concepts. The fact that e (`English as a foreign language') has become the most relevant potential renement term for the \languages" query, does not reect a new relation between the terms language and e but an increased importance of a relation, which existed before but had initially a much lower weight assigned to it. These changes reect that only observing real users' behaviour can help getting to a more appropriate domain model. References 1. Brin, S., and Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the Seventh International World Wide Web Conference (WWW7) (Brisbane, 1998). 2. Buckley, C., Salton, G., and Allan, J. The eect of adding relevance information in a relevance feedback environment. In Proceedings of the 17th Annual International ACM SIGIR Conference (1994), pp. 292{301. 3. Glance, N. Community Search Assistant. In Proceedings of the AAAI-2000 Workshop on Articial Intelligence for Web Search (Austin, TX, 2000), Technical Report WS-00-01, AAAI Press. 4. Hill, W., Stead, L., Rosenstein, M., and Furnas, G. Recommending and evaluating choices in a virtual community of use. In Proceedings of CHI'95 (New York, 1995), ACM. 5. Kruschwitz, U. A Rapidly Acquired Domain Model Derived from Markup Structure. In ESSLLI Workshop on Semantic Knowledge Acquisition and Categorisation (Helsinki, 2001). To appear. 6. Kruschwitz, U. Exploiting Structure for Intelligent Web Search. In Proceedings of the 34 th Hawaii International Conference on System Sciences (HICSS) (Maui, Hawaii, 2001), IEEE. 7. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. Group- Lens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of ACM CSCW'94 (1994), pp. 175{186. 8. van Rijsbergen, C. J. Information Retrieval. Butterworths, 1979.