Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests.

Size: px

Start display at page:

Download "Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests."

Scot Francis
6 years ago
Views:

1 Feature-Guided Automated Collaborative Filtering Yezdi Lashkari Abstract Information ltering systems have traditionally relied on some form of content analysis of documents to represent a prole of user interests. Such content ltering is generally ineective in domains with diverse media types such as audio, video, and images, because machineanalysis of such media is hard. Recently, information ltering systems relying primarily on human evaluations of documents have been built. Such automated collaborative ltering systems work by discovering correlations in evaluations of documents amongst users, and by using these correlations to recommend new documents. However, such systems rely on the implicit assumption that all the features of a document are equally important to a user's evaluation of that document. This assumption breaks down in broad domains, (such as all documents in the World Wide Web), where users correlate well only for some features of a document they evaluated similarly. This thesis claims that using a combination of easily extractible features of documents with subjective human evaluations for automated collaborative ltering is a powerful information ltering technique for complex information spaces. To verify this claim I propose building an information ltering system for the World Wide Web that relies primarily on a combination of simple feature extraction and human evaluations of documents to make eective personalized recommendations for documents to users.

2 1 Introduction 1.1 Motivation Automated Collaborative Filtering (ACF) [1] 1 is a technique for locating items of potential interest to users in almost any domain using human evaluations of items. It relies on a deceptively simple idea: if person A correlates strongly with person B in rating a set of items, then it is possible to predict the rating of a new item for A, given B's rating for that item. Since ACF does not rely on computer analysis of items, it is especially useful for domains where it is either too hard (or too computationally expensive) to analyze items by computer, such as information spaces containing images, movies, audio, text etc. The World Wide Web (WWW) is such a domain. The exponential growth of the web has exacerbated the problem of personal information overload faced by most networked computer users. While almost any information a user may wish to nd probably exists somewhere on the web, most users nd it almost impossible to locate such information on their own, or to keep track of new, related documents. Most current solutions to this problem attempt to create some form of index of web documents, which users may then query. Such solutions possess numerous drawbacks from the viewpoint of an ordinary user. The basic assumption behind all indexing schemes is that users will somehow learn of the existence of the index and can then query it eectively. With the growing number of indices, it is no longer possible, even for expert users, to keep track of all the useful indices. Further, web indices vary widely in quality, indexing mechanisms, and coverage. Hence, simply locating an index isn't enough: a user must know the correct set of keywords to locate relevant documents from that index. 2 web documents. Furthermore, most web indices do not attempt to index non-text 1 Also referred to as Social Information Filtering. 2 This set may dier from index to index depending on what information was used to construct the index 1

3 Several characteristics of the web suggest ACF as a potential solution to the information overload problem. By leveraging o of the opinions of millions of users who browse dierent parts of the web daily, the problem of computer analysis of a continuously growing set of documents with multiple rich media formats is avoided. By using correlations amongst human evaluations, ACF based recommendations of a document contain an implicit (subjective) evaluation of its quality as perceived by a user; an extremely valuable notion in a domain containing thousands of related documents (in terms of content) of widely varying quality. Furthermore, another potential benet of applying ACF to the web is the automatic identication of communities of users with similar interests, who may currently be unaware of each others's existence. Another fascinating possibility is ACF application across dierent domains: for example, if an ACF system knows that users A and B correlate in their music interests, it may try recommending movies to A based on B's movie evaluations. To be applied eectively, ACF assumes that the domain of items is suf- ciently narrow (for example, only documents about information ltering). When the domain is broad, an item may be given similar ratings by two users because the users correlate in their evaluations of some specic features of the item, rather than for every feature of the item. 3 In other words, two users may both give a high evaluation to a particular document for completely dierent reasons: one due to the fact that it was authored by Marvin Minsky and Seymour Papert, the other because it is about neural networks, Marvin Minsky is one of the authors, and is published by MIT Press. Such simple feature information is available for most domains, and could be used to make much better recommendations. In the example above, the ACF system could determine that the two users correlate only on documents for which Marvin Minsky is an author. To apply ACF eectively to the web, I propose using easily extractable features of documents along with human evaluations of documents to allow ACF to be applied relative to a set of features. This will reduce the number 3 This is the implicit assumption between correlating at the item level. 2

4 of items the ACF algorithm needs to consider in making a good prediction, as well as yield better recommendations, since only items having features on which a particular user correlates need be considered while calculating a prediction. This thesis proposes the implementation of a personalized WWW information agent per user. The agent observes its user's browsing patterns and attempts to learn its user's interests in WWW document space. The agent continually attempts to locate new documents similar to documents that the user has shown interest in in the past. The burden of locating interesting or new documents is thus shouldered by the agent, freeing the user to do more important work. The mechanism that I will explore is Feature-Guided Automated Collaborative Filtering (FGACF). 1.2 Related Work Automated Collaborative Filtering Information Filtering refers to the ltering of a dynamic information stream based on a long term prole of the user's interests created and maintained by the system. Most personalized ltering systems automatically create and maintain a user interest prole using machine learning techniques [2, 3]. Content ltering refers to the the ltering of information based on its content. Keyword-based ltering is an example of content ltering [4, 2]. Collaborative ltering techniques select documents based on correlations between people and on their subjective judgements of documents. In the Tapestry system [5] users annotate documents by hand, actively decide whose opinions they are interested in, and program arbitrarily complex lters that are then run continuously over a document store. For example, a typical Tapestry lter may be a query of the form: nd all articles in the newsgroup comp.unix-wizards with the keywords UNIX and BSD in the subject that John Doe replied to. Tapestry places the burden of identifying users with similar interests and programming the appropriate lters entirely on the user. Such a solution only works well in a small group of computer literate users. 3

5 An Automated Collaborative Filtering (ACF) system, by contrast, automatically determines correlations amongst users in their evaluations of items, and uses these correlations to recommend interesting items. The GroupLens system [6] uses this approach in the domain of USENET netnews. GroupLens partitions the document space by applying ACF separately within each newsgroup. Users evaluate articles using modied newsreader client programs. Grouplens is still undergoing testing; initial results with a very limited set of users are encouraging. The RINGO system [7] built at the MIT Media Lab uses ACF for making personalized music recommendations. RINGO currently consists of a single central server and does not partition the item space in any way. RINGO has been evaluated on a large user population and currently has a growing population of over 3000 users who can add new items to the database as well as submit reviews of groups and albums. While systems applying either content or collaborative ltering techniques exist, to date, no system has attempted to eectively combine the two. One such system currently being implemented, is the NewsWeeder system [3] for USENET netnews. However, the emphasis in NewsWeeder is in attempting to combine human evaluations with content representations, so as to discover new machine representations for text, and how evaluations can guide the learning of these representations WWW Indexing Mechanisms Most previous attempts to tackle the information overload problem for the WWW have attempted to construct a web index of some sort using a variety of approaches such as individual web-robots [8], collaborative swarms of ants [9], collaboratively-maintained, hierarchical indices [11], generalized resource discovery architectures [12], or meta-indices [13]. A few systems take a more user-centric approach. The Infobot hotlist database collects the contents of various hotlists mailed to it, and periodically forms a list of most popular documents [14]. The SIMON system [15] distributed a package that allowed users to associate keywords with various 4

6 resources in their WWW browser hotlist and retrieve their documents by simply specifying keywords. In return for using this software, users were requested to send their hotlists (and associated keywords) periodically to a central site, where the data is summarized. The hope is that each user's hotlist can be used to construct local maps of WWW space, which can then all be linked somehow (it is not clear exactly how), into a unied global map of the WWW. The Fish-search robot [16] integrated with Mosaic allows users to specify keywords to search for, in documents reachable from a particular document. A robot is then invoked which searches out from the start document looking for documents containing certain keywords. 2 System Design The system consists of two main components: a personalized information agent per user that continually monitors the user's interests, recommends documents of potential interest, and collects and propagates the user's evaluations of documents; and, a feature-guided automated collaborative ltering server that collects and collates evaluations from multiple users' agents so as to be able to recommend new documents to these users. User agents and FGACF servers communicate using a simple protocol. Figure 1 shows the various parts of the system as well as the various communication paths. The following sections describe each part in detail. 2.1 User Agents User agents help structure one's set of favorite documents. They learn user's interests, collect document evaluations, make document recommendations, and can help locate particular kinds of documents. Figure 2 shows a single user's agent. Note that every agent must interoperate with a WWW browser: we have chosen XMosaic as the browser for our implementation because it is popular and readily available. The agent communicates with the user's browser to retrieve the document pointer of the current document so as to pair an evaluation with that document. In 5

7 1 2 Agent interface Automated Feature Guided Collaborative Filtering Server WWW Browser (XMosaic) + collect ratings + manage user created category hierarchy + automatically classify documents + recommend new documents + collect feedback KEY 1 Instruct browser to retrieve document. Highlight recommended documents. 2 Retrieve URL of current document (for ratings and analysis). 3 Propagate URLs, ratings, extra features to server. 4 On demand recommendations: + Recommend docs like this one + How would my user rate this doc? + Recommend N docs 5 Server recommendations (documents, computed ratings, confidence levels) Figure 1: A single user's agent interacting with a FGSF server addition, the agent can instruct the browser to go to a specied document, and retrieve the current contents of the user's hotlist. An agent ideally consists of an extensible collection of document recommendation modules, each of which periodically makes a series of recommendations to a recommendation selector module that decides which documents to propose to the user based on the condence and past performance of the proposing document recommendation module. 4 The heterogeneous nature of the WWW implies that no single method is going to prove satisfactory in 4 Typically dierent document recommendation modules will be good at recommending certain types of documents, and bad at recommending other types. The hotlist structuring facility provided automatically provides the recommendation selector module with personalized partitions of WWW document space. 6

8 Agent Interface WWW Browser (XMosaic) Current document hotlist contents Document Categorization and Hotlist Structuring Module (simple feature extraction) Document Evaluation Interface + Feedback Module Goto specified document Directed Search Module Recommend Documents Propagate feedback Recommendation Selector Module Document Recommendation Module 1 Document Recommendation Module 2 Document Recommendation Module N Figure 2: Schematic of a single user's agent all cases - a modular design allows extendibility, as new document discovery methods become available. This thesis will only implement a document recommendation module that combines simple feature analysis (titles, keywords, servers, whether the document is an index, etc.), with user evaluations. Hence the recommendation selector module implemented will be extremely simple. 7

9 2.1.1 Hotlist structuring Ferret is a Tk interface for hotlist structuring that communicates with XMosaic. Ferret allows a user to structure her browser hotlist hierarchically, as also to associate keywords with sets of documents (or specic documents). The user can then retrieve documents either by specifying a set of keywords or by traversing the hierarchy she has created Evaluations for Documents The interface allows the user to enter ratings for the document she is currently reading. The agent interface collects these ratings and propagates them using the server-agent protocol to a FGACF server. Recommended documents may be presented by appearing in a distinguished font in the appropriate categories in the user's hotlist structure (either periodically or on demand). Feedback is provided by evaluating a recommended document. User feedback can be used in a variety of ways: with multiple document recommendation modules the recommendation selector module can adjust its weighting of the suggestions given to it by the various document recommendation modules; analogously, an ACF document recommendation module communicating with multiple ACF servers can adjust the amount of weight it places in each ACF server's recommendations in the future. Feedback may also be propagated back to ACF servers as evaluations so the server can make corrections to its database and parameters Agent Document Recommendation Module The agent's document recommendation module collects its user's ratings and propagates them along with user-provided information such as special keywords or comments to a FGACF server. It uses a standard protocol to query FGACF servers for evaluations for certain classes of documents (or documents similar to a given document). 8

10 2.2 Feature-Guided Automated Collaborative Filtering Server A feature-guided automated collaborative ltering (FGACF) server collects evaluations (and any additional information such as keywords, etc) from user agents. In addition each FGACF server contains a document processing module that can extract simple features from documents. These features will be used to determine useful partitions of the document space so that the ACF algorithm can be applied eectively. The features I initially plan on using are: document title, keywords (for text documents) as well as usersupplied keywords, document type, the server it originated from, whether it is an index, and author information (if available). The ACF algorithm can be guided by features of the documents in two ways: either automatic clusters formed by bands of correlations between similar users are analyzed to nd commonality between the documents in these clusters, or, features of the document are used to partition the space and then the ACF algorithm is applied within the partitions. I suspect both forms of partitioning will be useful: the former to locate important features for certain classes of documents which may help to reduce computation in the future; the latter to recommend documents \similar" to a given document. The FGACF server supports two forms of interactions between agents and the server: a subscription based interaction wherein it periodically sends document recommendations to registered agents; and a demand based interaction wherein agents can make specic queries to the server. The forms of demand-driven queries supported by a FGACF server (for a particular agent's user) are: Recommend new documents similar to a particular document. Compute a probable user evaluation for a specic document. Recommend the best (in terms of computed probable evaluations) N new documents. This provides individual agents the ability to control the ow of new documents coming in from FGACF servers, as also the ability to make directed 9

11 queries. Note that the FGACF server learns of a new document only if some user sees it and evaluates it - the growth of the server's database is thus continuous Evaluation Criteria There is no real way to evaluate the qualitative advantage to a user with a personalized agent, against one without. As a rough measure of quantitative advantage, user feedback to recommendations will take the form of actually rating recommended articles. These ratings will be compared against the calculated ratings to generate a measure of eectiveness. A \pure" ACF algorithm (no feature guidance) will be treated as the base case to compare the FGACF algorithm against. 3 Timetable The research will be carried out in two phases. In Phase I, I will implement the agent interface that allows users to evaluate documents and receive recommendations for documents. A simple ACF and feature-extraction module will be implemented. I will concentrate on setting up one ACF server that does not use any content information. This will serve as a testbed for the protocol as well as the agent module. Further, the performance of this method will serve as a benchmark for evaluation of FGACF later. In Phase II, I will implement the FGACF server using the experience gained with the results of Phase I, and evaluate it against the results from Phase I. This stage will also consist of rigorous testing with users and the deployment of multiple FGACF servers. The table below presents the various milestones along with the expected dates of completion. Phase I Implement Hotlist Facility Oct 30 Implement Agent Interface Nov 20 Implement simple ACF document recommendation module Nov 30 10

12 Implement and deploy pure ACF server Dec 15 Begin initial user testing Phase II Implement and deploy FGACF server March 10 Improve agent modules and FGACF server April 2 Begin user testing with FGACF server Deploy multiple FGACF servers April 15 Correlate results from users May 2 Preparation of Report May 7 4 Contributions I expect this thesis work to result in the following research contributions: Since ACF is domain independent, and simple item features can be extracted in almost any domain, the FGACF techniques developed here should provide a general framework for applying FGACF to any domain. Develop an extensible personalized agent architecture for WWW users. As new personalized document recommendation modules are built, it should be easy to simply \plug" them into a user's agent. 5 Deliverables At the beginning of April I hope to have a personalized agent system that can hook into a WWW browser. In addition I hope to have developed and deployed a feature-guided automated collaborative ltering server, and designed an architecture for distributed coordination amongst multiple such servers. 11

13 6 Equipment and Resources Required The browser used for the implementation will be XMosaic for the UNIX platforms. The agent interface will be implemented in C++ and Tk on a Silicon Graphics UNIX workstation at the Media Lab. The various modules will be implemented either in C or Perl. shared les. Modules will communicate via The FGACF server will be built in C (for speed and portability). I hope to use the generic ACF server being built by Max Metral [10] at the Media Lab as a starting point. References [1] Feynman, C., Nearest neighbor and maximum likelihood methods for social information ltering, Internal Document, MIT Media Lab, Fall [2] Sheth, B. D., A Learning Approach to Personalized Information Filtering, SM Thesis, Department of EECS, MIT, Feb [3] Lang, K., NewsWeeder: An Adaptive Multi-User Text Filter, Research Summary, Aug [4] Salton, G., and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, [5] Goldberg, D., Nichols, D., Oki, B., and Terry, D., Using Collaborative Filtering to Weave an Information Tapestry, CACM, 35 (12), Dec 1992, pp [6] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J., GroupLens: An Open Architecture for Collaborative Filtering of Netnews, in Proc CSCW-94. [7] Shardanand, U., Social Information Filtering for Music Recommendation, SM Thesis, Dept of EECS, MIT, Sept

14 [8] McBryan, O., GENVL and WWWW: Tools for Taming the Web, in Proc of the First Int'l World Wide Web Conference, CERN, Geneva, May [9] Maudlin, M. L., and Leavitt, J. R., Web Agent Related Research at the Center for Machine Translation, in Proc SIGNIDR-94, Aug 1994, McLean Virginia. [10] Metral, M. E., SM Thesis Proposal, MIT Media Laboratory, Oct [11] YAHOO - A guide to WWW, available online at [12] Bowman, C. M., Danzig, P. B., Hardy, D. R., Manber, U., and Schwartz, M. F., The Harvest Information Discovery and Access System, in Proc of the Second Int'l World Wide Web Conference, Chicago, IL, Oct [13] CUI W3 Catalog, available online at [14] Mueller, P., Infobot Hotlist Database, available online at ftp://ftp.netcom.com/pub/ksedgwic/hotlist/hotlist.html [15] Johnson, M., SIMON - System of Internet Mapping for Organized Navigation, available online at [16] DeBra, P., Houben, G-J., and Kornatzky, Y., Navigational Search in the World-Wide Web, available online at 13

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli