[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information

Size: px

Start display at page:

Download "[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information"

Vanessa Brown
5 years ago
Views:

1 [8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information ltering methods. Communications of the ACM, 35(12), December [9] Christopher Fox. A stop list for general text. SIGIR forum., 24(1/2):19, [10] Gerard Salton. A blueprint for automatic indexing. SIGIR forum., 16(2):22{38, Fall, [11] Charles Goldfarb. SGML Handbook. Oxford University Press, ( ). [12] Donna Harman. A Failure Analysis on the Limitations of Suxing in an Online Environment. SIGIR '87., page 102, [13] Ed Krol. The Whole Internet User's Guide & Catalog. O'Reilly & Associates, Inc., ( ). [14] Larry Wall and Randal L. Schwartz. Programming PERL. O'Reilly & Associates, Inc., ( ). [15] M. Horton and R. Adams. Standard for Interchange of USENET Messages. RFC 1036, December Available from URL ftp://nis.nsf.net/documents/rfc/rfc1036.txt. [16] T. W. Malone, K. R. Grant, F. A. Turbak, S. A. Brobst, and M. D. Cohen. Intelligent informationsharing systems. Communications of the ACM, 30(5), May [17] M. McCahill. The Internet Gopher Protocol: A Distributed Server Information System. In Connexions - The Interoperability Report, volume 6, no. 7, pages 10{14, July [18] National Center for Supercomputing Applications. NCSA Mosaic for X Documentation. Available from URL [19] newsstats@uunet.uu.net. Total trac through uunet for the last 2 weeks. USENET Newsgroup, news.lists, Sept. 11, URL news:350907i$fji@kong.uu.net. [20] G. Salton and C. Buckley. Global text matching for information retrieval. Science, 253:1012{1015, [21] Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley (Reading MA), ; ACM CR , [22] T. Berners-Lee et al. World-Wide Web: The Information Universe. In Electronic Networking: Research, Applications, and Policy, volume 1, no. 2, pages 52{58, Spring [23] Tak W. Yan and Hector Garcia-Molina. SIFT - A Tool for Wide-Area Information Dissemination. Available from URL [24] Larry Wall. Manual Page for PERL - Practical Extraction and Report Language. 13

2 SURF is also currently based on a single user, stand-alone model, resulting in inecient use of network resources. Articles may be retrieved from the NNTP server multiple times, once for each user. We are currently developing a version of SURF based on a client-server architecture which would allow us to optimize access to the news server by retrieving each article only once for multiple users. With the addition of user interface based client software, the prototype will also be much easier to use. In addition to a more complete user interface, we are investigating the addition of an request interface to the server-based version of SURF. This would allow the large numbers of Internet users who have only basic Internet access ( and FTP capability) to use SURF as well. Users would forward their proles to the server though the interface, and receive the results via FTP. Any one of the increasing number of commercially available SGML viewers could be used to view the results. SURF could also benet from a mechanism for relevance feedback [21]. This method has been shown to improve retrieval eectiveness by allowing the user to specify which documents are of interest, and which are not. Terms from the relevant documents are added to the query vector, and terms from the non-relevant documents are removed. This has the eect of \moving" the reformulated vector closer (in vector space) to that of relevant documents, providing more accurate retrieval results. 6 Conclusion This paper has described SURF, a prototype Netnews information lter, which uses well-known information retrieval techniques, and disseminates information as a set of hypertext-linked documents. In particular, we describe the techniques used to organize the ltered information, and methods by which the ltering process can be made to produce better results. Through extensive experience with the prototype, we have found SURF to complement the current suite of Internet resource discovery tools. Though still in the prototype stage, it has already become important facility for a number of group members. Current uses range from locating information for the maintenance of Internet mailing lists, to monitoring the performance of professional sports teams. With the addition of a number of usability enhancements, including a client-side user interface, we plan to release SURF to the general Internet community. References [1] A. Emtage and P. Deutsch. Archie: An Electronic Directory Service for the Internet. In Proc. Winter 1992 Usenix Conf., pages 93{110, Sunset Beach, Calif., Usenix. [2] Nicholas J. Belkin and Bruce Croft. Information ltering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), December [3] C. Mic Bowman, Peter B. Danzig, Udi Manber, and Michael F. Schwartz. Scalable internet resource discovery: Research problems and approaches. Communications of the ACM, 37(8), August [4] Brian Kantor and Phil Lapsley. Network News Transfer Protocol. RFC 977, February Available from URL ftp://nis.nsf.net/documents/rfc/rfc0977.txt. [5] CERN. UR* and The Names and Addresses of WWW objects. Available from URL [6] E. M. Housman and E. D. Kaskela. State of the art in selective dissemination of information. IEEE Transaction on Engineering and Writing Speech, EWS-13(2):78{83, Sept [7] Eric W. Mackie. Waterloo Text Database, System Overview. Technical Report 94-12, University of Waterloo, Waterloo, Ontario, March

3 expressions specied in the query vector. A relevance value is then computed for each article by measuring the similarity between the query and article vectors using the cosine similarity measure (Section 4.2.3). If the relevance value is greater than the user specied threshold, an index page is generated, and the article is added to the HTML table of contents. After all of the vectors have been processed, SURF waits a specied amount of time, and then resumes processing. The current prototype is written entirely in PERL [24], which provides powerful facilities for reading, manipulating, and writing large amounts of information. PERL also oers an extensive pattern matching capability, recognizing most of the common syntactic notations for regular expressions (e.g. a superset of grep, sed, and awk) [14]. The primary building block for document vectors is PERL's associative array mechanism. As their underlying implementation is based on hash tables, they provide an ecient means of adding, removing, or updating terms, or their associated weights Conguration By modifying settings in a conguration le, the users may individually tailor most aspects of the the ltering process to their specications. The following is a list of some of the more useful customizations: Removal of text quotations prior to ltering. As previously discussed, this may or may not be necessary depending on the newsgroup. Users may wish to experiment with this setting to determine what impact it may have on ltering eectiveness. Relevance threshold. The user is allowed to specify a minimum relevance value for an article to be included in the results. Local frequency threshold. To help further limit the amount of information returned, the user may specify a local term frequency threshold, which is the minimum number of times each expression must be matched within the article. Conversion of articles to HTML. Most Netnews articles have a short lifespan, usually ranging from one week to one month. The user can direct SURF to store a local copy of the article (in HTML) so that it will always be accessible. If this feature is turned o the article can still be accessed from the results (provided it has not expired) as SURF replaces the link to the local copy with the URL of the actual article. The time delay between successive runs. An additional, or replacement term stop list. A stop list of common, English language terms is provided, however, an additional list may be specied to eliminate any unwanted topic specic terms, or, specify common terms from another language. 5 Further Work The current SURF prototype relies heavily on a user's familiarity with Netnews to direct it to the appropriate newsgroups. While this is usually not a problem for seasoned users, Netnews novices will likely derive little benet from SURF. We are currently investigating techniques to make our prototype easier for inexperienced users to use. The most promising method involves transforming the Internet FAQ's 10 into their VSM equivalent, and computing the degree of similarity with the user query. Because most FAQ's point back to the newsgroup they originated from, the vectors with the greatest similarity to the query will yield a list of the most promising newsgroup starting points. 10 A list of \frequently asked questions" (with answers) posted to the newsgroups at regular intervals. 11

4 embedded cross references, neither of which provide accurate indications of article content. Hence, in the study some non-relevant articles may have been incorrectly assumed to be relevant. We have found through empirical observation that in many cases, relevant, \high-quality" articles tend to generate many responses of lesser relevance. This increases the likelihood that inclusion of quoted text in further processing will contribute to \diluted" retrieval eectiveness by propagating relevant terms throughout many followup articles, resulting in unrepresentative relevance values. A more reliable relevance assessment based solely on current content is possible if quotations are removed from further consideration. In this way, textual content contributes only to the analysis of the article it originally appears in, and has no additional impact on future articles. The actual impact of quotation inclusion on retrieval performance depends to a large extent on the individual newsgroup. Although there has been no formal measurement of the percentage of article content composed of quoted text, experience shows that it varies from newsgroup to newsgroup. Individual Netnews users are therefore probably the best judge of whether quoted text impacts ltering eectiveness, either positively or negatively. By default, SURF removes quoted text from articles before they are processed, however, this behavior can be changed through a conguration setting. Along with quotations, URL's and embedded HTML anchors are also removed from the article body. Although URL's provide the locations of other relevant resources, their lename-like syntax does not provide an adequate conceptual description of the resources they point to. Consequently, they are removed from the text of the article, but retained for the HTML index page in case the article is later determined to be relevant. With the advent and increasing popularity of Netnews-capable WWW clients, many Netnews users have begun embedding HTML mark-up tags directly into articles. Typically these tags represent anchors, or hypertext links to the user's personal WWW \homepage". When viewed using an HTML browser, the anchors appear as active hypertext links. Similar to URL's, anchors provide superuous information, but do not contribute any additional content discrimination power. In addition to an unstructured body, Netnews articles have a number of predened header elds which do provide additional content information. The subject eld is typically a one-line description of the article. Two optional elds, keywords, and summary, provide a list of the important article keywords, and a short summarization, respectively. The text from these three elds is concatenated with the main body before further processing occurs. After all of the previous text transformations have been been applied, the remaining text is compared with the regular expressions provided in the user prole. If any of the query expressions match portions of the article, we transform it into an initial vector representation (with local term frequencies) as follows: 1. Each matching regular expression is added to the document vector, along with its local weighting (the number of times it matches within the article). 2. The article text matching the regular expressions is removed. This is so individual terms in the matched portion (an expression may match large text structures) will not be included in further processing. 3. The remainder of the article is decomposed into individual terms 9 which are compared against a stop list of high-frequency, common function words (and, the, but, their, etc.) [9]. Article terms which appear in the stop list are removed. 4. The remaining terms are then added to the document vector along with their occurrence frequency. 5. The vector is retained in a disk le for later use. When the supply of new articles is exhausted, we compute term weights for each saved document vector. Combined local and global weights (tf idf) are assigned to each term, using the current count of articles read up to this point as the value for N (Section 4.2.2). Next, idf weights are computed for the regular 9 Whitespace delimited strings of characters stripped of leading and trailing punctuation. 10

5 followed by the name of the information archive. Since it is unlikely that an article without an announcement will contain this pattern, specifying it as a query term will result in only the retrieval of articles which announce new information archives. Finally, when used without any of their special syntax, regular expressions are equivalent to the use of single terms or phrases. This is important consideration for users who are unfamiliar with, or unwilling to learn the additional complexity of a new syntax. 4.4 Stemming Stemming, or sux stripping, refers to the process of automatically reducing related terms to their common word stems [21]. With this technique, words such as stemmed, stemming, and stemless are all reduced to the common root, stem. While stemming can help limit the length of document and query vectors, the technique is not well suited to a global environment such as Netnews, and may actually be redundant when used in conjunction with regular expression based queries. Traditionally, sux strippers have dealt exclusively with the terms and word stems from the English language. Typical IR domains include research papers, abstracts, or library catalogs - all predominately in English. However, with growing Internet participation from countries outside of North America, it is becoming increasingly important to consider other languages 8. A multi-lingual information ltering service must therefore either determine which language a document is written in and apply a language specic stemming routine (requiring an online dictionary for each language), or make no language assumptions at all. Furthermore, Harman [12] has shown that stemming has only a minimal inuence on the eectiveness of a retrieval system. SURF does not automatically stem words; instead, it relies upon the user to recognize and express any common word stems, if so desired. For example, a user searching for analysis, analyzer, or analytic can specify the expression analy.*. Although this assumes at least a passing familiarity with regular expressions, it has the advantage of being both language neutral, and less computationally intensive than an automatic sux stripping routine. 4.5 Implementation SURF begins by establishing a client level connection to a Netnews server [4], and retrieving new (unprocessed) articles. With the exception of well-dened elds in the header, Netnews articles have no readily exploitable structure (the contents are entirely free-form text) [15]. Users frequently augment articles with text devices intended to aid readability or supply additional information, such as ASCII-text diagrams, signatures (author name and information), or references to related documents. While it may prove benecial to human readers, some types of additional information may degrade ltering eectiveness, and should be removed prior to analysis. As SURF retrieves new articles, three text transformations are performed on the body of the article: text quotations are removed Uniform Resource Locator's are removed any embedded HTML anchors are removed It has become standard Netnews practice when replying to articles to include portions of the article to provide a context for the response. A measurement of the impact of text quotations on retrieval eectiveness was made by Salton, et al. [20]. Working with a collection of 1984 articles, they found that for a given level of recall, removing quotations from the articles caused a decrease in the average precision of the system. However, as stressed by the authors, article relevance was determined based on subject line comparisons and 8 Many newsgroups already carry on discussion in languages such as French, Spanish, and Portuguese. 9

6 Figure 4: HTML Netnews article 4.3 Model Enhancements The SURF system diers from other information retrieval systems by generalizing the concept of terms to include regular expressions. Regular expressions have the decided advantage over single terms or phrases by allowing considerably more expressive query formulations. Allowing more exibility in the specication of a query can potentially improve the overall eectiveness of the system by increasing both recall, which is the proportion of documents retrieved, and precision, which is the proportion of documents retrieved that are actually relevant. This is an especially important consideration in a \noisy" environment like Netnews where grammatical errors, misspellings, and non-standard term usage are commonplace. To help illustrate the potential impact of errors in spelling and non-standard term usage upon retrieval eectiveness, we manually examined 843 articles from an active newsgroup looking for incorrect usage of a common domain specic term. For our experiment we chose rec.climbing 7 - a newsgroup for the discussion of rock climbing, mountaineering, and hiking. We found that out of 62 uses of the word rappel or its derivatives, only 35 were spelled correctly, while 14 occurrences were misspelled, and 13 represented colloquial or slang usage. By not taking misspellings or slang into account, only 26 of the 39 relevant articles would be retrieved (66%). Since all of the misspellings were as a result of one too few p's, or one too many l's, a single regular expression accounting for this fact was able to match all of the correct and incorrectly spelled terms. This increased the number of retrieved articles to 33 (85%). Furthermore, one additional regular expression was sucient to represent all of the slang terms, and retrieve all 39 articles. While using regular expressions to account for misspellings may help improve system recall, they can also be used to increase precision, by matching larger, more descriptive text structures such as sentences or paragraphs. For example, articles announcing the availability of Internet resources typically contain a text line of the form: Archive-name: 7 Articles dated Aug. 31, Sept. 27,

7 Figure 3: SURF Index Page chance of being retrieved. For an article with k terms, the weighting for a term i is computed as: w i = tf i idf i q Pk i=1 tf2 i idf2 i The user specied query terms are assigned normalized idf weights [10] Similarity Measure We determine the relevance of an article by computing the degree of similarity between the query and the article. Since both are represented by vectors we use the cosine measure of similarity, dened to be the normal inner product of the vectors normalized for length. With normalized article and query weights (as in the prototype), the similarity between an article A, and a query Q is computed as: kx sim(a; Q) = a i q i i=1 where a i and q i are the ith corresponding terms of A and Q, respectively. This measure yields the cosine of the angle between the article and the query vector, i.e. a value of 0 is computed when the vectors have no terms in common, and 1 when the vectors are identical. 7

8 Figure 2: SURF Table of Contents which associate extra information with each term, typically, an \importance" ranking or frequency count. Within information retrieval literature, these alternate methods are as referred to as document models. Readers wishing a more thorough treatment of information retrieval and ltering are invited to refer to [2, 8, 21] Vector Space Model The model employed by SURF is the well-known Vector Space Model (VSM). In the VSM, both documents and queries are represented as vectors of terms with an associated weight. For example, a document D consisting of k distinct terms is represented as the k dimensional vector D = (w1; w2; :::w k ), where w i is the weight assigned to term i. Terms not in D are implicitly weighted with the value zero. Besides providing a parallel representation for both documents and queries, the VSM readily supports a number of similarity measurements. This provides the capability to rank ltered documents in decreasing order of similarity to the query vector Term Weights Weights for the individual article terms are computed based on combination of local and global term frequencies. The local term frequency (tf i ) of a term i is dened to be the number of times term i appears in the article. The global quantity, or inverse document frequency (idf i ) of a term i is dened as log(n=n i ), where N is the total number of articles processed, and n i is the number of articles with term i. The inverse document frequency is intended to provide a higher ranking for terms that appear few times throughout the entire collection, i.e. terms that are better able to distinguish individual documents from the rest of the collection. It is also important to normalize the weights for document length so that short documents have an equal 6

9 User NNTP Server User Profile Netnews Articles SGML Viewer SURF HTML Documents Figure 1: The SURF architecture The results of the lter can then be browsed by invoking an HTML browser 5 on the main \table of contents" le (Figure 2). This document provides a brief summary of the articles which match the user query by listing their corresponding subject lines in relevance order. Each entry in the list has two hypertext links to further information (underlined text in Figure 2). If the subject of the article appears interesting, the user can obtain more detailed information by clicking on its associated relevance value. This selects a hypertext link to an \index page". (Figure 3). If further information is not required, the user may proceed directly to the article itself (Figure 4) by clicking on the subject line. The index pages contain information regarding the relevance measure computed for the article. As all of the query terms need not appear in every relevant article, a list of terms present, along with their occurrence frequency within the article (in parentheses), is included. To provide an additional indication of content, the index page also includes the set of terms found in the article, ranked by their importance. The top ranked terms shown in Figure 3, for example, indicate that the article contains information about a directory of access providers. During the ltering process, any URL's located within the article are syntactically recognized and converted to active WWW hypertext links, and listed under the Resource Reference heading. If a WWW client is used for browsing, the resource indicated by the URL's can be viewed simply by selecting the link. Finally, the original article may be viewed by selecting the \view article" link from the index page. This takes the user to an HTML version of the article (Figure 4). Query terms are tagged so that they will appear in a dierent font than the rest of the article (in this case, boldface). As well, all URL's in the body of the article are converted to active WWW links. A user is therefore free to browse the referenced resources as the article is scanned. The following sections discuss in greater detail the methods and models used in the implementation of the prototype. We also present details of some of the enhancements necessary to provide more eective information ltering from an environment like Netnews. 4.2 Document Model Most information retrieval and ltering research focuses on text-based documents 6 which in essence are simply ordered sequences of terms (usually words). Alternative methods of representing documents exist 5 In this example we use Mosaic. 6 Throughout the remainder of the paper we will use the terms articles and documents synonymously. 5

10 3.1 HyperText Markup Language Rather than developing a new mark-up tag scheme for ltered information, we make use of the popular Hypertext Markup Language, or HTML [22]. Most Internet users are already familiar with HTML, an SGML-dened tag set for describing the structure of WWW documents. Besides a rich set of simple structural elements, HTML denes tags to embed \anchors", or hypertext links, into documents. Using HTML as the markup tag set means that: most Internet users will be able to take full advantage of the many widely available high-quality WWW browsers (called clients) to view the results, the search results can be incorporated directly into a WWW server. As momentum behind the WWW and HTML browsers has grown, it has become standard Netnews practice to publish the locations of new resources in a \universal" format. Uniform Resource Locators 3, or URL's are strings which uniquely encode the network address of a given resource [5]. For example, the URL of this paper is: ftp://csg.uwaterloo.ca/pub/kjl/surf.ps WWW browsers integrate anchors and URL's to providing hyperlink access to information anywhere on the Internet. While URL's provide a short, unique address for a given resource, the ood of interest in the Internet has spawned the generation of myriad new resource addresses. Extracting, indexing, and following-up on Netnews published URL's has become a tedious, time-consuming endeavor. As part of the ltering process, we extract and convert any URL's found in the text of relevant articles to active WWW hypertext links, and present them in both a summary-list form, and, in situ (in an HTML version of the article). Provided that a WWW browser (such as the National Center for Supercomputing Application's (NCSA) Mosaic [18]) is used to view the ltered information, any resources announced in the article can be examined simply by selecting their hypertext links. 4 The SURF System The primary interface to the SURF system is the user prole. It is through the prole that user species one or more \searches", representing any type of interest, ranging from long-term to \one-shot". Each search is identied by three elements: a directory in which the results are to be placed, a list of newsgroups, and a collection of query terms. Figure 1 presents an overview of the SURF architecture. At xed (user-specied) intervals, SURF queries the Netnews server for new articles from the list of newsgroups specied in the prole. The articles are then compared against the query terms and relevance rankings are computed. Finally, the results of the search are converted into a collection of hypertext-linked HTML documents which the user can browse. 4.1 Example We illustrate the use of SURF through an example session. Suppose a user currently looks for information regarding Internet access providers by monitoring the comp.infosystems.announce newsgroup. He rst creates a prole with the search terms \internet", \access", and \providers", the name of the newsgroup, and the location for SURF to place the results (a directory within a local le system). SURF is then started from the command line 4 and runs until all new (unseen) articles from the newsgroup have been processed. 3 Dened by the Internet Engineering Task Force (IETF). The WWW denes a similar standard, known as a Universal Resource Identier, or URI. As URL is the more common usage, we use it to avoid confusion. 4 SURF currently runs only in stand-alone mode; future versions will be implemented as a server. 4

11 2 Related Work The Stanford Information Filtering Tool (SIFT) [23] is a project aimed at providing an ecient widearea information dissemination service. One application of the SIFT ltering engine is the dissemination of Netnews. Although from the users' perspective, our prototype and SIFT provide similar functionality, namely a ne-grained lter for Netnews, the two approaches dier in both intent and scale. SIFT addresses the eciency aspects of ltering large amounts of information for a large number of users. While capable of gathering and disseminating the bulk of Netnews, it routes articles to users via an interface, not addressing any information organizational issues. The Information Lens [16] is a system intended to increase the amount of useful information exchange within a group, by providing intelligent information ltering and organization. Users of the system specify a set of rules to lter or classify incoming messages automatically. While providing a powerful mechanism for ltering shared information, it is based exclusively on a set of semi-structured message types, which limits its applicability to unstructured text domains, such as Netnews. 3 Descriptive Markup Descriptive markup is a technique developed to annotate plain text documents with structural information through the use of special tags. For instance, markup tags might be used to denote which parts of a text document are chapter headings, and which are paragraphs. Structural information might also be used to place \links" between related portions of text. Since dierent document types exhibit dierent structures each requires its own set of markup tags. The Standard Generalized Markup Language (SGML) [11] is a \metalanguage" used to dene tag sets for dierent document types. Through a number of portability and standardization eorts (ISO 8879), use of SGML is growing in popularity. Although intended primarily for electronic publishing and data interchange, descriptive markup techniques provide the means to organize information so it can be easily browsed and located when needed. In an information dissemination environment where potentially large amounts of data may be returned, it is important that users be able to determine quickly what information is, or is not relevant without forcing them to read each article. This is an especially important concern in an active, \noisy" domain like Netnews, where high volume and low quality make it dicult to lter out all irrelevant information. Applying markup techniques to ltered information makes it possible to exploit existing SGML browser technology. The hypertext capability provided by most SGML-based browsers provides an ecient mechanism for traversing document collections. Placing links between related information lets users quickly navigate to other documents in the collection. Hypertext capability permits information to be organized for browsing in ways that would otherwise be dicult to achieve. For example, to prevent inundating users with too much detailed information at once, a \summary" of the lter results could be shown rst, containing links to progressively more detailed views. Ideally, an early indication of the information relevance could be determined so that interesting information could be viewed in more detail, while non-relevant information could be ignored. While SGML browsers oer enhanced information viewing capability, SGML text database research can be applied to the problem of storing the disseminated information eciently. By extracting embedded structural information, SGML-based text databases partition document text into database records to improve storage and retrieval eciency [7]. In addition to the eciency aspects, the database acts a centralized repository, making it easier to locate needed information once it has been discovered. A further benet is that the full range of search and extraction facilities oered by the database can be used to locate and view even more focused subsets of information. This is particularly useful if a user elects to lter a broad range of information. 3

12 Most current discovery tools also typically employ a passive approach to locating new information, searching only when initiated by the user. A preferred approach is to have the tools selectively route new information to users based upon a description of their interests (e.g. a user prole). This type of active search mechanism, known as information ltering [2], or selective dissemination of information [6], is usually based upon well-known information retrieval (IR) techniques [21]. An active ltering approach relieves users of the burden of frequent and time consuming searches through large amounts of data. In a dynamic environment, where information may be removed or expire without notice, an information lter is less likely to overlook important resources than an occasional human guided search. USENET News, or Netnews [13], is the Internet's primary mechanism for information exchange, consisting of a collection of electronic bulletin boards, or newsgroups, coarsely organized by topic. With its topical arrangement, Netnews qualies as a primitive, self-administered type of information ltering service. Users receive information by subscribing to the newsgroups that they nd interesting. Netnews provides a medium for, among other things, public discussion, information sharing, and announcements, making it the most comprehensive information repository 1 on the Internet. Despite its obvious importance to resource discovery, a number of factors underscore the need for a ner level of ltering granularity within Netnews: Typical Netnews users follow a subset of newsgroups based on long-term personal or professional interests, or occasionally search for new resource announcements. Newsgroup organization is currently determined by topic, however, with the wide appeal of Netnews, newsgroups tend to have a much broader focus than most users' interests. Consequently, users will be interested in only a subset of the articles. Rising trac levels have also created problems for users. A bi-weekly measurement of total USENET trac, as of September 1994, has placed the average number of new articles per day at nearly 72,000, for an average volume of over 150 MB per day [19]. The same study also shows these gures to be growing steadily. Linked to rising trac levels, another concern is the amount of \low-quality" news articles. An unfortunate consequence of an unregulated 2 conferencing medium, without enforced publishing guidelines or review processes, is that a large percentage of trac is devoid of proper grammar, style, or useful content. We propose that users of Netnews can benet from a ne-grained ltering facility. Furthermore, applying descriptive markup techniques to the ltered information as it arrives can reduce organizational problems, and facilitate an ecient browsing mechanism. In this paper we describe the Selective USENET Retrieval Facility (SURF), a prototype Netnews information lter. SURF is intended to provide a ne-grained information lter for Netnews using well-known information retrieval models, and disseminate information as structured, easy-to-browse collections of documents. Rather than attempt to lter all Netnews articles, the majority of which are of no interest, SURF operates on a more personalized approach, ltering articles only from a user-selected set of newsgroups. While this approach does require at least some familiarity with the Netnews system, it enables users to control fully the scope of the ltering process, and consequently, place an upper bound on the amount of information likely to be selected. As a result, our prototype may be more useful to experienced Internet users than to novices. We are currently working on techniques to increase its appeal to novices (see Section 5). Providing this type of personalized information lter allows us to focus on maximizing the benet of Netnews, rather than on the large-scale indexing techniques necessary to lter eciently tens of thousands of new articles every day. We focus on the techniques employed in the organization of the ltered information, and on methods of increasing ltering eectiveness. 1 Not a true repository in the strictest sense; Netnews is more like moving window, with new information added while aged information expires, and is removed. 2 With the exception of \moderated" newsgroups where each article must be approved by a moderator before it is made public. 2

13 SURF - An Information Filtering Facility for USENET News Kurt Lichtner and D.D. Cowan Computer Science Department & Computer Systems Group University of Waterloo Waterloo, Ont. N2L 3G1 Abstract Sustained growth in the Internet has lead to a dramatic increase in the amount of electronic information available to users. Tools designed to help users discover new information are the key factors in deriving the maximum benet from a large-scale information sharing system. However, many tools currently available lack the features necessary for coping with a burgeoning volume of information, such as selectively ltering interesting information, and automatically organizing it for ecient browsing. In this paper we describe the Selective USENET Retrieval Facility (SURF), a prototype Netnews information lter. We discuss methods of applying descriptive markup techniques to provide a consistent organization for the ltered information. We also describe the information retrieval models used in the lter, as well as domain specic methods of improving ltering eectiveness. 1 Introduction Since its initial deployment 25 years ago, the Internet has become an indispensable tool for conducting research, communication, and information (or resource) sharing. The recent explosion in popularity observed on the Internet has dramatically increased the amount and type of information available. Unfortunately, most of this information is of no use to the majority of users [3]. Consequently, network resource discovery has increasingly become a \needle-in-the-haystack" activity. As more electronic information is made available, users are becoming increasingly faced with an information overload. The key to making eective use of a large-scale information system while remaining shielded from an overload lies in the set of tools used to locate and manage information. The past several years have witnessed the emergence of a small number of tools directed at locating information; Archie [1], Gopher [17], and the World-Wide Web (WWW) [22], are all examples which have achieved widespread acceptance within the Internet community. Although they have become an integral part of discovering new information on the Internet, these tools have not addressed the information management issues, which deal with automatically organizing (indexing) discovered information so that it can be easily found and viewed when needed. By relying on individual users to manage data cataloging and storage details, an additional discovery problem, one level \above" that of networked resource discovery is created; it is often time consuming or dicult to locate previously discovered information from within a private collection, especially if it is large or haphazardly organized. Recent standardization eorts in document portability and research in text databases can be of direct benet to users faced with an information management problem. Descriptive markup languages allow for information to be organized into a standard document structure, while text database browsers provide the means to explore collections of documents quickly and eciently.

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu