[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information

Size: px
Start display at page:

Download "[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information"

Transcription

1 [8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information ltering methods. Communications of the ACM, 35(12), December [9] Christopher Fox. A stop list for general text. SIGIR forum., 24(1/2):19, [10] Gerard Salton. A blueprint for automatic indexing. SIGIR forum., 16(2):22{38, Fall, [11] Charles Goldfarb. SGML Handbook. Oxford University Press, ( ). [12] Donna Harman. A Failure Analysis on the Limitations of Suxing in an Online Environment. SIGIR '87., page 102, [13] Ed Krol. The Whole Internet User's Guide & Catalog. O'Reilly & Associates, Inc., ( ). [14] Larry Wall and Randal L. Schwartz. Programming PERL. O'Reilly & Associates, Inc., ( ). [15] M. Horton and R. Adams. Standard for Interchange of USENET Messages. RFC 1036, December Available from URL ftp://nis.nsf.net/documents/rfc/rfc1036.txt. [16] T. W. Malone, K. R. Grant, F. A. Turbak, S. A. Brobst, and M. D. Cohen. Intelligent informationsharing systems. Communications of the ACM, 30(5), May [17] M. McCahill. The Internet Gopher Protocol: A Distributed Server Information System. In Connexions - The Interoperability Report, volume 6, no. 7, pages 10{14, July [18] National Center for Supercomputing Applications. NCSA Mosaic for X Documentation. Available from URL [19] newsstats@uunet.uu.net. Total trac through uunet for the last 2 weeks. USENET Newsgroup, news.lists, Sept. 11, URL news:350907i$fji@kong.uu.net. [20] G. Salton and C. Buckley. Global text matching for information retrieval. Science, 253:1012{1015, [21] Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley (Reading MA), ; ACM CR , [22] T. Berners-Lee et al. World-Wide Web: The Information Universe. In Electronic Networking: Research, Applications, and Policy, volume 1, no. 2, pages 52{58, Spring [23] Tak W. Yan and Hector Garcia-Molina. SIFT - A Tool for Wide-Area Information Dissemination. Available from URL [24] Larry Wall. Manual Page for PERL - Practical Extraction and Report Language. 13

2 SURF is also currently based on a single user, stand-alone model, resulting in inecient use of network resources. Articles may be retrieved from the NNTP server multiple times, once for each user. We are currently developing a version of SURF based on a client-server architecture which would allow us to optimize access to the news server by retrieving each article only once for multiple users. With the addition of user interface based client software, the prototype will also be much easier to use. In addition to a more complete user interface, we are investigating the addition of an request interface to the server-based version of SURF. This would allow the large numbers of Internet users who have only basic Internet access ( and FTP capability) to use SURF as well. Users would forward their proles to the server though the interface, and receive the results via FTP. Any one of the increasing number of commercially available SGML viewers could be used to view the results. SURF could also benet from a mechanism for relevance feedback [21]. This method has been shown to improve retrieval eectiveness by allowing the user to specify which documents are of interest, and which are not. Terms from the relevant documents are added to the query vector, and terms from the non-relevant documents are removed. This has the eect of \moving" the reformulated vector closer (in vector space) to that of relevant documents, providing more accurate retrieval results. 6 Conclusion This paper has described SURF, a prototype Netnews information lter, which uses well-known information retrieval techniques, and disseminates information as a set of hypertext-linked documents. In particular, we describe the techniques used to organize the ltered information, and methods by which the ltering process can be made to produce better results. Through extensive experience with the prototype, we have found SURF to complement the current suite of Internet resource discovery tools. Though still in the prototype stage, it has already become important facility for a number of group members. Current uses range from locating information for the maintenance of Internet mailing lists, to monitoring the performance of professional sports teams. With the addition of a number of usability enhancements, including a client-side user interface, we plan to release SURF to the general Internet community. References [1] A. Emtage and P. Deutsch. Archie: An Electronic Directory Service for the Internet. In Proc. Winter 1992 Usenix Conf., pages 93{110, Sunset Beach, Calif., Usenix. [2] Nicholas J. Belkin and Bruce Croft. Information ltering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), December [3] C. Mic Bowman, Peter B. Danzig, Udi Manber, and Michael F. Schwartz. Scalable internet resource discovery: Research problems and approaches. Communications of the ACM, 37(8), August [4] Brian Kantor and Phil Lapsley. Network News Transfer Protocol. RFC 977, February Available from URL ftp://nis.nsf.net/documents/rfc/rfc0977.txt. [5] CERN. UR* and The Names and Addresses of WWW objects. Available from URL [6] E. M. Housman and E. D. Kaskela. State of the art in selective dissemination of information. IEEE Transaction on Engineering and Writing Speech, EWS-13(2):78{83, Sept [7] Eric W. Mackie. Waterloo Text Database, System Overview. Technical Report 94-12, University of Waterloo, Waterloo, Ontario, March

3 expressions specied in the query vector. A relevance value is then computed for each article by measuring the similarity between the query and article vectors using the cosine similarity measure (Section 4.2.3). If the relevance value is greater than the user specied threshold, an index page is generated, and the article is added to the HTML table of contents. After all of the vectors have been processed, SURF waits a specied amount of time, and then resumes processing. The current prototype is written entirely in PERL [24], which provides powerful facilities for reading, manipulating, and writing large amounts of information. PERL also oers an extensive pattern matching capability, recognizing most of the common syntactic notations for regular expressions (e.g. a superset of grep, sed, and awk) [14]. The primary building block for document vectors is PERL's associative array mechanism. As their underlying implementation is based on hash tables, they provide an ecient means of adding, removing, or updating terms, or their associated weights Conguration By modifying settings in a conguration le, the users may individually tailor most aspects of the the ltering process to their specications. The following is a list of some of the more useful customizations: Removal of text quotations prior to ltering. As previously discussed, this may or may not be necessary depending on the newsgroup. Users may wish to experiment with this setting to determine what impact it may have on ltering eectiveness. Relevance threshold. The user is allowed to specify a minimum relevance value for an article to be included in the results. Local frequency threshold. To help further limit the amount of information returned, the user may specify a local term frequency threshold, which is the minimum number of times each expression must be matched within the article. Conversion of articles to HTML. Most Netnews articles have a short lifespan, usually ranging from one week to one month. The user can direct SURF to store a local copy of the article (in HTML) so that it will always be accessible. If this feature is turned o the article can still be accessed from the results (provided it has not expired) as SURF replaces the link to the local copy with the URL of the actual article. The time delay between successive runs. An additional, or replacement term stop list. A stop list of common, English language terms is provided, however, an additional list may be specied to eliminate any unwanted topic specic terms, or, specify common terms from another language. 5 Further Work The current SURF prototype relies heavily on a user's familiarity with Netnews to direct it to the appropriate newsgroups. While this is usually not a problem for seasoned users, Netnews novices will likely derive little benet from SURF. We are currently investigating techniques to make our prototype easier for inexperienced users to use. The most promising method involves transforming the Internet FAQ's 10 into their VSM equivalent, and computing the degree of similarity with the user query. Because most FAQ's point back to the newsgroup they originated from, the vectors with the greatest similarity to the query will yield a list of the most promising newsgroup starting points. 10 A list of \frequently asked questions" (with answers) posted to the newsgroups at regular intervals. 11

4 embedded cross references, neither of which provide accurate indications of article content. Hence, in the study some non-relevant articles may have been incorrectly assumed to be relevant. We have found through empirical observation that in many cases, relevant, \high-quality" articles tend to generate many responses of lesser relevance. This increases the likelihood that inclusion of quoted text in further processing will contribute to \diluted" retrieval eectiveness by propagating relevant terms throughout many followup articles, resulting in unrepresentative relevance values. A more reliable relevance assessment based solely on current content is possible if quotations are removed from further consideration. In this way, textual content contributes only to the analysis of the article it originally appears in, and has no additional impact on future articles. The actual impact of quotation inclusion on retrieval performance depends to a large extent on the individual newsgroup. Although there has been no formal measurement of the percentage of article content composed of quoted text, experience shows that it varies from newsgroup to newsgroup. Individual Netnews users are therefore probably the best judge of whether quoted text impacts ltering eectiveness, either positively or negatively. By default, SURF removes quoted text from articles before they are processed, however, this behavior can be changed through a conguration setting. Along with quotations, URL's and embedded HTML anchors are also removed from the article body. Although URL's provide the locations of other relevant resources, their lename-like syntax does not provide an adequate conceptual description of the resources they point to. Consequently, they are removed from the text of the article, but retained for the HTML index page in case the article is later determined to be relevant. With the advent and increasing popularity of Netnews-capable WWW clients, many Netnews users have begun embedding HTML mark-up tags directly into articles. Typically these tags represent anchors, or hypertext links to the user's personal WWW \homepage". When viewed using an HTML browser, the anchors appear as active hypertext links. Similar to URL's, anchors provide superuous information, but do not contribute any additional content discrimination power. In addition to an unstructured body, Netnews articles have a number of predened header elds which do provide additional content information. The subject eld is typically a one-line description of the article. Two optional elds, keywords, and summary, provide a list of the important article keywords, and a short summarization, respectively. The text from these three elds is concatenated with the main body before further processing occurs. After all of the previous text transformations have been been applied, the remaining text is compared with the regular expressions provided in the user prole. If any of the query expressions match portions of the article, we transform it into an initial vector representation (with local term frequencies) as follows: 1. Each matching regular expression is added to the document vector, along with its local weighting (the number of times it matches within the article). 2. The article text matching the regular expressions is removed. This is so individual terms in the matched portion (an expression may match large text structures) will not be included in further processing. 3. The remainder of the article is decomposed into individual terms 9 which are compared against a stop list of high-frequency, common function words (and, the, but, their, etc.) [9]. Article terms which appear in the stop list are removed. 4. The remaining terms are then added to the document vector along with their occurrence frequency. 5. The vector is retained in a disk le for later use. When the supply of new articles is exhausted, we compute term weights for each saved document vector. Combined local and global weights (tf idf) are assigned to each term, using the current count of articles read up to this point as the value for N (Section 4.2.2). Next, idf weights are computed for the regular 9 Whitespace delimited strings of characters stripped of leading and trailing punctuation. 10

5 followed by the name of the information archive. Since it is unlikely that an article without an announcement will contain this pattern, specifying it as a query term will result in only the retrieval of articles which announce new information archives. Finally, when used without any of their special syntax, regular expressions are equivalent to the use of single terms or phrases. This is important consideration for users who are unfamiliar with, or unwilling to learn the additional complexity of a new syntax. 4.4 Stemming Stemming, or sux stripping, refers to the process of automatically reducing related terms to their common word stems [21]. With this technique, words such as stemmed, stemming, and stemless are all reduced to the common root, stem. While stemming can help limit the length of document and query vectors, the technique is not well suited to a global environment such as Netnews, and may actually be redundant when used in conjunction with regular expression based queries. Traditionally, sux strippers have dealt exclusively with the terms and word stems from the English language. Typical IR domains include research papers, abstracts, or library catalogs - all predominately in English. However, with growing Internet participation from countries outside of North America, it is becoming increasingly important to consider other languages 8. A multi-lingual information ltering service must therefore either determine which language a document is written in and apply a language specic stemming routine (requiring an online dictionary for each language), or make no language assumptions at all. Furthermore, Harman [12] has shown that stemming has only a minimal inuence on the eectiveness of a retrieval system. SURF does not automatically stem words; instead, it relies upon the user to recognize and express any common word stems, if so desired. For example, a user searching for analysis, analyzer, or analytic can specify the expression analy.*. Although this assumes at least a passing familiarity with regular expressions, it has the advantage of being both language neutral, and less computationally intensive than an automatic sux stripping routine. 4.5 Implementation SURF begins by establishing a client level connection to a Netnews server [4], and retrieving new (unprocessed) articles. With the exception of well-dened elds in the header, Netnews articles have no readily exploitable structure (the contents are entirely free-form text) [15]. Users frequently augment articles with text devices intended to aid readability or supply additional information, such as ASCII-text diagrams, signatures (author name and information), or references to related documents. While it may prove benecial to human readers, some types of additional information may degrade ltering eectiveness, and should be removed prior to analysis. As SURF retrieves new articles, three text transformations are performed on the body of the article: text quotations are removed Uniform Resource Locator's are removed any embedded HTML anchors are removed It has become standard Netnews practice when replying to articles to include portions of the article to provide a context for the response. A measurement of the impact of text quotations on retrieval eectiveness was made by Salton, et al. [20]. Working with a collection of 1984 articles, they found that for a given level of recall, removing quotations from the articles caused a decrease in the average precision of the system. However, as stressed by the authors, article relevance was determined based on subject line comparisons and 8 Many newsgroups already carry on discussion in languages such as French, Spanish, and Portuguese. 9

6 Figure 4: HTML Netnews article 4.3 Model Enhancements The SURF system diers from other information retrieval systems by generalizing the concept of terms to include regular expressions. Regular expressions have the decided advantage over single terms or phrases by allowing considerably more expressive query formulations. Allowing more exibility in the specication of a query can potentially improve the overall eectiveness of the system by increasing both recall, which is the proportion of documents retrieved, and precision, which is the proportion of documents retrieved that are actually relevant. This is an especially important consideration in a \noisy" environment like Netnews where grammatical errors, misspellings, and non-standard term usage are commonplace. To help illustrate the potential impact of errors in spelling and non-standard term usage upon retrieval eectiveness, we manually examined 843 articles from an active newsgroup looking for incorrect usage of a common domain specic term. For our experiment we chose rec.climbing 7 - a newsgroup for the discussion of rock climbing, mountaineering, and hiking. We found that out of 62 uses of the word rappel or its derivatives, only 35 were spelled correctly, while 14 occurrences were misspelled, and 13 represented colloquial or slang usage. By not taking misspellings or slang into account, only 26 of the 39 relevant articles would be retrieved (66%). Since all of the misspellings were as a result of one too few p's, or one too many l's, a single regular expression accounting for this fact was able to match all of the correct and incorrectly spelled terms. This increased the number of retrieved articles to 33 (85%). Furthermore, one additional regular expression was sucient to represent all of the slang terms, and retrieve all 39 articles. While using regular expressions to account for misspellings may help improve system recall, they can also be used to increase precision, by matching larger, more descriptive text structures such as sentences or paragraphs. For example, articles announcing the availability of Internet resources typically contain a text line of the form: Archive-name: 7 Articles dated Aug. 31, Sept. 27,

7 Figure 3: SURF Index Page chance of being retrieved. For an article with k terms, the weighting for a term i is computed as: w i = tf i idf i q Pk i=1 tf2 i idf2 i The user specied query terms are assigned normalized idf weights [10] Similarity Measure We determine the relevance of an article by computing the degree of similarity between the query and the article. Since both are represented by vectors we use the cosine measure of similarity, dened to be the normal inner product of the vectors normalized for length. With normalized article and query weights (as in the prototype), the similarity between an article A, and a query Q is computed as: kx sim(a; Q) = a i q i i=1 where a i and q i are the ith corresponding terms of A and Q, respectively. This measure yields the cosine of the angle between the article and the query vector, i.e. a value of 0 is computed when the vectors have no terms in common, and 1 when the vectors are identical. 7

8 Figure 2: SURF Table of Contents which associate extra information with each term, typically, an \importance" ranking or frequency count. Within information retrieval literature, these alternate methods are as referred to as document models. Readers wishing a more thorough treatment of information retrieval and ltering are invited to refer to [2, 8, 21] Vector Space Model The model employed by SURF is the well-known Vector Space Model (VSM). In the VSM, both documents and queries are represented as vectors of terms with an associated weight. For example, a document D consisting of k distinct terms is represented as the k dimensional vector D = (w1; w2; :::w k ), where w i is the weight assigned to term i. Terms not in D are implicitly weighted with the value zero. Besides providing a parallel representation for both documents and queries, the VSM readily supports a number of similarity measurements. This provides the capability to rank ltered documents in decreasing order of similarity to the query vector Term Weights Weights for the individual article terms are computed based on combination of local and global term frequencies. The local term frequency (tf i ) of a term i is dened to be the number of times term i appears in the article. The global quantity, or inverse document frequency (idf i ) of a term i is dened as log(n=n i ), where N is the total number of articles processed, and n i is the number of articles with term i. The inverse document frequency is intended to provide a higher ranking for terms that appear few times throughout the entire collection, i.e. terms that are better able to distinguish individual documents from the rest of the collection. It is also important to normalize the weights for document length so that short documents have an equal 6

9 User NNTP Server User Profile Netnews Articles SGML Viewer SURF HTML Documents Figure 1: The SURF architecture The results of the lter can then be browsed by invoking an HTML browser 5 on the main \table of contents" le (Figure 2). This document provides a brief summary of the articles which match the user query by listing their corresponding subject lines in relevance order. Each entry in the list has two hypertext links to further information (underlined text in Figure 2). If the subject of the article appears interesting, the user can obtain more detailed information by clicking on its associated relevance value. This selects a hypertext link to an \index page". (Figure 3). If further information is not required, the user may proceed directly to the article itself (Figure 4) by clicking on the subject line. The index pages contain information regarding the relevance measure computed for the article. As all of the query terms need not appear in every relevant article, a list of terms present, along with their occurrence frequency within the article (in parentheses), is included. To provide an additional indication of content, the index page also includes the set of terms found in the article, ranked by their importance. The top ranked terms shown in Figure 3, for example, indicate that the article contains information about a directory of access providers. During the ltering process, any URL's located within the article are syntactically recognized and converted to active WWW hypertext links, and listed under the Resource Reference heading. If a WWW client is used for browsing, the resource indicated by the URL's can be viewed simply by selecting the link. Finally, the original article may be viewed by selecting the \view article" link from the index page. This takes the user to an HTML version of the article (Figure 4). Query terms are tagged so that they will appear in a dierent font than the rest of the article (in this case, boldface). As well, all URL's in the body of the article are converted to active WWW links. A user is therefore free to browse the referenced resources as the article is scanned. The following sections discuss in greater detail the methods and models used in the implementation of the prototype. We also present details of some of the enhancements necessary to provide more eective information ltering from an environment like Netnews. 4.2 Document Model Most information retrieval and ltering research focuses on text-based documents 6 which in essence are simply ordered sequences of terms (usually words). Alternative methods of representing documents exist 5 In this example we use Mosaic. 6 Throughout the remainder of the paper we will use the terms articles and documents synonymously. 5

10 3.1 HyperText Markup Language Rather than developing a new mark-up tag scheme for ltered information, we make use of the popular Hypertext Markup Language, or HTML [22]. Most Internet users are already familiar with HTML, an SGML-dened tag set for describing the structure of WWW documents. Besides a rich set of simple structural elements, HTML denes tags to embed \anchors", or hypertext links, into documents. Using HTML as the markup tag set means that: most Internet users will be able to take full advantage of the many widely available high-quality WWW browsers (called clients) to view the results, the search results can be incorporated directly into a WWW server. As momentum behind the WWW and HTML browsers has grown, it has become standard Netnews practice to publish the locations of new resources in a \universal" format. Uniform Resource Locators 3, or URL's are strings which uniquely encode the network address of a given resource [5]. For example, the URL of this paper is: ftp://csg.uwaterloo.ca/pub/kjl/surf.ps WWW browsers integrate anchors and URL's to providing hyperlink access to information anywhere on the Internet. While URL's provide a short, unique address for a given resource, the ood of interest in the Internet has spawned the generation of myriad new resource addresses. Extracting, indexing, and following-up on Netnews published URL's has become a tedious, time-consuming endeavor. As part of the ltering process, we extract and convert any URL's found in the text of relevant articles to active WWW hypertext links, and present them in both a summary-list form, and, in situ (in an HTML version of the article). Provided that a WWW browser (such as the National Center for Supercomputing Application's (NCSA) Mosaic [18]) is used to view the ltered information, any resources announced in the article can be examined simply by selecting their hypertext links. 4 The SURF System The primary interface to the SURF system is the user prole. It is through the prole that user species one or more \searches", representing any type of interest, ranging from long-term to \one-shot". Each search is identied by three elements: a directory in which the results are to be placed, a list of newsgroups, and a collection of query terms. Figure 1 presents an overview of the SURF architecture. At xed (user-specied) intervals, SURF queries the Netnews server for new articles from the list of newsgroups specied in the prole. The articles are then compared against the query terms and relevance rankings are computed. Finally, the results of the search are converted into a collection of hypertext-linked HTML documents which the user can browse. 4.1 Example We illustrate the use of SURF through an example session. Suppose a user currently looks for information regarding Internet access providers by monitoring the comp.infosystems.announce newsgroup. He rst creates a prole with the search terms \internet", \access", and \providers", the name of the newsgroup, and the location for SURF to place the results (a directory within a local le system). SURF is then started from the command line 4 and runs until all new (unseen) articles from the newsgroup have been processed. 3 Dened by the Internet Engineering Task Force (IETF). The WWW denes a similar standard, known as a Universal Resource Identier, or URI. As URL is the more common usage, we use it to avoid confusion. 4 SURF currently runs only in stand-alone mode; future versions will be implemented as a server. 4

11 2 Related Work The Stanford Information Filtering Tool (SIFT) [23] is a project aimed at providing an ecient widearea information dissemination service. One application of the SIFT ltering engine is the dissemination of Netnews. Although from the users' perspective, our prototype and SIFT provide similar functionality, namely a ne-grained lter for Netnews, the two approaches dier in both intent and scale. SIFT addresses the eciency aspects of ltering large amounts of information for a large number of users. While capable of gathering and disseminating the bulk of Netnews, it routes articles to users via an interface, not addressing any information organizational issues. The Information Lens [16] is a system intended to increase the amount of useful information exchange within a group, by providing intelligent information ltering and organization. Users of the system specify a set of rules to lter or classify incoming messages automatically. While providing a powerful mechanism for ltering shared information, it is based exclusively on a set of semi-structured message types, which limits its applicability to unstructured text domains, such as Netnews. 3 Descriptive Markup Descriptive markup is a technique developed to annotate plain text documents with structural information through the use of special tags. For instance, markup tags might be used to denote which parts of a text document are chapter headings, and which are paragraphs. Structural information might also be used to place \links" between related portions of text. Since dierent document types exhibit dierent structures each requires its own set of markup tags. The Standard Generalized Markup Language (SGML) [11] is a \metalanguage" used to dene tag sets for dierent document types. Through a number of portability and standardization eorts (ISO 8879), use of SGML is growing in popularity. Although intended primarily for electronic publishing and data interchange, descriptive markup techniques provide the means to organize information so it can be easily browsed and located when needed. In an information dissemination environment where potentially large amounts of data may be returned, it is important that users be able to determine quickly what information is, or is not relevant without forcing them to read each article. This is an especially important concern in an active, \noisy" domain like Netnews, where high volume and low quality make it dicult to lter out all irrelevant information. Applying markup techniques to ltered information makes it possible to exploit existing SGML browser technology. The hypertext capability provided by most SGML-based browsers provides an ecient mechanism for traversing document collections. Placing links between related information lets users quickly navigate to other documents in the collection. Hypertext capability permits information to be organized for browsing in ways that would otherwise be dicult to achieve. For example, to prevent inundating users with too much detailed information at once, a \summary" of the lter results could be shown rst, containing links to progressively more detailed views. Ideally, an early indication of the information relevance could be determined so that interesting information could be viewed in more detail, while non-relevant information could be ignored. While SGML browsers oer enhanced information viewing capability, SGML text database research can be applied to the problem of storing the disseminated information eciently. By extracting embedded structural information, SGML-based text databases partition document text into database records to improve storage and retrieval eciency [7]. In addition to the eciency aspects, the database acts a centralized repository, making it easier to locate needed information once it has been discovered. A further benet is that the full range of search and extraction facilities oered by the database can be used to locate and view even more focused subsets of information. This is particularly useful if a user elects to lter a broad range of information. 3

12 Most current discovery tools also typically employ a passive approach to locating new information, searching only when initiated by the user. A preferred approach is to have the tools selectively route new information to users based upon a description of their interests (e.g. a user prole). This type of active search mechanism, known as information ltering [2], or selective dissemination of information [6], is usually based upon well-known information retrieval (IR) techniques [21]. An active ltering approach relieves users of the burden of frequent and time consuming searches through large amounts of data. In a dynamic environment, where information may be removed or expire without notice, an information lter is less likely to overlook important resources than an occasional human guided search. USENET News, or Netnews [13], is the Internet's primary mechanism for information exchange, consisting of a collection of electronic bulletin boards, or newsgroups, coarsely organized by topic. With its topical arrangement, Netnews qualies as a primitive, self-administered type of information ltering service. Users receive information by subscribing to the newsgroups that they nd interesting. Netnews provides a medium for, among other things, public discussion, information sharing, and announcements, making it the most comprehensive information repository 1 on the Internet. Despite its obvious importance to resource discovery, a number of factors underscore the need for a ner level of ltering granularity within Netnews: Typical Netnews users follow a subset of newsgroups based on long-term personal or professional interests, or occasionally search for new resource announcements. Newsgroup organization is currently determined by topic, however, with the wide appeal of Netnews, newsgroups tend to have a much broader focus than most users' interests. Consequently, users will be interested in only a subset of the articles. Rising trac levels have also created problems for users. A bi-weekly measurement of total USENET trac, as of September 1994, has placed the average number of new articles per day at nearly 72,000, for an average volume of over 150 MB per day [19]. The same study also shows these gures to be growing steadily. Linked to rising trac levels, another concern is the amount of \low-quality" news articles. An unfortunate consequence of an unregulated 2 conferencing medium, without enforced publishing guidelines or review processes, is that a large percentage of trac is devoid of proper grammar, style, or useful content. We propose that users of Netnews can benet from a ne-grained ltering facility. Furthermore, applying descriptive markup techniques to the ltered information as it arrives can reduce organizational problems, and facilitate an ecient browsing mechanism. In this paper we describe the Selective USENET Retrieval Facility (SURF), a prototype Netnews information lter. SURF is intended to provide a ne-grained information lter for Netnews using well-known information retrieval models, and disseminate information as structured, easy-to-browse collections of documents. Rather than attempt to lter all Netnews articles, the majority of which are of no interest, SURF operates on a more personalized approach, ltering articles only from a user-selected set of newsgroups. While this approach does require at least some familiarity with the Netnews system, it enables users to control fully the scope of the ltering process, and consequently, place an upper bound on the amount of information likely to be selected. As a result, our prototype may be more useful to experienced Internet users than to novices. We are currently working on techniques to increase its appeal to novices (see Section 5). Providing this type of personalized information lter allows us to focus on maximizing the benet of Netnews, rather than on the large-scale indexing techniques necessary to lter eciently tens of thousands of new articles every day. We focus on the techniques employed in the organization of the ltered information, and on methods of increasing ltering eectiveness. 1 Not a true repository in the strictest sense; Netnews is more like moving window, with new information added while aged information expires, and is removed. 2 With the exception of \moderated" newsgroups where each article must be approved by a moderator before it is made public. 2

13 SURF - An Information Filtering Facility for USENET News Kurt Lichtner and D.D. Cowan Computer Science Department & Computer Systems Group University of Waterloo Waterloo, Ont. N2L 3G1 Abstract Sustained growth in the Internet has lead to a dramatic increase in the amount of electronic information available to users. Tools designed to help users discover new information are the key factors in deriving the maximum benet from a large-scale information sharing system. However, many tools currently available lack the features necessary for coping with a burgeoning volume of information, such as selectively ltering interesting information, and automatically organizing it for ecient browsing. In this paper we describe the Selective USENET Retrieval Facility (SURF), a prototype Netnews information lter. We discuss methods of applying descriptive markup techniques to provide a consistent organization for the ltered information. We also describe the information retrieval models used in the lter, as well as domain specic methods of improving ltering eectiveness. 1 Introduction Since its initial deployment 25 years ago, the Internet has become an indispensable tool for conducting research, communication, and information (or resource) sharing. The recent explosion in popularity observed on the Internet has dramatically increased the amount and type of information available. Unfortunately, most of this information is of no use to the majority of users [3]. Consequently, network resource discovery has increasingly become a \needle-in-the-haystack" activity. As more electronic information is made available, users are becoming increasingly faced with an information overload. The key to making eective use of a large-scale information system while remaining shielded from an overload lies in the set of tools used to locate and manage information. The past several years have witnessed the emergence of a small number of tools directed at locating information; Archie [1], Gopher [17], and the World-Wide Web (WWW) [22], are all examples which have achieved widespread acceptance within the Internet community. Although they have become an integral part of discovering new information on the Internet, these tools have not addressed the information management issues, which deal with automatically organizing (indexing) discovered information so that it can be easily found and viewed when needed. By relying on individual users to manage data cataloging and storage details, an additional discovery problem, one level \above" that of networked resource discovery is created; it is often time consuming or dicult to locate previously discovered information from within a private collection, especially if it is large or haphazardly organized. Recent standardization eorts in document portability and research in text databases can be of direct benet to users faced with an information management problem. Descriptive markup languages allow for information to be organized into a standard document structure, while text database browsers provide the means to explore collections of documents quickly and eciently.

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca The MANICURE Document Processing System Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Je Gilbreth Information Science Research Institute University of Nevada, Las Vegas ABSTRACT

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client. (Published in WebNet 97: World Conference of the WWW, Internet and Intranet, Toronto, Canada, Octobor, 1997) WebView: A Multimedia Database Resource Integration and Search System over Web Deepak Murthy

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology A World Wide Web Resource Discovery System Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong Abstract

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

1 Introduction The history of information retrieval may go back as far as According to Maron[7], 1948 signies three important events. The rst is

1 Introduction The history of information retrieval may go back as far as According to Maron[7], 1948 signies three important events. The rst is The MANICURE Document Processing System Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Je Gilbreth Technical Report 95-02 Information Science Research Institute University of

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Cognitive Walkthrough Evaluation

Cognitive Walkthrough Evaluation Columbia University Libraries / Information Services Digital Library Collections (Beta) Cognitive Walkthrough Evaluation by Michael Benowitz Pratt Institute, School of Library and Information Science Executive

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION AN EMPIRICAL STUDY OF PERSPECTIVE-BASED USABILITY INSPECTION Zhijun Zhang, Victor Basili, and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20742, USA fzzj, basili,

More information

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group SAMOS: an Active Object{Oriented Database System Stella Gatziu, Klaus R. Dittrich Database Technology Research Group Institut fur Informatik, Universitat Zurich fgatziu, dittrichg@ifi.unizh.ch to appear

More information

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Benchmarking the CGNS I/O performance

Benchmarking the CGNS I/O performance 46th AIAA Aerospace Sciences Meeting and Exhibit 7-10 January 2008, Reno, Nevada AIAA 2008-479 Benchmarking the CGNS I/O performance Thomas Hauser I. Introduction Linux clusters can provide a viable and

More information

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Model for Information Retrieval Agent System Based on Keywords Distribution A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr

More information

A Tool for Maintaining Multi-variant Hypertext. Documents? Shueh-Cheng Hu and Richard Furuta.

A Tool for Maintaining Multi-variant Hypertext. Documents? Shueh-Cheng Hu and Richard Furuta. A Tool for Maintaining Multi-variant Hypertext Documents? Shueh-Cheng Hu and Richard Furuta fshuehu,furutag@csdl.tamu.edu Telephone: +1-409-845-3839 FAX: +1-409-847-8578 Center for the Study of Digital

More information

Objectives. Introduction to HTML. Objectives. Objectives

Objectives. Introduction to HTML. Objectives. Objectives Objectives Introduction to HTML Developing a Basic Web Page Review the history of the Web, the Internet, and HTML. Describe different HTML standards and specifications. Learn about the basic syntax of

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Browsing in the tsimmis System. Stanford University. into requests the source can execute. The data returned by the source is converted back into the

Browsing in the tsimmis System. Stanford University. into requests the source can execute. The data returned by the source is converted back into the Information Translation, Mediation, and Mosaic-Based Browsing in the tsimmis System SIGMOD Demo Proposal (nal version) Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou, Jerey

More information

Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract

Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract Harvest: A Scalable, Customizable Discovery and Access System C. Mic Bowman Transarc Corp. Udi Manber University of Arizona Peter B. Danzig University of Southern California Michael F. Schwartz University

More information

[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document

[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document [MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document Intellectual Property Rights Notice for Open Specifications Documentation Technical Documentation. Microsoft

More information

Shared File Directory

Shared File Directory A Web-Based Repository Manager for Brain Mapping Data R.M. Jakobovits, B. Modayur, and J.F. Brinkley Departments of Computer Science and Biological Structure University of Washington, Seattle, WA The Web

More information

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

Business Intelligence and Reporting Tools

Business Intelligence and Reporting Tools Business Intelligence and Reporting Tools Release 1.0 Requirements Document Version 1.0 November 8, 2004 Contents Eclipse Business Intelligence and Reporting Tools Project Requirements...2 Project Overview...2

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine

More information

UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE JUNE 2016 / VERSION 1.0

UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE  JUNE 2016 / VERSION 1.0 UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE WWW.BOLTON.AC.UK/DIA JUNE 2016 / VERSION 1.0 This guide is for staff who have responsibility for webpages on the university website. All Web Publishers must adhere

More information

Reverse Engineering with a CASE Tool. Bret Johnson. Research advisors: Spencer Rugaber and Rich LeBlanc. October 6, Abstract

Reverse Engineering with a CASE Tool. Bret Johnson. Research advisors: Spencer Rugaber and Rich LeBlanc. October 6, Abstract Reverse Engineering with a CASE Tool Bret Johnson Research advisors: Spencer Rugaber and Rich LeBlanc October 6, 994 Abstract We examine using a CASE tool, Interactive Development Environment's Software

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

The Internet The Internet

The Internet The Internet The Internet The Internet is a computer network made up of thousands of networks worldwide. No one knows exactly how many computers are connected to the Internet. It is certain, however, that these number

More information

Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests.

Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests. Feature-Guided Automated Collaborative Filtering Yezdi Lashkari Abstract Information ltering systems have traditionally relied on some form of content analysis of documents to represent a prole of user

More information

Due on: May 12, Team Members: Arpan Bhattacharya. Collin Breslin. Thkeya Smith. INFO (Spring 2013): Human-Computer Interaction

Due on: May 12, Team Members: Arpan Bhattacharya. Collin Breslin. Thkeya Smith. INFO (Spring 2013): Human-Computer Interaction Week 6 Assignment: Heuristic Evaluation of Due on: May 12 2013 Team Members: Arpan Bhattacharya Collin Breslin Thkeya Smith INFO 608-902 (Spring 2013): Human-Computer Interaction Group 1 HE Process Overview

More information

arxiv: v1 [cs.hc] 14 Nov 2017

arxiv: v1 [cs.hc] 14 Nov 2017 A visual search engine for Bangladeshi laws arxiv:1711.05233v1 [cs.hc] 14 Nov 2017 Manash Kumar Mandal Department of EEE Khulna University of Engineering & Technology Khulna, Bangladesh manashmndl@gmail.com

More information

New Perspectives on Creating Web Pages with HTML. Tutorial Objectives

New Perspectives on Creating Web Pages with HTML. Tutorial Objectives New Perspectives on Creating Web Pages with HTML Tutorial 2: Adding Hypertext Links to a Web Page 1 Tutorial Objectives Create hypertext links between elements within a Web page Create hypertext links

More information

Video Representation. Video Analysis

Video Representation. Video Analysis BROWSING AND RETRIEVING VIDEO CONTENT IN A UNIFIED FRAMEWORK Yong Rui, Thomas S. Huang and Sharad Mehrotra Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

In both systems the knowledge of certain server addresses is required for browsing. In WWW Hyperlinks as the only structuring tool (Robert Cailliau: \

In both systems the knowledge of certain server addresses is required for browsing. In WWW Hyperlinks as the only structuring tool (Robert Cailliau: \ The Hyper-G Information System Klaus Schmaranz (Institute for Information Processing and Computer Supported New Media (IICM), Graz University of Technology, Austria kschmar@iicm.tu-graz.ac.at) June 2,

More information

PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES.

PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES. PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES by Richard Spinks A Master s paper submitted to the faculty of the School of Information

More information

A SUBSYSTEM FOR FAST (IP) FLUX BOTNET DETECTION

A SUBSYSTEM FOR FAST (IP) FLUX BOTNET DETECTION Chapter 6 A SUBSYSTEM FOR FAST (IP) FLUX BOTNET DETECTION 6.1 Introduction 6.1.1 Motivation Content Distribution Networks (CDNs) and Round-Robin DNS (RRDNS) are the two standard methods used for resource

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

ihits: Extending HITS for Personal Interests Profiling

ihits: Extending HITS for Personal Interests Profiling ihits: Extending HITS for Personal Interests Profiling Ziming Zhuang School of Information Sciences and Technology The Pennsylvania State University zzhuang@ist.psu.edu Abstract Ever since the boom of

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

UNIT V SYSTEM SOFTWARE TOOLS

UNIT V SYSTEM SOFTWARE TOOLS 5.1 Text editors UNIT V SYSTEM SOFTWARE TOOLS A text editor is a type of program used for editing plain text files. Text editors are often provided with operating systems or software development packages,

More information

Activity Report at SYSTRAN S.A.

Activity Report at SYSTRAN S.A. Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about:

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about: WWW and Web Browser 6.0 Introduction WWW stands for World Wide Web. WWW is a collection of interlinked hypertext pages on the Internet. Hypertext is text that references some other information that can

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Global Servers. The new masters

Global Servers. The new masters Global Servers The new masters Course so far General OS principles processes, threads, memory management OS support for networking Protocol stacks TCP/IP, Novell Netware Socket programming RPC - (NFS),

More information

CHAPTER-26 Mining Text Databases

CHAPTER-26 Mining Text Databases CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other

More information

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their A Model and a Visual Query Language for Structured Text Ricardo Baeza-Yates Gonzalo Navarro Depto. de Ciencias de la Computacion, Universidad de Chile frbaeza,gnavarrog@dcc.uchile.cl Jesus Vegas Pablo

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907-912, 1996. Connectionist Networks for Feature Indexing and Object Recognition Clark F. Olson Department of Computer

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Developing a Basic Web Page

Developing a Basic Web Page Developing a Basic Web Page Creating a Web Page for Stephen Dubé s Chemistry Classes 1 Objectives Review the history of the Web, the Internet, and HTML Describe different HTML standards and specifications

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

For our sample application we have realized a wrapper WWWSEARCH which is able to retrieve HTML-pages from a web server and extract pieces of informati

For our sample application we have realized a wrapper WWWSEARCH which is able to retrieve HTML-pages from a web server and extract pieces of informati Meta Web Search with KOMET Jacques Calmet and Peter Kullmann Institut fur Algorithmen und Kognitive Systeme (IAKS) Fakultat fur Informatik, Universitat Karlsruhe Am Fasanengarten 5, D-76131 Karlsruhe,

More information

Chapter 1 Introduction to HTML, XHTML, and CSS

Chapter 1 Introduction to HTML, XHTML, and CSS Chapter 1 Introduction to HTML, XHTML, and CSS MULTIPLE CHOICE 1. The world s largest network is. a. the Internet c. Newsnet b. the World Wide Web d. both A and B A PTS: 1 REF: HTML 2 2. ISPs utilize data

More information

\Classical" RSVP and IP over ATM. Steven Berson. April 10, Abstract

\Classical RSVP and IP over ATM. Steven Berson. April 10, Abstract \Classical" RSVP and IP over ATM Steven Berson USC Information Sciences Institute April 10, 1996 Abstract Integrated Services in the Internet is rapidly becoming a reality. Meanwhile, ATM technology is

More information

R&D White Paper WHP 018. The DVB MHP Internet Access profile. Research & Development BRITISH BROADCASTING CORPORATION. January J.C.

R&D White Paper WHP 018. The DVB MHP Internet Access profile. Research & Development BRITISH BROADCASTING CORPORATION. January J.C. R&D White Paper WHP 018 January 2002 The DVB MHP Internet Access profile J.C. Newell Research & Development BRITISH BROADCASTING CORPORATION BBC Research & Development White Paper WHP 018 Title J.C. Newell

More information

THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE. Colleen DeJong CS851--Information Retrieval December 13, 1996

THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE. Colleen DeJong CS851--Information Retrieval December 13, 1996 THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE Colleen DeJong CS851--Information Retrieval December 13, 1996 Table of Contents 1 Introduction.........................................................

More information

easily extended to accommodate additional languages. The multilingual design presented is reusable: most of its components do not depend on DIENST and

easily extended to accommodate additional languages. The multilingual design presented is reusable: most of its components do not depend on DIENST and Multilingual Extensions to DIENST Sarantos Kapidakis Iakovos Mavroidis y Hariklia Tsalapata z April 19, 1999 Abstract Digital libraries enable on-line access of information and provide advanced methods

More information

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile.

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile. Block Addressing Indices for Approximate Text Retrieval Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 212 - Santiago - Chile frbaeza,gnavarrog@dcc.uchile.cl

More information

New Perspectives on Creating Web Pages with HTML. Adding Hypertext Links to a Web Page

New Perspectives on Creating Web Pages with HTML. Adding Hypertext Links to a Web Page New Perspectives on Creating Web Pages with HTML Adding Hypertext Links to a Web Page 1 Objectives Create hypertext links between elements within a Web page Create hypertext links between Web pages Review

More information

KM COLUMN. How to evaluate a content management system. Ask yourself: what are your business goals and needs? JANUARY What this article isn t

KM COLUMN. How to evaluate a content management system. Ask yourself: what are your business goals and needs? JANUARY What this article isn t KM COLUMN JANUARY 2002 How to evaluate a content management system Selecting and implementing a content management system (CMS) will be one of the largest IT projects tackled by many organisations. With

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

[19] G. Salton and C. Buckley. On the automatic generation of content links in

[19] G. Salton and C. Buckley. On the automatic generation of content links in [18] G. Salton. Automatic Text Processing. Addison-Wesley, USA, 1989. [19] G. Salton and C. Buckley. On the automatic generation of content links in hypertext. Technical Report 89-993, Cornell University,

More information

Obsoletes: 2070, 1980, 1942, 1867, 1866 Category: Informational June 2000

Obsoletes: 2070, 1980, 1942, 1867, 1866 Category: Informational June 2000 Network Working Group Request for Comments: 2854 Obsoletes: 2070, 1980, 1942, 1867, 1866 Category: Informational D. Connolly World Wide Web Consortium (W3C) L. Masinter AT&T June 2000 The text/html Media

More information

Background of HTML and the Internet

Background of HTML and the Internet Background of HTML and the Internet World Wide Web in Plain English http://www.youtube.com/watch?v=akvva2flkbk Structure of the World Wide Web A network is a structure linking computers together for the

More information

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream Agent Roles in Snapshot Assembly Delbert Hart Dept. of Computer Science Washington University in St. Louis St. Louis, MO 63130 hart@cs.wustl.edu Eileen Kraemer Dept. of Computer Science University of Georgia

More information

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Ratko Orlandic Department of Computer Science and Applied Math Illinois Institute of Technology

More information

Spemmet - A Tool for Modeling Software Processes with SPEM

Spemmet - A Tool for Modeling Software Processes with SPEM Spemmet - A Tool for Modeling Software Processes with SPEM Tuomas Mäkilä tuomas.makila@it.utu.fi Antero Järvi antero.jarvi@it.utu.fi Abstract: The software development process has many unique attributes

More information

EBSCOhost Web 6.0. User s Guide EBS 2065

EBSCOhost Web 6.0. User s Guide EBS 2065 EBSCOhost Web 6.0 User s Guide EBS 2065 6/26/2002 2 Table Of Contents Objectives:...4 What is EBSCOhost...5 System Requirements... 5 Choosing Databases to Search...5 Using the Toolbar...6 Using the Utility

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

SCAM: A Copy Detection Mechanism for Digital Documents. Stanford University. fshiva,

SCAM: A Copy Detection Mechanism for Digital Documents. Stanford University. fshiva, SCAM: A Copy Detection Mechanism for Digital Documents Narayanan Shivakumar, Hector Garcia-Molina Department of Computer Science Stanford University Stanford, CA 94305-240 fshiva, hectorg@cs.stanford.edu

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Brouillon d'article pour les Cahiers GUTenberg n?? February 5, xndy A Flexible Indexing System Roger Kehr Institut fur Theoretische Informatik

Brouillon d'article pour les Cahiers GUTenberg n?? February 5, xndy A Flexible Indexing System Roger Kehr Institut fur Theoretische Informatik Brouillon d'article pour les Cahiers GUTenberg n?? February 5, 1998 1 xndy A Flexible Indexing System Roger Kehr Institut fur Theoretische Informatik Darmstadt University of Technology Wilhelminenstrae

More information

Collaborative Refinery: A Collaborative Information Workspace for the World Wide Web

Collaborative Refinery: A Collaborative Information Workspace for the World Wide Web Collaborative Refinery: A Collaborative Information Workspace for the World Wide Web Technical Report 97-03 David W. McDonald Mark S. Ackerman Department of Information and Computer Science University

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

BUILDING A CONCEPTUAL MODEL OF THE WORLD WIDE WEB FOR VISUALLY IMPAIRED USERS

BUILDING A CONCEPTUAL MODEL OF THE WORLD WIDE WEB FOR VISUALLY IMPAIRED USERS 1 of 7 17/01/2007 10:39 BUILDING A CONCEPTUAL MODEL OF THE WORLD WIDE WEB FOR VISUALLY IMPAIRED USERS Mary Zajicek and Chris Powell School of Computing and Mathematical Sciences Oxford Brookes University,

More information

TagFS Tag Semantics for Hierarchical File Systems

TagFS Tag Semantics for Hierarchical File Systems TagFS Tag Semantics for Hierarchical File Systems Stephan Bloehdorn, Olaf Görlitz, Simon Schenk, Max Völkel Institute AIFB, University of Karlsruhe, Germany {bloehdorn}@aifb.uni-karlsruhe.de ISWeb, University

More information

Technical Writing. Professional Communications

Technical Writing. Professional Communications Technical Writing Professional Communications Overview Plan the document Write a draft Have someone review the draft Improve the document based on the review Plan, conduct, and evaluate a usability test

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Automatic Bangla Corpus Creation

Automatic Bangla Corpus Creation Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Implementing Web Content

Implementing Web Content Implementing Web Content Tonia M. Bartz Dr. David Robins Individual Investigation SLIS Site Redesign 6 August 2006 Appealing Web Content When writing content for a web site, it is best to think of it more

More information

iscreen Usability INTRODUCTION

iscreen Usability INTRODUCTION INTRODUCTION Context and motivation The College of IST recently installed an interactive kiosk called iscreen, designed to serve as an information resource for student/visitors to the College of IST. The

More information

Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa

Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa Computazionale, INTERNET and DBT Abstract The advent of Internet has had enormous impact on working patterns and development in many scientific

More information