Organizing Topic-Specific Web Information

Size: px

Start display at page:

Download "Organizing Topic-Specific Web Information"

Kristin Phillips
6 years ago
Views:

1 Organizing Topic-Specific Web Information Sougata Mukherjea, C&C Research Laboratories, NEC USA Inc, San Jose, Ca, USA ABSTRACT With the explosive growth of the World-Wide Web, it is becoming increasingly difficult for users to collect and organize Web pages that are relevant to a particular topic To address this problem we are developing WTMS, a system for Web Topic Management In this paper we explain how WTMS collects Web pages for a topic and organizes them at various levels of abstraction We also introduce the user interface of the system that smoothly integrates querying and browsing Moreover, we present the various views of the interface that allow the user to navigate through the information space KEYWORDS: World-Wide Web, Topic Management, Information Visualization, Abstraction Hierarchy, Graph Algorithms 1 INTRODUCTION The World-Wide Web is undoubtedly the best source for getting information on any topic Therefore, more and more people use the Web for topic management [1], the task of gathering, evaluating and organizing information resources on the Web Users may investigate topics both for professional or personal interests Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW However, with the explosive growth of the Web, topic management is becoming an increasingly difficult task On the one hand this leads to a large number of documents being retrieved for most queries The results are presented as pages of scrolled lists Going through these pages to retrieve the relevant information is tedious Moreover, the Web has over 350 million pages and continues to grow rapidly at a million pages per day [4] Such growth and flux pose basic limits of scale for today s generic search engines Thus many relevant information may not have been gathered and some information may not be up-to-date Because of these problems, recently there is much awareness that for serious Web users, focussed portholes are more useful than generic portals [10] Therefore, systems that allow the user to collect and organize the information related to a particular topic and allow easy navigation through this information space is becoming essential We believe that such Web Topic Management Systems should have several essential features to be really useful: Focussed crawler: A crawler that allows the collection of resources from the WWW that relates to a user-specified topic is an essential requirement of the system Organize the information at various levels of abstraction: A search engine generally shows the relevant Web pages as the result of a user s query However, sometimes authors organize the information as a collection of pages; in these cases presenting the collection may be more useful for the reader In fact, if a Website has many relevant pages, presenting the Website itself may be better Furthermore, if there are many Websites for a particular topic, grouping similar Websites may be more convenient Therefore, an effective Topic Management system should be able to present the information at various levels of abstraction depending on the user s focus Integrate querying and browsing: The two major ways to access information in the Web are querying and browsing Querying is appropriate when the user has a well-defined understanding of what information is needed and how to formulate a query However, in many cases the user is not certain of exactly what information is desired and needs to learn more about the content of the information space In these cases browsing is an ideal navigational strategy Browsing can also be combined with querying when the result of the query is too large for the user to comprehend (by letting the user browse through the results) or too small (by showing the user other related information) An effective Topic Management system should allow the user to smoothly integrate querying and browsing to retrieve the relevant information Allow easy sharing of the gathered information: Once the collection for a particular topic has been gathered and organized, different users with interest in the topic should be able to share the information 133

2 We are building WTMS, a Web Topic Management System to allow the collection and organization of information on the Web related to a particular topic This paper discusses the architecture and various features of the system The next section cites related work Section 3 explains how we collect the information from the WWW and organize them at various levels of abstraction Section 4 presents the user interface of the system that allows the integration of querying and browsing The various views that allow the user to navigate through the information are also introduced Section 5 highlights some of our experiences with the system Finally, section 6 is the conclusion 2 RELATED WORK 21 Clustering in Hypertext The idea of grouping related nodes to reduce the complexity of hypertext networks has been used in the early hypertext systems In some cases (like [12], [13]) the hypertext is considered to be a set of documents and clustering algorithms for documents based on textual analysis are used In other cases (like [6], [14]), the similarity measure for clustering is based on the graph structure and graph algorithms are used for the clustering For example, nodes that have links connecting them are considered to be similar in [6] Modifications of the traditional clustering techniques are being applied to the WWW For example, [22] and [23] describe research at Xerox Parc that uses clustering to extract useful structures from the Web The clusters are determined by various criteria like co-citation analysis Similarly, in this paper we use various criteria to group Web pages for a particular topic at various levels of abstraction 22 Visualizing the World-Wide Web Several systems for visualizing WWW sites have been developed Examples includes Navigational View Builder [20], Harmony Internet Browser [2] and Narcissus [15] Visualization techniques for World-Wide Web search results are also being developed For example, the WebQuery system [8] visualizes the results of a search query along with all pages that link to or are linked to by any page in the original result set Another example is WebBook [7], which potentially allows the results of the search to be organized and manipulated in various ways in a 3D space In this paper our emphasis is to develop views that allow navigation through information about a particular topic organized at various levels of abstraction World-Wide Web Topic Management In recent times there has been much interest in collecting Web pages related to a particular topic and determining the important pages in the collection A focussed crawler for collecting topic-specific Web pages is presented in [10] Kleinberg [17], defines the HITS algorithm to identify authority and hub pages in a collection Authority pages are authorities on a topic and hubs point to many pages relevant to the topic Thus pages with many in-links, specially from hubs, are considered to be good authorities On the other hand, pages with many out-links, specially from authorities, are considered to be good hubs The algorithm has been refined in CLEVER [9] and Topic Distillation [5] We believe that this algorithm is very important for a Web Topic Management system However, we determine the hub and authority sites for a topic (instead of hub and authority pages) Mapuccino (formerly WebCutter) [18], [16], [3] and Topic- Shop [25], [1] are two systems that have been developed for WWW Topic Management Both systems use a crawler for collecting Web pages related to a topic and use various types of visualization to allow the user to navigate through the resultant information space While Mapuccino presents the information as a collection of Webpages, TopicShop presents the information as a collection of Websites As emphasized in the introduction, we believe that it is more effective to present the information at various levels of abstraction depending on the user s focus Moreover, a topic management system should allow the user to integrate querying and browsing to retrieve the relevant information 3 COLLECTING AND ORGANIZING INFORMATION The Web Topic Management System (WTMS) consists of two main components: Gatherer: The Gatherer uses a crawler to collect Web pages related to a user-specified topic A full-text search engine is used to index these pages and an organizer groups these pages at various levels of abstraction User Interface: The User Interface is a Java applet for navigating through the gathered information space In this section we will discuss how the gatherer collects and organizes the information 31 Crawling For collecting WWW pages related to a particular topic the user has to first specify some seed urls relevant to the topic Alternatively, the user can specify the keywords for the topic and the crawler issues a query with the specified keywords to a popular Web search engine and uses the results as the seed urls The crawler downloads the seed urls and creates a representative document vector (RDV) based on the frequently occurring keywords in these urls The crawler then downloads the pages that are referenced from the seed urls If the similarity of a downloaded page to the RDV is above a threshold, the page is indexed by the text search engine and the links from the page are added to a queue The crawler continues to follow the out-links until the queue is empty or a user-specified limit is reached (We use several heuristics to avoid downloading all the linked pages from the pages that are found to be relevant for the collection The heuristics are explained in [19])

3 Africa and Europe respectively Therefore, we use the directory structure of a particular Website to organize the pages at the lowest level At the end of this process the pages are grouped into their respective physical domains like wwwcnn com Figure 1: Abstraction Hierarchy of a Collection The disconnected connected components are at the top and the Web pages are the leaves 2 Logical Websites: Many large corporations use several Web servers based on functionality or geographic location For example, wwwneccom and wwwneccojp are the Websites of NEC Corporation in US and Japan respectively Similarly, shoppingyahoocom is the shopping component of the Yahoo wwwyahoocom portal Therefore, we group together related physical domains into logical Websites by analyzing the urls at the next level In other cases we may need to break down the domains into smaller logical Websites For example, for corporations like Geocities and Tripod who provide free homepages, the Websites wwwgeocitiescom and memberstripodcom are actually a collection of homepages with minimal relationship among each other (The technique of breaking physical domains to logical sites has not yet been incorporated into WTMS) The crawler also determines the pages pointing to the seed urls (A query link:u to search engines like AltaVista and Google returns all pages pointing to url u) These pages are downloaded and if their similarity to the RDV is greater than a threshold, they are indexed and the urls pointing to these pages are added to the queue The crawler continues to follow the in-links until the queue is empty or an user-specified limit is reached After crawling, the collection consists of the seed urls as well as all pages similar to these seed urls that have paths to or from the seeds We believe that this collection is a good source of information available on the WWW for the user-specified topic Unlike the results of a search engine query, this collection not only contains pages having the user specified keywords, but other pages linked from these pages which are relevant to the topic It should be noted that the crawler has a stop url list to avoid downloading some popular pages (like Yahoo s and Netscape s Homepages) as part of the collection 32 Organizing the Information For most topics the crawler will be downloading thousands of Web pages To allow the user to understand the information space better, the organizer groups the pages together at various levels of abstraction Figure 1 shows the abstraction hierarchy We use the following techniques to form the hierarchy: 1 Website Directory Structure: The Web pages are considered to be the leaves of the hierarchy Most Websites organize the pages into a meaningful directory hierarchy Thus wwwcnncom/world contains World news and wwwcnn com/us contains US news Further, wwwcnncom/world/ africa and wwwcnncom/world/europe contains news about 3 Strongly Connected Components: The above two steps create a collection of Websites for a particular topic For many topics there will be hundreds of Websites; therefore further abstraction will be useful To determine the related Websites, we create a site graph This is a directed graph with the Websites as the nodes If a page in Website a has a link to a page in Website b, then an edge is created from node a to b in the site graph We then calculate the strongly connected components in the site graph 1 Since each pair of nodes of a strongly connected component are reachable from each other, these nodes can be considered to be related So they can be grouped together 4 Connected Components: The procedure to determine the strongly connected components creates a component graph The nodes of the graph are the strongly connected components that were discovered If there was an edge between any two Websites belonging to different strongly connected components, then there will be an edge between the corresponding nodes in the component graph At the final stage of forming the abstraction hierarchy we consider the component graph as an undirected graph and determine the connected components 2 Nodes in the same connected component form a cluster 135 Figure 1 explains how the logical Websites are grouped into strongly connected and connected components Websites in the same strongly connected component have bi-directional paths among each other; for example sites a and b Onthe other hand, sites in the same connected component have unidirectional paths only; for example in Figure 1 there is a path 1 A strongly connected component is a maximal subgraph of a directed graph such that for every pair of vertices u, v in the subgraph, there is a directed path from u to v and a directed path from v to u [11] 2 A connected component is a maximal subgraph of an undirected graph such that for every pair of vertices u, v in the subgraph, there is a path between u and v [11]

4 Figure 2: The System Architecture of WTMS from b to c but not c to b Thus, sites in the same strongly connected component can be considered to be more similar than sites in the same connected component Obviously, sites belonging to different connected components have no links between them It should be emphasized that the procedure to form the abstraction hierarchy is linear and thus applicable to collections with a large number of Web pages The directory structure and the logical Websites can be determined by an analysis of the urls On the other hand, the algorithms for finding the strongly connected and connected components are O(n + e) where n and e are the number of nodes and edges in the graph 33 Sharing the Gathered Information As shown in Figure 2, WTMS uses a client-server architecture to allow easy sharing of the collection Any user can specify keywords or seed urls to gather collections for topics of interest After the information about one or more topics have been crawled, indexed and organized, a WTMS server is initialized An user with interest in a topic can navigate through the collection for the topic, by accessing the WTMS server from any Web browser Like any search engine the server allows the user to query the collection An unique feature of the system is the WTMS user interface which provides an overview of the collection for navigation This interface is opened in the client browser as a Java applet Since most users won t require all the collected information, the applet initially loads only a minimal amount of information Based on user interactions, the applet can request further information from the WTMS server XML is used for information exchange between the WTMS server and clients Note that to keep the collections up-todate, the crawling, indexing and organizing process has to be repeated periodically 4 INTERFACES FOR TOPIC MANAGEMENT The WTMS interface provides various types of views to allow the user to navigate through the information space This section gives a brief description of the WTMS interface applet Tabular View Figure 3 shows the initial view of the WTMS interface for a collection on World Cup Soccer On the left is a table showing the logical Websites that were discovered for the topic Various information about the sites are shown Besides the url and the number of pages, it also shows the hub and authority scores These scores are calculated using the algorithm introduced in [17] However, instead of applying the algorithm to individual Web pages, it is applied to the site graph (defined in section 32) Thus, sites with many inlinks in the site graph become the authorities and sites with many out-links become the hubs We believe that it is more useful to show hub and authority at the site level than at the page level since most HTML authors organize the information into a series of HTML pages for ease of navigation and readers generally want to know what is available on a given site, not on a single page The table also shows the title of the main page of the sites The main page is determined by the connectivity and the depth of the page in the site hierarchy as discussed in [21] Note that generally while calculating the hub and authority scores, intra-site links are ignored [5] Assuming that all Web pages within a site are by the same author, this removes the author s judgement while determining the global importance of a page within the overall collection However, while determining the local importance of a page within a site, the author s judgement is also important So the intra-site links are not ignored while determining the main page of a site The table gives a good overview about the collection Clicking on the labels on the top the user can sort the table by that particular statistics Thus, in Figure 3, it is sorted by the number of pages The first site wwwsportslinecom is also the authority (having the maximum possible authority score of 1:0) 42 Abstraction Hierarchy View The right of Figure 3 shows the initial abstraction hierarchy It shows the connected components that were discovered for the collection If some connected component contained only one site, the site itself is shown (for example, wwwiraniancom) Each component is represented by a glyph (graphical element) The size of the glyph is proportional to the number of Web pages the group contains and the brightness is proportional to last modification time of the freshest page in the group Thus, bright red glyphs indicate groups which contain very fresh pages and a black glyph indicates old pages 3 The glyphs are ordered by the maximum authority score among the sites in the group Thus the group containing wwwsportslinecom is at the top The label of a connected component is determined by the site with the highest authority score it contains The user can navigate through the abstraction hierarchy by 3 Of course, this is not very clear in the black and white figure

Figure 3: Initial WTMS Interface for a collection on World Cup Soccer The table on the left shows the logical Websites sorted by the number of Web pages The initial Abstraction Hierarchy showing the

5 Figure 3: Initial WTMS Interface for a collection on World Cup Soccer The table on the left shows the logical Websites sorted by the number of Web pages The initial Abstraction Hierarchy showing the connected components is on the right ordered by authority scores (a) The logical Website is selected (b) A directory (www2sportslinecom/u/soccer/worldcup98) is selected Figure 4: Abstraction Hierarchy of wwwsportslinecom 137

43 Page Level Views Clicking on a glyph (using the right mouse button) shows more information about the cluster or Web page For example, Figure 5 shows information like the title and the last

6 43 Page Level Views Clicking on a glyph (using the right mouse button) shows more information about the cluster or Web page For example, Figure 5 shows information like the title and the last modification time for the page cfphtml, a page in the collection on Information Visualiza- 138 tion collection This dialog box allows the user to visit the page or see the links to and from the page Thus, Figure 6(a) shows the links for the Infovis99 call for paper The page only has in-links Figure 5: Information about the Web page clicking on a glyph (with the left mouse button) to see the details of the corresponding cluster For example, Figure 4 shows the details of the site wwwsportslinecom In Figure 4(a), the logical Website is selected We can see the ancestors of the site (the connected and strongly connected component the site belongs to) as well as the siblings and children of the site The logical Website consists of many physical domains like www3sportslinecom andcbssportslinecom For domains containing a single page or directory, the page or directory itself is shown (for example, www2sportslinecom/u/ soccer) In Figure 4(b) the user has navigated further down the hierarchy and selected a directory wwwsportslinecom/u/ soccer/worldcup98 Notice that the glyph is an circle for a Web page and rectangle for the clusters Further, the label of the currently selected node is highlighted and the rectangles representing the currently selected path is not filled The view also shows the links to and from the currently selected site If another site has a link to the site, it has an arrow pointing out Similarly, if a site has a link from the selected site, it has an arrow pointing in The thickness of the arrow is proportional to the number of links Thus, Figure 4(a) shows that the site wwwsportscom has both inlinks and out-links to wwwsportslinecom The arrows to and from the children of the selected site, give an indication of their connectivity For example, Figure 4(b) shows that wwwsportslinecom/u/soccer/worldcup98/dm62598htm has several outgoing links Anotheruseful visualizationis thescatterplot view which allows the user to see related pages to a selected page as shown in Figure 6(b) In this view, the semantic similarity of a page to a selected page is determined by the vector space model [24] and is mapped in the Y dimension The structural similarity is mapped to the X dimension The structural similar pages for a selected page A is calculated as follows: 1 For each direct link to and from any page B to A, thestructural similarity score of B is increased by 1:0 2 For each indirect link between pages B and A, the structural similarity score is increased by 0:5 There can be 3 types of indirected links: (a) Transitive: B links to C and C links to A or vice versa (b) Social Filtering: Both A and B pointto C (c) Co-citation: C points to both A and B 3 The scores are normalized to determine the structural similarity The scatterplot visualization is very useful to see the related pages of the selected url (which is shown at the top right corner with high values for both semantic and structural similarity) The user can click on any glyph to see details of the corresponding page like the url and title as well as its links and related pages 44 Integrating Querying The WTMS interface allows the user to filter the information space based on various criteria For example, the user can see only pages modified before or after a particular date Moreover, in the Scatterplot visualization, the user can see only pages that are co-cited or directly linked to the selected page or in the same/different domain The tabular view and the abstraction hierarchy view are integrated Selecting a site in the abstraction hierarchy view highlights the site in the table Similarly, the user can select a site in the table to view its abstraction hierarchy These simple views can effectively show large collections containing thousands of Web pages and allow the user to gain an understanding about the topic Another useful technique is to specify keywords to determine relevant Webpages or sites For example, Figure 7 shows the results after a query to a collection on Titanic with the keywords Celine Dion (the lead singer in the movie) Figure 7(a) shows the Java Applet The relevant Websites are highlighted in the tabular view and only the relevant groups are visible in the abstraction hierarchy view It shows that out of the 20 connected components in the collection (see Table 1), only 4 are relevant to the query The user can navigate through the filtered space to determine the relevant information Of course, the user can also see the retrieved pages in a HTML window as shown in Figure 7(b) Thus, the results are viewable at 3 levels of abstraction (connected components, sites and pages) In this way WTMS smoothly integrates querying and browsing

(a) Showing the Links (b) Showing the Related Urls Figure 6:

The Java Applet (b) The HTML Interface Figure 7: Query Result

Information Visualization World Cup Soccer Titanic Web Pages

107 556 Stronly Connected Components 210 105 479 Connected

7 (a) Showing the Links (b) Showing the Related Urls Figure 6: Page Level Views for (a) The Java Applet (b) The HTML Interface Figure 7: Query Result of Celine Dion to the Titanic Collection Collections Information Visualization World Cup Soccer Titanic Web Pages Physical Domains Logical Sites Stronly Connected Components Connected Components Table 1: Statistics for the Abstraction Hierarchy Formation Process 139

8 It should be noted that for keyword queries as well as for determining the similar nodes to a selected node (used in the Scatterplot visualization), the WTMS interface sends a query to the WTMS server The results from the server are used in the visualizations 5 EXPERIENCES We have collected and organized information from the WWW relating to several topic The user selected topic keywords were used by WTMS to query the search engine Google and the top 100 retrieved pages were downloaded as the seed urls These pages were indexed and a representative document vector (RDV) was created from the 25 most frequently occurring words among these urls (excluding common words like and and the present in a stoplist) The crawler then followed paths of length 2 to and from the seed urls and indexed pages similar to the RDV A maximum of 100 in-links and out-links were followed for each page Note that the user specified limits like the number of pages to be retrieved by the search engine or the length of the path to follow could be changed by the user to gather collections of different size After the crawling, the pages were organized as described in the section 32 Table 1 shows some statistics of the organization process for 3 collections Titanic, World Cup Soccer and Information Visualization It shows the number of Web pages indexed as well as the number of physical domains, logical Websites, strongly connected components and connected components that were discovered 6 CONCLUSION In this paper, we have presented WTMS, a Web Topic Management system WTMS uses a crawler to gather information related to a particular topic from the WWW It then uses url and link analysis to organize the Web pages in a abstraction hierarchy The WTMS interface allows the user to integrate searching and browsing The applet provides various visualizations for various units of information like a tabular view of the Websites of the collection as well as a scatterplot visualization showing pages related to a selected page Moreover, the abstraction hierarchy view facilitates navigation through the information space at different levels of de- 140 tail The example scenarios presented in the paper indicates the usefulness of the system for Web Topic Management Future work is planned along various directions: Textual analysis during organization: At present we organize pages within a site by the directory hierarchy Since the Website administrator is the best judge of how the pages within a site should be organized, this grouping is effective However, for grouping related sites, we rely only on the links between the sites Although it leads to a meaningful abstraction hierarchy in most cases, for better results we have to augment the graph algorithms with textual analysis Allowing Collaboration: We plan to allow distributed users to collaborate over the internet while navigating through the same information space For example, colleagues can collaborate while performing literature survey on a particular topic Our client-server architecture will facilitate such a collaborative environment Usability Studies: Although the initial reactions to the system has been positive, more formal user studies are required to determine the effectiveness of the abstraction hierarchy and the WTMS interface for retrieving information about a particular topic Such feedbacks will help us to improve the system We believe that as the WWW grows bigger, systems like WTMS will become essential for retrieving useful information One interesting observation is that the number of strongly connected components is not much smaller than the number of logical sites for all three collections It seems to indicate that there are not many cyclical referencing among the Websites; therefore they could not be grouped into strongly connected components Table 1 also indicates that the number of connected components are small Thus our organization technique groups many linked (and therefore related) sites into a small number of meaningful clusters Thus it is useful to get an overview about a collection As discussed in [19], the WTMS interface also allows the user draw some useful insights about the topic REFERENCES 1 B Amento, W Hill, L Terveen, D Hix, and P Ju An Empirical Evaluation of User Interfaces for Topic Management of Web Sites In Proceedings of the ACM SIGCHI 99 Conference on Human Factors in Computing Systems, pages , Pittsburgh, Pa, May K Andrews Visualizing Cyberspace: Information Visualization in the Harmony Internet Browser In Proceedings of the 1995 Information Visualization Symposium, pages , Atlanta, GA, I Ben-Shaul, M Hersovici, M Jacovi, Y Maarek, D Pelleg, M Shtalheim,, V Soroka, and S Ur Adding Support for Dynamic and Focussed Search with Fetuccino In Proceedings of the Eight International World-Wide Web Conference, pages , Toronto, Canada, May K Bharat and A Broder A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April K Bharat and M Henzinger Improved Algorithms for Topic Distillation in a Hyperlinked Environment

9 In Proceedings of the ACM SIGIR 98 Conference on Research and Development in Information Retrieval, pages , Melbourne, Australia, August R Botafogo and B Shneiderman Identifying Aggregates in Hypertext Structures In Proceedings of ACM Hypertext 91 Conference, pages 63 74, San Antonio, TX, December S Card, G Robertson, and W York The WebBook and the Web Forager: An Information Workspace for the World-Wide Web In Proceedings of the ACM SIGCHI 96 Conference on Human Factors in Computing Systems, pages , Vancouver, Canada, April J Carriere and R Kazman Searching and Visualizing the Web through Connectivity In Proceedings of the Sixth International World-Wide Web Conference, pages , Santa Clara, CA, April S Chakrabarti, B Dom, D Gibson, J Kleinberg, P Raghavan, and S Rajagopalan Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April S Chakrabarti, M van den Berg, and B Dom Focussed Crawling: a New Approach to Topic-specific Web Resource Discovery In Proceedings of the Eight International World-Wide Web Conference, pages , Toronto, Canada, May T Cormen, C Leiserson, and R Rivest Introduction to Algorithms The MIT Press, D Crouch, C Crouch, and G Andreas The Use of Cluster Hierarchies in Hypertext Information Retrieval In Proceedings of ACM Hypertext 89 Conference, pages , Pittsburgh, Pa, November P Gloor Cybermap: Yet Another Way of Navigating in Hyperspace In Proceedings of ACM Hypertext 91 Conference, pages , San Antonio, TX, December Y Hara, A Keller, and G Wiederhold Implementing Hypertext Database Relationships Through Aggregations and Exceptions In Proceedings of ACM Hypertext 91 Conference, pages 75 90, San Antonio, TX, December R Hendley, N Drew, A Wood, and R Beale Narcissus: Visualizing Information In Proceedings of the 1995 Information Visualization Symposium, pages 90 96, Atlanta, GA, an Application: Tailored Web Site Mapping Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April J Kleinberg Authorative Sources in a Hyperlinked Environment In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, May Y Maarek and I Shaul WebCutter: A System for Dynamic and Tailorable Site Mapping In Proceedings of the Sixth International World-Wide Web Conference, pages , Santa Clara, CA, April S Mukherjea WTMS: A System for Collecting and Analyzing Topic-Specific Web Information In Proceedings of the Ninth International World-Wide Web Conference, Amsterdam, Netherlands, May S Mukherjea and J Foley Visualizing the World-Wide Web with the Navigational View Builder Computer Networks and ISDN Systems Special Issue on the Third International World-Wide Web Conference, Darmstadt, Germany, 27(6): , April S Mukherjea and Y Hara Focus+Context Views of World-Wide Web Nodes In Proceedings of the Eight ACM Conference on Hypertext, pages , Southampton, England, April P Pirolli, J Pitkow, and R Rao Silk from a Sow s Ear: Extracting Usable Structures from the Web In Proceedings of the ACM SIGCHI 96 Conference on Human Factors in Computing Systems, pages , Vancouver, Canada, April J Pitkow and P Pirolli Life, Death and Lawfulness on the Electronic Frontier In Proceedings of the ACM SIGCHI 97 Conference on Human Factors in Computing Systems, pages , Atlanta, Ga, March G Salton and M McGill Introduction to Modern Information Retrieval McGraw-Hill, L Terveen and H Will Finding and Visualizing Intersite Clan Graphs In Proceedings of the ACM SIGCHI 98 Conference on Human Factors in Computing Systems, pages , Los Angeles, Ca, April 1998 Author s Current Affiliation: Icarian Inc, 555 North Mathilda Avenue, Sunnyvale, Ca 94086, USA sougata@icariancom 16 M Hersovici, M Jacovi, Y Maarek, D Pelleg, M Shtalheim, and S Ur The Shark-Search Algorithm 141

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer