Organizing Topic-Specific Web Information

Size: px
Start display at page:

Download "Organizing Topic-Specific Web Information"

Transcription

1 Organizing Topic-Specific Web Information Sougata Mukherjea, C&C Research Laboratories, NEC USA Inc, San Jose, Ca, USA ABSTRACT With the explosive growth of the World-Wide Web, it is becoming increasingly difficult for users to collect and organize Web pages that are relevant to a particular topic To address this problem we are developing WTMS, a system for Web Topic Management In this paper we explain how WTMS collects Web pages for a topic and organizes them at various levels of abstraction We also introduce the user interface of the system that smoothly integrates querying and browsing Moreover, we present the various views of the interface that allow the user to navigate through the information space KEYWORDS: World-Wide Web, Topic Management, Information Visualization, Abstraction Hierarchy, Graph Algorithms 1 INTRODUCTION The World-Wide Web is undoubtedly the best source for getting information on any topic Therefore, more and more people use the Web for topic management [1], the task of gathering, evaluating and organizing information resources on the Web Users may investigate topics both for professional or personal interests Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW However, with the explosive growth of the Web, topic management is becoming an increasingly difficult task On the one hand this leads to a large number of documents being retrieved for most queries The results are presented as pages of scrolled lists Going through these pages to retrieve the relevant information is tedious Moreover, the Web has over 350 million pages and continues to grow rapidly at a million pages per day [4] Such growth and flux pose basic limits of scale for today s generic search engines Thus many relevant information may not have been gathered and some information may not be up-to-date Because of these problems, recently there is much awareness that for serious Web users, focussed portholes are more useful than generic portals [10] Therefore, systems that allow the user to collect and organize the information related to a particular topic and allow easy navigation through this information space is becoming essential We believe that such Web Topic Management Systems should have several essential features to be really useful: Focussed crawler: A crawler that allows the collection of resources from the WWW that relates to a user-specified topic is an essential requirement of the system Organize the information at various levels of abstraction: A search engine generally shows the relevant Web pages as the result of a user s query However, sometimes authors organize the information as a collection of pages; in these cases presenting the collection may be more useful for the reader In fact, if a Website has many relevant pages, presenting the Website itself may be better Furthermore, if there are many Websites for a particular topic, grouping similar Websites may be more convenient Therefore, an effective Topic Management system should be able to present the information at various levels of abstraction depending on the user s focus Integrate querying and browsing: The two major ways to access information in the Web are querying and browsing Querying is appropriate when the user has a well-defined understanding of what information is needed and how to formulate a query However, in many cases the user is not certain of exactly what information is desired and needs to learn more about the content of the information space In these cases browsing is an ideal navigational strategy Browsing can also be combined with querying when the result of the query is too large for the user to comprehend (by letting the user browse through the results) or too small (by showing the user other related information) An effective Topic Management system should allow the user to smoothly integrate querying and browsing to retrieve the relevant information Allow easy sharing of the gathered information: Once the collection for a particular topic has been gathered and organized, different users with interest in the topic should be able to share the information 133

2 We are building WTMS, a Web Topic Management System to allow the collection and organization of information on the Web related to a particular topic This paper discusses the architecture and various features of the system The next section cites related work Section 3 explains how we collect the information from the WWW and organize them at various levels of abstraction Section 4 presents the user interface of the system that allows the integration of querying and browsing The various views that allow the user to navigate through the information are also introduced Section 5 highlights some of our experiences with the system Finally, section 6 is the conclusion 2 RELATED WORK 21 Clustering in Hypertext The idea of grouping related nodes to reduce the complexity of hypertext networks has been used in the early hypertext systems In some cases (like [12], [13]) the hypertext is considered to be a set of documents and clustering algorithms for documents based on textual analysis are used In other cases (like [6], [14]), the similarity measure for clustering is based on the graph structure and graph algorithms are used for the clustering For example, nodes that have links connecting them are considered to be similar in [6] Modifications of the traditional clustering techniques are being applied to the WWW For example, [22] and [23] describe research at Xerox Parc that uses clustering to extract useful structures from the Web The clusters are determined by various criteria like co-citation analysis Similarly, in this paper we use various criteria to group Web pages for a particular topic at various levels of abstraction 22 Visualizing the World-Wide Web Several systems for visualizing WWW sites have been developed Examples includes Navigational View Builder [20], Harmony Internet Browser [2] and Narcissus [15] Visualization techniques for World-Wide Web search results are also being developed For example, the WebQuery system [8] visualizes the results of a search query along with all pages that link to or are linked to by any page in the original result set Another example is WebBook [7], which potentially allows the results of the search to be organized and manipulated in various ways in a 3D space In this paper our emphasis is to develop views that allow navigation through information about a particular topic organized at various levels of abstraction World-Wide Web Topic Management In recent times there has been much interest in collecting Web pages related to a particular topic and determining the important pages in the collection A focussed crawler for collecting topic-specific Web pages is presented in [10] Kleinberg [17], defines the HITS algorithm to identify authority and hub pages in a collection Authority pages are authorities on a topic and hubs point to many pages relevant to the topic Thus pages with many in-links, specially from hubs, are considered to be good authorities On the other hand, pages with many out-links, specially from authorities, are considered to be good hubs The algorithm has been refined in CLEVER [9] and Topic Distillation [5] We believe that this algorithm is very important for a Web Topic Management system However, we determine the hub and authority sites for a topic (instead of hub and authority pages) Mapuccino (formerly WebCutter) [18], [16], [3] and Topic- Shop [25], [1] are two systems that have been developed for WWW Topic Management Both systems use a crawler for collecting Web pages related to a topic and use various types of visualization to allow the user to navigate through the resultant information space While Mapuccino presents the information as a collection of Webpages, TopicShop presents the information as a collection of Websites As emphasized in the introduction, we believe that it is more effective to present the information at various levels of abstraction depending on the user s focus Moreover, a topic management system should allow the user to integrate querying and browsing to retrieve the relevant information 3 COLLECTING AND ORGANIZING INFORMATION The Web Topic Management System (WTMS) consists of two main components: Gatherer: The Gatherer uses a crawler to collect Web pages related to a user-specified topic A full-text search engine is used to index these pages and an organizer groups these pages at various levels of abstraction User Interface: The User Interface is a Java applet for navigating through the gathered information space In this section we will discuss how the gatherer collects and organizes the information 31 Crawling For collecting WWW pages related to a particular topic the user has to first specify some seed urls relevant to the topic Alternatively, the user can specify the keywords for the topic and the crawler issues a query with the specified keywords to a popular Web search engine and uses the results as the seed urls The crawler downloads the seed urls and creates a representative document vector (RDV) based on the frequently occurring keywords in these urls The crawler then downloads the pages that are referenced from the seed urls If the similarity of a downloaded page to the RDV is above a threshold, the page is indexed by the text search engine and the links from the page are added to a queue The crawler continues to follow the out-links until the queue is empty or a user-specified limit is reached (We use several heuristics to avoid downloading all the linked pages from the pages that are found to be relevant for the collection The heuristics are explained in [19])

3 Africa and Europe respectively Therefore, we use the directory structure of a particular Website to organize the pages at the lowest level At the end of this process the pages are grouped into their respective physical domains like wwwcnn com Figure 1: Abstraction Hierarchy of a Collection The disconnected connected components are at the top and the Web pages are the leaves 2 Logical Websites: Many large corporations use several Web servers based on functionality or geographic location For example, wwwneccom and wwwneccojp are the Websites of NEC Corporation in US and Japan respectively Similarly, shoppingyahoocom is the shopping component of the Yahoo wwwyahoocom portal Therefore, we group together related physical domains into logical Websites by analyzing the urls at the next level In other cases we may need to break down the domains into smaller logical Websites For example, for corporations like Geocities and Tripod who provide free homepages, the Websites wwwgeocitiescom and memberstripodcom are actually a collection of homepages with minimal relationship among each other (The technique of breaking physical domains to logical sites has not yet been incorporated into WTMS) The crawler also determines the pages pointing to the seed urls (A query link:u to search engines like AltaVista and Google returns all pages pointing to url u) These pages are downloaded and if their similarity to the RDV is greater than a threshold, they are indexed and the urls pointing to these pages are added to the queue The crawler continues to follow the in-links until the queue is empty or an user-specified limit is reached After crawling, the collection consists of the seed urls as well as all pages similar to these seed urls that have paths to or from the seeds We believe that this collection is a good source of information available on the WWW for the user-specified topic Unlike the results of a search engine query, this collection not only contains pages having the user specified keywords, but other pages linked from these pages which are relevant to the topic It should be noted that the crawler has a stop url list to avoid downloading some popular pages (like Yahoo s and Netscape s Homepages) as part of the collection 32 Organizing the Information For most topics the crawler will be downloading thousands of Web pages To allow the user to understand the information space better, the organizer groups the pages together at various levels of abstraction Figure 1 shows the abstraction hierarchy We use the following techniques to form the hierarchy: 1 Website Directory Structure: The Web pages are considered to be the leaves of the hierarchy Most Websites organize the pages into a meaningful directory hierarchy Thus wwwcnncom/world contains World news and wwwcnn com/us contains US news Further, wwwcnncom/world/ africa and wwwcnncom/world/europe contains news about 3 Strongly Connected Components: The above two steps create a collection of Websites for a particular topic For many topics there will be hundreds of Websites; therefore further abstraction will be useful To determine the related Websites, we create a site graph This is a directed graph with the Websites as the nodes If a page in Website a has a link to a page in Website b, then an edge is created from node a to b in the site graph We then calculate the strongly connected components in the site graph 1 Since each pair of nodes of a strongly connected component are reachable from each other, these nodes can be considered to be related So they can be grouped together 4 Connected Components: The procedure to determine the strongly connected components creates a component graph The nodes of the graph are the strongly connected components that were discovered If there was an edge between any two Websites belonging to different strongly connected components, then there will be an edge between the corresponding nodes in the component graph At the final stage of forming the abstraction hierarchy we consider the component graph as an undirected graph and determine the connected components 2 Nodes in the same connected component form a cluster 135 Figure 1 explains how the logical Websites are grouped into strongly connected and connected components Websites in the same strongly connected component have bi-directional paths among each other; for example sites a and b Onthe other hand, sites in the same connected component have unidirectional paths only; for example in Figure 1 there is a path 1 A strongly connected component is a maximal subgraph of a directed graph such that for every pair of vertices u, v in the subgraph, there is a directed path from u to v and a directed path from v to u [11] 2 A connected component is a maximal subgraph of an undirected graph such that for every pair of vertices u, v in the subgraph, there is a path between u and v [11]

4 Figure 2: The System Architecture of WTMS from b to c but not c to b Thus, sites in the same strongly connected component can be considered to be more similar than sites in the same connected component Obviously, sites belonging to different connected components have no links between them It should be emphasized that the procedure to form the abstraction hierarchy is linear and thus applicable to collections with a large number of Web pages The directory structure and the logical Websites can be determined by an analysis of the urls On the other hand, the algorithms for finding the strongly connected and connected components are O(n + e) where n and e are the number of nodes and edges in the graph 33 Sharing the Gathered Information As shown in Figure 2, WTMS uses a client-server architecture to allow easy sharing of the collection Any user can specify keywords or seed urls to gather collections for topics of interest After the information about one or more topics have been crawled, indexed and organized, a WTMS server is initialized An user with interest in a topic can navigate through the collection for the topic, by accessing the WTMS server from any Web browser Like any search engine the server allows the user to query the collection An unique feature of the system is the WTMS user interface which provides an overview of the collection for navigation This interface is opened in the client browser as a Java applet Since most users won t require all the collected information, the applet initially loads only a minimal amount of information Based on user interactions, the applet can request further information from the WTMS server XML is used for information exchange between the WTMS server and clients Note that to keep the collections up-todate, the crawling, indexing and organizing process has to be repeated periodically 4 INTERFACES FOR TOPIC MANAGEMENT The WTMS interface provides various types of views to allow the user to navigate through the information space This section gives a brief description of the WTMS interface applet Tabular View Figure 3 shows the initial view of the WTMS interface for a collection on World Cup Soccer On the left is a table showing the logical Websites that were discovered for the topic Various information about the sites are shown Besides the url and the number of pages, it also shows the hub and authority scores These scores are calculated using the algorithm introduced in [17] However, instead of applying the algorithm to individual Web pages, it is applied to the site graph (defined in section 32) Thus, sites with many inlinks in the site graph become the authorities and sites with many out-links become the hubs We believe that it is more useful to show hub and authority at the site level than at the page level since most HTML authors organize the information into a series of HTML pages for ease of navigation and readers generally want to know what is available on a given site, not on a single page The table also shows the title of the main page of the sites The main page is determined by the connectivity and the depth of the page in the site hierarchy as discussed in [21] Note that generally while calculating the hub and authority scores, intra-site links are ignored [5] Assuming that all Web pages within a site are by the same author, this removes the author s judgement while determining the global importance of a page within the overall collection However, while determining the local importance of a page within a site, the author s judgement is also important So the intra-site links are not ignored while determining the main page of a site The table gives a good overview about the collection Clicking on the labels on the top the user can sort the table by that particular statistics Thus, in Figure 3, it is sorted by the number of pages The first site wwwsportslinecom is also the authority (having the maximum possible authority score of 1:0) 42 Abstraction Hierarchy View The right of Figure 3 shows the initial abstraction hierarchy It shows the connected components that were discovered for the collection If some connected component contained only one site, the site itself is shown (for example, wwwiraniancom) Each component is represented by a glyph (graphical element) The size of the glyph is proportional to the number of Web pages the group contains and the brightness is proportional to last modification time of the freshest page in the group Thus, bright red glyphs indicate groups which contain very fresh pages and a black glyph indicates old pages 3 The glyphs are ordered by the maximum authority score among the sites in the group Thus the group containing wwwsportslinecom is at the top The label of a connected component is determined by the site with the highest authority score it contains The user can navigate through the abstraction hierarchy by 3 Of course, this is not very clear in the black and white figure

5 Figure 3: Initial WTMS Interface for a collection on World Cup Soccer The table on the left shows the logical Websites sorted by the number of Web pages The initial Abstraction Hierarchy showing the connected components is on the right ordered by authority scores (a) The logical Website is selected (b) A directory (www2sportslinecom/u/soccer/worldcup98) is selected Figure 4: Abstraction Hierarchy of wwwsportslinecom 137

6 43 Page Level Views Clicking on a glyph (using the right mouse button) shows more information about the cluster or Web page For example, Figure 5 shows information like the title and the last modification time for the page cfphtml, a page in the collection on Information Visualiza- 138 tion collection This dialog box allows the user to visit the page or see the links to and from the page Thus, Figure 6(a) shows the links for the Infovis99 call for paper The page only has in-links Figure 5: Information about the Web page clicking on a glyph (with the left mouse button) to see the details of the corresponding cluster For example, Figure 4 shows the details of the site wwwsportslinecom In Figure 4(a), the logical Website is selected We can see the ancestors of the site (the connected and strongly connected component the site belongs to) as well as the siblings and children of the site The logical Website consists of many physical domains like www3sportslinecom andcbssportslinecom For domains containing a single page or directory, the page or directory itself is shown (for example, www2sportslinecom/u/ soccer) In Figure 4(b) the user has navigated further down the hierarchy and selected a directory wwwsportslinecom/u/ soccer/worldcup98 Notice that the glyph is an circle for a Web page and rectangle for the clusters Further, the label of the currently selected node is highlighted and the rectangles representing the currently selected path is not filled The view also shows the links to and from the currently selected site If another site has a link to the site, it has an arrow pointing out Similarly, if a site has a link from the selected site, it has an arrow pointing in The thickness of the arrow is proportional to the number of links Thus, Figure 4(a) shows that the site wwwsportscom has both inlinks and out-links to wwwsportslinecom The arrows to and from the children of the selected site, give an indication of their connectivity For example, Figure 4(b) shows that wwwsportslinecom/u/soccer/worldcup98/dm62598htm has several outgoing links Anotheruseful visualizationis thescatterplot view which allows the user to see related pages to a selected page as shown in Figure 6(b) In this view, the semantic similarity of a page to a selected page is determined by the vector space model [24] and is mapped in the Y dimension The structural similarity is mapped to the X dimension The structural similar pages for a selected page A is calculated as follows: 1 For each direct link to and from any page B to A, thestructural similarity score of B is increased by 1:0 2 For each indirect link between pages B and A, the structural similarity score is increased by 0:5 There can be 3 types of indirected links: (a) Transitive: B links to C and C links to A or vice versa (b) Social Filtering: Both A and B pointto C (c) Co-citation: C points to both A and B 3 The scores are normalized to determine the structural similarity The scatterplot visualization is very useful to see the related pages of the selected url (which is shown at the top right corner with high values for both semantic and structural similarity) The user can click on any glyph to see details of the corresponding page like the url and title as well as its links and related pages 44 Integrating Querying The WTMS interface allows the user to filter the information space based on various criteria For example, the user can see only pages modified before or after a particular date Moreover, in the Scatterplot visualization, the user can see only pages that are co-cited or directly linked to the selected page or in the same/different domain The tabular view and the abstraction hierarchy view are integrated Selecting a site in the abstraction hierarchy view highlights the site in the table Similarly, the user can select a site in the table to view its abstraction hierarchy These simple views can effectively show large collections containing thousands of Web pages and allow the user to gain an understanding about the topic Another useful technique is to specify keywords to determine relevant Webpages or sites For example, Figure 7 shows the results after a query to a collection on Titanic with the keywords Celine Dion (the lead singer in the movie) Figure 7(a) shows the Java Applet The relevant Websites are highlighted in the tabular view and only the relevant groups are visible in the abstraction hierarchy view It shows that out of the 20 connected components in the collection (see Table 1), only 4 are relevant to the query The user can navigate through the filtered space to determine the relevant information Of course, the user can also see the retrieved pages in a HTML window as shown in Figure 7(b) Thus, the results are viewable at 3 levels of abstraction (connected components, sites and pages) In this way WTMS smoothly integrates querying and browsing

7 (a) Showing the Links (b) Showing the Related Urls Figure 6: Page Level Views for (a) The Java Applet (b) The HTML Interface Figure 7: Query Result of Celine Dion to the Titanic Collection Collections Information Visualization World Cup Soccer Titanic Web Pages Physical Domains Logical Sites Stronly Connected Components Connected Components Table 1: Statistics for the Abstraction Hierarchy Formation Process 139

8 It should be noted that for keyword queries as well as for determining the similar nodes to a selected node (used in the Scatterplot visualization), the WTMS interface sends a query to the WTMS server The results from the server are used in the visualizations 5 EXPERIENCES We have collected and organized information from the WWW relating to several topic The user selected topic keywords were used by WTMS to query the search engine Google and the top 100 retrieved pages were downloaded as the seed urls These pages were indexed and a representative document vector (RDV) was created from the 25 most frequently occurring words among these urls (excluding common words like and and the present in a stoplist) The crawler then followed paths of length 2 to and from the seed urls and indexed pages similar to the RDV A maximum of 100 in-links and out-links were followed for each page Note that the user specified limits like the number of pages to be retrieved by the search engine or the length of the path to follow could be changed by the user to gather collections of different size After the crawling, the pages were organized as described in the section 32 Table 1 shows some statistics of the organization process for 3 collections Titanic, World Cup Soccer and Information Visualization It shows the number of Web pages indexed as well as the number of physical domains, logical Websites, strongly connected components and connected components that were discovered 6 CONCLUSION In this paper, we have presented WTMS, a Web Topic Management system WTMS uses a crawler to gather information related to a particular topic from the WWW It then uses url and link analysis to organize the Web pages in a abstraction hierarchy The WTMS interface allows the user to integrate searching and browsing The applet provides various visualizations for various units of information like a tabular view of the Websites of the collection as well as a scatterplot visualization showing pages related to a selected page Moreover, the abstraction hierarchy view facilitates navigation through the information space at different levels of de- 140 tail The example scenarios presented in the paper indicates the usefulness of the system for Web Topic Management Future work is planned along various directions: Textual analysis during organization: At present we organize pages within a site by the directory hierarchy Since the Website administrator is the best judge of how the pages within a site should be organized, this grouping is effective However, for grouping related sites, we rely only on the links between the sites Although it leads to a meaningful abstraction hierarchy in most cases, for better results we have to augment the graph algorithms with textual analysis Allowing Collaboration: We plan to allow distributed users to collaborate over the internet while navigating through the same information space For example, colleagues can collaborate while performing literature survey on a particular topic Our client-server architecture will facilitate such a collaborative environment Usability Studies: Although the initial reactions to the system has been positive, more formal user studies are required to determine the effectiveness of the abstraction hierarchy and the WTMS interface for retrieving information about a particular topic Such feedbacks will help us to improve the system We believe that as the WWW grows bigger, systems like WTMS will become essential for retrieving useful information One interesting observation is that the number of strongly connected components is not much smaller than the number of logical sites for all three collections It seems to indicate that there are not many cyclical referencing among the Websites; therefore they could not be grouped into strongly connected components Table 1 also indicates that the number of connected components are small Thus our organization technique groups many linked (and therefore related) sites into a small number of meaningful clusters Thus it is useful to get an overview about a collection As discussed in [19], the WTMS interface also allows the user draw some useful insights about the topic REFERENCES 1 B Amento, W Hill, L Terveen, D Hix, and P Ju An Empirical Evaluation of User Interfaces for Topic Management of Web Sites In Proceedings of the ACM SIGCHI 99 Conference on Human Factors in Computing Systems, pages , Pittsburgh, Pa, May K Andrews Visualizing Cyberspace: Information Visualization in the Harmony Internet Browser In Proceedings of the 1995 Information Visualization Symposium, pages , Atlanta, GA, I Ben-Shaul, M Hersovici, M Jacovi, Y Maarek, D Pelleg, M Shtalheim,, V Soroka, and S Ur Adding Support for Dynamic and Focussed Search with Fetuccino In Proceedings of the Eight International World-Wide Web Conference, pages , Toronto, Canada, May K Bharat and A Broder A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April K Bharat and M Henzinger Improved Algorithms for Topic Distillation in a Hyperlinked Environment

9 In Proceedings of the ACM SIGIR 98 Conference on Research and Development in Information Retrieval, pages , Melbourne, Australia, August R Botafogo and B Shneiderman Identifying Aggregates in Hypertext Structures In Proceedings of ACM Hypertext 91 Conference, pages 63 74, San Antonio, TX, December S Card, G Robertson, and W York The WebBook and the Web Forager: An Information Workspace for the World-Wide Web In Proceedings of the ACM SIGCHI 96 Conference on Human Factors in Computing Systems, pages , Vancouver, Canada, April J Carriere and R Kazman Searching and Visualizing the Web through Connectivity In Proceedings of the Sixth International World-Wide Web Conference, pages , Santa Clara, CA, April S Chakrabarti, B Dom, D Gibson, J Kleinberg, P Raghavan, and S Rajagopalan Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April S Chakrabarti, M van den Berg, and B Dom Focussed Crawling: a New Approach to Topic-specific Web Resource Discovery In Proceedings of the Eight International World-Wide Web Conference, pages , Toronto, Canada, May T Cormen, C Leiserson, and R Rivest Introduction to Algorithms The MIT Press, D Crouch, C Crouch, and G Andreas The Use of Cluster Hierarchies in Hypertext Information Retrieval In Proceedings of ACM Hypertext 89 Conference, pages , Pittsburgh, Pa, November P Gloor Cybermap: Yet Another Way of Navigating in Hyperspace In Proceedings of ACM Hypertext 91 Conference, pages , San Antonio, TX, December Y Hara, A Keller, and G Wiederhold Implementing Hypertext Database Relationships Through Aggregations and Exceptions In Proceedings of ACM Hypertext 91 Conference, pages 75 90, San Antonio, TX, December R Hendley, N Drew, A Wood, and R Beale Narcissus: Visualizing Information In Proceedings of the 1995 Information Visualization Symposium, pages 90 96, Atlanta, GA, an Application: Tailored Web Site Mapping Computer Networks and ISDN Systems Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April J Kleinberg Authorative Sources in a Hyperlinked Environment In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, May Y Maarek and I Shaul WebCutter: A System for Dynamic and Tailorable Site Mapping In Proceedings of the Sixth International World-Wide Web Conference, pages , Santa Clara, CA, April S Mukherjea WTMS: A System for Collecting and Analyzing Topic-Specific Web Information In Proceedings of the Ninth International World-Wide Web Conference, Amsterdam, Netherlands, May S Mukherjea and J Foley Visualizing the World-Wide Web with the Navigational View Builder Computer Networks and ISDN Systems Special Issue on the Third International World-Wide Web Conference, Darmstadt, Germany, 27(6): , April S Mukherjea and Y Hara Focus+Context Views of World-Wide Web Nodes In Proceedings of the Eight ACM Conference on Hypertext, pages , Southampton, England, April P Pirolli, J Pitkow, and R Rao Silk from a Sow s Ear: Extracting Usable Structures from the Web In Proceedings of the ACM SIGCHI 96 Conference on Human Factors in Computing Systems, pages , Vancouver, Canada, April J Pitkow and P Pirolli Life, Death and Lawfulness on the Electronic Frontier In Proceedings of the ACM SIGCHI 97 Conference on Human Factors in Computing Systems, pages , Atlanta, Ga, March G Salton and M McGill Introduction to Modern Information Retrieval McGraw-Hill, L Terveen and H Will Finding and Visualizing Intersite Clan Graphs In Proceedings of the ACM SIGCHI 98 Conference on Human Factors in Computing Systems, pages , Los Angeles, Ca, April 1998 Author s Current Affiliation: Icarian Inc, 555 North Mathilda Avenue, Sunnyvale, Ca 94086, USA sougata@icariancom 16 M Hersovici, M Jacovi, Y Maarek, D Pelleg, M Shtalheim, and S Ur The Shark-Search Algorithm 141

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

Information Visualization for Hypermedia Systems

Information Visualization for Hypermedia Systems Information Visualization for Hypermedia Systems Sougata Mukherjea C&C Research Labs, NEC USA, 110 Rio Robles, San Jose, Ca 95134, USA Email: sougata@ccrl.sj.nec.com Abstract: Information Visualization

More information

Using Text Elements by Context to Display Search Results in Information Retrieval Systems Model and Research results

Using Text Elements by Context to Display Search Results in Information Retrieval Systems Model and Research results Using Text Elements by Context to Display Search Results in Information Retrieval Systems Model and Research results Offer Drori SHAAM Information Systems The Hebrew University of Jerusalem offerd@ {shaam.gov.il,

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento 1,2, Will Hill 1, Loren Terveen 1, Deborah Hix 2, and Peter Ju 1 1 AT&T Labs - Research 180 Park Avenue, P.O. Box

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

TIC: A Topic-based Intelligent Crawler

TIC: A Topic-based Intelligent Crawler 2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management

Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management Wen-Syan Li and K. Selçuk Candan C&C Research Laboratories,, NEC USA Inc. 110 Rio Robles, M/S SJ100, San Jose,

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

Providing Interactive Site Ma ps for Web Navigation

Providing Interactive Site Ma ps for Web Navigation Providing Interactive Site Ma ps for Web Navigation Wei Lai Department of Mathematics and Computing University of Southern Queensland Toowoomba, QLD 4350, Australia Jiro Tanaka Institute of Information

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Visualizing Search Results using SQWID

Visualizing Search Results using SQWID Visualizing Search Results using SQWID D. Scott McCrickard & Colleen M. Kehoe Graphics, Visualization, and Usability Center College of Computing Georgia Institute of Technology Atlanta, GA 30332 {mccricks,colleen}@cc.gatech.edu

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento AT&T Labs - Research 180 Park Avenue, P.O. Box 971 Florham Park, NJ 07932 USA brian@research.att.com ABSTRACT Topic

More information

Topical Locality in the Web

Topical Locality in the Web Topical Locality in the Web Brian D. Davison Department of Computer Science Rutgers, The State University of New Jersey New Brunswick, NJ 893 USA davison@cs.rutgers.edu Abstract Most web pages are linked

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Visualizing Complex Hypermedia Networks through Multiple Hierarchical Views

Visualizing Complex Hypermedia Networks through Multiple Hierarchical Views Visualizing Complex Hypermedia Networks through Multiple Hierarchical Views Sougata Mukherjea, James D. Foley, Scott Hudson Graphics, Visualization & Usability Center College of Computing Georgia Institute

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

WebKIV: Visualizing Structure and Navigation for Web Mining Applications

WebKIV: Visualizing Structure and Navigation for Web Mining Applications WebKIV: Visualizing Structure and Navigation for Web Mining Applications Yonghe Niu, Tong Zheng, Jiyang Chen, Randy Goebel Department of Computing Science University of Alberta Edmonton, Alberta, Canada

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

Navigating Large Hierarchical Space Using Invisible Links

Navigating Large Hierarchical Space Using Invisible Links Navigating Large Hierarchical Space Using Invisible Links Ming C. Hao, Meichun Hsu, Umesh Dayal, Adrian Krug* Software Technology Laboratory HP Laboratories Palo Alto HPL-2000-8 January, 2000 E-mail:(mhao,

More information

TopicShop: Enhanced Support for Evaluating and Organizing Collections of Web Sites

TopicShop: Enhanced Support for Evaluating and Organizing Collections of Web Sites TopicShop: Enhanced Support for Evaluating and Organizing Collections of Web Sites Brian Amento 1,2, Loren Terveen 1, Will Hill 1, and Deborah Hix 2 AT&T Labs - Research 1 180 Park Avenue, P.O. Box 971

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

DESIGN OF CATEGORY-WISE FOCUSED WEB CRAWLER

DESIGN OF CATEGORY-WISE FOCUSED WEB CRAWLER DESIGN OF CATEGORY-WISE FOCUSED WEB CRAWLER Monika 1, Dr. Jyoti Pruthi 2 1 M.tech Scholar, 2 Assistant Professor, Department of Computer Science & Engineering, MRCE, Faridabad, (India) ABSTRACT The exponential

More information

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Markov Cluster Algorithm Web Web Web Kleinberg HITS Web Web HITS Web Markov Cluster Algorithm ( ) Web The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm Kazutami KATO and Hiroshi

More information

Automatically Determining Semantics for World Wide Web Multimedia Information Retrieval

Automatically Determining Semantics for World Wide Web Multimedia Information Retrieval Journal of Visual Languages and Computing (1999) 10, 585}606 Article No. jvlc.1999.0147, available online at http://www.idealibrary.com on Automatically Determining Semantics for World Wide Web Multimedia

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Survey on Web Structure Mining

Survey on Web Structure Mining Survey on Web Structure Mining Hiep T. Nguyen Tri, Nam Hoai Nguyen Department of Electronics and Computer Engineering Chonnam National University Republic of Korea Email: tuanhiep1232@gmail.com Abstract

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

An Introduction to Search Engines and Web Navigation

An Introduction to Search Engines and Web Navigation An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

BIBL NEEDS REVISION INTRODUCTION

BIBL NEEDS REVISION INTRODUCTION BIBL NEEDS REVISION FCSM Statistical Policy Seminar Session Title: Integrating Electronic Systems for Disseminating Statistics Paper Title: Interacting with Tabular Data Through the World Wide Web Paper

More information

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about:

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about: WWW and Web Browser 6.0 Introduction WWW stands for World Wide Web. WWW is a collection of interlinked hypertext pages on the Internet. Hypertext is text that references some other information that can

More information

2 Approaches to worldwide web information retrieval

2 Approaches to worldwide web information retrieval The WEBFIND tool for finding scientific papers over the worldwide web. Alvaro E. Monge and Charles P. Elkan Department of Computer Science and Engineering University of California, San Diego La Jolla,

More information

From Paragraph Networks to Document Networks

From Paragraph Networks to Document Networks Proceedings of the International Conference on Information Technology (ITCC 24), Las Vegas, April 5-7, 24. From Paragraph Networks to Document Networks Jinghao Miao and Daniel Berleant Dept. of Electrical

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

Link Based Clustering of Web Search Results

Link Based Clustering of Web Search Results Link Based Clustering of Web Search Results Yitong Wang and Masaru Kitsuregawa stitute of dustrial Science, The University of Tokyo {ytwang, kitsure}@tkl.iis.u-tokyo.ac.jp Abstract. With information proliferation

More information

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Using Clusters on the Vivisimo Web Search Engine

Using Clusters on the Vivisimo Web Search Engine Using Clusters on the Vivisimo Web Search Engine Sherry Koshman and Amanda Spink School of Information Sciences University of Pittsburgh 135 N. Bellefield Ave., Pittsburgh, PA 15237 skoshman@sis.pitt.edu,

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Interactive Campaign Planning for Marketing Analysts

Interactive Campaign Planning for Marketing Analysts Interactive Campaign Planning for Marketing Analysts Fan Du University of Maryland College Park, MD, USA fan@cs.umd.edu Sana Malik Adobe Research San Jose, CA, USA sana.malik@adobe.com Eunyee Koh Adobe

More information

Data Mining and Visualization Engine Based on Link Structure of WWW

Data Mining and Visualization Engine Based on Link Structure of WWW Data Mining and Visualization Engine Based on Link Structure of WWW Technical Report April 17, 2000 Dr. Richard H. Fowler and Tarkan Karadayi Computer Science Department University of Texas - Pan American

More information

E-Business s Page Ranking with Ant Colony Algorithm

E-Business s Page Ranking with Ant Colony Algorithm E-Business s Page Ranking with Ant Colony Algorithm Asst. Prof. Chonawat Srisa-an, Ph.D. Faculty of Information Technology, Rangsit University 52/347 Phaholyothin Rd. Lakok Pathumthani, 12000 chonawat@rangsit.rsu.ac.th,

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Daniel Berleant Dept. of Electrical and Computer Engineering Iowa State University Ames, IA 50011, USA

Daniel Berleant Dept. of Electrical and Computer Engineering Iowa State University Ames, IA 50011, USA A GRAPH BASED APPROACH TO COMPARING SEARCH ENGINES Jinghao Miao Dept. of Electrical and Computer Engineering Iowa State University Ames, IA 511, USA jhmiao@iastate.edu Daniel Berleant Dept. of Electrical

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Exploration versus Exploitation in Topic Driven Crawlers

Exploration versus Exploitation in Topic Driven Crawlers Exploration versus Exploitation in Topic Driven Crawlers Gautam Pant @ Filippo Menczer @ Padmini Srinivasan @! @ Department of Management Sciences! School of Library and Information Science The University

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

Dynamic Aggregation to Support Pattern Discovery: A case study with web logs

Dynamic Aggregation to Support Pattern Discovery: A case study with web logs Dynamic Aggregation to Support Pattern Discovery: A case study with web logs Lida Tang and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20720 {ltang, ben}@cs.umd.edu

More information

Using a Painting Metaphor to Rate Large Numbers

Using a Painting Metaphor to Rate Large Numbers 1 Using a Painting Metaphor to Rate Large Numbers of Objects Patrick Baudisch Integrated Publication and Information Systems Institute (IPSI) German National Research Center for Information Technology

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

A World Wide Web-based HCI-library Designed for Interaction Studies

A World Wide Web-based HCI-library Designed for Interaction Studies A World Wide Web-based HCI-library Designed for Interaction Studies Ketil Perstrup, Erik Frøkjær, Maria Konstantinovitz, Thorbjørn Konstantinovitz, Flemming S. Sørensen, Jytte Varming Department of Computing,

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

SCHULICH MEDICINE & DENTISTRY Website Updates August 30, Administrative Web Editor Guide v6

SCHULICH MEDICINE & DENTISTRY Website Updates August 30, Administrative Web Editor Guide v6 SCHULICH MEDICINE & DENTISTRY Website Updates August 30, 2012 Administrative Web Editor Guide v6 Table of Contents Chapter 1 Web Anatomy... 1 1.1 What You Need To Know First... 1 1.2 Anatomy of a Home

More information

Finding Context Paths for Web Pages

Finding Context Paths for Web Pages Finding Context Paths for Web Pages Yoshiaki Mizuuchi Keishi Tajima Department of Computer and Systems Engineering Kobe University, Japan ( Currently at NTT Data Corporation) Background (1/3) Aceess to

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Objectives. Object-Oriented Analysis and Design with the Unified Process 2

Objectives. Object-Oriented Analysis and Design with the Unified Process 2 Objectives Understand the differences between user interfaces and system interfaces Explain why the user interface is the system to the users Discuss the importance of the three principles of user-centered

More information

Visualising File-Systems Using ENCCON Model

Visualising File-Systems Using ENCCON Model Visualising File-Systems Using ENCCON Model Quang V. Nguyen and Mao L. Huang Faculty of Information Technology University of Technology, Sydney, Australia quvnguye@it.uts.edu.au, maolin@it.uts.edu.au Abstract

More information

A Hierarchical Web Page Crawler for Crawling the Internet Faster

A Hierarchical Web Page Crawler for Crawling the Internet Faster A Hierarchical Web Page Crawler for Crawling the Internet Faster Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay and Young-Chon Kim Web Intelligence & Distributed Computing Research Lab, Techno India

More information

Information Gathering Support Interface by the Overview Presentation of Web Search Results

Information Gathering Support Interface by the Overview Presentation of Web Search Results Information Gathering Support Interface by the Overview Presentation of Web Search Results Takumi Kobayashi Kazuo Misue Buntarou Shizuki Jiro Tanaka Graduate School of Systems and Information Engineering

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

A Parallel Computing Architecture for Information Processing Over the Internet

A Parallel Computing Architecture for Information Processing Over the Internet A Parallel Computing Architecture for Information Processing Over the Internet Wendy A. Lawrence-Fowler, Xiannong Meng, Richard H. Fowler, Zhixiang Chen Department of Computer Science, University of Texas

More information

An Improved PageRank Method based on Genetic Algorithm for Web Search

An Improved PageRank Method based on Genetic Algorithm for Web Search Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 2983 2987 Advanced in Control Engineeringand Information Science An Improved PageRank Method based on Genetic Algorithm for Web

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Next-Generation Standards Management with IHS Engineering Workbench

Next-Generation Standards Management with IHS Engineering Workbench ENGINEERING & PRODUCT DESIGN Next-Generation Standards Management with IHS Engineering Workbench The addition of standards management capabilities in IHS Engineering Workbench provides IHS Standards Expert

More information

Review: Searching the Web [Arasu 2001]

Review: Searching the Web [Arasu 2001] Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

IHS Engineering Workbench 1.3

IHS Engineering Workbench 1.3 IHS Engineering Workbench 1.3 Important enhancements and updates in the latest release of the IHS Markit solution for technical knowledge discovery and management Introducing expanded capabilities for

More information

Basic Internet Skills

Basic Internet Skills The Internet might seem intimidating at first - a vast global communications network with billions of webpages. But in this lesson, we simplify and explain the basics about the Internet using a conversational

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 398 Web Usage Mining has Pattern Discovery DR.A.Venumadhav : venumadhavaka@yahoo.in/ akavenu17@rediffmail.com

More information

Appendix A: Scenarios

Appendix A: Scenarios Appendix A: Scenarios Snap-Together Visualization has been used with a variety of data and visualizations that demonstrate its breadth and usefulness. Example applications include: WestGroup case law,

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information