Finding Context Paths for Web Pages

Similar documents
Cut as a Querying Unit for WWW, Netnews, and

Title: Artificial Intelligence: an illustration of one approach.

Module 6: Pinhole camera model Lecture 32: Coordinate system conversion, Changing the image/world coordinate system

Discovery and Retrieval of Logical Information Units in Web

Query Evaluation in Wireless Sensor Networks

New Perspectives on Creating Web Pages with HTML. Tutorial Objectives

PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance

Developing a Basic Web Site

MURDOCH RESEARCH REPOSITORY

Full-Text and Structural XML Indexing on B + -Tree

Extracting Representative Image from Web Page

Extracting Representative Image from Web Page

New Perspectives on Creating Web Pages with HTML. Adding Hypertext Links to a Web Page

QoS-Aware Hierarchical Multicast Routing on Next Generation Internetworks

Trees. Tree Structure Binary Tree Tree Traversals

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about:

Oregon State University Practice problems 2 Winster Figure 1: The original B+ tree. Figure 2: The B+ tree after the changes in (1).

Routability-Driven Bump Assignment for Chip-Package Co-Design

Authoritative Sources in a Hyperlinked Environment

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Program generation. Creation date: June 2015 Used Version: 12.0; Standard CAM Installation mit einer CNC Weeke Venture Maschine

Information Discovery, Extraction and Integration for the Hidden Web

Chapter 27 Introduction to Information Retrieval and Web Search

Finding Neighbor Communities in the Web using Inter-Site Graph

2) SQL includes a data definition language, a data manipulation language, and SQL/Persistent stored modules. Answer: TRUE Diff: 2 Page Ref: 36

Physical Level of Databases: B+-Trees

Improving Relevance Prediction for Focused Web Crawlers

MURDOCH RESEARCH REPOSITORY

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Extraction of Semantic Text Portion Related to Anchor Link

1. The SIZE attribute can be used in which of the following elements? a. FONT b. IFRAME c. SPAN d. STYLE Answer: a Page 86

Crawler with Search Engine based Simple Web Application System for Forum Mining

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

A New Technique to Optimize User s Browsing Session using Data Mining

Mining Web Data. Lijun Zhang

Alignment Results of SOBOM for OAEI 2009

CS 240 Fall Mike Lam, Professor. Priority Queues and Heaps

Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

NOTES ON OBJECT-ORIENTED MODELING AND DESIGN

Configuring Traffic Policies

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Web Mechanisms. Draft: 2/23/13 6:54 PM 2013 Christopher Vickery

More about HTML. Digging in a little deeper

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

COMPUTER NETWORK PERFORMANCE. Gaia Maselli Room: 319

Measuring Similarity to Detect

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions

Test Bank for Database Processing Fundamentals Design and Implementation 13th Edition by Kroenke

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

quiz heapsort intuition overview Is an algorithm with a worst-case time complexity in O(n) data structures and algorithms lecture 3

Lowering the Error Floors of Irregular High-Rate LDPC Codes by Graph Conditioning

Analytical survey of Web Page Rank Algorithm

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

PATH FINDING AND GRAPH TRAVERSAL

Web page recommendation using a stochastic process model

A Component Retrieval Tree Matching Algorithm Based on a Faceted Classification Scheme

Similarity Image Retrieval System Using Hierarchical Classification

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Building Intelligent Learning Database Systems

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

The Results of Falcon-AO in the OAEI 2006 Campaign

doi: / _32

International ejournals

IMPROVED A* ALGORITHM FOR QUERY OPTIMIZATION

Automatic Identification of User Goals in Web Search [WWW 05]

Enhancing Wrapper Usability through Ontology Sharing and Large Scale Cooperation

Distributed minimum spanning tree problem

Whitepaper US SEO Ranking Factors 2012

Customer Group Catalog for Magento 2

Chapter 3 Style Sheets: CSS

Binary Trees. Reading: Lewis & Chase 12.1, 12.3 Eck Programming Course CL I

Principles of Algorithm Design

1/27/2013. Outline. Basic Links. Links and Navigations INTRODUCTION TO WEB DEVELOPMENT AND HTML

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

University of Waterloo Midterm Examination Solution

visualstate Reference Guide

Understanding this structure is pretty straightforward, but nonetheless crucial to working with HTML, CSS, and JavaScript.

A Firewall Application Using Binary Decision Diagram

EMV Contactless Specifications for Payment Systems

= a hypertext system which is accessible via internet

Deploying the BIG-IP LTM system and Microsoft Windows Server 2003 Terminal Services

Depth-first search in undirected graphs What parts of the graph are reachable from a given vertex?

EECS 583 Class 3 Region Formation, Predicated Execution

Heading-aware Snippet Generation for Web Search

Skill Area 323: Design and Develop Website. Multimedia and Web Design (MWD)

GPRS Tunneling Protocol V2 Support

Best Practices for Content Development

Searching non-text information objects

Automatic Generation of Wrapper for Data Extraction from the Web

Data Structures and Algorithms (DSA) Course 12 Trees & sort. Iulian Năstac

Cascade V8.4 Website Content Management for the Site Manager UMSL

Answer All Questions. All Questions Carry Equal Marks. Time: 20 Min. Marks: 10.

Transcription:

Finding Context Paths for Web Pages Yoshiaki Mizuuchi Keishi Tajima Department of Computer and Systems Engineering Kobe University, Japan ( Currently at NTT Data Corporation)

Background (1/3) Aceess to Web Pages There are three ways to access Web pages: 1. direct access by known URLs, 2. navigation from other pages, and 3. content-based access via search engines. currently, very important! However, many page authors assume only 1 and 2. 1

Background (2/3) Structure of Web Sites Many page authors set up standard routes to each page. Most Web sites are consisting of: entrance pages for direct access, other pages, hierarchical paths from entrance pages to other pages, and other links. Most Web sites have hierarchical standard routes. 2

Background (3/3) Omission of Already Known Information A Page author sometimes: assumes every reader comes through the same path, assumes they already know information written on that path, and omit such information in the current page. E.g.: The phrase Database Group is omitted in this page. 3

Problem Sometimes Web pages are not self-contained. Indexes only with the words in each page itself are sometimes incomplete. Queries with multiple keywords sometimes fail to find appropriate pages. E.g.: Query { Database Group, Publication List } cannot retreive the page in the last example. 4

Goal of This Research In order to complement incomplete indexes: we detect paths assumed by page authors, and we extract appropriate keywords from the pages on the paths. We call such a path entrance path or context path. The main goal of this research is: to develop a method of automatically detecting an entrance path for each page. 5

Related Work (1/3) Subgraph-based Retrieval of Web Pages Our previous research [ACM Hypertext 98] We detect documents consisting of series of connected Web pages, and "hypertext" use those documents as a unit of query. We can retrieve documents including given multiple keywords even if they appear in different pages. There are other kind of cases where query with multiple keywords fails. "query" 6

Related Work (2/3) Indexing with keywords in anchors [Li 98] They use words in an anchor of a link for indexing its destination page. We extract words from both direct and indirect ancestor pages while they consider only direct parent pages 7

Related Work (3/3) Structual Queries Many researches have proposed structural queries like: select p where appear( Database Group, p), appear( Publication List, p ), p p They may retrieve pairs of unrelated pages connected through very long links. 8

Our Basic Strategy To detect entrance paths, we use following information: directory structure encoded in URL file names (currently, only index.* ) graph structure We determine entrance paths by using heuristics based on those information. 9

Preliminaries (1/3) Link Patterns We classify links into 5 link patterns: intradirectory links: between pages in the same directory downward links: from pages in a directory to pages in its direct/indirect subdirectory upward links: from pages in a directory to pages in its ancestor directory sibling links: between pages in incomparable directories on the same site intersite links: between pages on different sites 10

Preliminaries (2/3) Link Roles Based on link patterns and other information, we also classify links into 3 link roles: entrance links: constituting the standard routes to their destination pages. back links: going back to pages on the entrance paths jump links: the others 11

Preliminaries (3/3) Basic Assumptions We adopt the following assumptions for simplicity: 1. Every page has at most one entrance path. We cannot perfectly determine entrance paths. Instead, we select one most likely path for each page. 2. Entrance paths are concatenation of entrance links. This will greatly reduce the computation cost. Discovery of entrance paths can be reduced to discovery of an entrance link. 12

Discovery of Entrance Links (1/6) Overview We determine the entrance link for each page in the following steps: 1. collect all incoming links, 2. exclude those that cannot be the entrance link, 3. select the most likely one, and 4. examine whether that is really the entrance link or not. 13

Discovery of Entrance Links (2/6) Step 2: Excluded Cases Those are never entrance links, and are excluded in Step 2: upward links, and intradirectory links to a page named index.*. 14

Discovery of Entrance Linss (3/6) Step 3: Priority for Link Selection In step 3, first we use the following priority orders: 1. downward links have the highest priority, among them, links from upper directory have higher priority 2. intradirectory links next, among them, links from index.* have higher priority 3. sibling links next, and 4. intersite links have the lowest priority 15

Discovery of Entrance Links (4/6) Step 3: Priority for Link Selection (cont d) For candidates with equal priority, we select one so that there would be the largest number of back links over that link. However, this produces mutual dependency among the decision of entrance links of pages. 16

Discovery of Entrance Links (5/6) Step 3: Priority for Link Selection (cont d) To avoid that mutual dependency, we use downward links instead of entrance links. This method: can be computed easily because downward links can be determined independently, and can approximate the previous method 17

Discovery of Entrance Links (6/6) Step 4: Examining the Selected Candidate We examine whether the selected one is really the entrance link by the following criteria: if the selected one is not a intersile link, = YES if the selected one is an intersite link, if the number of (pseudo) back links is greater than a threshold = YES if it is less than a threshold = NO (But we need some better method.) 18

Experimental Results (1/2) Precision Ratio We applied our method to our research group Web site with: 840 pages, 1400 links within the site, and 2200 links to other Web sites, and find correct entrance paths for 78% pages. However, the precision ratio greatly changes depending on the style of page organization of each Web site. 19

Experimental Results (2/2) Failing Cases directories with no index.* and no incoming downard links. (about 60 pages) 3 documents each of which is organized into a series of pages in the same directory and connected via next and previous links. We cannot distinguish next links and previous links. (about 120 pages) entrance pages, for which our method deterimne some links as their entrance links. (7 pages) 20

Conclusion We show the first step for discovery of entrance paths for Web pages. However, discovery of entrance paths heavily depends on the way of page organization, and there can be no unique general solution. Therefore, we need to determine the organization style of each Web site, determine the organization style of each group of Web pages, and adopt a method appropriate to that style. Currently, we are studying this approach. 21