Searching the Web What is this Page Known for? Luis De Alba

Similar documents
Searching the Web [Arasu 01]

Mining Web Data. Lijun Zhang

Review: Searching the Web [Arasu 2001]

Information Retrieval Spring Web retrieval

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Mining Web Data. Lijun Zhang

Link Analysis and Web Search

Address: Computer Science Department, Stanford University, Stanford, CA

COMP5331: Knowledge Discovery and Data Mining

Information Retrieval May 15. Web retrieval

Effective Page Refresh Policies for Web Crawlers

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

COMP 4601 Hubs and Authorities

World Wide Web has specific challenges and opportunities

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Information Retrieval. hussein suleman uct cs

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Web Structure Mining using Link Analysis Algorithms

Information Retrieval. Lecture 11 - Link analysis

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Link Analysis in Web Mining

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Social Networks 2015 Lecture 10: The structure of the web and link analysis

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Lecture 8: Linkage algorithms and web search

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

A Survey on Web Information Retrieval Technologies

DATA MINING - 1DL105, 1DL111

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

DATA MINING II - 1DL460. Spring 2014"

Information Networks: PageRank

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Chapter 6: Information Retrieval and Web Search. An introduction

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Link Analysis in Web Information Retrieval

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Bruno Martins. 1 st Semester 2012/2013

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Information Retrieval and Web Search

LIST OF ACRONYMS & ABBREVIATIONS

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS47300: Web Information Search and Management

Information Retrieval. Lecture 10 - Web crawling

Title: Artificial Intelligence: an illustration of one approach.

Information Retrieval

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Chapter 2: Literature Review

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

How to organize the Web?

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

CS6200 Information Retreival. The WebGraph. July 13, 2015

An Introduction to Search Engines and Web Navigation

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Searching the Web for Information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Exploring both Content and Link Quality for Anti-Spamming

Lecture 17 November 7

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Local Methods for Estimating PageRank Values

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Recent Researches on Web Page Ranking

Estimating Page Importance based on Page Accessing Frequency

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Web Crawling As Nonlinear Dynamics

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

COMP Page Rank

Chapter 27 Introduction to Information Retrieval and Web Search

A P2P-based Incremental Web Ranking Algorithm

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Brief (non-technical) history

INTRODUCTION. Chapter GENERAL

THE WEB SEARCH ENGINE

Experimental study of Web Page Ranking Algorithms

What is this Page Known for? Computing Web Page. Abstract. The textual content of the Web enriched with the hyperlink structure surrounding

Part 1: Link Analysis & Page Rank

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Search Engines. Information Retrieval in Practice

CS47300: Web Information Search and Management

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

SEARCH ENGINE INSIDE OUT

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

An Adaptive Approach in Web Search Algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi

Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University

Introduction People browse the Web using entry Points or using a Search Engine (many) The Web is Massive, No Coherent, Changes rapidly and it its geographically distributed. Over 8 billion pages. In.com domain 40% pages expected to change daily.

Introduction Studies aim to Web s linkage structure and how it can be modeled. Web is somewhat like a bow tie. 28% Core 22% Reach from Core 22% Reach Core

Search Engine How a Web search engine is typically put together.

Crawler Crawlers are programs that browse the Web on the search engine s behalf. Crawl Control module: to keep crawlers working and in which way.

Indexer & Repository Indexer: Extracts words from each page and records URLs. Repository: Collection (temporary) of retrieved pages.

Query Engine & Ranking Query E: Receives and fills Search Request from Users. Ranking: Due to Web size results are very large, hence the ranking will sort them.

Modules Crawling Storage Indexing Ranking

Crawling Start with an initial Set of URL s URL -X- -Y- -Z- Go for Page X Web URL -Y- -X1- -Z- -X3- -X5- URL -Z- -X3- -Y1- -X5- -X- Page Selection and Refreshing methods

Crawling What pages to Download? Not all, only important ones, prioritizing the Queue. Refreshing pages. Download pages then revisit to update if changed. Impact on freshness. Load on the visited Web sites. Consuming resources belonging to others.

Crawling Page Selection Importance Metrics: good pages to visit. Interest Driven: Similar words in Page and Query. Relationship between how many times the Word appear in the Web and in the Page. (Web size). Popularity Driven: Links that point to Page P from any other Page P (Web size). Location Driven: URL, fewer slashes,.com

Crawling Page Selection Crawler Models: visiting mainly highimportance pages. Crawl & Stop: Start at Page P and stop after K Pages. Some may be of high Importance. Crawl & Stop + Threshold: T is Importance target. Only accept above/equal T. Ordering Metrics: order URLs in queue due to importance.

Crawling Refresh Pages are maintained up-to-date Freshness Metric: Local page vs. real world counterpart. Collection of Pages calculations: Freshness: how fresh the collection is. Age: how old the collection is.

Crawling Refresh Refresh Strategy Uniform or Proportional refresh policy. Available resources. What page to refresh? Poisson process.

Storage Key challenges Scalability Storage Storage Dual Access: Random Page Collection of Pages Obsolete Updates

Storage Design Page distribution: to which node to assign. Uniform Distribution: all nodes are treated identically, page can go to any node. Hash Distribution: page allocation depends on page identifiers.

Storage Design Physical Page Organization: operations to be executed, addition / streaming / random page access. Hashed organization based on identifiers. Log structures with B-tree index of locations Hash-Log

Storage Design Update Strategies: dependant to crawler characteristics. Batch Mode Crawler, some day some time. Steady Crawler, runs with no pause. Partial/Complete Crawls, specific set of pages or sites. Shadowing: cache and then update

Indexing Process: Memory Loading Processing (inverter) Flushing Storage Page Stream

Indexing Indexer module builds two indexes: Link Index: portion of the Web is modeled as a graph. Edge A to B represent a hyperlink. Given page P get incoming and outward links. (Web size) Text (content) Index: Primary method to identify pages relevant to a query. Inverted indexes, index structure choice of the Web.

Indexing Inverted Index Inverted list for a term is a sorted list of locations where the term appears. Location: Page Identifier & Position in the Page.

Indexing Partitioning How to add the inverted list? Local Inverted File, different nodes with different subsets of pages. Queries are broadcasted to all nodes. Global Inverted File, each server stores only a subset of terms. [a-m] => Node 1 [n-z] => Node 2

Indexing Threads Experiments showed that sequential index-builder is 30%-40% slower than pipelined one.

Indexing Statistics Statistics are often used to rank search results. Statistics can be computed by the indexing system. IDF inverse document frequency log(n/dfw) N pages in collection dfw pages where w occurs

Ranking Pages that contain the search terms may be of poor quality or not relevant. Web pages are not sufficiently selfdescriptive. Can be manipulated. Link Structure: If A links to B then author of A recommends B. At Global Level it is robust against spamming.

Ranking Page Rank Importance of a page. Importance of pages that point to A and Importance of pages that A points to. Recursive, Depends and Influences other pages. depends Page A influences

Ranking Page Rank Simple Page Rank: Assume that Web pages form a strongly connected graph. N(i) denotes number of outgoing links i B(i) denotes the set of pages that point to i r(i) denotes Page Rank of page i

Ranking Page Rank Practical Page Rank Web is far from strongly connected. Rank Sink, no links point outwards. Rank Leak, page with no links. Random Surfer will get stuck or lost. Remove the Leak nodes and add a decay factor d. Leak nodes will point back. Random Surfer jumping randomly (decay factor)

Ranking Page Rank Computational Issues. Important value is the Page Order given by the Page Rank no the Values of the Page Rank. Is not necessary to finish the iterations. Algorithm can be stopped when values start to converge.

Ranking HITS HITS, Hypertext Induced Topic Search Instead of global rank it is Query- Dependant. Produces two scores, Authority and Hubs. Authority pages are most likely to be relevant. Hub pages point to several authority pages.

Ranking HITS Algorithm Using the Query String Identify a small subgraph of the Web and search for Authorities and Hubs. Form a root set R and expand it to the pages in the neighborhood. Link Analysis, Authority Value = number of Hubs pointing to it. Hubs Value = number of Links pointing to Authorities.

Ranking HITS Algorithm Resulting set shall be rich in Authorities and Hubs. Authorities usually do not point to Authorities Toyota -> Honda

Conclusion Searching the Web is the basis for many tasks. Search Engines are being relied in extracting the required information with one or two input keywords. Audio, Video, Images, new challenges for search engines.

What is this Page Known for? Rafiei, Mendelzon 2000. University of Toronto

Introduction Objective: Given a Page/Site on what topics is this page considered an authority by the Web community? Page classification. What is a Page/Site about? How is a Page/Site perceived? What is a Person known for?

Related Work Methods: Page Rank. HITS, Authority and Hubs. Random Surfer. Difference Ranking respect to a topic instead of computing a universal rank.

Random Walks One-Level Influence Propagation: Jumps of the Random Surfer are forward. Pages with relatively high reputations on a topic are more likely to be visited by the RS searching for that topic. The number of visits of the RS depends on the pages on the same topic pointing to this one page and the reputation of those pages.

Random Walks Two-Level Influence Propagation: The Surfer has two choices in page p Transition out of page p or, randomly pick any page q that has a link to page p and make a transition out of page q Surfer can go Forward or Backward backward q p forward

Reputation of Pages Is not enough to use the terms and phrases that appear in a page. Some terms may not be explicitly on the page. How to: Start in page p Collect all terms that appear in it. Look at incoming links and collect terms. Stop when incoming links have small effect.

Experiments Known Authoritative Pages java.sun.com <Search> <Microsoft>

Experiments Personal Home Pages Don Knuth <Dilbert>

Experiments Computer Science Departments www.cs.helsinki.fi <Linux> <Linus> www.cs.toronto.edu <Russia><Hockey>

Conclusion Algorithms are working as expected but still work to do improving their TOPIC prototype.