Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Similar documents
Collection Building on the Web. Basic Algorithm

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval Spring Web retrieval

Search Engines. Information Retrieval in Practice

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Performance Analysis for Crawling

Administrative. Web crawlers. Web Crawlers and Link Analysis!

SEARCH ENGINE INSIDE OUT

CS6200 Information Retreival. Crawling. June 10, 2015

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval May 15. Web retrieval

CS 572: Information Retrieval

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

CS47300: Web Information Search and Management

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Introduction to Information Retrieval

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Information Retrieval

Chapter 2: Literature Review

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS47300: Web Information Search and Management

CS November 2018

CS November 2017

Information Retrieval. Lecture 9 - Web search basics

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

Information retrieval

Web Crawling. Advanced methods of Information Retrieval. Gerhard Gossen Gerhard Gossen Web Crawling / 57

Web search engines. Prepare a keyword index for corpus Respond to keyword queries with a ranked list of documents.

Distributed computing: index building and use

Information Retrieval II

THE HISTORY & EVOLUTION OF SEARCH

Information Retrieval

Part I: Data Mining Foundations

THE WEB SEARCH ENGINE

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Definition. Spider = robot = crawler. Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

A Survey on Web Information Retrieval Technologies

Search Quality. Jan Pedersen 10 September 2007

US Patent 6,658,423. William Pugh

Information Retrieval and Web Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

DATA MINING - 1DL105, 1DL111

How Does a Search Engine Work? Part 1

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

DATA MINING II - 1DL460. Spring 2014"

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Information Retrieval. Lecture 4: Search engines and linkage algorithms

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Keywords: web crawler, parallel, migration, web database

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Beyond Ten Blue Links Seven Challenges

A Taxonomy of Web Search

Distributed computing: index building and use

Informa(on Retrieval

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Crawling - part II. CS6200: Information Retrieval. Slides by: Jesse Anderton

Data Centers. Tom Anderson

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Web Crawling. Contents

Crawling and Mining Web Sources

CS290N Summary Tao Yang

Lec 8: Adaptive Information Retrieval 2

Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti

Google Search Appliance

Brief (non-technical) history

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 12: Conclusion. Aidan Hogan

Sharding & CDNs. CS 475, Spring 2018 Concurrent & Distributed Systems

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

FILTERING OF URLS USING WEBCRAWLER

Remote Procedure Call. Tom Anderson

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Dremel: Interactice Analysis of Web-Scale Datasets

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Transcription:

Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!? 1

Two main difficulties The Web: Extracting significant data is difficult!! Size: more than tens of billions of pages Language and encodings: hundreds Distributed authorship: SPAM, format-less, Dynamic: in one year 35% survive, 20% untouched The User: Matching user needs is difficult!! Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language 1995-1997 AltaVista, Excite, Lycos, etc Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) 1998: Google Third generation -- answer the need behind the query Focus on user need, rather than on query Integrate multiple data-sources Click-through data Google, Yahoo, MSN, ASK, Fourth generation Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] 2

3

4

This is a search engine!!! Algoritmi per IR The structure of a Search Engine 5

The structure? Page archive Crawler Query Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure 6

Information Retrieval Crawling Spidering 24h, 7days walking over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*50*10 9 = 500*10 9 1-entries in adj matrix 7

Crawling Issues How to crawl? Quality: Best pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process Crawler cycle of life PQ Link Extractor Crawler Manager PR AR Downloaders Link Extractor: while(<page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract.. <insert these links into the Priority Queue> } Downloaders: while(<assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u Already Seen Page ) ( u Already Seen Page && <u s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } } 8

Page selection Given a page P, define how good P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined BFS BFS-order discovers the highest quality pages during the early stages of the crawl 328 millions of URL in the testbed [Najork 01] 9

This page is a new one? Check if file has been parsed or downloaded before after 20 mil pages, we have seen over 200 million URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista) Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers Links are given to Central coordinator Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web 10

Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/. www.di.unipi.it/ Dynamic relocation schemes may be complicated Let D be the number of downloaders. hash(url) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(u) = x Managing the fault-tolerance: What about the death of downloaders? D D-1, new hash!!! What about new downloaders? D D+1, new hash!!! A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is O(1)/S [load] any server gets (I/S) log S items w.h.p [scale] you can copy each server more times... 11

Examples: Open Source Nutch, also used by WikiSearch http://www.nutch.org Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html Consisten Hashing Amazon s Dynamo 12