Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!? 1

Two main difficulties The Web: Extracting significant data is difficult!! Size: more than tens of billions of pages Language and encodings: hundreds Distributed authorship: SPAM, format-less, Dynamic: in one year 35% survive, 20% untouched The User: Matching user needs is difficult!! Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language 1995-1997 AltaVista, Excite, Lycos, etc Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) 1998: Google Third generation -- answer the need behind the query Focus on user need, rather than on query Integrate multiple data-sources Click-through data Google, Yahoo, MSN, ASK, Fourth generation Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] 2

This is a search engine!!! Algoritmi per IR The structure of a Search Engine 5

The structure? Page archive Crawler Query Page Analizer Indexer Query resolver Ranker Control text auxiliary Structure 6

Information Retrieval Crawling Spidering 24h, 7days walking over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*50*10 9 = 500*10 9 1-entries in adj matrix 7

Crawling Issues How to crawl? Quality: Best pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process Crawler cycle of life PQ Link Extractor Crawler Manager PR AR Downloaders Link Extractor: while(<page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract.. <insert these links into the Priority Queue> } Downloaders: while(<assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u Already Seen Page ) ( u Already Seen Page && <u s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } } 8

Page selection Given a page P, define how good P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined BFS BFS-order discovers the highest quality pages during the early stages of the crawl 328 millions of URL in the testbed [Najork 01] 9

This page is a new one? Check if file has been parsed or downloaded before after 20 mil pages, we have seen over 200 million URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista) Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers Links are given to Central coordinator Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web 10

Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/. www.di.unipi.it/ Dynamic relocation schemes may be complicated Let D be the number of downloaders. hash(url) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(u) = x Managing the fault-tolerance: What about the death of downloaders? D D-1, new hash!!! What about new downloaders? D D+1, new hash!!! A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is O(1)/S [load] any server gets (I/S) log S items w.h.p [scale] you can copy each server more times... 11

Examples: Open Source Nutch, also used by WikiSearch http://www.nutch.org Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html Consisten Hashing Amazon s Dynamo 12