Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / CS342 / Fall 2014

Similar documents
The Anatomy of a Large-Scale Hypertextual Web Search Engine

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Information Retrieval Spring Web retrieval

Searching the Web for Information

THE WEB SEARCH ENGINE

A Survey on Web Information Retrieval Technologies

Link Analysis and Web Search

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Mining Web Data. Lijun Zhang

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Information Retrieval May 15. Web retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

CS 347 Parallel and Distributed Data Processing

Mining Web Data. Lijun Zhang

CS 347 Parallel and Distributed Data Processing

SEARCH ENGINE INSIDE OUT

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

COMP5331: Knowledge Discovery and Data Mining

Today we show how a search engine works

Search Engines. Information Retrieval in Practice

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Analytical survey of Web Page Rank Algorithm

Searching the Web What is this Page Known for? Luis De Alba

Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / / 2018

Information Retrieval

CS6200 Information Retreival. The WebGraph. July 13, 2015

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Searching the Web [Arasu 01]

An Adaptive Approach in Web Search Algorithm

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

CS6200 Information Retreival. Crawling. June 10, 2015

The Topic Specific Search Engine

DATA MINING II - 1DL460. Spring 2014"

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

DATA MINING - 1DL105, 1DL111

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Information Retrieval. Lecture 4: Search engines and linkage algorithms

A Survey of Google's PageRank

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Google Scale Data Management

Welcome to the class of Web Information Retrieval!

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Distributed computing: index building and use

Efficient query processing

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Web Structure Mining using Link Analysis Algorithms

Optimizing Search Engines using Click-through Data

Information Retrieval. Lecture 10 - Web crawling

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Chapter 6: Information Retrieval and Web Search. An introduction

Recent Researches on Web Page Ranking

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Information Retrieval and Web Search

Information Retrieval

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

The PageRank Citation Ranking: Bringing Order to the Web

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

Distributed computing: index building and use

Map Reduce. Yerevan.

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Searching the Deep Web

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval II

The Google File System

CSCI 5417 Information Retrieval Systems Jim Martin!

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Searching the Deep Web

Analysis of Link Algorithms for Web Mining

Lecture 8: Linkage algorithms and web search

Chapter 27 Introduction to Information Retrieval and Web Search

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Effective Page Refresh Policies for Web Crawlers

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Lecture #3: PageRank Algorithm The Mathematics of Google Search

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Transcription:

Computer Science / CS342 / Fall 2014 Multimedia Retrieval Chapter 3: Web Retrieval Dr. Roger Weber, roger.weber@credit-suisse.com 3.1 Motivation: The Problem of Web Retrieval 3.2 Size of the Internet and Coverage by Search Engines 3.3 Ordering of Documents 3.4 Context based Retrieval 3.5 Architecture of a Search Engine

3.1 Motivation: The Problem of Web Retrieval Classical Retrieval Web Retrieval Collection controlled set uncontrolled, incomplete Size Documents, Multimedia Structure of documents Links btw. documents Quality of documents small to large (1 MB - 20 GB [TREC]) homogenous homogenous seldom (citations of other documents) good to excellent extremely large (only text documents comprise more than 200 GB) heterogeneous (HTML, PDF, ASCII) heterogeneous lots of links in documents broad range of quality: poor grammar, wrong contents, incorrect, spamming, misspellings Queries precise and more terms short and imprecise, similarity search Results small number of hits (<100); results have good quality large numbers of hits (>100,000) Page 3-2

The Internet grows rapidly The Internet is growing at a rate of 14% a year. Every 5.32 years, the number of domains doubles (see figures above). Google s index contains more than 35 billion pages. In 2008, Google software engineers announced that they discovered one trillion unique URLs ( http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html) But the web is even larger! Many web sites address their content with infinite numbers of URIs. For example Google: if we assume a dictionary of 1 million terms, all combinations of two term queries for Google yield a trillion unique URIs. Adding a third, a forth, a fifth, and so on, term multiplies this number by 1 million each time. Page 3-3

The Problem of Ordering Lots of documents fulfill the typically small queries (2-5 terms). Mostly, result sets contain more than 100,000 documents with an rsv>0 But not all result documents are relevant Example: query ford returns 911,000,000 hits with Google 1st rank: Homepage of the car manufacturer Ford How is that possible? Google is based on Boolean retrieval! Search engines not only sort based on the rsv-value Depending on the rsv-function, only pages would appear which contain terms with the same frequencies as the query, contain the query terms most often, or contain all the query terms Classical retrieval lack a defense mechanism against spamming! Page 3-4

3.2 Size of the Internet and Coverage by Search Engines How to compute the number of web servers connected to the Internet [Giles99]: Assumption: IP-addresses of web servers are evenly distributed in the 32-bit address space. Approach: Choose N random IP-addresses. Access the root web page of that server. Let M the number of found pages (i.e., found web servers) Let M/N be the density of the coverage in the IP-address space Giles [1999]: M/N 1/269 This finally leads to 2 32 * M/N 16.0 millions web servers [Date: July 1999] Problem: This estimation also contains devices that are managed via a web front end, e.g., routers. Page 3-5

Estimating the number of web pages overlap analysis [Bharat98] : Assumption: Search engines index a sub space of the web independently of each other. The index sub space is a random sample of the web. Approach: A A B Web B p(a B) = p(a)*p(b) A, B are known for some search engines A = N*p(A) estimate ratio B : A B with selected queries B = N*p(B) A B = N*p(A B) N = A * B / A B Note: the assumption does not generally hold because search engines often start crawling at the same starting points (e.g., yahoo.com). Hence, the above estimation leads to a lower bound for the real number of web pages. Page 3-6

Estimating the number of web pages (2) Procedure: A and B are known for many search engines Determine the document frequency for terms in a sufficiently large sample of web pages. Let l=0; repeat k times: Query search engine B with a random query and select an arbitrary page in the result Build a query with terms appearing in the selected document that have the smallest document frequencies (select several terms to steer the query result size) Query the other search engine (A) with these terms. Due to the selected terms with small document frequencies, the result set is small. Increase l if the page is also found with search engine A. Estimate the ratio B : A B with k : l Determine N AB = A * B / A B = A * k / l Compute N AB for different combinations of search engines and estimate the total number of wbe pages by the average over all N AB -values Page 3-7

Some key figures Dec 1997: > 320M pages Some search engines indexed about 1/3 of the Web; the biggest 6 engines indexed around 60% of the Web Feb 1999: 800M pages, Some engines indexed 16% of the Web; the largest 11 engines indexed around 42% of the Web 2.8M public web servers, 16M web servers in total a page had an average size of 18,7 kb (excluding embedded images); 14 TBytes of data Jan 2000: > 1B pages Coverage of some search engines between 10%-15% 6.4M public web servers (with 2.2M mirrored servers); about 60% Apache, 25% MS-IIS Number of links on www.yahoo.com: 751.974 End of 2000:3-5B pages Coverage of search engines still between 10%-15%; Google outperforms all others in terms of coverage: 30% (also includes pages that Google crawlers only know from references) 30-50 TBytes of data (HTML format) BrightPlanet: distinguishes between surface and deep Web; the "surface web" subsumes all public pages; the deep web subsumes also private pages and dynamic pages (phone book, e-banking, etc.) 800B pages in the deep Web, more than 8000 TBytes of data 2005: 8B pages The deep Web has further grown; accurate estimates are no longer available. 2010: Google has >35B pages and reports to have seen more than 1 trillion unique URIs Page 3-8

Last reported index sizes of Search Engines Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K Yahoo 4.2 billion 500K Ask Jeeves 2.5 billion 101K+ [Source: SearchEngineWatch, 2005] PageDepth: maximum length of the indexed part of a found document. Google, for instance, only indexes the first 101 Kilo-Bytes of a page More recent figures: Google s Index had 8B entries at the end of 2005 (according to its homepage); today, no index sizes are published any more (-> end of search index size war with Yahoo) Basic idea for more current figures: use a keyword like the with a known frequency to appear on any web page (67%) 2010: 23 billion hits (2006: 14.4 billion) size is 34 billion pages (2006: 22 billion) Page 3-9

3.3 Ordering of Documents Ranking of Google (as far as documented): even though its based on Boolean retrieval, it has good precision today: most search engines use similar methods but the details are kept secret Ranking already starts with extracting the right information from documents: Google extracts Positions of terms in a document, the relative font size, the visual attributes (bold, italic), and the context of the page (terms in URL, title, meta-tags, text of references) text of references (e.g., between <A>...</A>) are also assigned to the referenced document. The ranking consists of the following factors:: Proximity of terms, i.e., distance between occurrences of distinct query terms Positions in the document (URL, text of references, title, meta-tag, body), Size of font, and visual attributes PageRank Further criteria (advertisements, pushed content) Page 3-10

3.3.1 Proximity of Terms Query: White House Document 1: the white car stands in front of the house (-> not relevant) Document 2: the president entered the White House (-> relevant) the closer the query terms are, the more relevant the text is Implementation in Google Prototype: for each position pair, a proximity values was assigned (10 values) the frequencies of these proximity values result in the Proximity -Vector multiplying this vector with a weighting vector (inner vector product) leads to the overall proximity value for the document Page 3-11

Example: hit list [ white ] = { 1, 81, 156 }, hit list [ house ] = { 2, 82, 115, 157 } Position pairs {(1,2), (1,82), (1,115), (1,157), (81,2), (81,82), (81,115),...} Pos Term 1 white 2 house... 81 white 82 house... 115 house... 156 white 157 house... (1,2) (81,82) (1,157) are mapped to the proximity vector: p = [3, 0, 0, 1, 1, 0, 0, 1, 2, 3] Proximity Frequency 1 (adjacent) 3 2 (close) 0 3 (...) 0 4 (...) 1 5 (nearby) 1 6 (...) 0 7 (...) 0 8 (distant) 1 9 (...) 2 10 (far away) 3 Overall proximity of the document: with w = [1.0, 0.9, 0.8, 0.7,..., 0.1] p T w = 5.6 Page 3-12

3.3.2 Positions in the Document, Size of Font, and Visual Attributes Queries typically aim at the title (heading) of a document E.g.: White House instead of Central Executive Place Often users look for brands, persons, or firms Text references in external documents describe the contents very accurately: E.g.: query eth lausanne is answer by Google with the home page of EPFL, although that page does not contain the term "ETH". Conclusion: Documents containing the query terms in the title, with special visual attributes (large size, heading, bold), or in reference texts linking to that document appear more relevant than documents that contain the terms just in the body ( I work for ETH Lausanne ) Google: counts the occurrences of terms along the dimensions described above multiplies the frequencies with well-chosen weights and sums these values to a second relevance value for the document Further: it contains mechanisms to cut off spamming Page 3-13

Implementation in Google Prototype Google: Pos Freq lim. Freq Weight <TITLE> 1 1 5.00 <META> 1 1 4.50 <P> 672 100 0.05 <B> 14 12 3.00 <I> 7 7 2.70 <H1> 1 1 4.00 <H2> 12 12 3.60... reference 892 100 7.50 866.6 Impact: Google is able to find pages given a brand, name, or firm that are highly relevant to the user. Spamming: if a page contains a term too often, the page gets ignored (e.g., if a term contributes to more than 10% of text then it s spam) Page 3-14

3.3.3 PageRank Which page better suits the query "Uni Basel" and why? Page 3-15

A Preliminary Model (not yet PageRank) Idea count the number of incoming links; the more incoming links the more important that page is because it is more likely that a surfer lands on that page: A 1 C 6 C is the most important page before D, A and B B 1 D 3 Problems: Not every page is equally important and thus are links Spamming Page 3-16

Computing the PageRank of a Page Improved Idea: A random surfer clicks with a probability p an outgoing link on the current page. With a probability of (1-p), the surfer selects an arbitrary page (bookmark, url) The PageRank of a page is given by the probability that a random surfer lands on the page (after a number of steps) How to compute this Probability: Notation: A an arbitrary page L(A) set of pages which have a reference to A N(A) number of outgoing links of page A PR(A) PageRank of page A p Probability that a surfer is following an outgoing link ( [0,1] ) Definition of PageRank: PR( A) = (1 p) + p PR( B) B L ( A) N( B) Page 3-17

Intuitive Explanation of the Formula The value of a link is given by the PageRank of the source document divided by the number of outgoing links on that page. This simulates the freedom of the random surfer to follow any link on that page (1-p) + p*... denotes the freedom of the surfer to follow a link with probability p or to jump to an arbitrary page with probability 1-p. Example: A 0.4 0.2 B 0.2 0.2 A and C have the same PageRank although A has only one incoming link instead of two as C 0.4 0.2 C 0.4 PR(C) Page 3-18

Computation The formula is recursive! The PR() values can be computed by a fix point iteration; experiments showed that the computation is minimal compared to the crawling effort (only a few iterations required) Approach: 1. Assign arbitrary initial values PR(A) for all documents A 2. Compute PR (A) (left hand side of equation) according to the formula above for all documents A 3. If PR (A)-PR(A) becomes sufficiently small then it holds that PR(A)=PR (A) is the solution; otherwise let PR(A)=PR (A) and repeat from step 2 Solving the fix point iteration takes only a few iterations (<100) and the computational effort is minimal (several hours) Page 3-19

Application PageRank derives a total ordering for web pages independent of the current query and its terms. Google uses PageRank in combination with other criteria PageRank is robust against spamming, i.e., against manipulations to push the PageRank of a page. Even if many links point to a page, this not necessarily implies its importance and a high PageRank. Ordering documents only based on PageRank would be fatal Let A be the document with the highest PageRank All queries with terms contained by A would be answered with A as the best document even though more relevant documents exist Page 3-20

3.3.4 Other Criteria to Order Documents Bought ranking positions A number of search engines get money for placing pages at top positions (advertisements) for certain query terms Advertisments RealName Page 3-21

Length of URL A query such as ford may be answered with the following documents http://www.ford.com/ http://www.ford.com/helpdesk/ http://www.careers.ford.com/main.asp http://www.ford.com/servlet/ecmcs/ford/index.jsp?section=ourservi ces&level2=rentalsfromdealers Search engine rank documents with short URLs at higher positions as they are more likely to be homepages/entry pages for the information need User Feedback direct hit used feedback of users to rank documents Internally, relevance based on feedback is stored similar to the PageRank. If a document in the result click is visited, the relevance value is increased, otherwise decreased Google experimented with feedback Page 3-22

3.3.5 Overall Ranking All search engine use and combine different criteria. In Google, the most prominent ones are: Proximity of terms Relevance values based on position in document, font size, and visual attributes PageRank The total relevance for a document results from summing up the relevance values of the individual criteria (with appropriate weighting). How to obtain those weights and which criteria to apply, however, remains a secret of search engine providers. Page 3-23

3.4 Context based Retrieval Observations: The web contains many pages addressing a specialized topic (e.g., Star Wars) E.g.: http://creativesoftwareinc.com/starwars/ lists many web sites devoted to the "Star Wars" movies (all sites cover the same topic) E.g.: http://www.carlynx.com/ lists web sites of different car brands and car manufacturers (all sites cover similar/related topics) What s Related Hubs and Authorities Consequently: Improve search results by explicitly taking context information into account (as described in the examples above) Similarly: determine the context of the query (probably by asking the user for more information) (Teoma, AskJeeves, Gigablast) Page 3-24

3.4.1 Hubs and Authorities A page denotes a so-called Hub for a query Q if it contains many links to pages that are relevant to the query. A page is a so-called Authority for a query Q if it is relevant for Q, i.e. provides the necessary information to answer the information need. Typically, one can identify and distinguish hubs an authorities based on their link structure. relevant to query Q relevant to query Q Hub Authority Page 3-25

Additionally, we observe: a good hub points to many good authorities, and a good authority is referenced by many good hubs. Based on hub-authority relationships, a search engine becomes able to identify relevant document which do not contain the query terms. Example: a query such as "looking for car manufacturers" will not lead a user to homepages of Honda, VW, or Ford. With a hubs/authorities analysis, it becomes possible to answer even such queries directly. Idea of Kleinberg [1997]: HITS Algorithm The Web is a graph G = (V,E) with the vertices V denoting the set of documents and the edges E denoting the links (from source to destination). Let (p,q) E, then document p references document q. Step 1: for a query Q, determine the first t (e.g. t=200) documents with the help of a search engine. The set obtained in this step is called the root set. For this set, we observe that it contains many relevant documents, but that it does not contain all the good hubs and authorities Page 3-26

Step 2: Extend the root set with documents referenced by a document in the root set and with documents pointing to documents in the root set. The resulting set of document denotes the so-called base set. To limit the size of the base set, one can restrict the number of documents added to d (e.g. 50) documents per element of the root set. Links within the same domain are removed as they frequently only serve as navigation aides. root base Page 3-27

Step 3: Compute the hub values h(p) and authority values a(p) for each document p. Thereby, the number of incoming links and outgoing links play a central role in the computation a simple solution could be: a ( p) = 1 h( p) = ( q, p) E ( p, q) 1 E an better idea: a good hub references many good authorities and a good authority is linked by many good hubs. a(p) and h(p) are always normalized p V a 2 2 ( p) 1 h( p) = 1 = p V Initialization: all pages start with the same values for a(p) and h(p) Iteration: the new values are computed based on the old ones a ( p) = h( q) h( p) = ( q, p) E ( p, q) a( q) Repeat the iteration (including normalization) until convergence is reached E Note: normalization of the new values must be established afterwards Page 3-28

Step 4: Compute the result if the user asks for overview pages, return the k documents having the largest hub values h(p) if the user asks for content pages, return the k documents having the largest authority values a(p) Notes: User is empowered to choose between Hubs and Authorities The iterative algorithm takes only few steps (10-20) to determine the values a(p) and h(p) Implementation: simple procedure is based on the availability of a search engine evaluation of the query with the search engine determines the root set determine referenced documents from members of the root set by downloading and parsing all documents in the root set determine incoming links and their source with the help of queries of type link: u with u denoting the URL of a document in the root set Page 3-29

Extensions of HITS (Henzinger, 1998) The HITS algorithm suffers from three fundamental problems: 1. If all pages in domain reference the same external page, that page becomes too strong an authority. Similarly, if a page links to many different pages in the same domain, that page becomes too strong a hub. 2. Automatically established links, e.g., advertisements, banners, links to the provider/host/designer of a web site become wrong authorities 3. Queries such a "jaguar car" tend to lead to pages about cars in general and hubs containing links to different manufacturer. More precisely, the more frequent term "car" dominates the more infrequent term jaguar. Page 3-30

Improvements: Problem 1: the same author (=same domain) has only one "vote" for an external page. Similarly, a document has only one "vote" when referencing documents int the same domain. If k pages p i of the same domain reference a document q, we weight the links with a value of aw(p i, q)=1/k for each edge (p i, q). If a page p references l pages q i in the same domain, we weight the links with a value hw(p, q i )=1/l for each edge (p, q i ). Adjust the iteration step with these weights a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Page 3-31

Problem 2 and 3: to deal with these problems, we eliminate nodes from the graph that obviously do not match the query or its relevant documents. To this end, we perform an artificial query against the base set: The query is described with the first few terms of all documents in the root set (e.g., the first 1000 terms of all documents) Query and documents are mapped to vectors according to the tf-idf weighting scheme of vector space retrieval Similarity s(p) between query and document p is given by the cosine measure For a given threshold t, eliminate all nodes/documents in the graph with s(p)<t. We obtain a good threshold by applying one of the methods to follow: t = median of all s(p) values t = median of s(p) values of documents in the root set t = 1/10 max s(p) a ( p) = aw( q, p) s( q) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) s( q) a( q) Page 3-32

Discussion: The HITS algorithm outperforms conventional web based search results. The extensions of the HITS algorithm further improve precision by more than 45% The main issues with the HITS algorithm are large evaluation costs and long retrieval times (30 seconds up to several minutes) In contrast to PageRank, ranking with HITS depends on the query. PageRank gives a total order of the documents for all queries. Page 3-33

3.4.2 What s Related The basic idea of Alexas What s Related was to identify related documents for existing documents. The definition of "related", however, was not based on similarity between documents but on similarity between the topics addressed in these documents potentially from different angles Related page for www.ford.com would be www.toyota.com, www.vw.com, Analogously to What s Related, Google provides Similar Pages for its result entries. The approaches to compute the relationships differ significantly between the two systems: Alexa used crawlers and data mining tools to determine related pages. Moreover, it spied on the surf patterns of the user (which pages has the user visited; which search result did the user investigate in more details; ) Google relies totally on the analysis of the link structure of web pages to derive the related documents. Two approaches were published by Google experts. Page 3-34

Companion Algorithm (Dean, Henzinger, 1999) The more complex approach is based on the extended HITS algorithm: given a URL u, the algorithm must find related documents to that page u Notation: If page w references page v, we call w the parent page of v, and we call v the child page of w. Step 1: build a directed graph in the "neighborhood" of u. That graph contains the following nodes: u at most b parent pages of u and for each parent page at most bf child pages at most f child pages of u and for each child page at most fb parent pages Step 2: Converge duplicates or "near-duplicates" Two documents are "near-duplicates", if they contain more than 10 links, and 95% of the contained links appear in both documents Page 3-35

Step 3: Assign weights for edges in the graph Similar to the extension of HITS algorithm: if k edges of documents within the same domain point to the same external page, these edges obtain a weight of 1/k. And: a document contains l edges to pages within the same domain, each of these edges obtains a weight of 1/l. Step 4: Determine hubs and authorities for the nodes of the graph according to the extension of the HITS algorithm (but without similarity weighting), i.e. a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Step 5: Determine the result The pages with highest authority weights (except for u) are the so-called related pages to u. Page 3-36

Co-citation Algorithm (Dean, Henzinger, 1999) This simpler approach counts how often a page u is referenced together with a page q. The page with the most frequent co-citations is the so-called most related page to u. Step 1: Determine at most b parent pages of u. Step 2: Determine for each parent page at most bf child pages with the links of these child pages being close to the link to u. All these pages are siblings of u. Step 3: Determine pages q i that are referenced most frequently together with u. Step 4: If Step 1-3 result in less than 15 co-citations with a frequency of 2 or higher, repeat the search with prefixes of the URL of u, i.e.: u=http://my.com/x/y/z.html u =http://my.com/x/y/ u =http://my.com/x/ u =http://my.com/ Page 3-37

Discussion: The approaches of Dean and Henzinger work much better than Alexa on average. Due to the non-existent information about how Alexa works, no qualitative and quantitative assessments are possible. Although Henzinger worked for Google, it is not clear which algorithms Google uses to find related pages. Page 3-38

3.5 Architecture of a Search Engine A search engine consist of the following main components Crawler/Repository Feature Extractor Indexer Sorter Feedback Component User Interface Page 3-39

Google (Brin Page) URL Server Crawler Store Server Anchors Indexer Repository URL Resolver Links Doc Index Barrels Sorter Lexicon PageRank Searcher Page 3-40

3.5.1 Main Problem: Scalability The data problem: search engines must deal with enormous data sizes: Assumption: size of web page is 10 KB; extraction returns 1 KB data to index Google s search index contains at least 35B pages (2010): cache size for documents: 350 TB Size of inverted lists: 35 TB Google uses >150 000 (commodity) PCs with a hard disk with 40-200 GB Total disk space: 15 000 TB But: space per PC is only 100 GB Problem: how to search through 35 TB if a single machine only stores 100 GB? How to organize such a cluster given the frequent updates and the enormous search frequencies? How do you assign identifiers for 35B pages? Page 3-41

The retrieval problem: there is no service window to update the software or the data; users search all the time Google: 250M queries per day = 10M queries per hour = 3000 queries per second! Daily peaks may be much higher 35 TB of index data => a single entry (=term) in the inverted lists consumes more than 5 GB (e.g. term house returns 3B hits with Google). Typical IDE Disk performs with 50 MB/s How long does it take to search through 5 GB of data? How do you reduce search times given the high IO load (not meaningful to cache data in memory)? Query house returns 3B hits Hits have to be sorted according to their relevance; with one term this can be pre-computed, but with several terms? an average PC needs quite some time to sort 3B hits even if the scores are already computed; computation of scores takes yet another "eternity" How can we decrease search times? by parallelization? you don't need to have the entire list, only the top hundreds: does that help? Page 3-42

The crawling problem: how do you fetch 35B pages in a reasonable time DNS lookups are expensive; a central DNS server would not allow you to scale enough (maximum number of downloads per second is limited) Google exchanged/exchanges its index once every month: upper limit on time: download 35B pages in a month = 24,000 pages per second! web page has 10 KB => 240 MB/s you most probably need several gigabit connections to the Internet And: servers on the Internet have different response times Incremental crawling: important pages such a news papers have to be read daily to support queries on "hot topics" news information with an age of 1-2 months is not interesting any more incremental crawler has to update the index of important pages that change frequently. Goolge's incremental crawler is called freshbot how do you select important pages and how can you update the inverted lists while 3000 concurrent queries run? Do you require ACID properties? Page 3-43

Google: success due to addressing the scalability problem from the beginning on The following considerations are based on Papers and presentations of Google Speculations in web search forums Page 3-44

3.5.2 Crawling tricks and musts DNS lookup Problem: local cache on each crawl server Be nice with web servers! Follow these rules: Only few requests per minute to the same server! Do not follow cgi-links as executing cgi scripts usually is expensive; moreover, you easily may activate/interact/change the page (order goods, games, forums, ) read and obey to robots.txt e.g. disallow: /cgi-bin/ filter "critical" URIs Note: there are much more pages than you ever can crawl, so it does not matter if you missed some of them; but do not miss the important ones! A single machine is not sufficient for crawling! A single connection to the Internet is not sufficient! Page 3-45

3.5.3 Distributed Data Management Google File System (GFS): maintains file systems with more than 1TB: very large files (>>size of disk) can be managed fault tolerance against crashes of machines, hard disks, partitioning of files to allow for massive parallelization Google implemented a two-dimensional data management Data are partitioned along groups of documents; the so-called sherds Each sherd is stored on an arbitrary number of machines increases fault-tolerance (commodity PCs are highly available) replication => distribution of load among different machines Page 3-46

3.5.4. Execute as parallel as possible Partitioning and replication provide parallelization inside a query evaluation and across all concurrent query evaluations The scalability of Google is available in two dimensions: partitioning enables support for growth of documents replication enables support for growth of concurrent queries Page 3-47

Query evaluation with Google query: cats and dogs www.google.com DNS resolver: Last geografisch verteilen 1 64.233.161.99 2 3 6 Google Entry System Google Entry System Google Entry System Spell Checker Ad Server 4 Index Servers 4 Index Servers Google Entry System Google Entry System Google Web Server Document Servers 5 Document Servers Index Servers Document Servers Page 3-48

3.5.5 Cluster Management Google: >1,000,000 PCs deployed in numerous data centers: PC (or any of it components) fails at least once a year 1,000,000 PCs: 3,000 failures per day Data (index, document cache) has to be refreshed in regular intervals. Google, for instance, refreshed the index once every month; freshbot refreshes the index once a day. >100 TB of data are not instantly distributed data has to be extended step by step Google must still answer queries but may use old or new index data Google s software is constantly improved. Software distribution must be fully automated and not result in any downtimes. Page 3-49

Literature and Links Google Inc. Google: http://www.google.com S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW7, 1998, http://www-db.stanford.edu/~backrub/google.html L. Page et. al., The PageRank Citation Ranking: Bringing Order to the Web, work in progress, http://dbpubs.stanford.edu:8090/pub/showdoc.fulltext?lang=en&doc=199 9-66&format=pdf&compression=&name=1999-66.pdf Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, Web Search for a Planet: The Google Cluster Architecture, April 2003 (IEEE Computer Society). Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, In: Proceedings of SOSP 03, October 19 22, 2003, Bolton Landing, New York, USA. What s Related AlexaResearch ( What s Related ) http://www.alexaresearch.com/clientdir/products/top_websites.php Google: Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web. Proceedings of the 8th International World Wide Web Conference (WWW8), 1999, pp. 389-401 Page 3-50

Literature and Links (2) Size of Internet [Bharat98] Krishna Bharat and Andrei Broder, A technique for measuring the relative size and overlap of public Web search engines, WWW7, 1998, http://www7.scu.edu.au/programme/fullpapers/1937/com1937.htm [Giles99] - Steve Lawrence, Lee Giles, "Accessibility of information on the web", Nature, Vol. 400, pp. 107-109, 1999 [SEW] SearchEngineWatch: http://www.searchenginewatch.com [BP] BrightPlanet Studie: http://www.brightplanet.com/resources/details/deepweb.html Internet Domain Survey: http://www.isc.org/ds/ Page 3-51