Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Similar documents
Information Retrieval Spring Web retrieval

Information Retrieval May 15. Web retrieval

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

Link Analysis in Web Mining

Lecture #3: PageRank Algorithm The Mathematics of Google Search

2011 Team Essay Solutions. 23, 8(3) = 24. Then 21, 22, 23, 24, 28, 29, and 30 are the integers from

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

COMP Page Rank

Relevance of a Document to a Query

Web Structure Mining using Link Analysis Algorithms

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CS47300 Web Information Search and Management

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Searching the Web for Information

Information Networks: PageRank

Query Processing and Alternative Search Structures. Indexing common words

Information Retrieval. (M&S Ch 15)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Mining Web Data. Lijun Zhang

Final Exam Search Engines ( / ) December 8, 2014

THE WEB SEARCH ENGINE

Big Data Analytics CSCI 4030

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Survey on Web Structure Mining

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Lec 8: Adaptive Information Retrieval 2

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Worst-case running time for RANDOMIZED-SELECT

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Creating a Classifier for a Focused Web Crawler

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Big Data Analytics CSCI 4030

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Chapter 2: Number Systems

A Survey of Google's PageRank

Information Retrieval. Lecture 10 - Web crawling

Ranking in a Domain Specific Search Engine

Unit VIII. Chapter 9. Link Analysis

Link Structure Analysis

CSE 494: Information Retrieval, Mining and Integration on the Internet

A Survey on Web Information Retrieval Technologies

Reading Time: A Method for Improving the Ranking Scores of Web Pages

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

SEO. Definitions/Acronyms. Definitions/Acronyms

Lecture 8: Linkage algorithms and web search

Search Engines. Dr. Johan Hagelbäck.

Lecture 27: Learning from relational data

III Data Structures. Dynamic sets

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

68A8 Multimedia DataBases Information Retrieval - Exercises

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Averages and Variation

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Mining Web Data. Lijun Zhang

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

3.2-Measures of Center

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Discrete Mathematics Introduction

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

A New Technique for Ranking Web Pages and Adwords

Big Data Analytics CSCI 4030

Analytical survey of Web Page Rank Algorithm

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Recent Researches on Web Page Ranking

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

CMPSCI 646, Information Retrieval (Fall 2003)

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Part 1: Link Analysis & Page Rank

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

INTRODUCTION. Chapter GENERAL

COMP6237 Data Mining Searching and Ranking

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Final Exam in Algorithms and Data Structures 1 (1DL210)

Ratios and Proportional Relationships (RP) 6 8 Analyze proportional relationships and use them to solve real-world and mathematical problems.

Transcription:

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to unit length? Answer: To determine the similarity of two pages you want to consider the frequency (or, more generally, significance) of different words in the document, not the overall length of the documents. That is: Suppose that U and V are two documents, W has the same distribution of words as U but is twice as long; X has the same distribution of words as V but is twice as long. Then what you want is that Sim(U,V) = Sim(U,X) = Sim(W,V) = Sim(W,X); and this is achieved if you normalize the vectors and use either the dot product or the vectors. But W dot X = 4* U dot V, and distance(w,x) = 2 * distance(u,v), so using either dot product or distance without normalizing gives wrong results. B. Let D and E be the normalized vectors corresponding to documents D and E. What is the minimum possible value of D E? If D E is equal to this minimum, what can you say about D and E? Answer: The minimum value is 0. This achieved if D and E have no words in common (other than stop words) C. What is the maximum possible value of D E? If D E is equal to this maximum, what can you say about D and E? Answer: The maximum value is 1. This is achieved if D and E have identical word distributions. Problem 2: A. In Google, the inverted index is divided into separate parallel databases handled by separate servers. How is the index divided? Answer: The index is divided by document. The server for a given document is calculated by hashing the document ID to a server index. B. Describe the high-level steps in using the parallel database to answer queries. Answer: The query is sent to all the servers for the inverted index. Each such server looks up each query word in the inverted index and gets a list of documents with the relevance of the document to the query word; calculates an overall list of candidate documents, using the Boolean (hard) constraints in the query; calculates the relevance of each candidate document to the query, using the relevance of the document to each query word, the proximity of the query words in the documents, the PageRank of the document, etc. returns to the main server a list of documents relevant to the query, with a relevance score, sorted in decreasing order of relevance. 1

The main query server merges these list to obtain an overall list in decreasing order of relevance. C. Besides the parallelism involved in (A), there are two other types of parallelism involved in the query engine (not including parallelism in the crawler). Describe them briefly. Answer: What I had in mind here was, first that there are Google maintains multiple mirror web clusters at various geographic locations, to achieve robustness and efficient, fast use of the Internet; second, that there are multiple main query servers at each location, to handle different queries in parallel. Some students pointed out that, in addition, there is parallelism with the advertisement server and the spelling checker, which is also true. Problem 3: Pages in the Web graph is divided into the categories SCC, IN, OUT, TENDRILS, and DCC. Suppose that page P is in OUT, page Q is in IN, and that page W is in TENDRILS. Suppose that you now add a link from P to Q and a link from P to W. For each of the following statements, state whether it is definitely true, possibly true, or definitely false: A. Page P is now in IN. Definitely false. B. Page P is now in SCC. Definitely true. C. Page P remains in OUT. Definitely false. D. Page Q remains in IN. Definitely false. E. Page Q is now in SCC. Definitely true. F. Page Q is now in TENDRILS. Definitely false. G. Page W is now in OUT. Possibly true. It could also be now in SCC if previously there had been a path from W to P. G. Page W remains in TENDRILS.Definitely false. H. Some pages other than P, Q, or W have been moved into SCC. Possibly true. I. The diameter of SCC has increased. Possibly true. Problem 4: In the standard PageRank algorithm, every page starts with the same inherent values, and then acquires more value through its inlinks. A. It has been suggested that a link from page P to page Q should be considered as more significant if P and Q are from different domains than if they are from the same domain. Give two arguments in favor of this. Answer: The obvious argument is that A and B are different people, then B s endorsement of A s page is, on the whole, less biased than A s endorsement of his own page. The less obvious answer is this: If P and Q are in different web sites then a link from P to Q is definitely a cross-reference; the author of P is saying that, if you re interested in P, you might want to look at Q. If P and Q are in the same web sites, then a link from P to Q might just be a navigational tool; the author is saying that, if you re browsing the site, here s a way to get to Q. An extreme case of this is where Q is a page that must for some reason be included in the site but is rarely of interest to anyone, such as a privacy notice, a log of raw data, a long dull proof, and so on. Here the link is there so that, if necessary, you can get to the page, but even the author is not saying that anyone would actually want to go to the page. 2

B. In Brin and Page s original PageRank paper, they suggest that one could define alternative notions of PageRank by assigning diffferent inherent values to different pages. At an extreme, one could assign all the inherent value to a single starting page START, and have all other pages derive their values via chains of inlinks from START. In terms of the stochastic model, this would correspond to a modified version where, if you flip heads, you follow a random outlink; if you flip tails, you always jump to START, rather than jumping at random in the web. (They tried the experiment with START = John McCarthy s home page.) Now, suppose that you combine this model with the link-weighting scheme in part (A), interpreting more significant as twice as significant. Using this new model, set up the system of linear equations for PageRank in the graph shown below. Assume that the probability of flipping heads, E=0.6. B:T B:V A:START C:W A:U A:X Answer: The equations differ from the standard formulation in that (a) Every cross-domain outlink counts as two outlinks. E.g. B:T contributes 0.6 * 2/5 of its importance to A:START, 0.6*2/5 to C:W; and 0.6*1/5 to B:V. and (b) in the importance formulation, all the inherent important lies in START; in the Markov chain formulation, all flips of tails go to START. In the importance formulation, the equations are START = 0.4 + 0.24 T + 0.15 U + 0.6 W. T = 0.4 START U = 0.2 START V = 0.12 T W = 0.24 T + 0.3 U + 0.6 V + 0.6 X X = 0.15 U The constant term 0.4 in the first equation only serves to make sure that the solutions add to 1.0. Any other positive value will give the same order of PageRanks. In the Markov chain formulation, the first equation become START = 0.4 START + 0.64 T + 0.55 U + 0.4 V + W + 0.4 X the remaining equations remain unchanged, and the normalizing constraint 3

START + T + U + V + W + X = 1.0 is added. Problem 5: Consider an experiment that is doing statistical studies over a sample of 1 million web pages. For any page P, let I(P) be the number of inlinks to P. Let P N be the page with the N th largest value of I(P) (ties broken arbitrarily). Suppose that I(P) follows the inverse power-law distribution, I(P N ) = 200, 000, 000/(N+20) 2.1, where X is the largest integer less than or equal to X. Thus, the page P 1 with the greatest number of inlinks has 200, 000, 000/21 2.1 = 334, 479 inlinks. A. How many pages have no inlinks? Answer: The function I(P N ) attains 0 at N=8972. Thus, only 8971 pages have any inlinks; 991,029 pages have no inlinks. What is the median value of I(P)? Answer: 0. B. How many links are there in total? What is the average number of links per page? Answer: The total number of links numlinks = 1,000,000 J=1 I(P J ) = 6, 543, 246. The average number of links is numlinks/1,000,000 = 6.543. C. For any link L, let S(L) (the siblings of L) be the set of links whose head is the same as L. Consider L itself to be an element of S(L). For instance, in the graph in problem 3. if L is the link from A:U C:W then S(L) is the set { A:U C:W, A:X C:W, B:T C:W, B:V C:W }. What is the average size of S(L), averaged over all links in the experimental sample? Answer: We want to evaluate avgsibs = (1/numLinks) L links S(L) Note that each page P has I(P) inlinks, each of which has a sibling set of size I(P). Therefore, in the above sum, the page P contributes I(P) terms corresponding to its different inlinks each of which has size S(L) = I(P). Therefore avgsibs = (1/numLinks) L links S(L) = (1/numLinks) P pages I(P)*I(P) = 121,038.4 Note: The large difference between the median and average number of links per page is characteristic of the power-law distribution. D. Explain, in general terms, how the use of link-based evaluation strategies in popular search engines such as Google tends to exacerbate the very uneven distribution of inlinks among web pages. Answer: Using link-based strategies in search engines means that people using search engines to find pages which is the way a large fraction of pages are found will tend to find pages with many inlinks. Since those are the pages that have been found, those are the pages that will be linked to. Thus, pages with many links tend to get more links and pages with few links tend not to get more links, regardless of inherent quality. Problem 6: It is important that when a crawler downloads a page, it can quickly check whether it has seen the content before. A. How can this check be implemented efficiently? Answer: Hash the contents of the paper to a large signature (64 bit is commonly used. The number of possible values must be much 4

larger than the total number of pages to be downloaded.) Then store these signatures in a hash table. When a new page comes, compute its signature, and check in the hash table whether this signature has been seen before. B. It is reasonably frequent that two pages P and Q differ only in their HTML tags and white space. Describe how the method in (A) can be modified to check whether two pages are identical in this sense. Answer: In each page, replace every gap between text that is, every subsection that contains only white space and HTML tags by a single blank character. Then proceed as in (A). Describe an application in which P and Q can be treated as identical (that is, if it has dealt with P, it can ignore Q.) Answer: Any application where you are only interested in the text. E.g. Text-based web mining. Describe an application in which P and Q should not be treated as identical. Answer: The key point here is that hlinks are HTML tags (though anchors are text). Therefore any application that uses links e.g. the crawler itself must treat P and Q as different. C. Suppose that crawler for a general purpose search engine has downloaded URL P and has discovered that its content is exactly identical, including HTML tags and white space, to URL Q, which has already been processed. What does the crawler now do with P? Answer: This is a more ambiguous question than I intended. It s not clear, for example, how Google computes PageRank in such a case: Does it merge P and Q in the PageRank computation or does it treat them as separate pages? The two give different answer. Certainly some notation is made in the Q entry in the document table that URL P is an identical page. Certainly repeated pages are weeded out in the results page to a query. But whether P gets its own document ID or its own entries in the inverted file is not clear. 5