Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to unit length? Answer: To determine the similarity of two pages you want to consider the frequency (or, more generally, significance) of different words in the document, not the overall length of the documents. That is: Suppose that U and V are two documents, W has the same distribution of words as U but is twice as long; X has the same distribution of words as V but is twice as long. Then what you want is that Sim(U,V) = Sim(U,X) = Sim(W,V) = Sim(W,X); and this is achieved if you normalize the vectors and use either the dot product or the vectors. But W dot X = 4* U dot V, and distance(w,x) = 2 * distance(u,v), so using either dot product or distance without normalizing gives wrong results. B. Let D and E be the normalized vectors corresponding to documents D and E. What is the minimum possible value of D E? If D E is equal to this minimum, what can you say about D and E? Answer: The minimum value is 0. This achieved if D and E have no words in common (other than stop words) C. What is the maximum possible value of D E? If D E is equal to this maximum, what can you say about D and E? Answer: The maximum value is 1. This is achieved if D and E have identical word distributions. Problem 2: A. In Google, the inverted index is divided into separate parallel databases handled by separate servers. How is the index divided? Answer: The index is divided by document. The server for a given document is calculated by hashing the document ID to a server index. B. Describe the high-level steps in using the parallel database to answer queries. Answer: The query is sent to all the servers for the inverted index. Each such server looks up each query word in the inverted index and gets a list of documents with the relevance of the document to the query word; calculates an overall list of candidate documents, using the Boolean (hard) constraints in the query; calculates the relevance of each candidate document to the query, using the relevance of the document to each query word, the proximity of the query words in the documents, the PageRank of the document, etc. returns to the main server a list of documents relevant to the query, with a relevance score, sorted in decreasing order of relevance. 1
The main query server merges these list to obtain an overall list in decreasing order of relevance. C. Besides the parallelism involved in (A), there are two other types of parallelism involved in the query engine (not including parallelism in the crawler). Describe them briefly. Answer: What I had in mind here was, first that there are Google maintains multiple mirror web clusters at various geographic locations, to achieve robustness and efficient, fast use of the Internet; second, that there are multiple main query servers at each location, to handle different queries in parallel. Some students pointed out that, in addition, there is parallelism with the advertisement server and the spelling checker, which is also true. Problem 3: Pages in the Web graph is divided into the categories SCC, IN, OUT, TENDRILS, and DCC. Suppose that page P is in OUT, page Q is in IN, and that page W is in TENDRILS. Suppose that you now add a link from P to Q and a link from P to W. For each of the following statements, state whether it is definitely true, possibly true, or definitely false: A. Page P is now in IN. Definitely false. B. Page P is now in SCC. Definitely true. C. Page P remains in OUT. Definitely false. D. Page Q remains in IN. Definitely false. E. Page Q is now in SCC. Definitely true. F. Page Q is now in TENDRILS. Definitely false. G. Page W is now in OUT. Possibly true. It could also be now in SCC if previously there had been a path from W to P. G. Page W remains in TENDRILS.Definitely false. H. Some pages other than P, Q, or W have been moved into SCC. Possibly true. I. The diameter of SCC has increased. Possibly true. Problem 4: In the standard PageRank algorithm, every page starts with the same inherent values, and then acquires more value through its inlinks. A. It has been suggested that a link from page P to page Q should be considered as more significant if P and Q are from different domains than if they are from the same domain. Give two arguments in favor of this. Answer: The obvious argument is that A and B are different people, then B s endorsement of A s page is, on the whole, less biased than A s endorsement of his own page. The less obvious answer is this: If P and Q are in different web sites then a link from P to Q is definitely a cross-reference; the author of P is saying that, if you re interested in P, you might want to look at Q. If P and Q are in the same web sites, then a link from P to Q might just be a navigational tool; the author is saying that, if you re browsing the site, here s a way to get to Q. An extreme case of this is where Q is a page that must for some reason be included in the site but is rarely of interest to anyone, such as a privacy notice, a log of raw data, a long dull proof, and so on. Here the link is there so that, if necessary, you can get to the page, but even the author is not saying that anyone would actually want to go to the page. 2
B. In Brin and Page s original PageRank paper, they suggest that one could define alternative notions of PageRank by assigning diffferent inherent values to different pages. At an extreme, one could assign all the inherent value to a single starting page START, and have all other pages derive their values via chains of inlinks from START. In terms of the stochastic model, this would correspond to a modified version where, if you flip heads, you follow a random outlink; if you flip tails, you always jump to START, rather than jumping at random in the web. (They tried the experiment with START = John McCarthy s home page.) Now, suppose that you combine this model with the link-weighting scheme in part (A), interpreting more significant as twice as significant. Using this new model, set up the system of linear equations for PageRank in the graph shown below. Assume that the probability of flipping heads, E=0.6. B:T B:V A:START C:W A:U A:X Answer: The equations differ from the standard formulation in that (a) Every cross-domain outlink counts as two outlinks. E.g. B:T contributes 0.6 * 2/5 of its importance to A:START, 0.6*2/5 to C:W; and 0.6*1/5 to B:V. and (b) in the importance formulation, all the inherent important lies in START; in the Markov chain formulation, all flips of tails go to START. In the importance formulation, the equations are START = 0.4 + 0.24 T + 0.15 U + 0.6 W. T = 0.4 START U = 0.2 START V = 0.12 T W = 0.24 T + 0.3 U + 0.6 V + 0.6 X X = 0.15 U The constant term 0.4 in the first equation only serves to make sure that the solutions add to 1.0. Any other positive value will give the same order of PageRanks. In the Markov chain formulation, the first equation become START = 0.4 START + 0.64 T + 0.55 U + 0.4 V + W + 0.4 X the remaining equations remain unchanged, and the normalizing constraint 3
START + T + U + V + W + X = 1.0 is added. Problem 5: Consider an experiment that is doing statistical studies over a sample of 1 million web pages. For any page P, let I(P) be the number of inlinks to P. Let P N be the page with the N th largest value of I(P) (ties broken arbitrarily). Suppose that I(P) follows the inverse power-law distribution, I(P N ) = 200, 000, 000/(N+20) 2.1, where X is the largest integer less than or equal to X. Thus, the page P 1 with the greatest number of inlinks has 200, 000, 000/21 2.1 = 334, 479 inlinks. A. How many pages have no inlinks? Answer: The function I(P N ) attains 0 at N=8972. Thus, only 8971 pages have any inlinks; 991,029 pages have no inlinks. What is the median value of I(P)? Answer: 0. B. How many links are there in total? What is the average number of links per page? Answer: The total number of links numlinks = 1,000,000 J=1 I(P J ) = 6, 543, 246. The average number of links is numlinks/1,000,000 = 6.543. C. For any link L, let S(L) (the siblings of L) be the set of links whose head is the same as L. Consider L itself to be an element of S(L). For instance, in the graph in problem 3. if L is the link from A:U C:W then S(L) is the set { A:U C:W, A:X C:W, B:T C:W, B:V C:W }. What is the average size of S(L), averaged over all links in the experimental sample? Answer: We want to evaluate avgsibs = (1/numLinks) L links S(L) Note that each page P has I(P) inlinks, each of which has a sibling set of size I(P). Therefore, in the above sum, the page P contributes I(P) terms corresponding to its different inlinks each of which has size S(L) = I(P). Therefore avgsibs = (1/numLinks) L links S(L) = (1/numLinks) P pages I(P)*I(P) = 121,038.4 Note: The large difference between the median and average number of links per page is characteristic of the power-law distribution. D. Explain, in general terms, how the use of link-based evaluation strategies in popular search engines such as Google tends to exacerbate the very uneven distribution of inlinks among web pages. Answer: Using link-based strategies in search engines means that people using search engines to find pages which is the way a large fraction of pages are found will tend to find pages with many inlinks. Since those are the pages that have been found, those are the pages that will be linked to. Thus, pages with many links tend to get more links and pages with few links tend not to get more links, regardless of inherent quality. Problem 6: It is important that when a crawler downloads a page, it can quickly check whether it has seen the content before. A. How can this check be implemented efficiently? Answer: Hash the contents of the paper to a large signature (64 bit is commonly used. The number of possible values must be much 4
larger than the total number of pages to be downloaded.) Then store these signatures in a hash table. When a new page comes, compute its signature, and check in the hash table whether this signature has been seen before. B. It is reasonably frequent that two pages P and Q differ only in their HTML tags and white space. Describe how the method in (A) can be modified to check whether two pages are identical in this sense. Answer: In each page, replace every gap between text that is, every subsection that contains only white space and HTML tags by a single blank character. Then proceed as in (A). Describe an application in which P and Q can be treated as identical (that is, if it has dealt with P, it can ignore Q.) Answer: Any application where you are only interested in the text. E.g. Text-based web mining. Describe an application in which P and Q should not be treated as identical. Answer: The key point here is that hlinks are HTML tags (though anchors are text). Therefore any application that uses links e.g. the crawler itself must treat P and Q as different. C. Suppose that crawler for a general purpose search engine has downloaded URL P and has discovered that its content is exactly identical, including HTML tags and white space, to URL Q, which has already been processed. What does the crawler now do with P? Answer: This is a more ambiguous question than I intended. It s not clear, for example, how Google computes PageRank in such a case: Does it merge P and Q in the PageRank computation or does it treat them as separate pages? The two give different answer. Certainly some notation is made in the Q entry in the document table that URL P is an identical page. Certainly repeated pages are weeded out in the results page to a query. But whether P gets its own document ID or its own entries in the inverted file is not clear. 5