Information Retrieval. Lecture 9 - Web search basics

Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30

Introduction Up to now: techniques for general Information Retrieval (building and compressing an index, querying, expressing relevance feedback) Today: specific type of Information Retrieval systems web search engines World Wide Web standard IR collections (e.g. newswires) Users of web search engines users of standard IR systems 2/ 30

Overview Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 3/ 30

Background and history Background and history Complexity of web search comes from: its scale (about 20 billions pages currently) its lack of coordination (decentralized content publishing) the heterogeneity of its contributors (motives and backgrounds) Success of WWW comes from: easy-to-learn edition language (HTML) robust browsers (unknown code is ignored) access to the source-code (learn by example) 4/ 30

Background and history Early web search engines and web collections 2 families of engines: (1) full-text index-based search engines (altavista, excite, infoseek) (2) taxonomy-based search engines (yahoo) (2) relies on a classification of documents that may be unintuitive, and has a high cost of maintenance Early collections: tens of millions pages (larger than any prior collection) indexing and fast querying performed successfully, without the expected quality of retrieval new techniques were needed to rank the retrieved pages and deal with the spams 5/ 30

Background and history Indexing the web Questions arising when one wants to index the web: Which pages can one trust? How a search engine can assign a measure of trust to a webpage? How to deal with the expansion of the collection? ( ) How to deal with redundancy? (*) By the end of 1995, altavista had crawled 30 millions static webpages (the size of the index was multiplied by 2 every few months) 6/ 30

Overview Anatomy of the world wide web Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 7/ 30

Anatomy of the world wide web Web as a graph The web can be represented as a graph: webpage node hyperlink directed edge # in-links in-degree of a node # out-links out-degree of a node In-links follows a power law the number of webpages with in-degree i is proportional to 1 i α α 2.1 8/ 30

Anatomy of the world wide web Web as a graph (continued) A F C B D E G J K 9/ 30

Anatomy of the world wide web Web as a graph (continued) The web has a Bowtie shape where webpages belong to one of these 3 categories: IN (out-degree 0) OUT (in-degree 0) SCC There are no hyperlinks from OUT to SCC, nor from SCC to IN IN OUT SCC IN 10/ 30

Anatomy of the world wide web Web as graph (continued) Figure from (Manning et al., 2008) 11/ 30

Anatomy of the world wide web Dealing with spam Web considered as a medium to connect advertisers to prospective buyers Sponsored search engines (e.g. Goto using bid for queries) 1 st generation of spam: building documents with specific high-frequency terms, in order to appear first in the retrieval for some queries Cloaking: Is the client a crawler? Yes misleading document No spam A doorway document is used to get highly ranked, but when accessed by a browser, it redirects the user to a spam Current solution: link analysis (more later) 12/ 30

Overview Web search users Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 13/ 30

Web search users Web search users Improving retrieval results needs to better understand how the search engine is used: users do not know (or care) about the heterogeneity of content users do not know (or care) about the querying syntax users use on average between 2 and 3 keywords Catching a bigger audience needs to better understand how the search engine is used (cf revenue from sponsorised search) The google example: (1) focus on relevance and precision (rather than recall) (2) lightweight user experience (clean input and output graphical interface) 14/ 30

Web search users User query needs 3 categories of common search queries: informational (general info. about a broad topic) navigational (particular webpage, precision at 1) transactional (prelude to a transaction, such as purchase, download, etc.) The category of the query should have an impact on the algorithmic search 15/ 30

Web search systems Web search systems From (Manning et al., 2008) 16/ 30

Overview Estimating the size of the index Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 17/ 30

Estimating the size of the index Estimating the size of the index Question: given 2 web search engines, what are the relative size of their indexes? what about documents retrieved while not fully indexed? what about partitioned index (only a proportion of the total index is used for retrieval)? No simple measure of the size of the index 18/ 30

Estimating the size of the index Estimating the size of the index (continued) Capture-recapture method: is a page indexed by E 1 also indexed by E 2 and vice versa? x: proportion of pages in E 1 that are in E 2 y: proportion of pages in E 2 that are in E 1 and thus x. E 1 y. E 2 E 1 E 2 y x How to select the sample of indexed webpage? query should not come from a specific group of users webpage should not be hosted by a specific machine (IP sharing) 19/ 30

Estimating the size of the index Selecting random queries Idea: pick a page at random from a search engine s index by posing a random query to it (keywords chosen at random) In practice: a) random conjunctive query applied to E 1 top 100 documents b) from these 100 documents page p selected at random c) is p indexed by E 2? select 6-8 low frequency terms in p for querying E 2 Nonetheless, bias from the length of documents, and from the ranking algorithm used Solution: statistical sampling (evaluates the magnitude of the bias) 20/ 30

Overview Duplicates and near-duplicates Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 21/ 30

Duplicates and near-duplicates Duplicates and near-duplicates 40 % of the web is supposed to be duplicates (e.g. mirroring for access reliability) To find duplicates: computation of a fingerprint (digest of the term sequence) when fingerprints match, the documents are compared (and in case of duplicate, one document is removed from the index) To find near-duplicates: uses shingling k-shingle of a document d is the set of all consecutive sequences of k-terms in d documents are near-duplicates when they share many shingles 22/ 30

Duplicates and near-duplicates Finding near-duplicates Let S(d j ) be the set of shingles for document d j J(S(d i ), S(d j )) = S(d i) S(d j ) S(d i ) S(d j ) If J(S(d i ), S(d j )) 0.9, d j is not indexed Cost of computing pairwise comparisons is too high How to estimate this shingle-sharing? (i.e. reduce comparison cost) Estimation using a hash function producing a 64-bit integer applied on shingles 23/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Let us define H(d j ) = {hash(x) x S(d j )} We look for pairs (d i, d j ) such that H(d i ) and H(d j ) have large overlaps Let π be a random bit permutation within a 64-bit integer Let Π(d j ) be the set of permuted hash values in H(d j ) Let x π j be the smallest integer in Π(d j ) Theorem: J(S(d i ), S(d j )) = P[x π i = x π j ] 24/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) (Figure from Manning et al., 2008) 25/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Intuition: consider the matrix A where rows are elements i (e.g. hash-values) and columns are sets S j (e.g. H(d x )). a 11 a 1n A =..... a m1 a mn a ij = 1 iff element i is in S j Π is a random permutation of the rows of A Π(S j ) is the permutation of column j x π j is the index of the first row such that Π(S j ) = 1 For two columns m, n: P[x π j m = x π j n ] = J(S jm, S jn ) 26/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Considering two columns S jm and S jn, filled with 1 and 0: S jm J(S jm, S jn ) = S jn 0 1 1 1 1 0 0 1 0 0 C 11 C 01 + C 10 + C 11 This fraction is also P[x π j m = x π j n ] (probability of finding 1 1 during top-down row-scanning) 27/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) To sum up: test for Jaccard-overlap based on the permutation Π The values xi π are computed for different documents, if xi π = xj π then the documents i and j are near-duplicates All permutations are not computed, only a sketch ψ i of 200 permutations ψ i ψ j threshold 200 means documents i and j are near-duplicates 28/ 30

Conclusion Duplicates and near-duplicates Introduction to the WWW (history, shape, users) Estimate of the size of the index of a web search engine Techniques to remove duplicates (fingerprints) and near-duplicates (shingles) To come: web crawling link analysis (google page rank) 29/ 30

Duplicates and near-duplicates References C. Manning, P. Raghavan and H. Schütze Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter19-webchar.pdf Ziv Bar-Yossef and Maxim Gurevich Random Sampling from a Search Engine s Index (2006) http://www2006.org/programme/item.php?id=3047 Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener Graph structure in the web (2000) http://www9.org/w9cdrom/160/160.html 30/ 30