Information Retrieval. Lecture 9 - Web search basics

Similar documents
Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 10 - Web crawling

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Lecture 17 November 7

Topic: Duplicate Detection and Similarity Computing

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Brief (non-technical) history

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lec 8: Adaptive Information Retrieval 2

US Patent 6,658,423. William Pugh

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Do TREC Web Collections Look Like the Web?

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Finding Similar Sets

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

World Wide Web has specific challenges and opportunities

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Ranking of ads. Sponsored Search

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Co-clustering or Biclustering

Connected Components, and Pagerank

Introduction to Information Retrieval

Search Engines. Information Retrieval in Practice

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

Finding dense clusters in web graph

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

21. Search Models and UIs for IR

CS506/606 - Topics in Information Retrieval

5. search engine marketing

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

: Semantic Web (2013 Fall)

Automatic Identification of User Goals in Web Search [WWW 05]

CS47300 Web Information Search and Management

THE HISTORY & EVOLUTION OF SEARCH

Finding Similar Items:Nearest Neighbor Search

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

A Taxonomy of Web Search

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Searching the Web [Arasu 01]

The Anatomy of a Large-Scale Hypertextual Web Search Engine

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

CS290N Summary Tao Yang

Link Analysis and Web Search

COMP Page Rank

Information Retrieval and Web Search

Review: Searching the Web [Arasu 2001]

Big Data Analytics CSCI 4030

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Information Retrieval

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Link Analysis in Web Mining

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Lecture 8: Linkage algorithms and web search

Information Retrieval Spring Web retrieval

New Issues in Near-duplicate Detection

SEARCH ENGINE INSIDE OUT

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

Introduction to Data Mining

Lecture 5: Data Streaming Algorithms

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CSE 5243 INTRO. TO DATA MINING

A STUDY ON THE EVOLUTION OF THE WEB

Today we show how a search engine works

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Big Data Analytics CSCI 4030

Introduction to Information Retrieval

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

Breadth-First Search Crawling Yields High-Quality Pages

An introduction to Web Mining part II

Lecture #3: PageRank Algorithm The Mathematics of Google Search

DATA MINING - 1DL105, 1DL111

Recent Researches on Web Page Ranking

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

SEO 1 8 O C T O B E R 1 7

Introduction to Information Retrieval

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

DATA MINING II - 1DL460. Spring 2014"

Module 1: Internet Basics for Web Development (II)

Random Sampling from a Search Engine s Index

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Mathematical Analysis of Google PageRank

Transcription:

Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30

Introduction Up to now: techniques for general Information Retrieval (building and compressing an index, querying, expressing relevance feedback) Today: specific type of Information Retrieval systems web search engines World Wide Web standard IR collections (e.g. newswires) Users of web search engines users of standard IR systems 2/ 30

Overview Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 3/ 30

Background and history Background and history Complexity of web search comes from: its scale (about 20 billions pages currently) its lack of coordination (decentralized content publishing) the heterogeneity of its contributors (motives and backgrounds) Success of WWW comes from: easy-to-learn edition language (HTML) robust browsers (unknown code is ignored) access to the source-code (learn by example) 4/ 30

Background and history Early web search engines and web collections 2 families of engines: (1) full-text index-based search engines (altavista, excite, infoseek) (2) taxonomy-based search engines (yahoo) (2) relies on a classification of documents that may be unintuitive, and has a high cost of maintenance Early collections: tens of millions pages (larger than any prior collection) indexing and fast querying performed successfully, without the expected quality of retrieval new techniques were needed to rank the retrieved pages and deal with the spams 5/ 30

Background and history Indexing the web Questions arising when one wants to index the web: Which pages can one trust? How a search engine can assign a measure of trust to a webpage? How to deal with the expansion of the collection? ( ) How to deal with redundancy? (*) By the end of 1995, altavista had crawled 30 millions static webpages (the size of the index was multiplied by 2 every few months) 6/ 30

Overview Anatomy of the world wide web Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 7/ 30

Anatomy of the world wide web Web as a graph The web can be represented as a graph: webpage node hyperlink directed edge # in-links in-degree of a node # out-links out-degree of a node In-links follows a power law the number of webpages with in-degree i is proportional to 1 i α α 2.1 8/ 30

Anatomy of the world wide web Web as a graph (continued) A F C B D E G J K 9/ 30

Anatomy of the world wide web Web as a graph (continued) The web has a Bowtie shape where webpages belong to one of these 3 categories: IN (out-degree 0) OUT (in-degree 0) SCC There are no hyperlinks from OUT to SCC, nor from SCC to IN IN OUT SCC IN 10/ 30

Anatomy of the world wide web Web as graph (continued) Figure from (Manning et al., 2008) 11/ 30

Anatomy of the world wide web Dealing with spam Web considered as a medium to connect advertisers to prospective buyers Sponsored search engines (e.g. Goto using bid for queries) 1 st generation of spam: building documents with specific high-frequency terms, in order to appear first in the retrieval for some queries Cloaking: Is the client a crawler? Yes misleading document No spam A doorway document is used to get highly ranked, but when accessed by a browser, it redirects the user to a spam Current solution: link analysis (more later) 12/ 30

Overview Web search users Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 13/ 30

Web search users Web search users Improving retrieval results needs to better understand how the search engine is used: users do not know (or care) about the heterogeneity of content users do not know (or care) about the querying syntax users use on average between 2 and 3 keywords Catching a bigger audience needs to better understand how the search engine is used (cf revenue from sponsorised search) The google example: (1) focus on relevance and precision (rather than recall) (2) lightweight user experience (clean input and output graphical interface) 14/ 30

Web search users User query needs 3 categories of common search queries: informational (general info. about a broad topic) navigational (particular webpage, precision at 1) transactional (prelude to a transaction, such as purchase, download, etc.) The category of the query should have an impact on the algorithmic search 15/ 30

Web search systems Web search systems From (Manning et al., 2008) 16/ 30

Overview Estimating the size of the index Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 17/ 30

Estimating the size of the index Estimating the size of the index Question: given 2 web search engines, what are the relative size of their indexes? what about documents retrieved while not fully indexed? what about partitioned index (only a proportion of the total index is used for retrieval)? No simple measure of the size of the index 18/ 30

Estimating the size of the index Estimating the size of the index (continued) Capture-recapture method: is a page indexed by E 1 also indexed by E 2 and vice versa? x: proportion of pages in E 1 that are in E 2 y: proportion of pages in E 2 that are in E 1 and thus x. E 1 y. E 2 E 1 E 2 y x How to select the sample of indexed webpage? query should not come from a specific group of users webpage should not be hosted by a specific machine (IP sharing) 19/ 30

Estimating the size of the index Selecting random queries Idea: pick a page at random from a search engine s index by posing a random query to it (keywords chosen at random) In practice: a) random conjunctive query applied to E 1 top 100 documents b) from these 100 documents page p selected at random c) is p indexed by E 2? select 6-8 low frequency terms in p for querying E 2 Nonetheless, bias from the length of documents, and from the ranking algorithm used Solution: statistical sampling (evaluates the magnitude of the bias) 20/ 30

Overview Duplicates and near-duplicates Background and history Anatomy of the world wide web Web search users Web search systems Estimating the size of the index Duplicates and near-duplicates 21/ 30

Duplicates and near-duplicates Duplicates and near-duplicates 40 % of the web is supposed to be duplicates (e.g. mirroring for access reliability) To find duplicates: computation of a fingerprint (digest of the term sequence) when fingerprints match, the documents are compared (and in case of duplicate, one document is removed from the index) To find near-duplicates: uses shingling k-shingle of a document d is the set of all consecutive sequences of k-terms in d documents are near-duplicates when they share many shingles 22/ 30

Duplicates and near-duplicates Finding near-duplicates Let S(d j ) be the set of shingles for document d j J(S(d i ), S(d j )) = S(d i) S(d j ) S(d i ) S(d j ) If J(S(d i ), S(d j )) 0.9, d j is not indexed Cost of computing pairwise comparisons is too high How to estimate this shingle-sharing? (i.e. reduce comparison cost) Estimation using a hash function producing a 64-bit integer applied on shingles 23/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Let us define H(d j ) = {hash(x) x S(d j )} We look for pairs (d i, d j ) such that H(d i ) and H(d j ) have large overlaps Let π be a random bit permutation within a 64-bit integer Let Π(d j ) be the set of permuted hash values in H(d j ) Let x π j be the smallest integer in Π(d j ) Theorem: J(S(d i ), S(d j )) = P[x π i = x π j ] 24/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) (Figure from Manning et al., 2008) 25/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Intuition: consider the matrix A where rows are elements i (e.g. hash-values) and columns are sets S j (e.g. H(d x )). a 11 a 1n A =..... a m1 a mn a ij = 1 iff element i is in S j Π is a random permutation of the rows of A Π(S j ) is the permutation of column j x π j is the index of the first row such that Π(S j ) = 1 For two columns m, n: P[x π j m = x π j n ] = J(S jm, S jn ) 26/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) Considering two columns S jm and S jn, filled with 1 and 0: S jm J(S jm, S jn ) = S jn 0 1 1 1 1 0 0 1 0 0 C 11 C 01 + C 10 + C 11 This fraction is also P[x π j m = x π j n ] (probability of finding 1 1 during top-down row-scanning) 27/ 30

Duplicates and near-duplicates Finding near-duplicates (continued) To sum up: test for Jaccard-overlap based on the permutation Π The values xi π are computed for different documents, if xi π = xj π then the documents i and j are near-duplicates All permutations are not computed, only a sketch ψ i of 200 permutations ψ i ψ j threshold 200 means documents i and j are near-duplicates 28/ 30

Conclusion Duplicates and near-duplicates Introduction to the WWW (history, shape, users) Estimate of the size of the index of a web search engine Techniques to remove duplicates (fingerprints) and near-duplicates (shingles) To come: web crawling link analysis (google page rank) 29/ 30

Duplicates and near-duplicates References C. Manning, P. Raghavan and H. Schütze Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter19-webchar.pdf Ziv Bar-Yossef and Maxim Gurevich Random Sampling from a Search Engine s Index (2006) http://www2006.org/programme/item.php?id=3047 Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener Graph structure in the web (2000) http://www9.org/w9cdrom/160/160.html 30/ 30