World Wide Web has specific challenges and opportunities

Size: px

Start display at page:

Download "World Wide Web has specific challenges and opportunities"

Percival Walters
5 years ago
Views:

1 6. Web Search

2 Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has specific challenges and opportunities it is decentralized, i.e., there is no central repository from which we can obtain all available web pages web pages are connected via hyperlinks, which provides us with additional information, potentially including anchor texts search engines can monitor the behavior of their users (e.g., which results they click on) and use this to improve the effectiveness of search results 2

3 A Brief History of Web Search 1990s: Early web search engines (e.g., Excite and Altavista) mostly relied on traditional methods from Information Retrieval (e.g., tf.idf-based models that only looked at text in web pages) 1998: Google, as a project out of Stanford University, with its PageRank algorithm makes clever use of the Web s link structure to estimate the importance of web pages since mid-2000s: search engines increasingly rely on learningto-rank methods that rely on machine learning to rank result documents based on observed user behavior and taking into account a large number of signals 3

4 Agenda 6.1 The Web as a Graph 6.2 Crawling the Web 6.3 Link Analysis 6.4 Web Spam 6.5 Learning-to-Rank 4

5 6.1 The Web as a Graph The World Wide Web can be seen as a directed graph G(V, E) web pages (URLs) as vertices V (nodes) hyperlinks as directed edges E Adjacency Matrix: A = S T W X U V

distributions Distributions of indegrees and outdegrees follow

6 Degree Distributions Researchers have investigated different structural properties of the web graph including its degree distributions Distributions of indegrees and outdegrees follow power laws P[ = k ] Ã k with α 2.1 (indegrees) and α 2.7 (outdegrees) 6

7 Bow-Tie Structure Connectivity structure of the web graph resembles a bow tie large strongly connected component (SCC) of web pages that are reachable from each other smaller sets IN and OUT from which web pages in the SCC can be reached or that are reachable from SCC 7

8 Small Diameter The diameter of a graph is defined as the maximal length of any shortest path between two nodes a small diameter means that we can get from any other node in few steps The web graph has been observed to have a small diameter within its SCC, i.e., we can get from any web page to any other web page by following few hyperlinks Small diameters were first observed in the analysis of social networks and are known as the small world phenomenon 8

9 Small-World Phenomenon Small-world phenomenon was first mentioned in a famous experiment conducted by Stanley Milgram in 1967 participants of the experiment had to forward a package to a target person located in Boston and were only allowed to pass on the package to a person known to them on average, it took about six steps (i.e., passes) to route the packet to the target person This result is famously known as six degrees of separation, alluding to the fact that everybody seems connected to anybody else via a few steps 9

10 Random Graph Models Erdös-Renyi model G(n, p) randomly generates a graph having n nodes, in which each possible edge exists with probability p Barabási-Albert model randomly generates a graph by iteratively adding nodes, each of which has m outgoing edges; the target node of such an edge is chosen with probability depending on its current indegree (i.e., the rich get richer ) Barabási-Albert model yields graphs in which the distribution of indegrees follows a power law (i.e., similar to web graph) 10

11 6.2 Crawling the Web World Wide Web is inherently decentralized, i.e., web pages are stored on geographically distributed web servers, and there is no central repository of all available URLs When building a search engine, one thus has to first discover and download web pages to be indexed Search engines employ so-called crawlers (also: spiders) to discover and download web pages; a crawler starts from a set of seed pages and follows hyperlinks therein to discover other web pages and traverse the Web 11

12 Crawling Illustrated D B A E F Frontier : B, C C G Frontier new PriorityQueue() Frontier.add(Seeds) // insert seeds while!frontier.isempty() URL Frontier.Remove() // get next URL URL.fetch() // retrieve URL s contents Frontier.add(URL.getLinks()) // enqueue links 12

13 Challenges Politeness a crawler may only request permitted resources (in robots.txt) requests to the same web server have to be issued at a moderate rate (e.g., 1 request/second) to avoid overloading it Robustness crawler traps (e.g., dynamically generated calendars) incomplete/invalid HTML documents network problems (e.g., high latency, low bandwidth, temporary unavailability of web servers) 13

14 Robots Exclusion Standard (robots.txt) Robots Exclusion Standard demands that crawlers first request the file robots.txt, in which the content provider can specify which crawlers can access which contents at which rate User-agent: WebReaper Disallow: / User-agent: Slurp Crawl-delay: 18 User-agent: * Disallow: /active/ Disallow: /artikelversand/ Disallow: /cgi-bin/ Disallow: /staticgen/mobil/ 14

15 6.3 Link Analysis The link structure of the World Wide Web can be analyzed to obtain information about the importance of web pages PageRank, as used in the original Google search engine, is based on the intuition that a web page is important, if other important web pages contain hyperlinks to it PageRank is based on a random walk on the web graph, which can be formally described and analyzed as a so-called Markov chain 15

16 PageRank PageRank is based on a random walk on the web graph: in each step, a random surfer chooses between two options with probability ε the random surfer performs a random jump to any of the V nodes (web page) in the web graph with probability (1 ε) the random surfer follows on of the outgoing edges (hyperlinks) of the currently visited node The PageRank p(v ) of the node v is recursively defined as p(v) =(1 ) ÿ (u,v)œe p(u) out(u) + V and reflects the importance of the node v 16

node to another node I (1 )/out(u)+ / V : (u, v) œ E P (u,v) = / V : P = S T 0.

17 PageRank PageRank values can be computed based on a transition probability matrix P, whose entries reflect with which probability the random surfer moves from one node to another node I (1 )/out(u)+ / V : (u, v) œ E P (u,v) = / V : P = S T W X U V =0.2 17

18 PageRank PageRank are computed using the power iteration method start with an vector of initial state probabilities π (0) fi (0) = # 1/ V... 1/ V $ compute state probabilities after one step of the random surfer as fi (1) = fi (0) P compute state probabilities after i steps of the random surfer as fi (i) = fi (i 1) P terminate computation once state probabilities have converged 18

19 PageRank Power iteration method applied to our example graph fi (0) = # $ fi (1) = # $ fi (2) = # $ fi (10) = # $ P = S T W X U V

20 6.4 Web Spam Many web sites have commercial interests and hope to attract more visitors from search engines Search Engine Optimization (SEO) seeks to make web pages more findable (e.g., by optimizing their meta data) Web spam refers to a family of techniques that seek to bring up web pages in search results for specific queries (e.g., by manipulating their contents) 20

21 Web Spam Gyöngyi and Garcia-Molina distinguish three kinds of web spam term spam manipulates the contents of web pages link spam manipulates the link structure of the web graph content hiding hides the actual content of web pages Web spam techniques evolved in parallel to search engines in a kind of arms race: initially: mostly content manipulation, targeting tf.idf-style methods then: manipulation of web graph, targeting link analysis methods more recently: manipulation of social media (e.g., comments, likes) to affect relevant signals in learning-to-rank methods 21

22 Term Spam Assumption: Search engines relies on a retrieval model that considers the term frequency Idea: Augment content of web pages with occurrences of terms, for which it should be returned in search results Our charming hotel close to Oslo (Norway) Our charming hotel close to Oslo (Norway) hotel hotel hotel norway norway norway fjord fjord cheap cheap cheap To avoid irritating users, additional term occurrences are added, so that they are not visible (e.g., same color as background) 22

23 Link Spam Assumption: Search engines relies on a link analysis method (e.g., PageRank) to estimate the importance of web pages Idea: manipulate other important web pages (e.g., DMOZ or Wikipedia), so that they point to own web pages

24 Link Spam Honey pots are collections of actually useful web pages (e.g., a copy of Wikipedia) modified to point to own web pages Link directories (e.g., DMOZ) can be manipulated to include hyperlinks to own web pages Comments in forums or social media sites can be created to include hyperlinks to own web pages Spam farms are collections of different web sites constructed solely to make own web pages appear more important 24

25 6.5 Learning-to-Rank Nowadays, search engines rely on a multitude of different signals to rank documents in response to a query, e.g.: textual relevance (e.g., based on a retrieval model) link-based importance (e.g., based on PageRank) spam probability (e.g., estimated using a classifier) user popularity (e.g., estimated on observed clicks) textual quality (e.g., based on typos in content) readability (e.g., based on sentence length) age (e.g., based on date of last modification) 25

26 Learning-to-Rank Search engines need to combine these different signals in a meaningful manner, so that search results are effective Machine learning methods can learn how to combine these signals in an effective manner based on observed user behavior This can be done, for example, by casting it into a classification problem and trying to predict whether a user will click or not click on a specific document for a specific query 26

27 Learning-to-Rank # $ R # $ N # $ N # $ R Training data with classes non-relevant (N) relevant (R) Classifier # ? $ Query time: Class R/N is determined for a previously unseen document and query 27

28 Learning-to-Rank Training data for classifier is obtained, e.g.: based on relevance judgments provided by assessors based on observed user behavior (e.g., does the user click on a document for a specific query or not) When processing a query, a common approach today is to identify the top-1000 documents using a base ranker (e.g., retrieval model combined with link analysis), which are then re-ranked using the learned classifier, taking all available signals into account 28

29 Summary World Wide Web can be seen as a directed graph, which has a small diameter and degrees whose distributions follow power laws Decentralized nature of World Wide Web requires that documents have to be first collected using a crawler PageRank as a link-analysis methods that estimates the importance of web pages based on the web graphs Learning-to-rank as a common approach to combine multiple signals to yield effective search results 29

30 Literature [1] C. D. Manning, P. Raghavan, and H. Schütze: Introduction to Information Retrieval, Cambridge University Press, 2008 (Chapter 20 & 21) [2] W. B. Croft, D. Metzler, and T. Strohman: Search Engines Information Retrieval in Practice, Pearson Education, 2009 (Chapter 3 & 4) 30

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems