Web Document Ranking

Size: px

Start display at page:

Download "Web Document Ranking"

Molly Wendy McDowell
6 years ago
Views:

1 Web Document Ranking Sérgio Nunes DEI, Faculdade de Engenharia Universidade do Porto SSIIM, MIEIC, 2015/ /15

2 Overview of concepts and techniques for ranking web documents

3 The World Wide Web

4 The Web The World Wide Web is a distributed information system unprecedented in many ways in size, in lack of central coordination, and in the diversity of users backgrounds. The first published vision of a large-scale distributed hypertext system can be traced back to Vannevar Bush s seminal article As We May Think (1945).

Web Growth Web pages >> web hosts. Altavista reported an index of 30 million web pages in 1995.

5 Web Growth Web pages >> web hosts. Altavista reported an index of 30 million web pages in At least 11.5 billion indexable web pages in 2005 [Gulli et al.]. How can we estimate the size of the web?

6 Authority Problem Several factors have led to the mass adoption of the web as a publishing medium from anonymous individuals to professional organizations. The lack of a central authority or coordination, the simplicity of the underlying technology, and the easy access to free web publishing tools, means that anybody can publish anything. How can we assess the reliability of content found on the web? Which pages can we trust?

7 Web Directories Awebdirectoryisahierarchicalstructure,organizedby topics, containing selected web sites e.g. dmoz.org. In the early days of the web, these directories were very popular human editors selected the highest quality pages for each category. This approach quickly became unfeasible at web-scale. Additionally, these approaches implied a strong semantic agreement between the directory s editors and the users.

8 Search Engines First generation search engines were based on classic keyword matching techniques developed for text search. The main challenge was dealing with the size of the web. While classic text search techniques provided sufficient results, the overall quality was questionable due to the nature of web content. Most notably, the web has no central editorial control, there is a complete lack of publishing standards, there is a high degree of content duplication and some content is published with malicious intents (i.e. spam).

9 Web s Size Estimating the size of the web is not a trivial problem e.g. the number of dynamic web pages is technically infinite. The deep web is estimated to be several orders of magnitude bigger than the surface web. The size of the surface web was considered to be 170 TB in The deep web was several orders of magnitude bigger, with approximately 90,000 TB. How Much Information?

10 SPAM On the web, spam is an issue of major importance. At its root, spam exists due to commercial motivations e.g. achieve better rankings in search engines. There is a wide range of techniques for web spam, from simple to highly sophisticated. Keyword stuffing Repetition of high-value keywords in content. Cloaking (mask) Show different content to search engines. Link spam Artificial links created using hidden links, link farms, etc. Web search engines operate in an adversarial information retrieval environment (research topic).

11 SPAM Example 1. Scrape content from real web documents: blogs, Wikipedia, news sites, etc. 2. Mix and generate synthetic content to avoid duplicate detection. 3. Insert key words and phrases. 4. Replace or insert links to sites being promoted. 5. Publish content on the web using free publishing platforms (e.g. wordpress, blogspot, comments, etc).

12 The Web Graph The web is usually modeled as a directed graph, where each web page is a node and each link is a directed edge. A B C The hyperlinks that point to a page are called in-links and those originating in the page are called out-links. The number of in-links to a page is called in-degree.

13 The Bowtie Model TENDRILS IN SCC OUT TUBE DC A web surfer can pass from any page in IN to any page in SCC by following hyperlinks. Likewise, from any page in SCC to any page in OUT. SCC is a strongly connected core. Graph structure in the Web (2000)

14 Web Ranking

15 Web Document Ranking Web documents can be ranked in a static, absolute way or ranked in a given context. The static ranking of document is typically called query-independent i.e.documents have a weight regardless of a query or a context. E.g.: most important document on the world wide web. In query-dependent ranking, each document has a different weight depending on the query of context being analyzed. E.g.: best document for learning how to cook.

16 Signals Documents are scored (i.e. ranked) using various sources of information, usually called features or, more generically, signals. Amultitudeofsignalscanbeidentified: Length of document Age of document Number of incoming links Number of outgoing links Document s host domain Document s language Number of query terms Time of query Query terms in document Query terms in collection Query terms in document title Query s language On the left are examples of query-independent signals, on the right are query-dependent examples. Google reportedly uses more than 200 signals in their ranking.

17 Types of Signals The signals available in a collection of web documents can be divided in two groups depending on their origins. The signals obtained directly from the document are named document-based signals. E.g.: term frequency, doc length, etc. Signals obtained from the Web are named web-based signals. E.g.: number of citations, anchor text, etc. Web search engines have access to other sources of signals: click data, external collections, etc.

18 Document-based Signals

19 Term Frequency The number of occurrences of a terms in a document is a signal typically used in text retrieval. However, the web is an adversarial information retrieval environment. Quasi architecto Quasi architecto Quasi architecto Sed ut perspiciatis unde omnis iste natus error sit flowers accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo flowers veritatis et quasi architecto beatae vitae dicta sunt explicabo. Sed ut flowers unde omnis flowers natus error sit flowers accusantium flowers laudantium, totam rem aperiam, eaque ipsa quae ab illo flowers veritatis et quasi flowers beatae vitae dicta sunt explicabo. flowers ut flowers flowers omnis flowers flowers flowers sit flowers flowers flowers flowers, totam flowers aperiam, flowers ipsa flowers ab flowers flowers flowers et quasi flowers flowers flowers dicta flowers. Nemo enim flowers voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Nemo enim flowers voluptatem quia voluptas sit aspernatur aut flowers aut fugit, sed quia flowers magni dolores eos qui ratione voluptatem sequi flowers. flowers enim flowers flowers quia flowers flowers flowers aut flowers aut flowers, flowers quia flowers flowers dolores flowers qui flowers flowers sequi flowers. TF("flowers") = 3 TF("flowers") = 10 TF("flowers") =

20 Inverse Document Frequency Terms that appear in fewer documents of a collection have more discriminative power, thus are given an higher weight. IDF(term) = Documents in collection Documents containing term Measures the general importance of a term. Combined with term frequency, results in the classic tf.idf measure.

Term Position The position of a term within an HTML file has impact on its meaning and importance. Terms within the title or strong tags are highlighted differently.

21 Term Position The position of a term within an HTML file has impact on its meaning and importance. Terms within the title or strong tags are highlighted differently. Quasi architecto Quasi flowers Sed ut perspiciatis unde omnis iste natus flowers sit olucap accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo sumo veritatis et quasi flowers beatae vitae dicta sunt explicabo. Sed ut perspiciatis unde omnis iste natus error sit olucap accusantium doloremque flowers, totam rem aperiam, eaque ipsa quae ab illo sumo veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim etupm voluptatem quia flowers sit aspernatur aut odit aut fugit, sed quia consequuntur flowers dolores eos qui ratione voluptatem sequi nesciunt. Nemo enim etupm voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt.

22 Term Position Regardless of the HTML structure, should terms in different positions have different weights? Quasi architecto Quasi architecto Sed ut flowers unde flowers iste natus flowers sit olucap flowers doloremque flowers, totam rem aperiam, eaque ipsa quae ab illo sumo veritatis et quasi architecto beatae vitae dicta sunt explicabo. Sed ut perspiciatis unde omnis iste natus error sit olucap accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo sumo veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim etupm voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Nemo enim etupm voluptatem quia voluptas sit aspernatur aut odit aut fugit, flowers quia flowers magni dolores flowers qui ratione flowers flowers nesciunt.

23 Web-based Signals

24 Host Structure Web documents in the same host are related to each other. Adocumentinahigh-valuehostlikewww.bbc.co.uk should be valued higher than The location of a document in a site structure is an important signal. Documents that are closer to the root of a site are typically more important.

25 Anchor Text A citation between web documents is defined by an HTML anchor tag that requires a content. The text used in anchor tags is one of the most valuable signals <a href=" amazon sucks books books

26 Link Analysis Link analysis has many aspects in common with the field of bibliometrics, morespecificallycitation analysis. Central assumption a link is an endorsement. AhyperlinkfrompageAtopageBrepresentsavoteinpage BfromthecreatorofpageA. Simply using the in-degree of a page as a measure of its importance would be easy to manipulate (e.g. link spam).

27 PageRank Originated from Stanford and used by Google. The PageRank algorithm depends on the link structure of the web graph and assigns a score between 0 and 1 to each page. The PageRank weight is a query-independent score. The PageRank Citation Ranking: Bringing Order to the Web Larry Page, Sergey Brin, Rajeev Motwani and Terry Winograd (1998)

28 PageRank Random Surfer Consider a random surfer visiting web pages and following the out-links in a random fashion at each point. 2. Eventually, the nodes with an higher in-degree will be visited more often. 3. The idea behind PageRank is that pages that have more visits are more important.

29 PageRank Calculation PR(A) =(1 d)+d p In(A) PR(p) Out(p) d = 1 Computation is performed iteratively until aminimumthresholdisachieved.

30 PageRank Example PR(A) = PR(B) 2 + PR(C) 1 + PR(E) 3 B PR(B) = PR(D) 1 A D PR(C) = PR(E) 3 C E PR(D) = PR(A) 1 + PR(E) 3 PR(E) = PR(B) 2

31 HITS The Hyperlinked Induced Topic Selection (HITS) was proposed by Jon Kleinberg in HITS is an algorithm that uses the link structure of the web to produce two query-dependent scores an authority score and a hub score. An authority is a page with many citations from hubs. A hub is a page that cites alargenumberofauthorities. Three major differences from PageRank: (1) it is computed at query time (!); (2) it produces two values for each page; (3) it is applied to subsets of the web.

32 HITS Calculation 1. Select a collection of documents related to a query. 2. Iteratively calculate authority and hub values for each document. Authority(A) = p In(A) Hub(p) Hub(A) = p Out(A) Authority(p)

33 Scoring With so many signals, how to obtain a single ranking score? Score(P )=α Signal 1 (P )+β S 2 (P )+γ S 3 (P ) Manually tuning by experts based on real-data measurements. 2. Use machine-learning methods to automatically build ranking formulas: learning to rank / machine-learned relevance.

34 Search Engines

35 Discovering Information There are two broad categories of services for facilitating the discovering of information on the web. Full-Text Search Engines Generically known as web search engines, these services crawl the web, index their contents and rank the documents. Web Directories Topic-oriented collections, maintained by human editors.

36 Search Engine Architecture WEB USER CRAWLER SEARCH INDEXER RANKING Disk Disk Disk

37 Crawler Includes the software that finds and fetches web pages. Multiple and distributed crawlers operate simultaneously. First generation search engines had a scheduled periodic crawl of the web. In current search engines, crawlers operate continuously e.g.verypopularanddynamicdocumentsare crawled multiples times a day. There is an infinite number of pages on the Web, thus the crawler must decide which will be crawled and which won t. A crawler must be robust and polite. A crawler should be distributed, scalable, efficient, fresh, quality-targeted and extensible.

38 robots.txt User-agent: * Disallow: /ADS/ Disallow: /banners/ Disallow: /bartoon/ Disallow: /bdt/ Disallow: /bin/ Disallow: /calvin_and_hobbes/ Disallow: /cinecartaz/ Disallow: /desportohtml/ Disallow: /emprego/ Disallow: /especial/ Disallow: /img/ Disallow: /includekimus/ Disallow: /lazer/ Disallow: /mail/ Disallow: /static/ Disallow: /xsl/ User-agent: * Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues Disallow: /news Allow: /news/directory Disallow: /nwshp Disallow: /setnewsprefs? Disallow: /index.html? Disallow: /? Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /relcontent Disallow: /imgres Disallow: /imglanding Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand...

Indexer Indices are data structures designed for fast reading. The index is the biggest component of a search engine. Web documents are parsed and separated into tokens.

39 Indexer Indices are data structures designed for fast reading. The index is the biggest component of a search engine. Web documents are parsed and separated into tokens. This is averychallengingtaskduetothediversityoftheweb:file formats, language ambiguity, word boundaries, etc. a domingo estranho flores porto... d1... d1,d17,d30 d2 d1,d3,d5 d4,d18 Research challenges in: size optimization, parallelism, maintenance, lookup speed, etc.

40 Ranking and Presentation QUERY MAGIC in x millisecs 10 DOCS For a given query, documents are ordered combining hundreds of signals. Additionally,ads are selected ($) and snippets are produced for each document. All in a few milliseconds.

41 Business 1% of the web search market is worth over $1 billion Search engine s business model is based on advertisement. First business models were based on small per-view charges. Ads were indiscriminately published, resulting a low conversion rates. The use of targeted advertising (ads are related to searches) resulted in much higher conversion rates. Advertisers bid on query terms and pay-per-click. Search engines operate complex systems that try to maximize revenue by selecting which ads to display.

42 Summary The World Wide Web didn t exist 20 years ago. The Web is scientifically young and combines research from many different fields, not just technology. There are many open problems and much more to be opened. Some currently hot topics: learning to rank, wisdom of the crowds, social media, real-time, contextual, hcir.

43 Thank You ssn

44 Some Ideas for SSIIM - ANT: evaluation of entity oriented search Queries in entity search: relation, attribute, entity, type, keyword - State of the art report on DB ranking - Web template extraction - Web meta-search - Web crawling - Measuring diversity in search results - Social Networks characterization

45 References An Introduction to Information Retrieval (2009) Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Web Information Retrieval (2009) Nick Craswell and David Hawking

OPTIQUE BRAND GUIDELINES PRESENTATION

OPTIQUE BRAND GUIDELINES PRESENTATION 1 INDEX 3 Logotype presentation 4 Logotype personalized typography 5 Construction grid 6 Minimum logo legibility 7 The exclusion zone 14 Brand pattern 15 Brand imagery