DATA MINING - 1DL460

Size: px

Start display at page:

Download "DATA MINING - 1DL460"

Norma Nelson
6 years ago
Views:

1 DATA MINING - 1DL460 Spring 2014" A second course in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 07/04/14 1

2 Introduction to Data Mining: Search Engines Technology and Algorithms for Efficient Searching of Information on the Web (slides + supplemental articles)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 07/04/14 2

3 Lecture s Outline" The Web and its Search Engines Heuristics-based Ranking PageRank (Google) for discovering the most important web pages HITS: hubs and authorities (Clever project) more detailed evaluation of pages importance 07/04/14 3

4 The Web in 2001: Some Facts " More than 3 billion pages; several terabytes Highly dynamic More than 1 million new pages every day! Over 600 GB of pages change per month Average page changes in a few weeks Largest crawlers Refresh less than 18% in a few weeks Cover less than 50% ever (invisible Web) Average page has 7 10 links Links form content-based communities 07/04/14 4

5 Chaos on the Web" Internet lacks organization and structure: pages written in any language, dialect or style; different cultures, interests and motivation; mixes truth, falsehood, wisdom, propaganda Challenge: Quickly extract from this digital morass, high-quality, relevant, up-to-date pages in response to specific information needs No precise mathematical measure of best results 07/04/14 5

6 Search Products and Services " Verity Fulcrum PLS Oracle text extender DB2 text extender Infoseek Intranet SMART (academic) Glimpse (academic) BING (Microsoft) Baidu (China) Yandex (Russia) Inktomi (HotBot) Alta Vista Raging Search Google Dmoz.org Yahoo! Infoseek Internet Lycos Excite * heuristics-based * humanly-selected 07/04/14 6

7 Web Search Queries" Web search queries are short: ~2.4 words on average (Aug. 2000) Has increased, was 1.7 (~1997) User expectations: The first item shown should be what I want to see! This works if the user has the most popular or most common notion in mind; not otherwise 07/04/14 7

8 Standard Web Search Engine Architecture" crawl the web eliminate duplicates DocIds user! query create an inverted index show results to user Search engine servers inverted index 07/04/14 8

9 Heuristics-based ranking" Naive attempt used by many search engines Heuristics employed: number of times a page contains the query term favor instances where the term appears early give weight to word appearing in a special place or form e.g. in a title or in bold. All heuristics fail miserably due to: spamming, or synonymy and polysemy of natural language words 07/04/14 9

10 Hyperlink Graph Analysis" Hypermedia is a social network Telephoned, advised, co-authored, paid Social network theory (cf. Barnes, 1954 and Wasserman & Faust, 1994) Extensive research applying graph notions Centrality and prestige Co-citation (relevance judgement) and bibliographic coupling Applications Web search: HITS, Google Classification and topic distillation 07/04/14 10

11 Hypertext models for classification" c = class, t = text, N = neighbors Text-only model: Pr[t c] Using neighbors text to judge my topic: Pr[t, t(n) c]? Better model: Pr[t, c(n) c] Recursive relationships 07/04/14 11

12 Exploiting the Web s Hyperlink Structure" Underlying assumption: view each link as an implicit endorsement of the location it points to Assumption: If the pages pointing to this page are good, then this is also a good page. References: Kleinberg 98, Page et al. 98 Draws upon earlier research in sociology and bibliometrics. Kleinberg s model includes authorities (highly referenced pages) and hubs (pages containing good reference lists). Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976). 07/04/14 12

13 Link analysis for ranking pages" Why does this work? The official Ferrari site will be linked to by lots of other official (or highquality) sites The best Ferrari fan-club sites probably also have many links pointing to it Less high-quality sites do not have as many high-quality sites linking to them 07/04/14 13

14 Page Rank" Intuition: recursive definition of importance. A page is important if important pages link to it. Method: create a stochastic matrix of the Web each page corresponds to a matrix s row and column if page j has L successors, (out-links) then the ij-th entry is: 1/L if page i is one of these L successors of page j 0 otherwise 07/04/14 14

15 Page Rank intuition" Initially, each page has one unit of importance. At each round, each page shares whatever importance it has with its successors, and receives new importance from its predecessors. Eventually, the importance reaches a limit, which happens to be its component of the principal eigenvector of this matrix. Importance = probability that a random Web surfer, starting from a random Web page, and following random links will be at the page in question after a long series of links. 07/04/14 15

16 Page Rank example: the web in 1689" Netscape Equation: N M A = 1/2 0 1/ /2 1/2 1 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 3/2 5/4 3/4 1 9/8 1/2 11/8 5/4 11/16 17/ /5 3/5 6/5 07/04/14 16

17 Problems with real web graphs" Dead ends: a page that has no successors has nowhere to send its importance Eventually, all importance will leak out of the Web Spider traps: a group of one or more pages that have no links outside the group Eventually, these pages will accumulate all the importance of the Web. 07/04/14 17

18 The Bowtie structure of the Web" Tendrils Out Tendrils In In Component Strongly Connected Component Out Component Tubes Disconnected Components From Rajaraman and Ullman, Mining Massive Data Sets 07/04/14 18

19 Dead ends - rank leaks: Microsoft tries to duck monopoly charges..." Netscape Equation: N M A = 1/2 0 1/ /2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 1/2 3/4 1/4 1/2 5/8 1/4 1/2 1/2 3/16 5/ /04/14 19

20 Spider traps - rank sinks Microsoft considers itself the center of the universe..." Netscape Equation: N M A = 1/2 0 1/ /2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = /2 1/2 3/4 7/4 1/2 5/8 2 3/8 1/2 35/16 5/ /04/14 20

21 Google solution to dead ends and spider traps" Instead of applying the matrix directly, tax each page with some fraction of its current importance, and distribute the taxed importance equally among all pages. N M A = 0.8 1/2 0 1/ /2 1/2 0 0 N M A The solution to this equation is now: N = 7/11; M = 21/11; A = 5/11 Formally, the system should be: stochastic (columns of non-negative numbers with the sum of 1) irreducible (strongly connected), i.e. all pairs of nodes connected and aperiodic, all states should be aperiodic, i.e. have a period of 1 07/04/14 21

22 PageRank" Where PR is the PageRank, p 1, p 2,..., p N are the pages under consideration, M(p i ) is the set of pages that link to p i, L(p j ) is the number of outbound links on page p j, N is the total number of pages, and d is the damping factor. In an iterative formulation we have at t= 0: At each time step, the PageRank can be computed as follows: 07/04/14 22

23 PageRank" Expressed as a matrix equation (using a modified adjacency matrix): where is given by:, 1 is an N-length column vector of ones and the matrix M Convergence is assumed when: 07/04/14 23

24 Google anti-spam devices" Spamming: an attempt by many web sites to appear to be about a subject that will attract surfers, without truly being about the subject Early search engines relied on the words on a page to tell what it is about. Led to tricks in which pages attracted attention by placing false words in the background color on their page. Google trusts the words in anchor text Relies on others telling the truth about your page, rather than relying on you. Makes it harder for a homepage to appear to be about something it is not. The use of page rank to measure importance, rather than the more naive number of links into a page, also protects against spamming. E.g. Page Rank recognizes as unimportant 1000 pages that mutually link to one another. 07/04/14 24

25 Google search engine evolution, (ref: wired.com)" Backrub [September 1997] This search engine, which had run on Stanford s servers for almost two years, is renamed Google. Its breakthrough innovation: ranking searches based on the number and quality of incoming links. New algorithm [August 2001] The search algorithm is completely revamped to incorporate additional ranking criteria more easily. Local connectivity analysis [February 2003] Google s first patent is granted for this feature, which gives more weight to links from authoritative sites. Fritz [Summer 2003] This initiative allows Google to update its index constantly, instead of in big batches. Personalized results [June 2005] Users can choose to let Google mine their own search behavior to provide individualized results. Bigdaddy [December 2005] Engine update allows for more-comprehensive Web crawling. Universal search [May 2007] Building on Image Search, Google News, and Book Search, the new Universal Search allows users to get links to any medium on the same results page. Google Caffeine [August 2009] A new architecture designed to return results faster and to better deal with rapidly updated information from services including Facebook and Twitter. With Caffeine, Google moved its back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform. Panda [February 2011] Change to lower the rank of "low-quality sites" or "thin sites and return higher-quality sites near the top of the search results, e.g. drop rankings for sites containing large amounts of advertising Penquin [April 2012] Change to decrease search engine rankings of websites that violate Google s Webmaster Guidelines, such as keyword stuffing, cloaking, participating in link schemes, deliberate creation of duplicate content, and others. 07/04/14 25

26 Google Facts (from end 2001)" Indexes 3 billion web pages (50 billion pages in 2012) If printed, they would result in a stack of paper 200 km high If a person reads a page per minute (and does nothing else), (s)he would need 6000 years to read them all 200 million search queries a day Approx. 80 billion searches a year! (in 2013 there were around billion searches) Most searches take less than half second Support for 35 non-english languages (now in 2014 its 151 language choices) Searchable index contained 3 trillion items Updated every 28 days 07/04/14 26

Google architecture (approx.)" 3. The search results are returned to the user typically in less than a second. 1. The web server sends the query to the index servers.

27 Google architecture (approx.)" 3. The search results are returned to the user typically in less than a second. 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 07/04/14 27

28 Google Advanced Search" 07/04/14 28

29 Google strengths and weaknesses" Successful business implementation Good at fighting spam Query independent very efficient at query time Time is not considered (limited time interval for result available) Basic algorithm do not distinguish between authoritative in general and on a certain topic (no public info available on this topic) 07/04/14 29

30 Hubs and Authorities" HITS - hypertext-induced topic selection (IBM s Clever project) Defined in a mutually recursive way: a hub links to many (valuable) authorities; an authority is linked to by many (good) hubs. Authorities turn out to be pages that offer the best information about a topic. Hubs are pages that do not provide any information, but specify a collection of links on where to find the information. 07/04/14 30

31 Hubs and Authorities" Use a matrix formulation similar to that of Page Rank, but without the stochastic restriction. Rows and columns correspond to web pages; A[i,j] = 1 if page i links to page j A[i,j] = 0 otherwise The transpose of A looks like the matrix for Page Rank, but it has a 1 where the Page Rank matrix has a fraction An iterative solution where the matrix A is repeatedly applied will result in a diverging solution vector. However, by introducing scaling factors the computed values of authority and hubbiness for each page can be kept within finite bounds. 07/04/14 31

Computing Hubbiness and Authority of pages" Let a and h be vectors The i-th component of a and h correspond to the degrees of authority and hubbiness of the i-th page respectively.

32 Computing Hubbiness and Authority of pages" Let a and h be vectors The i-th component of a and h correspond to the degrees of authority and hubbiness of the i-th page respectively. Let λ and µ be suitable scaling factors. Then: the hubbiness of each page is the sum of authorities it links to, scaled by λ h = λ A a the authority of each page is the sum of the hubbiness of all pages that link to it, scaled by µ T a = µ A h 07/04/14 32

33 Computing Hubbiness and Authority of pages" By simple substitution, two equations that relate vectors a and h only to themselves: T a = λ µ A A a T h = λ µ A A h Thus, we can compute a and h by iteratively solving the equations, giving us the principal eigenvectors of the matrices AA T and A T A, respectively. 07/04/14 33

34 Hubs and Authorities example: Netscape acknowledges Microsoft s existence" Netscape Matrices: A = A T = Microsoft AA T = A T A = Amazon 07/04/14 34

35 Hubs and Authorities example: Netscape acknowledges Microsoft s existence" Netscape Iteratively solving the equations for a and h assuming λ = µ = 1: Microsoft a(n) a(m) a(a) = = = a(a) 1.3 a(a) a(a) Amazon h(n) h(m) h(a) = = = a(a) 0,268 a(a) 0,735 a(a) 07/04/14 35

36 Authorities and Hubs pragmatics" The system is jump-started by obtaining a set of root pages from a standard text index such as Google or Bing. The iterative process settles very rapidly. A root set of 3,000 pages requires just 5 rounds of calculations! The results are independent of the initial estimates Algorithm naturally separates web sites into clusters e.g. a search for abortion partitions the Web into a pro-life and a pro-choice community 07/04/14 36

37 HITS strengths and weaknesses" Rank pages according to query topic Less good at fighting spam (later developments more robust) Topic drift (due to expanded base set can include irrelevant pages) More time consuming (compared to Page Rank) 07/04/14 37

38 Moore s Law" 07/04/14 38

39 Web Sizes over Time" ~150M in 1998 ~5B in x increase Moore would predict 25x Monthly refresh in 1998 Daily refresh in 2005 What about 2010? 40B? (actually 29B but 50B in 2012) Where is the content? Public Web? Personal Web? 07/04/14 39

40 Philosophical Remarks" The Web of today is dramatically different from what it was five years ago. Predicting the next five years seems futile. Will even the basic act of indexing soon become infeasible? If so, will our notion of searching the Web undergo fundamental changes? The Web s relentless growth will continue to generate computational challenges for wading through the ever increasing volume of on-line information. 07/04/14 40

41 World Wide Web statistics" According to a 2001 study, there were massively more than 550 billion documents on the Web, mostly in the invisible Web, or deep Web.! A 2002 survey of 2,024 million Web pages determined that by far the most Web content was in English: 56.4%; next were pages in German (7.7%), French (5.6%), and Japanese (4.9%).! A more recent study, which used Web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion Web pages in the publicly indexable Web as of the end of January 2005.! As of March 2009, the indexable web contains at least billion pages.! On July 25, 2008, Google software engineers Jesse Alpert and Nissan Hajaj announced that Google Search had discovered one trillion unique URLs.! As of May 2009, over million websites operated. Of these 74% were commercial or other sites operating in the.com generic top-level domain.! 07/04/14 41

DATA MINING - 1DL460

DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala