DATA MINING - 1DL460 - PDF Free Download

DATA MINING - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 25/02/16 1

Introduction to Data Mining: Search Engines Technology and Algorithms for Efficient Searching of Information on the Web (slides + supplemental articles) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 25/02/16 2

Lecture s Outline The Web and its Search Engines Heuristics-based Ranking PageRank (Google) for discovering the most important web pages HITS: hubs and authorities (Clever project) more detailed evaluation of pages importance 25/02/16 3

The Web in 2001: Some Facts More than 3 billion pages; several terabytes Highly dynamic More than 1 million new pages every day! Over 600 GB of pages change per month Average page changes in a few weeks Largest crawlers Refresh less than 18% in a few weeks Cover less than 50% ever (invisible Web) Average page has 7 10 links Links form content-based communities 25/02/16 4

Chaos on the Web Internet lacks organization and structure: pages written in any language, dialect or style; different cultures, interests and motivation; mixes truth, falsehood, wisdom, propaganda Challenge: Quickly extract from this digital morass, high-quality, relevant, up-to-date pages in response to specific information needs No precise mathematical measure of best results 25/02/16 5

Search Products and Services Verity Fulcrum PLS Oracle text extender DB2 text extender Infoseek Intranet SMART (academic) Glimpse (academic) BING (Microsoft) Baidu (China) Yandex (Russia) Inktomi (HotBot) Alta Vista Raging Search Google Ask.com (based on HITS variant) Dmoz.org Yahoo! Infoseek Internet Lycos Excite * heuristics-based * humanly-selected 25/02/16 6

Web Search Queries Web search queries are short: ~2.4 words on average (Aug. 2000) Has increased, was 1.7 (~1997) User expectations: The first item shown should be what I want to see! This works if the user has the most popular or most common notion in mind; not otherwise 25/02/16 7

Standard Web Search Engine Architecture crawl the web eliminate duplicates DocIds user query create an inverted index show results to user Search engine servers inverted index 25/02/16 8

Heuristics-based ranking Naive attempt used by many search engines Heuristics employed: number of times a page contains the query term favor instances where the term appears early give weight to word appearing in a special place or form e.g. in a title or in bold. All heuristics fail miserably due to: spamming, or synonymy and polysemy of natural language words 25/02/16 9

Hyperlink Graph Analysis Hypermedia is a social network Telephoned, advised, co-authored, paid Social network theory (cf. Barnes, 1954 and Wasserman & Faust, 1994) Extensive research applying graph notions Centrality and prestige Co-citation (relevance judgement) and bibliographic coupling Applications Web search: HITS, Google Classification and topic distillation 25/02/16 10

Hypertext models for classification c = class, t = text, N = neighbors Text-only model: Pr[t c] Using neighbors text to judge my topic: Pr[t, t(n) c]? Better model: Pr[t, c(n) c] Recursive relationships 25/02/16 11

Exploiting the Web s Hyperlink Structure Underlying assumption: view each link as an implicit endorsement of the location it points to Assumption: If the pages pointing to this page are good, then this is also a good page. References: Kleinberg 98, Page et al. 98 Draws upon earlier research in sociology and bibliometrics. Kleinberg s model includes authorities (highly referenced pages) and hubs (pages containing good reference lists). Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976). 25/02/16 12

Link analysis for ranking pages Why does this work? The official Ferrari site will be linked to by lots of other official (or highquality) sites The best Ferrari fan-club sites probably also have many links pointing to it Less high-quality sites do not have as many high-quality sites linking to them 25/02/16 13

PageRank Intuition: recursive definition of importance. A page is important if important pages link to it. Method: create a stochastic matrix of the Web each page corresponds to a matrix s row and column if page j has L successors, (out-links) then the ij-th entry is: 1/L if page i is one of these L successors of page j 0 otherwise 25/02/16 14

PageRank intuition Initially, each page has one unit of importance. At each round, each page shares whatever importance it has with its successors, and receives new importance from its predecessors. Eventually, the importance reaches a limit, which happens to be its component of the principal eigenvector of this matrix. Importance = probability that a random Web surfer, starting from a random Web page, and following random links will be at the page in question after a long series of links. 25/02/16 15

PageRank example: the web in 1689 Netscape Equation: N M A = 1/2 0 1/2 0 0 1/2 1/2 1 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = 1 1 1 1 1/2 3/2 5/4 3/4 1 9/8 1/2 11/8 5/4 11/16 17/16......... 6/5 3/5 6/5 25/02/16 16

Problems with real web graphs Dead ends: a page that has no successors has nowhere to send its importance Eventually, all importance will leak out of the Web Spider traps: a group of one or more pages that have no links outside the group Eventually, these pages will accumulate all the importance of the Web. 25/02/16 17

The Bowtie structure of the Web Tendrils Out Tendrils In In Component Strongly Connected Component Out Component Tubes Disconnected Components From Rajaraman and Ullman, Mining Massive Data Sets 25/02/16 18

The Bowtie structure of the Web From article: The web is a bow tie, Nature 405, May 2000 25/02/16 19

Dead ends - rank leaks: Microsoft tries to duck monopoly charges... Netscape Equation: N M A = 1/2 0 1/2 0 0 1/2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = 1 1 1 1 1/2 1/2 3/4 1/4 1/2 5/8 1/4 1/2 1/2 3/16 5/16......... 0 0 0 25/02/16 20

Spider traps - rank sinks Microsoft considers itself the center of the universe... Netscape Equation: N M A = 1/2 0 1/2 0 1 1/2 1/2 0 0 N M A Microsoft Solution by relaxation (iterative solution): Amazon N M A = = = 1 1 1 1 3/2 1/2 3/4 7/4 1/2 5/8 2 3/8 1/2 35/16 5/16......... 0 3 0 25/02/16 21

Google solution to dead ends and spider traps Instead of applying the matrix directly, tax each page with some fraction of its current importance, and distribute the taxed importance equally among all pages. N M A = 0.8 1/2 0 1/2 0 1 1/2 1/2 0 0 N M A + 0.2 0.2 0.2 The solution to this equation is now: N = 7/11; M = 21/11; A = 5/11 Formally, the system should be: stochastic (columns of non-negative numbers with the sum of 1) irreducible (strongly connected), i.e. all pairs of nodes connected and aperiodic, all states should be aperiodic, i.e. have a period of 1 25/02/16 22

PageRank Where PR is the PageRank, p 1, p 2,..., p N are the pages under consideration, M(p i ) is the set of pages that link to p i, L(p j ) is the number of outbound links on page p j, N is the total number of pages, and d is the damping factor. In an iterative formulation we have at t= 0: At each time step, the PageRank can be computed as follows: 25/02/16 23

PageRank Expressed as a matrix equation (using a modified adjacency matrix): where is given by:, 1 is an N-length column vector of ones and the matrix M Convergence is assumed when: 25/02/16 24

Variations of PageRank Google anti-spam devices: Spamming: an attempt by many web sites to appear to be about a subject that will attract surfers, without truly being about the subject Early search engines relied on the words on a page to tell what it is about. Led to tricks in which pages attracted attention by placing false words in the background color on their page. Google trusts the words in anchor text Relies on others telling the truth about your page, rather than relying on you. Makes it harder for a homepage to appear to be about something it is not. The use of page rank to measure importance, rather than the more naive number of links into a page, also protects against spamming techniques such as term spamming. 25/02/16 25

Variations of PageRank TopicRank, a Topic-Sensitive PageRank based upon weighting of certain pages higher based on their topic decided in correspondence to a teleport set of pages of known topics. TrustRank, a variant of topic-sensitive PageRank method based upon a set of trusted source pages. TrustRank handles link spamming, i.e. pages that mutually link to one another in order to increase PageRank values, also called spam farms. It is reported that Google can recognize pages as unimportant that involve 1000 pages that mutually link to one another. Spam mass is an alternative measure to identify spam farms. Spam mass = (PR TR / PR) Neg values or close to 0 is considered to be OK Values close to 1 is considered as part of a spam farm 25/02/16 26

Efficient calculation of PageRank Very-large scale PageRank calculation: Sparse blocked matrix representation. Parallel matrix-vector processing. Several variants on how to distribute processing over a set of parallel processes that can be based on MapReduce or other alternatives. 25/02/16 27

Google search engine evolution, (ref: wired.com) Backrub [September 1997] This search engine, which had run on Stanford s servers for almost two years, is renamed Google. Its breakthrough innovation: ranking searches based on the number and quality of incoming links. New algorithm [August 2001] The search algorithm is completely revamped to incorporate additional ranking criteria more easily. Local connectivity analysis [February 2003] Google s first patent is granted for this feature, which gives more weight to links from authoritative sites. Fritz [Summer 2003] This initiative allows Google to update its index constantly, instead of in big batches. Personalized results [June 2005] Users can choose to let Google mine their own search behavior to provide individualized results. Bigdaddy [December 2005] Engine update allows for more-comprehensive Web crawling. Universal search [May 2007] Building on Image Search, Google News, and Book Search, the new Universal Search allows users to get links to any medium on the same results page. Google Caffeine [August 2009] A new architecture designed to return results faster and to better deal with rapidly updated information from services including Facebook and Twitter. With Caffeine, Google moved its back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform. Panda [February 2011] Change to lower the rank of "low-quality sites" or "thin sites and return higher-quality sites near the top of the search results, e.g. drop rankings for sites containing large amounts of advertising Penquin [April 2012] Change to decrease search engine rankings of websites that violate Google s Webmaster Guidelines, such as keyword stuffing, cloaking, participating in link schemes, deliberate creation of duplicate content, and others. 25/02/16 28

Google search engine evolution, (ref: wired.com) Hummingbird [August 2013] Conversational search leverages natural language, semantic search, and more to improve the way search queries are parsed.[13] Unlike previous search algorithms which would focus on each individual word in the search query, Hummingbird considers each word but also how each word makes up the entirety of the query the whole sentence or conversation or meaning is taken into account, rather than particular words. The goal is that pages matching the meaning do better, rather than pages matching just a few words. Google Pigeon [July 2014] The update is aimed to increase the ranking of local listing in a search. The changes will also effect the search results shown in Google Maps along with the Google regular search results. To improve the quality of local searches Google is relying on the factors such as location and distance to provide better search results to user. 25/02/16 29

Google Facts (from end 2001) Indexes 3 billion web pages (50 billion pages in 2012) If printed, they would result in a stack of paper 200 km high If a person reads a page per minute (and does nothing else), (s)he would need 6000 years to read them all 200 million search queries a day Approx. 80 billion searches a year! (in 2013 there were around 2 160 billion searches) Most searches take less than half second Support for 35 non-english languages (now in 2014 its 151 language choices) Searchable index contained 3 trillion items Updated every 28 days 25/02/16 30

Google architecture (approx.) 3. The search results are returned to the user typically in less than a second. 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 25/02/16 31

Google Advanced Search 25/02/16 32

Google strengths and weaknesses Successful business implementation Good at fighting spam Query independent very efficient at query time Time is not considered (limited time interval for result available) Basic algorithm do not distinguish between authoritative in general and on a certain topic (no public info available on this topic) 25/02/16 33

Hubs and Authorities HITS - hypertext-induced topic selection (IBM s Clever project) Defined in a mutually recursive way: a hub links to many (valuable) authorities; an authority is linked to by many (good) hubs. Authorities turn out to be pages that offer the best information about a topic. Hubs are pages that do not provide any information, but specify a collection of links on where to find the information. 25/02/16 34

Hubs and Authorities Use a matrix formulation similar to that of Page Rank, but without the stochastic restriction. Rows and columns correspond to web pages; A[i,j] = 1 if page i links to page j A[i,j] = 0 otherwise The transpose of A looks like the matrix for Page Rank, but it has a 1 where the Page Rank matrix has a fraction An iterative solution where the matrix A is repeatedly applied will result in a diverging solution vector. However, by introducing scaling factors the computed values of authority and hubbiness for each page can be kept within finite bounds. 25/02/16 35

Computing Hubbiness and Authority of pages Let a and h be vectors The i-th component of a and h correspond to the degrees of authority and hubbiness of the i-th page respectively. Let λ and µ be suitable scaling factors. Then: the hubbiness of each page is the sum of authorities it links to, scaled by λ h = λ A a the authority of each page is the sum of the hubbiness of all pages that link to it, scaled by µ T a = µ A h 25/02/16 36

Computing Hubbiness and Authority of pages By simple substitution, two equations that relate vectors a and h only to themselves: T a = λ µ A A a T h = λ µ A A h Thus, we can compute a and h by iteratively solving the equations, giving us the principal eigenvectors of the matrices AA T and A T A, respectively. 25/02/16 37

Hubs and Authorities example: Netscape acknowledges Microsoft s existence Netscape Matrices: A = 1 1 1 0 0 1 1 1 0 A T = 1 0 1 1 0 1 1 1 0 Microsoft 3 1 2 AA T = 1 1 0 2 0 2 A T A = 2 2 1 2 2 1 1 1 2 Amazon 25/02/16 38

Hubs and Authorities example: Netscape acknowledges Microsoft s existence Netscape Iteratively solving the equations for a and h assuming λ = µ = 1: Microsoft a(n) a(m) a(a) = = = 1 1 1 5 5 4 24 24 18 114 114 84......... 1.3 a(a) 1.3 a(a) a(a) Amazon h(n) h(m) h(a) = = = 1 1 1 6 2 4 28 8 20 132 36 96......... 1.000 a(a) 0,268 a(a) 0,735 a(a) 25/02/16 39

Authorities and Hubs pragmatics The system is jump-started by obtaining a set of root pages from a standard text index such as Google or Bing. The iterative process settles very rapidly. A root set of 3,000 pages requires just 5 rounds of calculations! The results are independent of the initial estimates Algorithm naturally separates web sites into clusters e.g. a search for abortion partitions the Web into a pro-life and a pro-choice community 25/02/16 40

HITS strengths and weaknesses Rank pages according to query topic Less good at fighting spam (later developments more robust) Topic drift (due to expanded base set can include irrelevant pages) More time consuming (compared to PageRank) 25/02/16 41

Web sizes over time ~150M in 1998 ~5B in 2005 33x increase Moore would predict 25x Monthly refresh in 1998 Daily refresh in 2005 What about 2010? 40B? (actually 29B but 50B in 2012) Where is the content? Public Web? Personal Web? 25/02/16 42

Compare previous slide to Moore s Law (computing power) 25/02/16 43

Philosophical Remarks The Web of today is dramatically different from what it was five years ago. Predicting the next five years seems futile. Will even the basic act of indexing soon become infeasible? If so, will our notion of searching the Web undergo fundamental changes? The Web s relentless growth will continue to generate computational challenges for wading through the ever increasing volume of on-line information. 25/02/16 44

World Wide Web statistics According to a 2001 study, there were massively more than 550 billion documents on the Web, mostly in the invisible Web, or deep Web. A 2002 survey of 2,024 million Web pages determined that by far the most Web content was in English: 56.4%; next were pages in German (7.7%), French (5.6%), and Japanese (4.9%). A more recent study, which used Web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion Web pages in the publicly indexable Web as of the end of January 2005. As of March 2009, the indexable web contains at least 25.21 billion pages. On July 25, 2008, Google software engineers Jesse Alpert and Nissan Hajaj announced that Google Search had discovered one trillion unique URLs. As of May 2009, over 109.5 million websites operated. Of these 74% were commercial or other sites operating in the.com generic top-level domain. 25/02/16 45