Jeffrey D. Ullman Stanford University

Similar documents
Jeffrey D. Ullman Stanford University/Infolab

Link Analysis in Web Mining

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Introduction to Data Mining

CS47300 Web Information Search and Management

Slides based on those in:

CS425: Algorithms for Web Scale Data

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Analysis of Large Graphs: TrustRank and WebSpam

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030

Unit VIII. Chapter 9. Link Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Part 1: Link Analysis & Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis. Chapter PageRank

DATA MINING - 1DL460

Information Networks: PageRank

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

2.3 Algorithms Using Map-Reduce

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Introduction to Data Mining

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

DATA MINING - 1DL460

COMP 4601 Hubs and Authorities

Information Retrieval. Lecture 11 - Link analysis

Link Analysis and Web Search

COMP Page Rank

Lecture 4: Information Retrieval and Web Mining.

Is the statement sufficient? If both x and y are odd, is xy odd? 1) xy 2 < 0. Odds & Evens. Positives & Negatives. Answer: Yes, xy is odd

Lec 8: Adaptive Information Retrieval 2

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Link Analysis in the Cloud

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type.

Brief (non-technical) history

Lecture 8: Linkage algorithms and web search

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

University of Maryland. Tuesday, March 2, 2010

MapReduce and Friends

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Final Exam Search Engines ( / ) December 8, 2014

CS6200 Information Retreival. The WebGraph. July 13, 2015

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

An Overview of Search Engine Spam. Zoltán Gyöngyi

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

The Citizen s Guide to Dynamic Programming

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Retrieval Spring Web retrieval

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Popularity of Twitter Accounts: PageRank on a Social Network

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 10: Asymptotic Complexity and

Mining Web Data. Lijun Zhang

Numerical Linear Algebra

Web Structure Mining using Link Analysis Algorithms

TODAY S LECTURE HYPERTEXT AND

S Postgraduate Course on Signal Processing in Communications, FALL Topic: Iteration Bound. Harri Mäntylä

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

PageRank and related algorithms

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

6.001 Notes: Section 4.1

Graph and Link Mining

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Project 2: How Parentheses and the Order of Operations Impose Structure on Expressions

APPENDIX B. Fortran Hints

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

How to declare an array in C?

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The transition: Each student passes half his store of candies to the right. students with an odd number of candies eat one.

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

Information Retrieval May 15. Web retrieval

Exploring both Content and Link Quality for Anti-Spamming

x = 12 x = 12 1x = 16

CSC-140 Assignment 5

Singular Value Decomposition, and Application to Recommender Systems

CSE 546 Machine Learning, Autumn 2013 Homework 2

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

A brief history of Google

Databases 2 (VU) ( / )

Transcription:

Jeffrey D. Ullman Stanford University

3 Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can be found. Example: course home pages. Hubs tell where the authorities are. Example: departmental course-listing page.

HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not. A T, the transpose of A, is similar to the PageRank matrix M, but A T has 1 s where M has fractions. Also, HITS uses column vectors h and a representing the degrees to which each page is a hub or authority, respectively. Computation of h and a is similar to the iterative way we compute PageRank. 4

5 Yahoo A = y a m y 1 1 1 a 1 0 1 m 0 1 0 Amazon M soft

6 Powers of A and A T have elements whose values grow exponentially with the exponent, so we need scale factors λ and μ. Let h and a be column vectors measuring the hubbiness and authority of each page. Equations: h = λaa; a = μa T h. Hubbiness = scaled sum of authorities of successor pages (out-links). Authority = scaled sum of hubbiness of predecessor pages (in-links).

From h = λaa; a = μa T h we can derive: h = λμaa T h a = λμa T A a Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority. Pick an appropriate value of λμ. 7

Remember: it is only the direction of the vectors, or the relative hubbiness and authority of Web pages that matters. As for PageRank, the only reason to worry about scale is so you don t get overflows or underflows in the values as you iterate. 8

9 a = λμa T A a; h = λμaa T h 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 3 2 1 AA T = 2 2 0 1 0 1 2 1 2 A T A= 1 2 1 2 1 2 a(yahoo) a(amazon) a(m soft) = = = 1 1 1 5 4 5 24 18 24 114 84 114......... 1+ 3 2 1+ 3 h(yahoo) = 1 h(amazon) = 1 h(microsoft) = 1 6 4 2 28 20 8 132 96 36......... 1.000 0.735 0.268

10 Iterate as for PageRank; don t try to solve equations. But keep components within bounds. Example: scale to keep the largest component of the vector at 1. Consequence is that λ and μ actually vary as time goes on.

Correct approach: start with h = [1,1,,1]; multiply by A T to get first a; scale, then multiply by A to get next h, and repeat until approximate convergence. You may be tempted to compute AA T and A T A first, then iterate multiplication by these matrices, as for PageRank. Bad, because these matrices are not nearly as sparse as A and A T. 11

Spamming = any deliberate action solely in order to boost a Web page s position in searchengine results. Spam = Web pages that are the result of spamming. SEO industry might disagree! SEO = search engine optimization

Boosting techniques. Techniques for achieving high relevance/importance for a Web page. Hiding techniques. Techniques to hide the use of boosting from humans and Web crawlers.

Term spamming. Manipulating the text of web pages in order to appear relevant to queries. Link spamming. Creating link structures that boost PageRank.

16 Repetition of terms, e.g., Viagra, in order to subvert TF.IDF-based rankings. Dumping = adding large numbers of words to your page. Example: run the search query you would like your page to match, and add copies of the top 10 pages. Example: add a dictionary, so you match every search query. Key hiding technique: words are hidden by giving them the same color as the background.

18 PageRank prevents spammers from using term spam to fool a search engine. While spammers can still use the techniques, they cannot get a high-enough PageRank to be in the top 10. Spammers now attempt to fool PageRank by link spam by creating structures on the Web, called spam farms, that increase the PageRank of undeserving pages.

19 Three kinds of Web pages from a spammer s point of view: 1. Own pages. Completely controlled by spammer. 2. Accessible pages. E.g., Web-log comment pages: spammer can post links to his pages. 3. Inaccessible pages. Everything else.

20 Spammer s goal: Maximize the PageRank of target page t. Technique: 1. Get as many links as possible from accessible pages to target page t. 2. Construct a spam farm to get a PageRankmultiplier effect.

21 Accessible Own Inaccessible t 1 2 Goal: boost PageRank of page t. One of the most common and effective organizations for a spam farm. M Note links are 2-way. Page t links to all M pages and they link back.

22 Accessible Own Inaccessible t 1 2 Suppose rank from accessible pages = x. PageRank of target page = y. Taxation rate = 1-b. Rank of each farm page = by/m + (1-b)/N. From t; M = number of farm pages M Share of tax ; N = size of the Web. Total PageRank = 1.

23 Inaccessible Accessible t Own 1 2 Tax share for t. Very small; ignore. y = x + bm[by/m + (1-b)/N] + (1-b)/N y = x + b 2 y + b(1-b)m/n y = x/(1-b 2 ) + cm/n where c = b/(1+b) PageRank of each farm page M

24 Accessible Own Inaccessible t 1 2 M y = x/(1-b 2 ) + cm/n where c = b/(1+b). For b = 0.85, 1/(1-b 2 )= 3.6. Multiplier effect for acquired page rank. By making M large, we can make y almost as large as we want.

If you design your spam farm just as was described, Google will notice it and drop it from the Web. More complex designs might be undetected, but SEO innovations can be tracked by Google et al. Fortunately, there are other techniques that do not rely on direct detection of spam farms. 25

26 Topic-specific PageRank, with a set of trusted pages as the teleport set is called TrustRank. Spam Mass = (PageRank TrustRank)/PageRank. High spam mass means most of your PageRank comes from untrusted sources you may be linkspam.

27 Two conflicting considerations: Human may have to inspect each trusted page, so this set should be as small as possible. Must ensure every good page gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths. Implies that the trusted set must be geographically diverse, hence large.

28 1. Pick the top k pages by PageRank. It is almost impossible to get a spam page to the very top of the PageRank order. 2. Pick the home pages of universities. Domains like.edu are controlled. Notice that both these approaches avoid the requirement for human intervention.

30 Google computes the PageRank of a trillion pages (at least!). The PageRank vector of double-precision reals requires 8 terabytes. And another 8 terabytes for the next estimate of PageRank.

31 The matrix of the Web has two special properties: 1. It is very sparse: the average Web page has about 10 out-links. 2. Each column has a single value 1 divided by the number of out-links that appears wherever that column is not 0.

32 Trick: for each column, store n = the number of out-links and a list of the rows with nonzero values (1/n). Thus, the matrix of the Web requires at least (4*1+8*10)*10 12 = 84 terabytes. Integer n Average 10 links/column, 8 bytes per row number.

Divide the current and next PageRank vectors into k stripes of equal size. Each stripe is the components in some consecutive rows. Divide the matrix into squares whose sides are the same length as one of the stripes. Pick k large enough that we can fit a stripe of each vector and a block of the matrix in main memory at the same time. Note: the multiplication may actually be done at many machines in parallel. 33

34 w1 M11 M12 M13 v1 w2 = M21 M22 M23 v2 w3 M31 M32 M33 v3 At one time, we need wi, vj, and Mij in memory. Vary v slowest: w1 = M11 v1; w2 = M21 v1; w3 = M31 v1; w1 += M12 v2; w2 += M22 v2; w3 += M32 v2; w1 += M13 v3; w2 += M23 v3; w3 += M33 v3

35 Each column of a block is represented by: 1. The number n of nonzero elements in the entire column of the matrix (i.e., the total number of outlinks for the corresponding Web page). 2. The list of rows of that block only that have nonzero values (which must be 1/n). I.e., for each column, we store n with each of the k blocks and the out-link with whatever block has the row to which the link goes.

36 Total space to represent the matrix = (4*k+8*10)*10 12 = 4k+80 terabytes. Integer n for a column is represented in each of k blocks. Average 10 links/column, 8 bytes per row number, spread over k blocks.

37 We are not just multiplying a matrix and a vector. We need to multiply the result by a constant to reflect the taxation. We need to add a constant to each component of the result w. Neither of these changes are hard to do. After computing each component w i of w, multiply by b and then add (1-b)/N.

38 The strategy described can be executed on a single machine. But who would want to? There is a simple MapReduce algorithm to perform matrix-vector multiplication. But since the matrix is sparse, better to treat it as a relational join.

Another approach is to use many jobs, each to multiply a row of matrix blocks by the entire v. Use main memory to hold the one stripe of w that will be produced. Read one stripe of v into main memory at a time. Read the block of M that needs to multiply the current stripe of v, a tiny bit at a time. Works as long as k is large enough that stripes fit in memory. M read once; v read k times, among all the jobs. OK, because M is much larger than v. 39

40 M i1 w i v 1 Main Memory for job i

41 M i2 w i v 2 Main Memory for job i

42 M ij w i v j Main Memory for job i