Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Similar documents
Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Topic: Duplicate Detection and Similarity Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE 5243 INTRO. TO DATA MINING

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

Finding Similar Items:Nearest Neighbor Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Networks: PageRank

Unit VIII. Chapter 9. Link Analysis

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Big Data Analytics CSCI 4030

Finding Similar Sets

Part 1: Link Analysis & Page Rank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides based on those in:

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Big Data Analytics CSCI 4030

Lecture 8: Linkage algorithms and web search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Jeffrey D. Ullman Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis. Chapter PageRank

CS425: Algorithms for Web Scale Data

Jeffrey D. Ullman Stanford University/Infolab

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Big Data Analytics CSCI 4030

Analysis of Large Graphs: TrustRank and WebSpam

Introduction to Data Mining

Introduction to Data Mining

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Information Retrieval

Information Retrieval. Lecture 9 - Web search basics

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Mining Web Data. Lijun Zhang

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Clustering. Huanle Xu. Clustering 1

A brief history of Google

Brief (non-technical) history

Jeffrey D. Ullman Stanford University

CS290N Summary Tao Yang

2.3 Algorithms Using Map-Reduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Solutions to Problem Set 1

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Web Structure Mining using Link Analysis Algorithms

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #44. Multidimensional Array and pointers

Singular Value Decomposition, and Application to Recommender Systems

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Data-Intensive Computing with MapReduce

Package textrank. December 18, 2017

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

COMP 4601 Hubs and Authorities

US Patent 6,658,423. William Pugh

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Locality-Sensitive Hashing

Introduction to Data Mining

CS 6604: Data Mining Large Networks and Time-Series

Link Analysis. Hongning Wang

Link Analysis in the Cloud

Link Analysis in Web Mining

Automatic Summarization

Information Retrieval

Today we show how a search engine works

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Locality-Sensitive Hashing

CSI 445/660 Part 10 (Link Analysis and Web Search)

Scalable Techniques for Similarity Search

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

CS6220: DATA MINING TECHNIQUES

Final Exam Search Engines ( / ) December 8, 2014

Impact of Search Engines on Page Popularity

Information Retrieval. Lecture 11 - Link analysis

1 Data Mining. 1.1 What is Data Mining? Statistical Modeling

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2018

Towards a hybrid approach to Netflix Challenge

MinHash Sketches: A Brief Survey. June 14, 2016

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Popularity of Twitter Accounts: PageRank on a Social Network

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Machine Learning using MapReduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining Web Data. Lijun Zhang

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Transcription:

Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1

PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important pages link to it. In high-falutin terms: importance = the principal eigenvector of the stochastic matrix of the Web. A few fixups needed. 2

Stochastic Matrix M of the Web Suppose page j links to n pages, including i j i 1/n Expresses how importance flows around the Web. Equivalent to following random walkers. 3

Example: The Web in 1839 Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 M Amazon M soft 4

The Google Idea Imagine many random walkers on the Web. At each tick, each walker picks an out-link at random and follows it. Distribution of walkers v becomes M v after one tick. Compute M 50 v (approximately 50). 5

The Walkers Yahoo Amazon M soft 6

The Walkers Yahoo Amazon M soft 7

The Walkers Yahoo Amazon M soft 8

The Walkers Yahoo Amazon M soft 9

In the Limit Yahoo Amazon M soft 10

Real-World Problems Some pages are dead ends (have no links out). Such a page causes importance to leak out. Other (groups of) pages are spider traps (all out-links are within the group). Eventually spider traps absorb all importance. 11

Microsoft Becomes a Dead End Yahoo Amazon M soft 12

Microsoft Becomes a Dead End Yahoo Amazon M soft 13

Microsoft Becomes a Dead End Yahoo Amazon M soft 14

Microsoft Becomes a Dead End Yahoo Amazon M soft 15

In the Limit Yahoo Amazon M soft 16

Microsoft Becomes a Spider Trap Yahoo Amazon M soft 17

Microsoft Becomes a Spider Trap Yahoo Amazon M soft 18

Microsoft Becomes a Spider Trap Yahoo Amazon M soft 19

In the Limit Yahoo Amazon M soft 20

Topic-Specific Page Rank Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. sports or cooking. Allows search queries to be answered based on interests of the user. Example: Query batter wants different pages depending on whether you are interested in sports or cooking. 21

Teleport Sets Assume each walker has a small probability of teleporting at any tick. Teleport can go to: 1. Any page with equal probability. To avoid dead-end and spider-trap problems. 2. A topic-specific set of relevant pages (teleport set ). For topic-specific PageRank. 22

Example: Topic = Software Only Microsoft is in the teleport set. Assume 20% tax. 23

Only Microsoft in Teleport Set Yahoo Dr. Who s phone booth. Amazon M soft 24

Only Microsoft in Teleport Set Yahoo Amazon M soft 25

Only Microsoft in Teleport Set Yahoo Amazon M soft 26

Only Microsoft in Teleport Set Yahoo Amazon M soft 27

Only Microsoft in Teleport Set Yahoo Amazon M soft 28

Only Microsoft in Teleport Set Yahoo Amazon M soft 29

Only Microsoft in Teleport Set Yahoo Amazon M soft 30

New Topic: Similarity Search Many objects that populate overlapping sets. Find the pairs of sets that are similar. Jaccard similarity of sets = size of intersection divided by size of union. 31

Example: Jaccard Similarity 3 in intersection. 8 in union. Jaccard similarity = 3/8 32

Applications 1. Collaborative Filtering : Represent Amazon customers by the sets of products they buy. Recommend what similar customers bought. 2. Similar Documents : Represent pages by their sets of k-shingles = strings of k consecutive characters. Similar pages could be plagiarism. 33

When Is the Problem Interesting? 1. When the sets are so large or so many that they cannot fit in main memory. 2. When there are so many sets that comparing all pairs of sets takes too much time. 34

Key Ideas 1. Minhashing : (Edith Cohen, Andrei Broder) Construct small signatures for sets so that the Jaccard similarity of sets can be determined from the signatures. 2. Locality-Sensitive Hashing : (Rajeev Motwani, Piotr Indyk) Focus on pairs of (likely) similar sets without looking at all pairs. 35

Minhashing as a Matrix Problem Think of sets represented by a matrix of 0 s and 1 s. Row = element. Column = set. 1 means that element is in that set. 36

Example C 1 C 2 a 0 1 b 1 0 * * C 1 = {b, c, e} C 2 = {a, c, e, f} c 1 1 Sim (C 1, C 2 ) = * * d 0 0 2/5 = 0.4 e 1 1 f 0 1 * * * 37

Four Types of Rows Given columns C 1 and C 2, rows may be classified as: C 1 C 2 a 1 1 b 1 0 c 0 1 d 0 0 Also, a = # rows of type a, etc. Note Sim (C 1, C 2 ) = a /(a +b +c ). 38

Minhashing Imagine the rows permuted randomly. Define hash function h (C ) = the number of the first (in the permuted order) row in which column C has 1. Use several (100?) independent hash functions to create a signature. 39

40 Minhashing Example Input matrix 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 3 4 7 6 1 2 5 Signature matrix M 1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 2 6 7 3 1 2 1 2 1

Surprising Property The probability (over all permutations of the rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1, C 2 ). Both are a /(a +b +c )! Why? Look down columns C 1 and C 2 (in permuted order) until we see a 1. If it s a type-a row, then h (C 1 ) = h (C 2 ). If a type-b or type-c row, then not. 41

Finding Similar Sets We can use minhashing to replace sets (columns of the matrix) by short lists of integers. But we still need to compare each pair of signatures. Example: 20 million Amazon customers; 2*10 14 pairs of customers to evaluate. 42

Locality-Sensitive Hashing What we want seems impossible: map signatures to buckets so that 1. Two similar signatures have a very good chance of appearing in the same bucket. 2. If two signatures are not very similar, they probably don t appear in one bucket. Then, we only have to compare bucket-mates (candidate pairs ). 43

The LSH Trick Think of the signature for each column as a column of the signature matrix S. Divide the rows of S into b bands of r rows each. 44

Partition Into Bands b bands r rows per band Matrix S 45

Partition into Bands --- (2) For each band, hash its portion of each column to a hash table with many buckets. Candidate column pairs are those that hash to the same bucket for 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs. 46

Buckets Matrix S r rows b bands 47

LSH --- Graphically Example Target: All pairs with Sim > t. Suppose we use only one hash function: 1.0 Prob. 0.0 t Sim 1.0 Partition into bands gives us: 1.0 Prob. 0.0 Actual 1.0 Prob. s t 1.0 0.0 Sim Sim t 1.0 Ideal 1 (1 s r ) b t ~ (1/b) 1/r 48

Summary of Minhash/LSH 1. Represent the objects you are comparing by sets (ad-hoc method). 2. Represent the sets by signatures (Minhashing). 3. Use LSH to create buckets; candidate pairs are those in the same bucket. 4. Evaluate only the candidate pairs. 49

Experience 1. Finding news articles with the same source. 2. Entity resolution : finding customers shared by two businesses. 50

News Sources Two members of the database group at Stanford were asked by the PoliSci Dept. to examine 1.5 million news articles and identify those that were really the same synticated article published by different newspapers. Each newspaper decorates the article with its own material, e.g. masthead. 51

News Sources (2) They developed their own algorithm and reported it to the group. I suggested minhashing + LSH. They reimplemented and found that minhash+lsh was faster and more accurate for all but very high degrees of similarity. 52

Matching Customer Records I once took a consulting job solving the following problem: Company A agreed to solicit customers for Company B, for a fee. They then had a parting of the ways, and argued over how many customers. Neither recorded exactly which customers were involved. 53

Customer Records (2) Company B had about 1 million records of all its customers. Company A had about 1 million records describing customers, some of which it had signed up for B. Records had name, address, and phone, but for various reasons, they could be different for the same person. 54

Customer Records (3) Step 1: design a measure of how similar records are: E.g., deduct points for small misspellings ( Jeffrey vs. Geoffery ), same phone, different area code. Step 2: score all pairs of records; report very similar records as matches. 55

Customer Records (4) Problem: (1 million) 2 /2 is too many pairs of records to score. Solution: A simple LSH. Three hash functions: exact values of name, address, phone. Compare iff records are identical in at least one. Misses similar records with a small difference in all three fields. 56

Customer Records Aside We were able to tell what values of the scoring function were reliable in an interesting way. Identical records had a creation date difference of 10 days. We only looked for records created within 90 days, so bogus matches had a 45-day average. 57

Aside (2) By looking at the pool of matches with a fixed score, we could compute the average time-difference, say x, and deduce that fraction (45-x)/35 of them were valid matches. Alas, the lawyers didn t think the jury would understand. 58