CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE 5243 INTRO. TO DATA MINING

Finding Similar Sets

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Topic: Duplicate Detection and Similarity Computing

Finding Similar Items:Nearest Neighbor Search

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Locality-Sensitive Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data-Intensive Computing with MapReduce

CSE 5243 INTRO. TO DATA MINING

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining di Dati Web. Lezione 3 - Clustering and Classification

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Locality-Sensitive Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Theorem 2.9: nearest addition algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Data Sets Winter Final

CS 124/LINGUIST 180 From Languages to Information

6 Randomized rounding of semidefinite programs

Scalable Techniques for Similarity Search

NP Completeness. Andreas Klappenecker [partially based on slides by Jennifer Welch]

CS 124/LINGUIST 180 From Languages to Information

An Efficient Recommender System Using Locality Sensitive Hashing

Hashing and sketching

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Introduction to Data Mining

HMMT February 2018 February 10, 2018

CS 161 Problem Set 4

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Finite Math - J-term Homework. Section Inverse of a Square Matrix

Machine Learning: Symbol-based

Similarity Estimation Techniques from Rounding Algorithms. Moses Charikar Princeton University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CS 124/LINGUIST 180 From Languages to Information

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

UNC Charlotte 2010 Comprehensive

Vocabulary: Looking For Pythagoras

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

Chapter 2 Basic Structure of High-Dimensional Spaces

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Information Retrieval. Lecture 9 - Web search basics

Compact Data Representations and their Applications. Moses Charikar Princeton University

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

MATH 54 - LECTURE 10

Coding for Random Projects

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011

Hierarchical Clustering 4/5/17

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Document Clustering: Comparison of Similarity Measures

Traveling Salesperson Problem (TSP)

BBS654 Data Mining. Pinar Duygulu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

US Patent 6,658,423. William Pugh

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CPSC 340: Machine Learning and Data Mining

val(y, I) α (9.0.2) α (9.0.3)

STA 4273H: Statistical Machine Learning

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Big Data Analytics CSCI 4030

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Vector Calculus: Understanding the Cross Product

Problem 1: Complexity of Update Rules for Logistic Regression

Locality- Sensitive Hashing Random Projections for NN Search

Exploiting Social Network Structure for Person-to-Person Sentiment Analysis (Supplementary Material to [WPLP14])

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions.

Introduction to Data Mining

MC 302 GRAPH THEORY 10/1/13 Solutions to HW #2 50 points + 6 XC points

2.3 Algorithms Using Map-Reduce

1 Metric spaces. d(x, x) = 0 for all x M, d(x, y) = d(y, x) for all x, y M,

Finding Similar Items and Locality Sensitive Hashing. STA 325, Supplemental Material

2 Geometry Solutions

Polar Coordinates. 2, π and ( )

Exam Review Session. William Cohen

Part II. Graph Theory. Year

Mining Social Network Graphs

Catalan Numbers. Table 1: Balanced Parentheses

Transcription:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or billions) of documents, find near duplicates Problem: Too many documents to compare all pairs Solution: Hash documents so that similar documents hash into the same bucket Documents in the same bucket are then candidate pairs whose similarity is then evaluated

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 3 Document Shingling Signatures: short integer vectors that represent the sets, and reflect their similarity The set of strings of length k that appear in the document Min-Hashing Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 5 A k-shingle (or k-gram) is a sequence of k tokens that appears in the document Example: k=2; D = abcab Set of 2-shingles: C = S(D ) = {ab, bc, ca} Represent a doc by a set of hash values of its k-shingles A natural similarity measure is then the Jaccard similarity: sim(d, D 2 ) = C ÇC 2 / C ÈC 2 Similarity of two documents is the Jaccard similarity of their shingles

Min-Hashing: Convert large sets into short signatures, while preserving similarity: Pr[h(C ) = h(c 2 )] = sim(d, D 2 ) /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 6 Similarities of columns and signatures (approx.) match! -3 2-4 -2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67.00 0 0 Signature matrix M 5 7 6 3 2 4 4 5 6 7 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Input matrix (Shingles x Documents) 3 4 7 2 6 5 Permutation p 2 2 4 2 2 2

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 7 Hash columns of the signature matrix M: Similar columns likely hash to same bucket Divide matrix M into b bands of r rows (M=b r) Candidate column pairs are those that hash to the same bucket for band Buckets b bands r rows Matrix M Prob. of sharing bucket Threshold s Similarity

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 8 Points Hash func. Signatures: short integer signatures that reflect point similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a locality sensitive hash function (for a given distance metric) Apply the Bands technique

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 9 The S-curve is where the magic happens Probability of sharing bucket Remember: Probability of equal hash-values = similarity Similarity t of two sets This is what hash-code gives you Pr[h p (C ) = h p (C 2 )] = sim(d, D 2 ) No chance if t<s Threshold s Probability= if t>s Similarity t of two sets This is what we want! How to get a step-function? By choosing r and b!

Remember: b bands, r rows/band Let sim(c, C 2 ) = s What s the prob. that at least band is equal? Pick some band (r rows) Prob. that elements in a single row of columns C and C 2 are equal = s Prob. that all rows in a band are equal = s r Prob. that some row in a band is not equal = - s r Prob. that all bands are not equal = ( - s r ) b Prob. that at least band is equal = - ( - s r ) b P(C, C 2 is a candidate pair) = - ( - s r ) b /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 0

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets Picking r and b to get the best S-curve 50 hash-functions (r=5, b=0) 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity, s

Given a fixed threshold s. We want choose r and b such that the P(Candidate pair) has a step right around s. Prob(Candidate pair) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. r =..0, b = r =, b =..0 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity r = 5, b =..50 0.9 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. r = 0, b =..50 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity prob = - ( - t r ) b

Signatures: short vectors that represent the sets, and reflect their similarity Min-Hashing Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 4 We have used LSH to find similar documents More generally, we found similar columns in large sparse matrices with high Jaccard similarity Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let s generalize what we ve learned!

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 5 d() is a distance measure if it is a function from pairs of points x,y to real numbers such that:! ", $ 0!(", $) = 0 +,, " = $!(", $) =!($, ")! ", $!(",.) +!(., $) (triangle inequality) Jaccard distance for sets = minus Jaccard similarity Cosine distance for vectors = angle between the vectors Euclidean distances: L 2 norm: d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension The most common notion of distance L norm: sum of the differences in each dimension Manhattan distance = distance if you travel along coordinates only

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 9 For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows A hash function is any function that allows us to say whether two elements are equal Shorthand: h(x) = h(y) means h says x and y are equal A family of hash functions is any set of hash functions from which we can pick one at random efficiently Example: The set of Min-Hash functions generated from permutations of rows

Suppose we have a space S of points with a distance measure d(x,y) Critical assumption A family H of hash functions is said to be (d, d 2, p, p 2 )-sensitive if for any x and y in S:. If d(x, y) < d, then the probability over all hî H, that h(x) = h(y) is at least p 2. If d(x, y) > d 2, then the probability over all hî H, that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 20

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Small distance, high probability Distance threshold t p Pr[h(x) = h(y)] p 2 d d 2 Large distance, low probability of hashing to the same value Distance d(x,y)

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 22 Let: S = space of all sets, d = Jaccard distance, H is family of Min-Hash functions for all permutations of rows Then for any hash function hî H: Pr[h(x) = h(y)] = - d(x, y) Simply restates theorem about Min-Hashing in terms of distances rather than similarities

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 23 Claim: Min-hash H is a (/3, 2/3, 2/3, /3)- sensitive family for S and d. If distance < /3 (so similarity 2/3) Then probability that Min-Hash values agree is > 2/3 For Jaccard similarity, Min-Hashing gives a (d,d 2,(-d ),(-d 2 ))-sensitive family for any d <d 2

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 24 Can we reproduce the S-curve effect we saw before for any LS family? Prob. of sharing a bucket The bands technique we learned for signature matrices carries over to this more general setting Can do LSH with any (d, d 2, p, p 2 )-sensitive family! Two constructions: AND construction like rows in a band OR construction like many bands Similarity t

Given family H, construct family H consisting of r functions from H For h = [h,,h r ] in H, we say h(x) = h(y) if and only if h i (x) = h i (y) for all i Note this corresponds to creating a band of size r Theorem: If H is (d, d 2, p, p 2 )-sensitive, then H is (d,d 2, (p ) r, (p 2 ) r )-sensitive Proof: Use the fact that h i s are independent i r Also lowers probability for small distances (Bad) Lowers probability for large distances (Good) /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 26

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 27 Independence of hash functions (HFs) really means that the prob. of two HFs saying yes is the product of each saying yes But two particular hash functions could be highly correlated For example, in Min-Hash if their permutations agree in the first one million entries However, the probabilities in definition of a LSH-family are over all possible members of H, H (i.e., average case and not the worst case)

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 28 Given family H, construct family H consisting of b functions from H For h = [h,,h b ] in H, h(x) = h(y) if and only if h i (x) = h i (y) for at least i Theorem: If H is (d, d 2, p, p 2 )-sensitive, then H is (d, d 2, -(-p ) b, -(-p 2 ) b )-sensitive Proof: Use the fact that h i s are independent Raises probability for small distances (Good) Raises probability for large distances (Bad)

AND makes all probs. shrink, but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not OR makes all probs. grow, but by choosing b correctly, we can make the upper prob. approach while the lower does not Prob. sharing a bucket 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. AND r=..0, b= 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity of a pair of items Prob. sharing a bucket 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 29 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. OR r=, b=..0 Similarity of a pair of items

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 30 By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches As for the signature matrix, we can use the AND construction followed by the OR construction Or vice-versa Or any sequence of AND s and OR s alternating

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 3 r-way AND followed by b-way OR construction Exactly what we did with Min-Hashing AND: If bands match in all r values hash to same bucket OR: Cols that have ³ common bucket à Candidate Take points x and y s.t. Pr[h(x) = h(y)] = s H will make (x,y) a candidate pair with prob. s Construction makes (x,y) a candidate pair with probability -(-s r ) b The S-Curve! Example: Take H and construct H by the AND construction with r = 4. Then, from H, construct H by the OR construction with b = 4

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 32 s p=-(-s 4 ) 4.2.0064.3.0320.4.0985.5.2275.6.4260.7.6666.8.8785.9.9860 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)-sensitive family.

Picking r and b to get desired performance 50 hash-functions (r = 5, b = 0) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Threshold s 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity s Blue area X: False Negative rate These are pairs with sim > s but the X fraction won t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them. /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 34

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 35 Picking r and b to get desired performance 50 hash-functions (r * b = 50) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Threshold s r=2, b=25 r=5, b=0 r=0, b=5 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity s

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 36 Apply a b-way OR construction followed by an r-way AND construction Transforms similarity s (probability p) into (-(-s) b ) r The same S-curve, mirrored horizontally and vertically Example: Take H and construct H by the OR construction with b = 4. Then, from H, construct H by the AND construction with r = 4

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 37 s p=(-(-s) 4 ) 4..040.2.25.3.3334.4.5740.5.7725.6.905.7.9680.8.9936 Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.2 0.4 0.6 0.8 Similarity s The example transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.25)-sensitive family

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 38 Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9999996,.000875)-sensitive family Note this family uses 256 (=4*4*4*4) of the original hash functions

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 39 For each AND-OR S-curve -(-s r ) b, there is a threshold t, for which -(-t r ) b = t Above t, high probabilities are increased; below t, low probabilities are decreased You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t Iterate as you like. Similar observation for the OR-AND type of S- curve: (-(-s) b ) r

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 40 Probability Is raised Prob(Candidate pair) Probability Is lowered Threshold t s t

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 4 Pick any two distances d < d 2 Start with a (d, d 2, (- d ), (- d 2 ))-sensitive family Apply constructions to amplify (d, d 2, p, p 2 )-sensitive family, where p is almost and p 2 is almost 0 The closer to 0 and we get, the more hash functions must be used!

LSH methods for other distance metrics: Cosine distance: Random hyperplanes Euclidean distance: Project on lines Points Hash func. Design a (d, d 2, p, p 2 )-sensitive family of hash functions (for that particular distance metric) Signatures: short integer signatures that reflect their similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Depends on the distance function used /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 43

Data Hash func. Signatures: short integer signatures that reflect their similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Documents 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MinHash 5 5 2 3 3 6 4 6 4 Bands technique Candidate pairs Data points 0 0 0 0 0 0 0 0 0 0 0 0 Random Hyperplanes - + - - + + + - - - - - Bands technique Candidate pairs /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 44

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 45 Cosine distance = angle between vectors from the origin to the points in question d(a, B) = q = arccos(a B / ǁAǁ ǁBǁ) Has range " $ (equivalently 0...80 ) A B ǁBǁ Can divide q by $ to have distance in range 0 Cosine similarity = -d(a,b) But often defined as cosine sim: cos()) =,.,. A - Has range - for general vectors - Range 0.. for non-negative vectors (angles up to 90 ) B

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 46 For cosine distance, there is a technique called Random Hyperplanes Technique similar to Min-Hashing Random Hyperplanes method is a (d, d 2, (-d /!), (-d 2 /!))-sensitive family for any d and d 2 Reminder: (d, d 2, p, p 2 )-sensitive. If d(x,y) < d, then prob. that h(x) = h(y) is at least p 2. If d(x,y) > d 2, then prob. that h(x) = h(y) is at most p 2

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 47 Each vector v determines a hash function h v with two buckets h v (x) = + if v x ³ 0; = - if v x < 0 LS-family H = set of all functions derived from any vector Claim: For points x and y, Pr[h(x) = h(y)] = d(x,y) /!

Look in the plane of x and y. v θ x v Hyperplane normal to v. Here h(x) h(y) Hyperplane normal to v. Here h(x) = h(y) /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 48 y Note: what is important is that hyperplane is outside the angle, not that the vector is inside.

So: Prob[Red case] = θ /! So: P[h(x)=h(y)] = - θ/" = -d(x,y)/" /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 49

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 50 Pick some number of random vectors, and hash your data for each vector The result is a signature (sketch) of + s and s for each data point Can be used for LSH like we used the Min-Hash signatures for Jaccard distance Amplify using AND/OR constructions

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 5 Expensive to pick a random vector in M dimensions for large M Would have to generate M random numbers A more efficient approach It suffices to consider only vectors v consisting of + and components Why? Assuming data is random, then vectors of +/- cover the entire space evenly (and does not bias in any way)

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 52 Idea: Hash functions correspond to lines Partition the line into buckets of size a Hash each point to the bucket containing its projection onto the line An element of the Signature is a bucket id for that given projection line Nearby points are always close; distant points are rarely in same bucket

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 53 v v v v v v v v Line Buckets of size a v v v v Lucky case: Points that are close hash in the same bucket Distant points end up in different buckets Two unlucky cases: Top: unlucky quantization Bottom: unlucky projection

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 54 v v v v v v v

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 55 Points at distance d If d << a, then the chance the points are in the same bucket is at least d/a. Bucket width a Randomly chosen line

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 56 If d >> a, θ must be close to 90 o for there to be any chance points go to the same bucket. Points at distance d θ d cos θ Bucket width a Randomly chosen line

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 57 If points are distance d < a/2, prob. they are in same bucket - d/a = ½ If points are distance d > 2a apart, then they can be in the same bucket only if d cos θ a cos θ ½ 60 < θ < 90, i.e., at most /3 probability Yields a (a/2, 2a, /2, /3)-sensitive family of hash functions for any a Amplify using AND-OR cascades

Documents Data Hash func. Design a (d, d 2, p, p 2 )-sensitive family of hash functions (for that particular distance metric) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MinHash Signatures: short integer signatures that reflect their similarity 5 5 2 3 3 6 4 6 4 Localitysensitive Hashing Bands technique Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Candidate pairs Data points 0 0 0 0 0 0 0 0 0 0 0 0 Random Hyperplanes - + - - + + + - - - - - Bands technique Candidate pairs /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 60

/2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 6 Property P(h(C )=h(c 2 ))=sim(c,c 2 ) of hash function h is the essential part of LSH, without it we can t do anything LS-hash functions transform data to signatures so that the bands technique (AND, OR constructions) can then be applied