Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Similar documents
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Computing with MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Topic: Duplicate Detection and Similarity Computing

Data-Intensive Distributed Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Finding Similar Items:Nearest Neighbor Search

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CSE 5243 INTRO. TO DATA MINING

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets

US Patent 6,658,423. William Pugh

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Distributed Computing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Data-Intensive Computing with MapReduce

University of Maryland. Tuesday, March 2, 2010

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Locality-Sensitive Hashing

Data-Intensive Computing with MapReduce

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

B490 Mining the Big Data. 5. Models for Big Data

Scalable Techniques for Similarity Search

Graph Algorithms. Revised based on the slides by Ruoming Kent State

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Mining di Dati Web. Lezione 3 - Clustering and Classification

VK Multimedia Information Systems

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Fast Indexing Strategies for Robust Image Hashes

Association Pattern Mining. Lijun Zhang

What is Unsupervised Learning?

Knowledge Discovery and Data Mining 1 (VO) ( )

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Textual Similarity Analysis Using Locality Sensitive Hashing

Information Retrieval. Lecture 9 - Web search basics

CPSC 340: Machine Learning and Data Mining

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Problem 1: Complexity of Update Rules for Logistic Regression

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

CS290N Summary Tao Yang

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Overview of Clustering

Data-Intensive Distributed Computing

Leveraging Set Relations in Exact Set Similarity Join

Map-Reduce Applications: Counting, Graph Shortest Paths

Chapter 2 Basic Structure of High-Dimensional Spaces

General Instructions. Questions

Introduction to Information Retrieval

BIG DATA ANALYSIS A CASE STUDY. Du, Haoran Supervisor: Dr. Wang, Qing. COMP 8790: Software Engineering Project AUSTRALIAN NATIONAL UNIVERSITY

Clustering. (Part 2)

An efficient approach in Near Duplicate Document Detection using Shingling-MD5

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Shortest path problems

Mining Web Data. Lijun Zhang

Graphs (Part II) Shannon Quinn

Predicting Popular Xbox games based on Search Queries of Users

CS5112: Algorithms and Data Structures for Applications

University of Maryland. Tuesday, March 23, 2010

Finding dense clusters in web graph

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Vector Space Models: Theory and Applications

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Coding for Random Projects

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Sergei Silvestrov, Christopher Engström. January 29, 2013

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Evaluating Classifiers

Similarity Joins in MapReduce

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Finding Clusters 1 / 60

Collection Building on the Web. Basic Algorithm

Map-Reduce Applications: Counting, Graph Shortest Paths

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Locality-Sensitive Hashing

Network Traffic Measurements and Analysis

Databases 2 (VU) ( / )

Information Integration of Partially Labeled Data

Metric Learning Applied for Automatic Large Image Classification

Machine Learning for NLP

K- Nearest Neighbors(KNN) And Predictive Accuracy

5.2 Surface Registration

Transcription:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2016w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Structure of the Course Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining Core framework features and algorithm design

What s the Problem? Finding similar items with respect to some distance metric Two variants of the problem: l Offline: extract all similar pairs of objects from a large collection l Online: is this object similar to something I ve seen before?

Literature Note Many communities have tackled similar problems: l Theoretical computer science l Information retrieval l Data mining l Databases l Issues l Slightly different terminology l Results not easy to compare

Four Steps Specify distance metric l Jaccard, Euclidean, cosine, etc. Compute representation l Shingling, tf.idf, etc. Project l Minhash, random projections, etc. Extract l Bucketing, sliding windows, etc.

Distances Source: www.flickr.com/photos/thiagoalmeida/250190676/

Distance Metrics 1. Non-negativity: 2. Identity: d(x, y) 0 3. Symmetry: d(x, y) = 0 () x=y d(x, y) = d(y, x) 4. Triangle Inequality d(x, y) apple d(x, z) + d(z, y)

Distance: Jaccard Given two sets A, B Jaccard similarity: J(A, B) = A \ B A [ B d(a, B) =1 J(A, B)

Distance: Norms Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Euclidean distance (L 2 -norm) d(x, y) = Manhattan distance (L 1 -norm) d(x, y) = L r -norm d(x, y) = v ux t n (x i y i ) 2 i=0 nx x i y i i=0 " X n # 1/r x i y i r i=0

Distance: Cosine Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Idea: measure distance between the vectors Thus: cos = x y x y sim(x, y) = pp n i=0 x2 i d(x, y) = 1 sim(x, y) P n i=0 x iy i pp n i=0 y2 i

Distance: Hamming Given two bit vectors Hamming distance: number of elements which differ

Representations

Representations: Text Unigrams (i.e., words) Shingles = n-grams l At the word level l At the character level Feature weights l boolean l tf.idf l BM25 l

Representations: Beyond Text For recommender systems: l Items as features for users l Users as features for items For graphs: l Adjacency lists as features for vertices With log data: l Behaviors (clicks) as features

Minhash Source: www.flickr.com/photos/rheinitz/6158837748/

Near-Duplicate Detection of Webpages What s the source of the problem? l Mirror pages (legit) l Spam farms (non-legit) l Additional complications (e.g., nav bars) Naïve algorithm: l Compute cryptographic hash for webpage (e.g., MD5) l Insert hash values into a big hash table l Compute hash for new webpage: collision implies duplicate What s the issue? Intuition: l Hash function needs to be tolerant of minor differences l High similarity implies higher probability of hash collision

Minhash Naïve approach: N 2 comparisons: Can we do better? Seminal algorithm for near-duplicate detection of webpages l Used by AltaVista l For details see Broder et al. (1997) Setup: l Documents (HTML pages) represented by shingles (n-grams) l Jaccard similarity: dups are pairs with high similarity

Preliminaries: Representation Sets: l A = {e 1, e 3, e 7 } l B = {e 3, e 5, e 7 } Can be equivalently expressed as matrices: Element A B e 1 1 0 e 2 0 0 e 3 1 1 e 4 0 0 e 5 0 1 e 6 0 0 e 7 1 1

Preliminaries: Jaccard Element A B e 1 1 0 e 2 0 0 e 3 1 1 e 4 0 0 e 5 0 1 e 6 0 0 e 7 1 1 Let: M 00 = # rows where both elements are 0 M 11 = # rows where both elements are 1 M 01 = # rows where A=0, B=1 M 10 = # rows where A=1, B=0 J(A, B) = M 11 M 01 + M 10 + M 11

Minhash Computing minhash l Start with the matrix representation of the set l Randomly permute the rows of the matrix l minhash is the first row with a one Example: Element A B e 1 1 0 e 2 0 0 e 3 1 1 e 4 0 0 e 5 0 1 e 6 0 0 e 7 1 1 h(a) = e 3 h(b) = e 5 Element A B e 6 0 0 e 2 0 0 e 5 0 1 e 3 1 1 e 7 1 1 e 4 0 0 e 1 1 0

Minhash and Jaccard Element A B e 6 0 0 e 2 0 0 e 5 0 1 e 3 1 1 e 7 1 1 e 4 0 0 e 1 1 0 M 00 M 00 M 01 M 11 M 11 M 00 M 10 P [h(a) =h(b)] = J(A, B) M 11 M 11 M 01 + M 10 + M 11 Woah! M 01 + M 10 + M 11

To Permute or Not to Permute? Permutations are expensive Interpret the hash value as the permutation Only need to keep track of the minimum hash value l Can keep track of multiple minhash values at once

Extracting Similar Pairs Task: discover all pairs with similarity greater than s Naïve approach: N 2 comparisons: Can we do better? Tradeoffs: l False positives: discovered pairs that have similarity less than s l False negatives: pairs with similarity greater than s not discovered The errors (and costs) are asymmetric!

Extracting Similar Pairs (LSH) We know: P [h(a) =h(b)] = J(A, B) Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute its minhash value l Group objects by their hash values l Output all pairs within each group Analysis: l If J(A,B) = s, then probability we detect it is s

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 What s the issue? False Positives Threshold = 0.8 False Negatives 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

2 Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute 2 minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l If J(A,B) = s, then probability we detect it is s 2

3 Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute 3 minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l If J(A,B) = s, then probability we detect it is s 3

k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute k minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l If J(A,B) = s, then probability we detect it is s k

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 k Minhash Signatures concatenated together False Negatives What s the issue now? False Positives Threshold = 0.8 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

n different k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute n sets k minhash values l For each set, concatenate k minhash values together l Within each set: Group objects by their signatures Output all pairs within each group l De-dup pairs Analysis: l If J(A,B) = s, P(none of the n collide) = (1 s k ) n l If J(A,B) = s, then probability we detect it is 1 (1 s k ) n

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 k Minhash Signatures concatenated together False Negatives False Positives Threshold = 0.8 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 6 Minhash Signatures concatenated together False Negatives False Positives Threshold = 0.8 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

Probability of Detection 0.0 0.2 0.4 0.6 0.8 1.0 n different sets of 6 Minhash Signatures False Negatives False Positives Threshold = 0.8 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard

n different k Minhash Signatures Example: J(A,B) = 0.8, 10 sets of 6 minhash signatures l P(k minhash signatures match) = (0.8) 6 = 0.262 l P(k minhash signature doesn t match in any of the 10 sets) = (1 (0.8) 6 ) 10 = 0.0478 l Thus, we should find 1 (1 (0.8) 6 ) 10 = 0.952 of all similar pairs Example: J(A,B) = 0.4, 10 sets of 6 minhash signatures l P(k minhash signatures match) = (0.4) 6 = 0.0041 l P(k minhash signature doesn t match in any of the 10 sets) = (1 (0.4) 6 ) 10 = 0.9598 l Thus, we should find 1 (1 0.262144) 10 = 0.040 of all similar pairs

n different k Minhash Signatures s 1 (1 s 6 ) 10 0.2 0.0006 0.3 0.0073 0.4 0.040 0.5 0.146 0.6 0.380 0.7 0.714 0.8 0.952 0.9 0.999 What s the issue?

Practical Notes Common implementation: l Generate M minhash values, select k of them n times l Reduces amount of hash computations needed Determining authoritative version is non-trivial

MapReduce/Spark Implementation Map over objects: l Generate M minhash values, select k of them n times l Each draw yields a signature, emit: key = (p, signature), where p = [ 1 n ] value = object id Shuffle/sort: Reduce: l Receive all object ids with same (n, signature), emit clusters Second pass to de-dup and group clusters (Optional) Third pass to eliminate false positives

Offline Extraction vs. Online Querying Batch formulation of the problem: l Discover all pairs with similarity greater than s l Useful for post-hoc batch processing of web crawl Online formulation of the problem: l Given new webpage, is it similar to one I ve seen before? l Useful for incremental web crawl processing

Online Similarity Querying Preparing the existing collection: l For each object, compute n sets of k minhash values l For each set, concatenate k minhash values together l Keep each signature in hash table (in memory) l Note: can parallelize across multiple machines Querying and updating: l For new webpage, compute signatures and check for collisions l Collisions imply duplicate (determine which version to keep) l Update hash tables

Random Projections Source: www.flickr.com/photos/roj/4179478228/

Limitations of Minhash Minhash is great for near-duplicate detection l Set high threshold for Jaccard similarity Limitations: l Jaccard similarity only l Set-based representation, no way to assign weights to features Random projections: l Works with arbitrary vectors using cosine similarity l Same basic idea, but details differ l Slower but more accurate: no free lunch!

Random Projection Hashing Generate a random vector r of unit length l Draw from univariate Gaussian for each component l Normalize length Define: h r (u) = 1 if r u 0 0 if r u < 0 Physical intuition?

RP Hash Collisions It can be shown that: (u, v) P [h r (u) = h r (v)] = 1 l Proof in (Goemans and Williamson, 1995) Thus: cos( (u, v)) = cos((1 P [h r (u) = h r (v)]) ) Physical intuition?

Random Projection Signature Given D random vectors: [r 1, r 2, r 3,...r D ] Convert each object into a D bit signature u! [h r1 (u),h r2 (u),h r3 (u),...h rd (u)] l Since: cos( (u, v)) = cos((1 P [h r (u) = h r (v)]) ) l We can derive: hamming(su, s v ) cos( (u, v)) = cos D Thus: similarity boils down to comparison of hamming distances between signatures

One-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first bit, bucket objects into two sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: cos 1 (s) 1 l Efficiency: 2 N N 2 vs. 2 2 * * Note, this is actually a simplification: see Ture et al. (SIGIR 2011) for details.

Two-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first two bits, bucket objects into four sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 2 (s) 1 l Efficiency: N 2 vs. 4 N 4 2

k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first k bits, bucket objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 k (s) 1 l Efficiency: 2 N N 2 vs. 2 k 2 k

m Sets of k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Choose m sets of k bits l For each set, use k selected bits to partition objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket (of each set), retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: " apple cos 1 (s) 1 1 1 l Efficiency: k # m N 2 vs. m 2 k N 2 k 2

MapReduce/Spark Implementation Map over objects: l Compute D-bit RP signature for every object l Choose m sets of k bits and use to bucket; for each, emit: key = (p, k bits), where p = [ 1 m ] value = (object id, rest of signature bits) Shuffle/sort: Reduce: l Receive (p, k bits) l Perform brute force pairwise (hamming distance) comparison for each key, retain those below hamming distance threshold Second pass to de-dup and group clusters (Optional) Third pass to eliminate false positives

Online Querying Preprocessing: l Compute D-bit RP signature for every object l Choose m sets of k bits and use to bucket l Store signatures in memory (across multiple machines) Querying l Compute D-bit signature of query object, choose m sets of k bits in same way l Perform brute-force scan of correct bucket (in parallel)

Additional Issues to Consider Emphasis on recall, not precision Two sources of error: l From LSH l From using hamming distance as proxy for cosine similarity Load imbalance Parameter tuning

Sliding Window Algorithm Compute D-bit RP signature for every object For each object, permute bit signature m times For each permutation, sort bit signatures l Apply sliding window of width B over sorted l Compute hamming distances of bit signatures within window

MapReduce Implementation Mapper: l Process each individual object in parallel l Load in random vectors as side data l Compute bit signature l Permute m times, for each emit: key = (p, signature), where p = [ 1 m ] value = object id Reduce l Keep FIFO queue of B bit signatures l For each newly-encountered bit signature, compute hamming distance wrt all bit signatures in queue l Add new bit signature to end of queue, displacing oldest

Four Steps to Finding Similar Items Specify distance metric l Jaccard, Euclidean, cosine, etc. Compute representation l Shingling, tf.idf, etc. Project l Minhash, random projections, etc. Extract l Bucketing, sliding windows, etc.

Questions? Source: Wikipedia (Japanese rock garden)