Data-Intensive Computing with MapReduce

Size: px
Start display at page:

Download "Data-Intensive Computing with MapReduce"

Transcription

1 Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

2 Developing big data intuitions Source: Wikipedia (Mahout)

3 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application

4 What s the Problem? Finding similar items with respect to some distance metric Two variants of the problem: l Offline: extract all similar pairs of objects from a large collection l Online: is this object similar to something I ve seen before? Similarities/differences with the clustering problem

5 Applications Near-duplicate detection of webpages l Offline vs. online variant of problem Recommender systems Forensic text analysis (e.g., plagiarism detection)

6 Near-Duplicate Detection of Webpages What s the source of the problem? l Mirror pages (legit) l Spam farms (non-legit) l Additional complications (e.g., nav bars) Naïve algorithm: l Compute cryptographic hash for webpage (e.g., MD5) l Insert hash values into a big hash table l Compute hash for new webpage: collision implies duplicate What s the issue? Intuition: l Hash function needs to be tolerant of minor differences l High similarity implies higher probability of hash collision

7 Literature Note Many communities have tackled similar problems: l Theoretical computer science l Information retrieval l Data mining l Databases l Issues l Slightly different terminology l Results not easy to compare

8 Four Steps Specify distance metric l Jaccard, Euclidean, cosine, etc. Compute representation l Shingling, tf.idf, etc. Project l Minhash, random projections, etc. Extract l Bucketing, sliding windows, etc.

9 Distances Source:

10 Distance Metrics 1. Non-negativity: 2. Identity: d(x, y) 0 3. Symmetry: d(x, y) = 0 () x=y d(x, y) = d(y, x) 4. Triangle Inequality d(x, y) apple d(x, z) + d(z, y)

11 Distance: Jaccard Given two sets A, B Jaccard similarity: J(A, B) = A \ B A [ B d(a, B) =1 J(A, B)

12 Distance: Norms Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Euclidean distance (L 2 -norm) d(x, y) = Manhattan distance (L 1 -norm) d(x, y) = L r -norm d(x, y) = v ux t n (x i y i ) 2 i=0 nx x i y i i=0 " X n # 1/r x i y i r i=0

13 Distance: Cosine Given: x=[x 1,x 2,...x n ] y=[y 1,y 2,...y n ] Idea: measure distance between the vectors Thus: cos = x y x y sim(x, y) = pp n i=0 x2 i d(x, y) = 1 sim(x, y) P n i=0 x iy i pp n i=0 y2 i

14 Distance: Hamming Given two bit vectors Hamming distance: number of elements which differ

15 Representations

16 Representations: Text Unigrams (i.e., words) Shingles = n-grams l At the word level l At the character level Feature weights l boolean l tf.idf l BM25 l Which representation to use? What s the actual text? Use the hashing trick!

17 Representations: Beyond Text For recommender systems: l Items as features for users l Users as features for items For graphs: l Adjacency lists as features for vertices With log data: l Behaviors (clicks) as features

18 Minhash Source:

19 Minhash Seminal algorithm for near-duplicate detection of webpages l Used by AltaVista l For details see Broder et al. (1997) Setup: l Documents (HTML pages) represented by shingles (n-grams) l Jaccard similarity: dups are pairs with high similarity

20 Preliminaries: Representation Sets: l A = {e 1, e 3, e 7 } l B = {e 3, e 5, e 7 } Can be equivalently expressed as matrices: Element A B e e e e e e e 7 1 1

21 Preliminaries: Jaccard Element A B e e e e e e e Let: M 00 = # rows where both elements are 0 M 11 = # rows where both elements are 1 M 01 = # rows where A=0, B=1 M 10 = # rows where A=1, B=0 J(A, B) = M 11 M 01 + M 10 + M 11

22 Minhash Computing minhash l Start with the matrix representation of the set l Randomly permute the rows of the matrix l minhash is the first row with a one Example: Element A B e e e e e e e h(a) = e 3 h(b) = e 5 Element A B e e e e e e e 1 1 0

23 Minhash and Jaccard Element A B e e e e e e e M 00 M 00 M 01 M 11 M 11 M 00 M 10 P [h(a) =h(b)] = J(A, B) M 11 M 01 + M 10 + M 11 M 11 M 01 + M 10 + M 11

24 To Permute or Not to Permute? Permutations are expensive Interpret the hash value as the permutation Only need to keep track of the minimum hash value l Can keep track of multiple minhash values at once

25 Extracting Similar Pairs (LSH) We know: P [h(a) =h(b)] = J(A, B) Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute its minhash value l Group objects by their hash values l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s l Probability that any pair is invalid is (1 s) What s the fundamental issue?

26 Two Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute two minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s 2 l Probability that any pair is invalid is (1 s) 2

27 k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute k minhash values and concatenate together into a signature l Group objects by their signatures l Output all pairs within each group Analysis: l Probability we will discovered all pairs is s k l Probability that any pair is invalid is (1 s) k What s the issue now?

28 n different k Minhash Signatures Task: discover all pairs with similarity greater than s Algorithm: l For each object, compute n sets k minhash values l For each set, concatenate k minhash values together l Within each set: Group objects by their signatures Output all pairs within each group l De-dup pairs Analysis: l Probability we will miss a pair is (1 s k ) n l Probability that any pair is invalid is n(1 s) k

29 Practical Notes In some cases, checking all candidate pairs may be possible l Time cost is small relative to everything else l Easy method to discard false positives Most common practical implementation: l Generate M minhash values, randomly select k of them n times l Reduces amount of hash computations needed Determining authoritative version is non-trivial

30 MapReduce Implementation Map over objects: l Generate M minhash values, randomly select k of them n times l Each draw yields a signature: emit as intermediate key, value is object id Shuffle/sort: Reduce: l Receive all object ids with same signature, generate candidate pairs l Okay to buffer in memory? Second pass to de-dup

31 Offline Extraction vs. Online Querying Batch formulation of the problem: l Discover all pairs with similarity greater than s l Useful for post-hoc batch processing of web crawl Online formulation of the problem: l Given new webpage, is it similar to one I ve seen before? l Useful for incremental web crawl processing

32 Online Similarity Querying Preparing the existing collection: l For each object, compute n sets of k minhash values l For each set, concatenate k minhash values together l Keep each signature in hash table (in memory) l Note: can parallelize across multiple machines Querying and updating: l For new webpage, compute signatures and check for collisions l Collisions imply duplicate (determine which version to keep) l Update hash tables

33 Random Projections Source:

34 Limitations of Minhash Minhash is great for near-duplicate detection l Set high threshold for Jaccard similarity Limitations: l Jaccard similarity only l Set-based representation, no way to assign weights to features Random projections: l Works with arbitrary vectors using cosine similarity l Same basic idea, but details differ l Slower but more accurate: no free lunch!

35 Random Projection Hashing Generate a random vector r of unit length l Draw from univariate Gaussian for each component l Normalize length Define: h r (u) = 1 if r u 0 0 if r u < 0 Physical intuition?

36 RP Hash Collisions It can be shown that: P [h r (u) = h r (v)] = 1 (u, v) l Proof in (Goemans and Williamson, 1995) Physical intuition?

37 Random Projection Signature Given D random vectors: [r 1, r 2, r 3,...r D ] Convert each object into a D bit signature u! [h r1 (u),h r2 (u),h r3 (u),...h rd (u)] l We can derive: h sim(u, v) = cos D i hamming(s u, s v ) Thus: similarity boils down to comparison of hamming distances between signatures

38 One-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first bit, bucket objects into two sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: cos 1 (s) 1 l Efficiency: 2 N N 2 vs. 2 2 * * Note, this is actually a simplification: see Ture et al. (SIGIR 2011) for details.

39 Two-RP Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first two bits, bucket objects into four sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 2 (s) 1 l Efficiency: N 2 vs. 4 N 4 2

40 k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Take first k bits, bucket objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket, retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: apple cos 1 k (s) 1 l Efficiency: 2 N N 2 vs. 2 k 2 k

41 m Sets of k-rp Signature Task: discover all pairs with cosine similarity greater than s Algorithm: l Compute D-bit RP signature for every object l Choose m sets of k bits l For each set, use k selected bits to partition objects into 2 k sets l Perform brute force pairwise (hamming distance) comparison in each bucket (of each set), retain those below hamming distance threshold Analysis: l Probability we will discover all pairs: " apple cos 1 (s) l Efficiency: k # m N 2 vs. m 2 k N 2 k 2

42 Theoretical Results Threshold = 0.7 What s interesting here? Recall Precision (theoretical) 1 set 2 sets 3 sets 4 sets 0.7 Precision Recall Number of random projections Number of Random Projections

43 Online Querying Preprocessing: l Compute D-bit RP signature for every object l Choose m sets of k bits and use to bucket l Store signatures in memory (across multiple machines) Querying l Compute D-bit signature of query object, choose m sets of k bits in same way l Perform brute-force scan of correct bucket (in parallel)

44 Additional Issues to Consider Emphasis on recall, not precision Two sources of error: l From LSH l From using hamming distance as proxy for cosine similarity Balance of bucket sizes? Parameter tuning?

45 Sliding Window Algorithm Compute D-bit RP signature for every object For each object, permute bit signature m times For each permutation, sort bit signatures l Apply sliding window of width B over sorted l Compute hamming distances of bit signatures within window

46 MapReduce Implementation Mapper: l Process each individual object in parallel l Load in random vectors as side data l Compute bit signature l Permute m times, for each emit: key = (n, signature), where n = [ 1 m ] value = object id Reduce l Keep FIFO queue of B bit signatures l For each newly-encountered bit signature, compute hamming distance wrt all bit signatures in queue l Add new bit signature to end of queue, displacing oldest

47 Slightly Faster MapReduce Variant In reducer, don t perform hamming distance computations l Write directly to HDFS in large blocks Perform hamming distance computations in second mapperonly pass l Advantages?

48 Case Study: Wikipedia Ture et al., SIGIR 2011 Ture and Lin, NAACL 2012 Source: Wikipedia (Wikipedia)

49 Mining Bitext from Wikipedia Task: extract similar article pairs in different languages l Why is this useful? l Why not just use inter-wiki links? Use LSH! l Why is this non-trivial?

50 German Doc A MT translate English doc vector v A Machine Translation English Doc B doc vector v B Tradeoffs? German Doc A doc vector v A CLIR translate doc vector v A Vector Translation English Doc B doc vector v B tf 0 (e, D) = X f2f df 0 (e) = X f2f p(f e) tf(f,d) p(f e) df(f)

51 Translations are noisy! Question: To what extent are ground truth mutual translation pairs similar?

52 Architecture German articles English articles Preprocess document vectors (English) Signature generation Signatures LSH Similar article pairs

53 Evaluation Setup Collection: 3.44m English m German Wikipedia l Brute force: 5.05 trillion comparisons Task: extract similar documents with cosine similarity > 0.3 Representation: 1000 bit random projections Ground truth: l Sample ~1000 German Wikipedia articles Evaluation metrics: l Time l Relative cost l Recall

54 Results (16 node cluster) tables 200 tables 300 tables 64 million document pairs Time (hours) Window Size (B)

55 Sources of Error Hamming distance computation introduces error! l 1000-bit random projections yields ~0.04 absolute error l Maximum obtainable recall is 0.76 Why not just compute cosine similarity? l 20 times slower than computing hamming distance Let s renormalize with respect to hamming distance upper bound

56 No Free Lunch! 100 Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 70% cost Relative Cost (%) 500 tables

57 No Free Lunch! % recall? Don t use LSH! Relative Effectivenes (%) 95 95% recall 39% cost 99% recall 62% cost 99% recall 70% cost Relative Cost (%) 500 tables 1500 tables

58 Parallel Sentence Extraction Model parallel sentence extraction as a classification problem l Given a candidate pair, classify as parallel or not parallel What s the problem? l 64m document pairs yields 400b raw sentence pairs What s that quote about premature optimization?

59 Two-Step Classification Approach: l Apply a few simple heuristics l Simple classifier: efficiently filters out irrelevant pairs l Complex classifier: effectively classifies remaining pairs

60 Join Algorithm 1 l Mapper processes both English and foreign collection Load all (d e, d f ) similarity pairs in memory as side data When encountering relevant document, emit (d e, d f ) pair as key, list of sentence vectors as value l Reducer: Gather sentence vectors, compute Cartesian product and apply classifier Algorithm 2 l Same idea as first, except with stripes pattern Which is faster?

61 Complete Pipeline source collection F target collection E Preprocess Preprocess doc vectors F doc vectors E Signature Generation Signature Generation signatures E Phase 1 Sliding Window Algorithm cross-lingual document pairs aligned bilingual sentence pairs 2-step Parallel Sentence Classifier candidate sentence pairs Candidate Generation Phase 2

62 Statistical Machine Translation Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) he sat at the table the service was good Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence ê 1 I = argmax e 1 I!P(e " I 1 f J 1 )# $ = argmax e 1 I!P(e " I 1 )P( f J 1 e I 1 )# $

63 Evaluation

64 Today s Agenda The problem of similar item detection Distances and representations Minhash Random projections End-to-end Wikipedia application

65 Questions? Source: Wikipedia (Japanese rock garden)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Finding Similar Items:Nearest Neighbor Search

Finding Similar Items:Nearest Neighbor Search Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

Topic: Duplicate Detection and Similarity Computing

Topic: Duplicate Detection and Similarity Computing Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

US Patent 6,658,423. William Pugh

US Patent 6,658,423. William Pugh US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that

More information

Finding Similar Sets

Finding Similar Sets Finding Similar Sets V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Motivation Many Web-mining problems can be expressed as

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009 Data-Intensive Text Processing with MapReduce Jimmy Lin The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009 This work is licensed under a Creative

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 7: Clustering and Classification Jimmy Lin University of Maryland Thursday, March 7, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017 CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

More information

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Locality-Sensitive Hashing

Locality-Sensitive Hashing Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island Mahout in Action SEAN OWEN ROBIN ANIL TED DUNNING ELLEN FRIEDMAN II MANNING Shelter Island contents preface xvii acknowledgments about this book xx xix about multimedia extras xxiii about the cover illustration

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity. Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity Jesse Kornblum Outline Introduction Artificial Intelligence Spam Detection Clustering

More information

Scalable Techniques for Similarity Search

Scalable Techniques for Similarity Search San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Scalable Techniques for Similarity Search Siddartha Reddy Nagireddy San Jose State University

More information

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies. CSE547: Machine Learning for Big Data Spring 2019 Problem Set 1 Please read the homework submission policies. 1 Spark (25 pts) Write a Spark program that implements a simple People You Might Know social

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

B490 Mining the Big Data. 5. Models for Big Data

B490 Mining the Big Data. 5. Models for Big Data B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

New Issues in Near-duplicate Detection

New Issues in Near-duplicate Detection New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Motivation About 30% of the Web is redundant. [Fetterly 03, Broder

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.

More information

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University  Infinite data. Filtering data streams /9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them

More information

Epilog: Further Topics

Epilog: Further Topics Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Chapter 8: Implementation- Clipping and Rasterization

Chapter 8: Implementation- Clipping and Rasterization Chapter 8: Implementation- Clipping and Rasterization Clipping Fundamentals Cohen-Sutherland Parametric Polygons Circles and Curves Text Basic Concepts: The purpose of clipping is to remove objects or

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

University of Maryland. Tuesday, March 23, 2010

University of Maryland. Tuesday, March 23, 2010 Data-Intensive Information Processing Applications Session #7 MapReduce and databases Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Statistical Machine Translation Lecture 3. Word Alignment Models

Statistical Machine Translation Lecture 3. Word Alignment Models p. Statistical Machine Translation Lecture 3 Word Alignment Models Stephen Clark based on slides by Philipp Koehn p. Statistical Modeling p Mary did not slap the green witch Maria no daba una bofetada

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Supervised Learning: Nearest Neighbors

Supervised Learning: Nearest Neighbors CS 2750: Machine Learning Supervised Learning: Nearest Neighbors Prof. Adriana Kovashka University of Pittsburgh February 1, 2016 Today: Supervised Learning Part I Basic formulation of the simplest classifier:

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Readings in unsupervised Learning Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Hashing 2 Hashing: definition Hashing is the process of converting

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN

More information

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Locality- Sensitive Hashing Random Projections for NN Search

Locality- Sensitive Hashing Random Projections for NN Search Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (2/2) March 16, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016 Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1 Classification Representing data: Hypothesis (classifier) 2 Clustering 3 Supervised

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

CS290N Summary Tao Yang

CS290N Summary Tao Yang CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

More information