Being Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon

Similar documents
FStream: a decentralized and social music streamer

Nearest Neighbors Graph Construction: Peer Sampling to the Rescue

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

A P2P REcommender system based on Gossip Overlays (PREGO)

Fast Topology Management in Large Overlay Networks

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 2 ARCHITECTURES

Learning independent, diverse binary hash functions: pruning and locality

Gossip Protocols. Márk Jelasity. Hungarian Academy of Sciences and University of Szeged, Hungary

WebGC Gossiping on Browsers without a Server [Live Demo/Poster]

Distribution-free Predictive Approaches

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Constructing Overlay Networks through Gossip

CLSH: Cluster-based Locality-Sensitive Hashing

Introduction to Data Mining

T-Man: Gossip-based Overlay Topology Management

over Multi Label Images

PUBLISHER Subscriber system is an event notification service

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

High Dimensional Indexing by Clustering

Cloud Computing: "network access to shared pool of configurable computing resources"

T-Man: Gossip-Based Overlay Topology Management

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Overlay Management for Fully Distributed User-based Collaborative Filtering

Supervised Learning: K-Nearest Neighbors and Decision Trees

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

T-Man: Gossip-based Fast Overlay Topology Construction

Unsupervised Learning of Spatiotemporally Coherent Metrics

Locality- Sensitive Hashing Random Projections for NN Search

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Kleinberg s Small-World Networks. Normalization constant have to be calculated:

Build One, Get One Free: Leveraging the Coexistence of Multiple P2P Overlay Networks

Review: Identification of cell types from single-cell transcriptom. method

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Chapter 8: GPS Clustering and Analytics

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Cosine Approximate Nearest Neighbors

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

De#anonymizing,Social,Networks, and,inferring,private,attributes, Using,Knowledge,Graphs,

Predictive Indexing for Fast Search

Social Network Analysis With igraph & R. Ofrit Lesser December 11 th, 2014

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Naïve Bayes for text classification

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Chord on Demand. Mark Jelasity University of Bologna, Italy.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Architectural Approaches for Social Networks. Presenter: Qian Li

Epidemic-Style Management of Semantic Overlays for Content-Based Searching

A Systems View of Large- Scale 3D Reconstruction

Chord on Demand. Ozalp Babaoglu University of Bologna, Italy. Alberto Montresor University of Bologna, Italy

Machine Learning and Pervasive Computing

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Accelerated Machine Learning Algorithms in Python

CS570: Introduction to Data Mining

Small-World Networks: Is there a mismatch between theory and practice?

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

The Bootstrapping Service

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Nearest Neighbor Classification. Machine Learning Fall 2017

The Out-of-core KNN Awakens: The light side of computation force on large datasets

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Link Prediction in Graph Streams

Going nonparametric: Nearest neighbor methods for regression and classification

Available online at ScienceDirect. Procedia Computer Science 78 (2016 )

Data Mining and Machine Learning: Techniques and Algorithms

Algorithms for Nearest Neighbors

Application of Random Walks to Decentralized Recommender Systems

Image Restoration using Markov Random Fields

WHATSUP: A Decentralized Instant News Recommender

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Large-Scale Face Manifold Learning

Z-KNN Join for the Swiss Feed Database: a feasibility study

Nonparametric Clustering of High Dimensional Data

A Hybrid Peer-to-Peer Recommendation System Architecture Based on Locality-Sensitive Hashing

T-Man: Gossip-based Fast Overlay Topology Construction

Epidemic-style Management of Semantic Overlays for Content-Based Searching

Introduction to Data Science Lecture 8 Unsupervised Learning. CS 194 Fall 2015 John Canny

PUB-2-SUB: A Content-Based Publish/Subscribe Framework for Cooperative P2P Networks

Mining Web Data. Lijun Zhang

Algorithms for Nearest Neighbors

Fractional Cascading in Wireless. Jie Gao Computer Science Department Stony Brook University

Heterogeneous Gossip. Davide Frey Rachid Guerraoui Anne-Marie Kermarrec Boris Koldehofe Maxime Monod Martin Mogensen Vivien Quéma

Sizing Sketches: A Rank-Based Analysis for Similarity Search

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Density Based Clustering Using Mutual K-nearest. Neighbors

CPS 110 Final Exam. Spring 2011

Problem 1: Complexity of Update Rules for Logistic Regression

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Decomposition of log-linear models

Latent Variable Models for Structured Prediction and Content-Based Retrieval

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon

Co-authors Joint work with François Taiani Nupur Mittal Anne-Marie Kermarrec Published at ICDE 2016 2

Context: Engineering & Scale

Context: Engineering & Scale $1 million prize recommendation too much engineering effort F. Taiani

Context: Engineering & Scale Which Which practical practical approaches approaches for for scale scale and and performance? performance? $1 million prize recommendation too much engineering effort F. Taiani

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 6

KNN Graph Construction Entities (e.g. users) Items (e.g. locations) Ratings Alice User profile 7

KNN Graph Construction Similarity sim(, )= Goal: for each entity find k closest entities Many applications recommendations search recommendation, learning, Bob Alice... profile of Alice profile of Bob similarity function 8

Challenges Brute force not scalable: Alternatives: Approximate KNN Graph Using Locality Sensitive Hashing (LSH) Using Greedy Construction: best at the moment Vicinity [VS05], T-Man [JMB09], NNDescent [DML11], Hyrec [BFGKP14]? 9

Greedy KNN Construction Parallel-iterative algorithm, From a random graph Each node looks for potential new neighbours: (1) among friends of friends (2) among random nodes (optional) Carl Yann Alice Dave Bob Xavier 10

Repeat for all users until #changes < ε a b current neighbor candidates neighborhood from (1) & (2) Greedy Procedure node distance c computation sim(, ) = 3 6 9 1 8 4 4 6 8 9 d ranking d selection 1 f new neighborhood 3

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 12

Is Greed all there is? Observation 1: Similarity remains the bottleneck 90% of execution time spent on similarity (Wikipedia dataset) Observation 2: Datasets are (often) sparse Many datasets use item-based profiles Most items little shared : sparse The curse of dimensionality 13

The Problem with Sparsity Density: E = # ratings, U = #users, I = #items density = 35% 14

The Problem with Sparsity Density: E = # ratings, U = #users, I = #items Only few rungs ("ratings") on the ladder 2 random nodes unlikely to share items density = 35% 15

The Problem with Sparsity Similarity metrics account for shared items Two random nodes unlikely to be close Hence greedy processes slow to start Difficult to find relevant candidates Execution of many similarity evaluations 16

KIFF's Intuition Greedy KNN approaches Assume no initial structure Start from a random graph In practice Underlying bipartite user / item graph Can be used to bootstrap the greedy process Use items to create Ranked Candidate Sets RCS( ) iff items( ) items( ) 17

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 18

RCS Construction Item profil Bipartite user / item graph Ranked Candidate Set construction 19

RCS Construction Users Items IPchalet Alice itemsalice IPbank Bob itemsbob Darth Stormy 20

RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Stormy 21

RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Unrelated users are never compared Unrelated users are never compared Stormy 22

Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5

Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice top γ candidates in RCSAlice by item count 2 Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5 3 top k by sim Alice Carl C 0.9 Xavier X Y 0.6 0.5 B 0.4 D 0.3 Yann 4 new neighborhood 24

Alice Bob Carl Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave Indexing followed by "greedy" iteration Indexing followed by "greedy" iteration C B D X Y 1 current neighborhood sim(a, ) = 0.9 0.4 0.3 0.6 0.5 Trivially Trivially parallelizable parallelizable ++ highly highly local by sim 3 top klocal Alice Carl C O( U B D X Y RCS ) Indexing: O( E ) Indexing: O( E ) Iteration: Iteration: O( U RCS ) 0.9 0.4 0.3 0.6 0.5 Xavier Yann 4 new neighborhood 25

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 26

Evaluation: Datasets

Evaluation: Datasets Long Long tail tail profile profile size size distribution distribution

Evaluation: Metrics Wall-clock computation time Recall : ideal KNN neighborhood for user u : approximated KNN neighborhood for user u Scan rate 29

Performance Details 30

Performance Details Much Much reduced reduced scan scan rate rate 31

Overall Performance [DML11] [BFGKP14] Faster Faster (x14), (x14), Better Better (+20%) (+20%)

Execution time Arxiv Wikipedia

KIFF's Scan Rate Arxiv Dataset KIFF: KIFF: First First iterations iterations yield yield highest highest gains gains 34

Impact of RCS on Bootstrap Iteration 0 Bob 8 Dave 7... sorted by item count RCSAlice 35

Repeat for all users until #changes β Termination Criteria Vertical bars: RCS truncation imposed by KIFF Termination Termination only only impacts impacts minority minority of of users users 36

Effect of Density 37

Effect of Density Scan Scan rate rate grows grows with with density, density, hurting hurting perf perf 38

Conclusion Novel KNN construction algorithm Intuition: reduce similarity computations Faster (x14) and more accurate (+20%) than SotA Performs best on sparse datasets 39

Some References [JMB09] M. Jelasity, A. Montresor, and O. Babaoglu. 2009. T-Man: Gossip-based fast overlay topology construction. Comput. Netw. 53, 13 (August 2009), 2321-2339. [VS05] S. Voulgaris and M. v. Steen, Epidemic-style management of semantic overlays for content-based searching, in Euro-Par, 2005, pp. 1143 1152. [DML11] W. Dong, C. Moses, and K. Li, Efficient k-nearest neighbor graph construction for generic similarity measures, in WWW, 2011, pp. 577 586. [BFGKP14] A. Boutet, D. Frey, R. Guerraoui, A.-M. Kermarrec, and R. Patra, HyRec: Leveraging Browsers for Scalable Recommenders, in Middleware, 2014, pp. 85 96. 40