Being Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon

Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon

Co-authors Joint work with François Taiani Nupur Mittal Anne-Marie Kermarrec Published at ICDE 2016 2

Context: Engineering & Scale

Context: Engineering & Scale $1 million prize recommendation too much engineering effort F. Taiani

Context: Engineering & Scale Which Which practical practical approaches approaches for for scale scale and and performance? performance? $1 million prize recommendation too much engineering effort F. Taiani

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 6

KNN Graph Construction Entities (e.g. users) Items (e.g. locations) Ratings Alice User profile 7

KNN Graph Construction Similarity sim(, )= Goal: for each entity find k closest entities Many applications recommendations search recommendation, learning, Bob Alice... profile of Alice profile of Bob similarity function 8

Challenges Brute force not scalable: Alternatives: Approximate KNN Graph Using Locality Sensitive Hashing (LSH) Using Greedy Construction: best at the moment Vicinity [VS05], T-Man [JMB09], NNDescent [DML11], Hyrec [BFGKP14]? 9

Greedy KNN Construction Parallel-iterative algorithm, From a random graph Each node looks for potential new neighbours: (1) among friends of friends (2) among random nodes (optional) Carl Yann Alice Dave Bob Xavier 10

Repeat for all users until #changes < ε a b current neighbor candidates neighborhood from (1) & (2) Greedy Procedure node distance c computation sim(, ) = 3 6 9 1 8 4 4 6 8 9 d ranking d selection 1 f new neighborhood 3

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 12

Is Greed all there is? Observation 1: Similarity remains the bottleneck 90% of execution time spent on similarity (Wikipedia dataset) Observation 2: Datasets are (often) sparse Many datasets use item-based profiles Most items little shared : sparse The curse of dimensionality 13

The Problem with Sparsity Density: E = # ratings, U = #users, I = #items density = 35% 14

The Problem with Sparsity Density: E = # ratings, U = #users, I = #items Only few rungs ("ratings") on the ladder 2 random nodes unlikely to share items density = 35% 15

The Problem with Sparsity Similarity metrics account for shared items Two random nodes unlikely to be close Hence greedy processes slow to start Difficult to find relevant candidates Execution of many similarity evaluations 16

KIFF's Intuition Greedy KNN approaches Assume no initial structure Start from a random graph In practice Underlying bipartite user / item graph Can be used to bootstrap the greedy process Use items to create Ranked Candidate Sets RCS( ) iff items( ) items( ) 17

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 18

RCS Construction Item profil Bipartite user / item graph Ranked Candidate Set construction 19

RCS Construction Users Items IPchalet Alice itemsalice IPbank Bob itemsbob Darth Stormy 20

RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Stormy 21

RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Unrelated users are never compared Unrelated users are never compared Stormy 22

Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5

Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice top γ candidates in RCSAlice by item count 2 Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5 3 top k by sim Alice Carl C 0.9 Xavier X Y 0.6 0.5 B 0.4 D 0.3 Yann 4 new neighborhood 24

Alice Bob Carl Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave Indexing followed by "greedy" iteration Indexing followed by "greedy" iteration C B D X Y 1 current neighborhood sim(a, ) = 0.9 0.4 0.3 0.6 0.5 Trivially Trivially parallelizable parallelizable ++ highly highly local by sim 3 top klocal Alice Carl C O( U B D X Y RCS ) Indexing: O( E ) Indexing: O( E ) Iteration: Iteration: O( U RCS ) 0.9 0.4 0.3 0.6 0.5 Xavier Yann 4 new neighborhood 25

Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 26

Evaluation: Datasets

Evaluation: Datasets Long Long tail tail profile profile size size distribution distribution

Evaluation: Metrics Wall-clock computation time Recall : ideal KNN neighborhood for user u : approximated KNN neighborhood for user u Scan rate 29

Performance Details 30

Performance Details Much Much reduced reduced scan scan rate rate 31

Overall Performance [DML11] [BFGKP14] Faster Faster (x14), (x14), Better Better (+20%) (+20%)

Execution time Arxiv Wikipedia

KIFF's Scan Rate Arxiv Dataset KIFF: KIFF: First First iterations iterations yield yield highest highest gains gains 34

Impact of RCS on Bootstrap Iteration 0 Bob 8 Dave 7... sorted by item count RCSAlice 35

Repeat for all users until #changes β Termination Criteria Vertical bars: RCS truncation imposed by KIFF Termination Termination only only impacts impacts minority minority of of users users 36

Effect of Density 37

Effect of Density Scan Scan rate rate grows grows with with density, density, hurting hurting perf perf 38

Conclusion Novel KNN construction algorithm Intuition: reduce similarity computations Faster (x14) and more accurate (+20%) than SotA Performs best on sparse datasets 39

Some References [JMB09] M. Jelasity, A. Montresor, and O. Babaoglu. 2009. T-Man: Gossip-based fast overlay topology construction. Comput. Netw. 53, 13 (August 2009), 2321-2339. [VS05] S. Voulgaris and M. v. Steen, Epidemic-style management of semantic overlays for content-based searching, in Euro-Par, 2005, pp. 1143 1152. [DML11] W. Dong, C. Moses, and K. Li, Efficient k-nearest neighbor graph construction for generic similarity measures, in WWW, 2011, pp. 577 586. [BFGKP14] A. Boutet, D. Frey, R. Guerraoui, A.-M. Kermarrec, and R. Patra, HyRec: Leveraging Browsers for Scalable Recommenders, in Middleware, 2014, pp. 85 96. 40