Being Prepared In A Sparse World: The Case of KNN Graph Construction Antoine Boutet DRIM LIRIS, Lyon
Co-authors Joint work with François Taiani Nupur Mittal Anne-Marie Kermarrec Published at ICDE 2016 2
Context: Engineering & Scale
Context: Engineering & Scale $1 million prize recommendation too much engineering effort F. Taiani
Context: Engineering & Scale Which Which practical practical approaches approaches for for scale scale and and performance? performance? $1 million prize recommendation too much engineering effort F. Taiani
Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 6
KNN Graph Construction Entities (e.g. users) Items (e.g. locations) Ratings Alice User profile 7
KNN Graph Construction Similarity sim(, )= Goal: for each entity find k closest entities Many applications recommendations search recommendation, learning, Bob Alice... profile of Alice profile of Bob similarity function 8
Challenges Brute force not scalable: Alternatives: Approximate KNN Graph Using Locality Sensitive Hashing (LSH) Using Greedy Construction: best at the moment Vicinity [VS05], T-Man [JMB09], NNDescent [DML11], Hyrec [BFGKP14]? 9
Greedy KNN Construction Parallel-iterative algorithm, From a random graph Each node looks for potential new neighbours: (1) among friends of friends (2) among random nodes (optional) Carl Yann Alice Dave Bob Xavier 10
Repeat for all users until #changes < ε a b current neighbor candidates neighborhood from (1) & (2) Greedy Procedure node distance c computation sim(, ) = 3 6 9 1 8 4 4 6 8 9 d ranking d selection 1 f new neighborhood 3
Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 12
Is Greed all there is? Observation 1: Similarity remains the bottleneck 90% of execution time spent on similarity (Wikipedia dataset) Observation 2: Datasets are (often) sparse Many datasets use item-based profiles Most items little shared : sparse The curse of dimensionality 13
The Problem with Sparsity Density: E = # ratings, U = #users, I = #items density = 35% 14
The Problem with Sparsity Density: E = # ratings, U = #users, I = #items Only few rungs ("ratings") on the ladder 2 random nodes unlikely to share items density = 35% 15
The Problem with Sparsity Similarity metrics account for shared items Two random nodes unlikely to be close Hence greedy processes slow to start Difficult to find relevant candidates Execution of many similarity evaluations 16
KIFF's Intuition Greedy KNN approaches Assume no initial structure Start from a random graph In practice Underlying bipartite user / item graph Can be used to bootstrap the greedy process Use items to create Ranked Candidate Sets RCS( ) iff items( ) items( ) 17
Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 18
RCS Construction Item profil Bipartite user / item graph Ranked Candidate Set construction 19
RCS Construction Users Items IPchalet Alice itemsalice IPbank Bob itemsbob Darth Stormy 20
RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Stormy 21
RCS Construction RCSAlice Bob 1 Alice... Users IPchalet itemsalice IPbank RCSBob Alice 1 Items Bob itemsbob... Darth Unrelated users are never compared Unrelated users are never compared Stormy 22
Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5
Alice Carl Bob Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice top γ candidates in RCSAlice by item count 2 Dave 1 current neighborhood C sim(a, ) = 0.9 B 0.4 D 0.3 X Y 0.6 0.5 3 top k by sim Alice Carl C 0.9 Xavier X Y 0.6 0.5 B 0.4 D 0.3 Yann 4 new neighborhood 24
Alice Bob Carl Xavier 6 Yann 3... sorted by item count Repeat for all users until #changes β RCSAlice 2 top γ candidates in RCSAlice by item count Dave Indexing followed by "greedy" iteration Indexing followed by "greedy" iteration C B D X Y 1 current neighborhood sim(a, ) = 0.9 0.4 0.3 0.6 0.5 Trivially Trivially parallelizable parallelizable ++ highly highly local by sim 3 top klocal Alice Carl C O( U B D X Y RCS ) Indexing: O( E ) Indexing: O( E ) Iteration: Iteration: O( U RCS ) 0.9 0.4 0.3 0.6 0.5 Xavier Yann 4 new neighborhood 25
Outline The problem: KNN graph construction The intuition: Is greed all there is? KIFF: K-nearest neighbor Fast and efficient Evaluation 26
Evaluation: Datasets
Evaluation: Datasets Long Long tail tail profile profile size size distribution distribution
Evaluation: Metrics Wall-clock computation time Recall : ideal KNN neighborhood for user u : approximated KNN neighborhood for user u Scan rate 29
Performance Details 30
Performance Details Much Much reduced reduced scan scan rate rate 31
Overall Performance [DML11] [BFGKP14] Faster Faster (x14), (x14), Better Better (+20%) (+20%)
Execution time Arxiv Wikipedia
KIFF's Scan Rate Arxiv Dataset KIFF: KIFF: First First iterations iterations yield yield highest highest gains gains 34
Impact of RCS on Bootstrap Iteration 0 Bob 8 Dave 7... sorted by item count RCSAlice 35
Repeat for all users until #changes β Termination Criteria Vertical bars: RCS truncation imposed by KIFF Termination Termination only only impacts impacts minority minority of of users users 36
Effect of Density 37
Effect of Density Scan Scan rate rate grows grows with with density, density, hurting hurting perf perf 38
Conclusion Novel KNN construction algorithm Intuition: reduce similarity computations Faster (x14) and more accurate (+20%) than SotA Performs best on sparse datasets 39
Some References [JMB09] M. Jelasity, A. Montresor, and O. Babaoglu. 2009. T-Man: Gossip-based fast overlay topology construction. Comput. Netw. 53, 13 (August 2009), 2321-2339. [VS05] S. Voulgaris and M. v. Steen, Epidemic-style management of semantic overlays for content-based searching, in Euro-Par, 2005, pp. 1143 1152. [DML11] W. Dong, C. Moses, and K. Li, Efficient k-nearest neighbor graph construction for generic similarity measures, in WWW, 2011, pp. 577 586. [BFGKP14] A. Boutet, D. Frey, R. Guerraoui, A.-M. Kermarrec, and R. Patra, HyRec: Leveraging Browsers for Scalable Recommenders, in Middleware, 2014, pp. 85 96. 40