Sanny: Scalable Approximate Nearest Neighbors Search System Using Partial Nearest Neighbors Sets

Sanny: EC 1,a) 1,b) EC EC EC EC Sanny Sanny ( ) Sanny: Scalable Approximate Nearest Neighbors Search System Using Partial Nearest Neighbors Sets Yusuke Miyake 1,a) Ryosuke Matsumoto 1,b) Abstract: Building a recommendation system has become a big concern to solve the information overload problem of the electronic commerce sites. Implementing a feature selection system based on machine learning requires approximation of the nearest neighbor search for maintaining the availability and speed while allowing to reducing the accuracy and complexity since the feature sets consist of large and high dimensional vectors. In this report, we propose Sanny, a search engine by approximation of nearest neighbors and using the partial neighbor set. Sanny aggregates the neighbors of the low dimensional vectors using the property that the neighbors of high dimensional vectors obtained from the neural network overlap with the neighbors of partial vectors. Sanny searches the candidates generated by the aggregation, which results in reducing the total time of searching. We evaluate performance and accuracy comparison of Sanny with the existing methods. 1. EC EC 1 GMO Pepabo R&D Institute, GMO Pepabo, Inc., Tenjin, Chuo ku, Fukuoka 810-0001 Japan a) miyakey@pepabo.com b) matsumotory@pepabo.com [4], [16] c 2018 Information Processing Society of Japan 1

EC [20], [21] EC EC [1], [15], [19] [13] [10] ( ) EC [6] EC Sanny 2 EC 3 4 5 2. 2.1 EC EC EC [8] [2] [4], [16] [5], [9], [12] [11] [20], [21] TF-IDF Word2vec Doc2vec [14], [18] Word2vec EC c 2018 Information Processing Society of Japan 2

2.2 2.1 [22] [1], [15], [19] [13] [10] [17] EC 2.3 EC EC EC EC 3. 2 EC ( 1 ) ( 2 ) 2.1 c 2018 Information Processing Society of Japan 3

Word2vec 3.1 Y R D D = D/m S j (1 j m) 3.2 Y q R D q j (1 j m) j S j n NN(q j ) = arg min d(q j, s) (1) s S j d (1) NN(p j ) n N j N j N Y N Y N R D q NN(q) = arg min d(q, yn) (2) yn Y N Y N n m (2) 3.3 Sanny *1 OSS Sanny *1 https://github.com/monochromegane/sanny 1 Fig. 1 1 Sanny Process flow of Sanny. 3 ( 1 ) ( 2 ) ( 3 ) (1) (2) (3) Sanny (3) a b argmin a (a b) 2 = argmin a a 2 2ab + b 2 = argmin a a 2 2ab (3) b b 2 a 2 2ab a 2 CPU Single Instruction Multiple Data (SIMD) Sanny 4. 4.1 c 2018 Information Processing Society of Japan 4

6 4 10 30 5 Inception-v3[3] ImageNet 2048 EC 6 Word2vec Wikipedia *2 200 6 Fashion-MNIST *3 28x28 786 6 200 6 1 Inception-v3 Word2vec Fashion-MNIST Fashion-MNIST Inception-v3 Word2vec Inception-v3 Word2vec 95 100 2 3 *2 http://www.cl.ecei.tohoku.ac.jp/ m-suzuki/jawiki vector/ *3 https://github.com/zalandoresearch/fashion-mnist Table 1 1 Precision by proposed method. 1 2 3 4 Inception-v3 0.57 0.37 0.50 0.65 0.97 Word2vec 0.67 0.68 0.68 0.69 0.97 Fashion-MNIST 0.10 0.23 0.31 0.16 0.63 0.05 0.05 0.05 0.05 0.18 0.05 0.04 0.04 0.04 0.17 2 2048 Fig. 2 Precision-Number of neighborhoods tradeoff. 20 1000 100 0.1% 4.2 4.1 10 500 c 2018 Information Processing Society of Japan 5

2 Table 2 Parameters of nearest neighbors search. 3 200 Fig. 3 Precision-Number of neighborhoods tradeoff. Annoy[19] *4 NGT[15] *5 LSHForest[1] *6 2 LSHForest ann-benchmarks *7 3 4 X Y *4 https://github.com/spotify/annoy *5 https://github.com/yahoojapan/ngt *6 https://github.com/ekzhu/lsh *7 https://github.com/erikbern/ann-benchmarks Annoy Tree 100 400 SearchK 100 400000 NGT Edge 0 80 LSH Hash table size 20 Hash value size 20 Slot size 5.0 Sanny-Annoy Split 8 Top 20 100 Tree 100 400 SearchK 100 400000 Sanny-NGT Split 8 Top 20 100 Edge 0 80 Sanny-LSH Split 8 Top 20 100 Hash table size 20 Hash value size 20 Slot size 5.0 4 Fig. 4 Precision-Query per second tradeoff. 5. EC Sanny c 2018 Information Processing Society of Japan 6

[7] [1] Bawa, Mayank, Tyson Condie, and Prasanna Ganesan. LSH forest: self-tuning indexes for similarity search. Proceedings of the 14th international conference on World Wide Web. ACM, 2005. [2] Burke, Robin. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction 12.4 (2002): 331-370. [3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 [4] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna, Rethinking the Inception Architecture for Computer Vision, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818-2826, 2016 [5] K. Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics, 36(4):93-202, 1980 [6] Galletta, Dennis F., et al. Web site delays: How tolerant are users?. Journal of the Association for Information Systems 5.1 (2004): 1. [7] Ge, Tiezheng, et al. Optimized product quantization for approximate nearest neighbor search. Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013. [8] Hyung Jun Ahn, A new similarity measure for collaborative filtering to alleviate the new user coldstarting problem, Information Sciences 178, pp. 37-51, 2008 [9] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell, DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proc. ICML, 2014 [10] Jegou, Herve, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33.1 (2011): 117-128. [11] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012. [12] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradientbased learning applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998 [13] Muja, Marius, and David G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE transactions on pattern analysis and machine intelligence 36.11 (2014): 2227-2240. [14] Mikolov, Tomas, et al. Efficient estimation of word representations in vector space. arxiv preprint arxiv:1301.3781 (2013). [15] Kohei Sugawara, Hayato Kobayashi, and Masajiro Iwasaki, On Approximately Searching for Similar Word Embeddings, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics ACL 2016, pages 2265-2275. Association for Computational Linguistics, 2016. [16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision 115, 3, pp. 211-252, December 2015 [17] Pearson, Karl. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2.11 (1901): 559-572. [18] Le, Quoc, and Tomas Mikolov. Distributed representations of sentences and documents. International Conference on Machine Learning. 2014. [19] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Wenjie Zhang, Xuemin Lin, Approximate Nearest Neighbor Search on High Dimensional Data Experiments, Analyses, and Improvement (v1.0), In Proc. 2016 [20] S. Zhang, L. Yao, and A. Sun, Deep learning based recommender system: A survey and new perspectives, arxiv preprint arxiv:1707.07435, 2017. [21],,,,, IOT,2017-IOT-37(4),1-8 (2017-05-18), 2188-8787 [22].. (CVIM) 2009.13 (2009): 1-12. c 2018 Information Processing Society of Japan 7