Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 2 / 32
Introduction I m Nicholas Tietz B.S. in Mathematics and Computer Science Software Engineer at GraphSQL (1+ years) We re GraphSQL Founding team from Teradata, Twitter, Google, IBM, etc. Founded 1.5 years ago Working on the fastest, most scalable graph platform (We re hiring!) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 3 / 32
Background - Graphs Graph: a collection of edges and vertices (a network) Big graphs contain: over 100 million vertices billions of edges Graphs provide clear insights into: Recommendations Fraud detection Resource optimization Churn analysis Difficult to process traditionally Sparse offerings, improving Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 4 / 32
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 5 / 32
The Problem Providing friend recommendations Goal: Provide friend recommendations to all users Must be fast and scalable Motivation: Many social services Keeps users engaged Drives business Extremely hard problem to solve: Worked on at LinkedIn, Facebook, Twitter, etc. Lots of money spent solving Lots of servers used Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 6 / 32
Requirements Providing friend recommendations Provide 10 recommendations to each user Must be fast sub-second required under 0.1 seconds ideal Must support real-time updates New users added constantly Cannot do in a batch Must scale well Needs to support hundreds of millions of users Must be good Require reasonably high acceptance rate Cannot just return random users Friends-of-friends Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 7 / 32
Naive Algorithm We ll use a simple friend-of-friends algorithm 1 : 1 Retrieve your friends of friends. 2 Rank by number of common neighbors. 3 Select the top 10 scores. 1 We use a much more complicated algorithm in production Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 8 / 32
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 9 / 32
RDBMS - Schema CREATE TABLE friends ( user_id INTEGER, friend_id INTEGER ); (Assumption: if (a, b) friends, then (b, a) friends.) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 10 / 32
RDBMS - Query Query for naive algorithm. WITH my_friends AS ( SELECT friend_id FROM friends f WHERE f.user_id = 679328 ) SELECT fof.friend_id AS recommended_id, count(*) AS common_friends FROM friends fof WHERE fof.user_id IN (SELECT * FROM my_friends) AND fof.friend_id!= 679328 AND fof.friend_id NOT IN (SELECT * FROM my_friends) GROUP BY recommended_id ORDER BY common_friends DESC LIMIT 10; And the real algorithm used is much more complicated! Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 11 / 32
RDBMS - Problems This approach has a few problems: Will not scale Difficult-to-optimize multiway self-joins Requires many thousands of index lookups Will not feel responsive to users Effective way to DOS your own DB Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 12 / 32
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 13 / 32
NoSQL Approach Replace RDBMS with HBase Python app server to: retrieve friend lists from HBase perform join logic and recommendations (Optional) Batch mode in Hadoop Does not solve the problem: Still need to do many lookups Not natural programming model for problem Difficult to deal with hub nodes Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 14 / 32
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 15 / 32
GraphSQL - Design Overview Store data in a native graph format Users are vertices Friendships (or contacts) are edges REST server built into our stack supports this use case Can modify the graph Can call your functions Perform recommendations via quick graph-based computations (Optional) Batch pre-compute recommendations Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 16 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 17 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 18 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 19 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 20 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 21 / 32
GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 22 / 32
GraphSQL - Graph Programming Model Analogous to Pregel + MapReduce Iteration-based Activation-based or whole-graph modes Each iteration has: 2 EdgeMap Called on each outgoing edge for each active vertex VertexReduce Called for each vertex which received messages Very efficient for many problems Graph problems fit naturally Database join problems fit easily 2 Other specialized functions are available. Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 23 / 32
GraphSQL - Pseudocode (cont.) EdgeMap: def edge_map(from_vertex, to_vertex): if iteration == 1 or iteration == 2 and to_vertex.value == 0: emit(to_vertex.id, from_vertex.value) Reduce: def reduce(vertex, messages): score = sum(messages) if iteration == 1: set_value(vertex.id, score) else if iteration == 2: result_heap.add((vertex.id, score)) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 24 / 32
GraphSQL - Implementation Process Implementing is easy on our platform: Only need to write one class defining EdgeMap, Reduce, etc. REST API already exists Similar experience to Hadoop Current: no public API or SDK, more on this later Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 25 / 32
GraphSQL - Industry Experience We have this deployed for two companies. Requires fewer servers than the NoSQL approach Faster end-to-end response times Allows more sophisticated recommendation algorithms Easier to write and maintain Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 26 / 32
Moral of the Story I don t do friend recommendations, why do I care? Two reasons: 1 Networks are everywhere, and you have one in your data. Use the right tools for the right jobs. 2 Joins are everywhere. They are expensive to do, but with a graph platform you can have pre-computed always-up-to-date joins. Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 27 / 32
Shameless Plug We re hiring! Software engineer (data / log analysis, RDBMS, dashboard / data visualization) Systems software engineer (file systems, database storage, distributed systems) POC Software engineer (algorithms background, work with customers) We re looking for companies to work with! Develop proof-of-concept for you Helps us improve our offering Contact us for more information! Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 28 / 32
1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 29 / 32
Questions? nicholas@graphsql.com graphsql.com Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 30 / 32
Appendix A: NoSQL Pseudocode def get_recommendations(user_id): friends = hbase.get_row(user_id).as_set candidates = {} for f1 in friends: f1_friends = hbase.get_row(f1).as_set for f2 in friend_friends: if f2 in candidates: candidates[f2] += 1 else candidates[f2] = 0 for f1 in friends: candidates.remove(f1) candidates.remove(user_id) return get_top_10(candidates) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 31 / 32
Appendix B: Hub Nodes (NoSQL) A hub node has high degree Dangerous to traverse from Difficult to join on No obvious way to avoid expanding hub nodes (in NoSQL) Storing degree information shifts the problem How do you safely apply graph updates that change degree? Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 32 / 32