Efficient and Scalable Friend Recommendations

Similar documents
Webinar Series TMIP VISION

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

One Trillion Edges. Graph processing at Facebook scale

Distributed Graph Storage. Veronika Molnár, UZH

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

The Evolution of Big Data Platforms and Data Science

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Data Informatics. Seon Ho Kim, Ph.D.

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Embedded Technosolutions

Big Data with Hadoop Ecosystem

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Intro to Neo4j and Graph Databases

BIG DATA ANALYTICS A PRACTICAL GUIDE

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Review - Relational Model Concepts

S2Graph : A large-scale graph database

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

6 TIPS FOR IMPROVING YOUR WEB PRESENCE

CIB Session 12th NoSQL Databases Structures

This Event Is About Your Internet Presence.

An Introduction to Big Data Formats

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Promoting Your Small Business with and Social Media

MapReduce and Friends

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

CMO Briefing Google+:

Introduction to NoSQL by William McKnight

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Prototyping Data Intensive Apps: TrendingTopics.org

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Databricks, an Introduction

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

DATABASE DESIGN II - 1DL400

Real-Time Deep-Link Analytics for Big Graphs. Challenges and Solutions

relational Key-value Graph Object Document

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

An Introduction to Apache Spark

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Intro Cassandra. Adelaide Big Data Meetup.

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Microsoft Big Data and Hadoop

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

Hadoop An Overview. - Socrates CCDH

BIG DATA COURSE CONTENT

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

HTML presentation, positioning and designing responsive web applications.

Big Data Analytics. Rasoul Karimi

AllegroGraph for Flexibility in the Enterprise and on the Web. Jans Aasman Franz Inc

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

The Technology of the Business Data Lake. Appendix

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

A Review Paper on Big data & Hadoop

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Text transcript of show #280. August 18, Microsoft Research: Trinity is a Graph Database and a Distributed Parallel Platform for Graph Data

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Figure 1: A directed graph.

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

A Glimpse of the Hadoop Echosystem

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

CISC 7610 Lecture 2b The beginnings of NoSQL

Getting to know. by Michelle Darling August 2013

Analyzing Flight Data

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.

Developing Enterprise Cloud Solutions with Azure

Big Data Hadoop Stack

Accelerate your SAS analytics to take the gold

Acquiring Big Data to Realize Business Value

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Stages of Data Processing

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS

Data Processing on Large Clusters. By: Stephen Cardina

Let's Play... Try to name the databases described on the following slides...

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to the Active Everywhere Database

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Fast Innovation requires Fast IT

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Repurposing Your Podcast. 3 Places Your Podcast Must Be To Maximize Your Reach (And How To Use Each Effectively)

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Developing with Google App Engine

Session 7: Oracle R Enterprise OAAgraph Package

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Introduction to Graph Databases

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Moving from RELATIONAL TO NoSQL: Relational to NoSQL:

Real-time Fraud Detection with Innovative Big Graph Feature. Gaurav Deshpande, VP Marketing, TigerGraph; Mingxi Wu, VP Engineering, TigerGraph

Understanding the SAP HANA Difference. Amit Satoor, SAP Data Management

Transcription:

Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 2 / 32

Introduction I m Nicholas Tietz B.S. in Mathematics and Computer Science Software Engineer at GraphSQL (1+ years) We re GraphSQL Founding team from Teradata, Twitter, Google, IBM, etc. Founded 1.5 years ago Working on the fastest, most scalable graph platform (We re hiring!) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 3 / 32

Background - Graphs Graph: a collection of edges and vertices (a network) Big graphs contain: over 100 million vertices billions of edges Graphs provide clear insights into: Recommendations Fraud detection Resource optimization Churn analysis Difficult to process traditionally Sparse offerings, improving Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 4 / 32

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 5 / 32

The Problem Providing friend recommendations Goal: Provide friend recommendations to all users Must be fast and scalable Motivation: Many social services Keeps users engaged Drives business Extremely hard problem to solve: Worked on at LinkedIn, Facebook, Twitter, etc. Lots of money spent solving Lots of servers used Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 6 / 32

Requirements Providing friend recommendations Provide 10 recommendations to each user Must be fast sub-second required under 0.1 seconds ideal Must support real-time updates New users added constantly Cannot do in a batch Must scale well Needs to support hundreds of millions of users Must be good Require reasonably high acceptance rate Cannot just return random users Friends-of-friends Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 7 / 32

Naive Algorithm We ll use a simple friend-of-friends algorithm 1 : 1 Retrieve your friends of friends. 2 Rank by number of common neighbors. 3 Select the top 10 scores. 1 We use a much more complicated algorithm in production Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 8 / 32

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 9 / 32

RDBMS - Schema CREATE TABLE friends ( user_id INTEGER, friend_id INTEGER ); (Assumption: if (a, b) friends, then (b, a) friends.) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 10 / 32

RDBMS - Query Query for naive algorithm. WITH my_friends AS ( SELECT friend_id FROM friends f WHERE f.user_id = 679328 ) SELECT fof.friend_id AS recommended_id, count(*) AS common_friends FROM friends fof WHERE fof.user_id IN (SELECT * FROM my_friends) AND fof.friend_id!= 679328 AND fof.friend_id NOT IN (SELECT * FROM my_friends) GROUP BY recommended_id ORDER BY common_friends DESC LIMIT 10; And the real algorithm used is much more complicated! Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 11 / 32

RDBMS - Problems This approach has a few problems: Will not scale Difficult-to-optimize multiway self-joins Requires many thousands of index lookups Will not feel responsive to users Effective way to DOS your own DB Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 12 / 32

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 13 / 32

NoSQL Approach Replace RDBMS with HBase Python app server to: retrieve friend lists from HBase perform join logic and recommendations (Optional) Batch mode in Hadoop Does not solve the problem: Still need to do many lookups Not natural programming model for problem Difficult to deal with hub nodes Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 14 / 32

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 15 / 32

GraphSQL - Design Overview Store data in a native graph format Users are vertices Friendships (or contacts) are edges REST server built into our stack supports this use case Can modify the graph Can call your functions Perform recommendations via quick graph-based computations (Optional) Batch pre-compute recommendations Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 16 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 17 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 18 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 19 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 20 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 21 / 32

GraphSQL - Algorithm in Graph Model Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 22 / 32

GraphSQL - Graph Programming Model Analogous to Pregel + MapReduce Iteration-based Activation-based or whole-graph modes Each iteration has: 2 EdgeMap Called on each outgoing edge for each active vertex VertexReduce Called for each vertex which received messages Very efficient for many problems Graph problems fit naturally Database join problems fit easily 2 Other specialized functions are available. Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 23 / 32

GraphSQL - Pseudocode (cont.) EdgeMap: def edge_map(from_vertex, to_vertex): if iteration == 1 or iteration == 2 and to_vertex.value == 0: emit(to_vertex.id, from_vertex.value) Reduce: def reduce(vertex, messages): score = sum(messages) if iteration == 1: set_value(vertex.id, score) else if iteration == 2: result_heap.add((vertex.id, score)) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 24 / 32

GraphSQL - Implementation Process Implementing is easy on our platform: Only need to write one class defining EdgeMap, Reduce, etc. REST API already exists Similar experience to Hadoop Current: no public API or SDK, more on this later Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 25 / 32

GraphSQL - Industry Experience We have this deployed for two companies. Requires fewer servers than the NoSQL approach Faster end-to-end response times Allows more sophisticated recommendation algorithms Easier to write and maintain Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 26 / 32

Moral of the Story I don t do friend recommendations, why do I care? Two reasons: 1 Networks are everywhere, and you have one in your data. Use the right tools for the right jobs. 2 Joins are everywhere. They are expensive to do, but with a graph platform you can have pre-computed always-up-to-date joins. Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 27 / 32

Shameless Plug We re hiring! Software engineer (data / log analysis, RDBMS, dashboard / data visualization) Systems software engineer (file systems, database storage, distributed systems) POC Software engineer (algorithms background, work with customers) We re looking for companies to work with! Develop proof-of-concept for you Helps us improve our offering Contact us for more information! Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 28 / 32

1 Introduction 2 Problem Description 3 RDBMS Approach 4 NoSQL Approach 5 GraphSQL Approach 6 Questions Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 29 / 32

Questions? nicholas@graphsql.com graphsql.com Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 30 / 32

Appendix A: NoSQL Pseudocode def get_recommendations(user_id): friends = hbase.get_row(user_id).as_set candidates = {} for f1 in friends: f1_friends = hbase.get_row(f1).as_set for f2 in friend_friends: if f2 in candidates: candidates[f2] += 1 else candidates[f2] = 0 for f1 in friends: candidates.remove(f1) candidates.remove(user_id) return get_top_10(candidates) Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 31 / 32

Appendix B: Hub Nodes (NoSQL) A hub node has high degree Dangerous to traverse from Difficult to join on No obvious way to avoid expanding hub nodes (in NoSQL) Storing degree information shifts the problem How do you safely apply graph updates that change degree? Nicholas Tietz (GraphSQL) Efficient and Scalable Friend Recommendations January 13, 2014 32 / 32