Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Similar documents
Why do we need graph processing?

Graph-Parallel Problems. ML in the Context of Parallel Architectures

modern database systems lecture 10 : large-scale graph processing

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Distributed Graph Storage. Veronika Molnár, UZH

One Trillion Edges. Graph processing at Facebook scale

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Graph Algorithms

Case Study 4: Collaborative Filtering. GraphLab

Distributed Machine Learning" on Spark

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Pregel. Ali Shah

An Introduction to Apache Spark

Using Existing Numerical Libraries on Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Graph Data Management

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Big Data Hadoop Stack

Matrix Computations and " Neural Networks in Spark

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

CS 347 Parallel and Distributed Data Processing

Webinar Series TMIP VISION

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

CS 347 Parallel and Distributed Data Processing

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Distributed Computing with Spark and MapReduce

Parallel learning of content recommendations using map- reduce

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Cluster Computing Architecture. Intel Labs

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Harp-DAAL for High Performance Big Data Computing

Batch & Stream Graph Processing with Apache Flink. Vasia

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Distributed Machine Learning: An Intro. Chen Huang

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graph-Processing Systems. (focusing on GraphChi)

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Distributed Computing with Spark

Pregel: A System for Large-Scale Graph Proces sing

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

BSP, Pregel and the need for Graph Processing

Big Graph Processing. Fenggang Wu Nov. 6, 2016

GraphLab: A New Framework for Parallel Machine Learning

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Logistic Regression

ECS289: Scalable Machine Learning

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Case Study 4: Collabora1ve Filtering

SociaLite: A Datalog-based Language for

Efficient and Scalable Friend Recommendations

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Introduction to MapReduce Algorithms and Analysis

Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Access

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Clustering Lecture 8: MapReduce

Graph Analytics and Machine Learning A Great Combination Mark Hornick

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

CS November 2018

Counting Triangles & The Curse of the Last Reducer. Siddharth Suri Sergei Vassilvitskii Yahoo! Research

TI2736-B Big Data Processing. Claudia Hauff

Using Numerical Libraries on Spark

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Sublinear Models for Streaming and/or Distributed Data

Graphs (Part II) Shannon Quinn

SociaLite: A Python-Integrated Query Language for

Introduction to MapReduce (cont.)

GraphChi: Large-Scale Graph Computation on Just a PC

Analyzing Flight Data

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Opportunities and challenges in personalization of online hotel search

Scaled Machine Learning at Matroid

Similarity Ranking in Large- Scale Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

CS 5220: Parallel Graph Algorithms. David Bindel

The Stratosphere Platform for Big Data Analytics

Big Data Infrastructures & Technologies

Praynaa Rawlani. at the. August 2014 Fseovevber 20H L4-RARIES. Department of Electrical Engineering and Computer Science August 22, 2014

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

Graph and Link Mining

MapReduce and Friends

Databases 2 (VU) ( / )

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Transcription:

Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014

Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014

Outline Machine Learning Cases Why Apache Giraph? Walk-through example for Recommendation Systems What s more? 3

Machine Learning cases Friend Recommendation Fake account detection 4

Machine Learning cases Product recommendation Online advertisements 5

Machine Learning cases Route planning Delivery scheduling 6

Machine Learning cases Graphs are everywhere Graphs need processing! 7

(Graphs ABC) Graph: a representation of a set of objects V = Vertices (nodes) E = Edges (links) Graphs capture the relationship between objects ) Graphs can be directed or undirected 8

Graphs need processing! So what? 9

Challenge #1 Scale of graphs indexes ~50B pages has ~1.1B users has ~570M users has ~530M users 10

Challenge #2 Complexity of graphs Compute shortest distance from google.com à Multiple passes to compute the result à Inherent dependencies make it hard to parallelize 11

MapReduce Well established Efficient for big data analytics Not efficient with iterative algorithms (stateless) Graph algorithms are iterative 12

Why Apache Giraph? Explicitly designed for graph processing on top of the Hadoop ecosystem 13

The story. Google Pregel (2010) Apache Top Level Project (2012) 1.1 release (2014) Donated to ASF by Yahoo! (2011) 1.0 release (2013) Supported by: Facebook Yahoo! LinkedIn 14

Giraph follows the Pregel model or Bulk Synchronous Parallel 15

I am a vertex! How would I coordinate with other vertices to solve the problem? Thinking like a vertex 16

Shortest Paths I only know my value and who my neighbors are 17

Receive messages à Update value à Send messages Vertices compute asynchronously 18

Global Synchronization Synchronization barrier 19

And again 20

And again 21

Giraph super powers Message-passing communication In-memory computation à stateful Global synchronization Iterations à Iterations à Iterations 22

Recommendation Systems 23

Collaborative Filtering Recommendation systems technique June 16, 2014 Apache Giraph for applications in Machine Learning Maria Stylianou 24

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 25

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 26

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 27

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 28

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 29

Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 30

What s more Okapi ML The 1 st advanced ML toolkit for Giraph Available as open source Code available at: https://github.com/grafos-ml-okapi Documentation: http://grafos.ml/okapi.html 31

The Okapi library Collaborative filtering Alternating Least Squares Stochastic Gradient Descent Singular Value Decomposition Collaborative Less-is-More (CLiMF) Context-aware recom. (TFMAP) Bayesian Personalized Ranking Popularity Ranking Clustering Affinity propagation Kmeans Graph analytics Clustering coefficient Graph partitioning K-Core PageRank Semi-clustering Shortest distances SybilRank and adding Triangle counting 32

What s more Giraph in Action The 1 st book for Giraph First steps with Giraph Build applications Integrate with other tools More! More details: http://manning.com/martella/ 33

Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014