Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Size: px
Start display at page:

Download "Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis"

Transcription

1 Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Novartis Züri Machine Learning Meetup #5 June 16, 2014

2 Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Novartis Züri Machine Learning Meetup #5 June 16, 2014

3 Outline Machine Learning Cases Why Apache Giraph? Walk-through example for Recommendation Systems What s more? 3

4 Machine Learning cases Friend Recommendation Fake account detection 4

5 Machine Learning cases Product recommendation Online advertisements 5

6 Machine Learning cases Route planning Delivery scheduling 6

7 Machine Learning cases Graphs are everywhere Graphs need processing! 7

8 (Graphs ABC) Graph: a representation of a set of objects V = Vertices (nodes) E = Edges (links) Graphs capture the relationship between objects ) Graphs can be directed or undirected 8

9 Graphs need processing! So what? 9

10 Challenge #1 Scale of graphs indexes ~50B pages has ~1.1B users has ~570M users has ~530M users 10

11 Challenge #2 Complexity of graphs Compute shortest distance from google.com à Multiple passes to compute the result à Inherent dependencies make it hard to parallelize 11

12 MapReduce Well established Efficient for big data analytics Not efficient with iterative algorithms (stateless) Graph algorithms are iterative 12

13 Why Apache Giraph? Explicitly designed for graph processing on top of the Hadoop ecosystem 13

14 The story. Google Pregel (2010) Apache Top Level Project (2012) 1.1 release (2014) Donated to ASF by Yahoo! (2011) 1.0 release (2013) Supported by: Facebook Yahoo! LinkedIn 14

15 Giraph follows the Pregel model or Bulk Synchronous Parallel 15

16 I am a vertex! How would I coordinate with other vertices to solve the problem? Thinking like a vertex 16

17 Shortest Paths I only know my value and who my neighbors are 17

18 Receive messages à Update value à Send messages Vertices compute asynchronously 18

19 Global Synchronization Synchronization barrier 19

20 And again 20

21 And again 21

22 Giraph super powers Message-passing communication In-memory computation à stateful Global synchronization Iterations à Iterations à Iterations 22

23 Recommendation Systems 23

24 Collaborative Filtering Recommendation systems technique June 16, 2014 Apache Giraph for applications in Machine Learning Maria Stylianou 24

25 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 25

26 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 26

27 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 27

28 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 28

29 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 29

30 Giraph for Recommendation Systems Stochastic Gradient Descent algorithm (SGD) 30

31 What s more Okapi ML The 1 st advanced ML toolkit for Giraph Available as open source Code available at: Documentation: 31

32 The Okapi library Collaborative filtering Alternating Least Squares Stochastic Gradient Descent Singular Value Decomposition Collaborative Less-is-More (CLiMF) Context-aware recom. (TFMAP) Bayesian Personalized Ranking Popularity Ranking Clustering Affinity propagation Kmeans Graph analytics Clustering coefficient Graph partitioning K-Core PageRank Semi-clustering Shortest distances SybilRank and adding Triangle counting 32

33 What s more Giraph in Action The 1 st book for Giraph First steps with Giraph Build applications Integrate with other tools More! More details: 33

34 Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Novartis Züri Machine Learning Meetup #5 June 16, 2014

Why do we need graph processing?

Why do we need graph processing? Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Distributed Graph Storage. Veronika Molnár, UZH

Distributed Graph Storage. Veronika Molnár, UZH Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems

More information

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016

More information

Distributed Graph Algorithms

Distributed Graph Algorithms Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction

More information

Case Study 4: Collaborative Filtering. GraphLab

Case Study 4: Collaborative Filtering. GraphLab Case Study 4: Collaborative Filtering GraphLab Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin March 14 th, 2013 Carlos Guestrin 2013 1 Social Media

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing

More information

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

MLlib and Distributing the  Singular Value Decomposition. Reza Zadeh MLlib and Distributing the " Singular Value Decomposition Reza Zadeh Outline Example Invocations Benefits of Iterations Singular Value Decomposition All-pairs Similarity Computation MLlib + {Streaming,

More information

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number

More information

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato Marroquín! PhD student: Interested in: Information retrieval. Distributed and scalable data management. Apache Gora:

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei

More information

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very

More information

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data. Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Matrix Computations and " Neural Networks in Spark

Matrix Computations and  Neural Networks in Spark Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

Cluster Computing Architecture. Intel Labs

Cluster Computing Architecture. Intel Labs Intel Labs Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED

More information

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Batch & Stream Graph Processing with Apache Flink. Vasia

Batch & Stream Graph Processing with Apache Flink. Vasia Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Distributed Machine Learning: An Intro. Chen Huang

Distributed Machine Learning: An Intro. Chen Huang : An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous

More information

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Graph-Processing Systems. (focusing on GraphChi)

Graph-Processing Systems. (focusing on GraphChi) Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])

More information

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Pregel: A System for Large-Scale Graph Proces sing

Pregel: A System for Large-Scale Graph Proces sing Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

BSP, Pregel and the need for Graph Processing

BSP, Pregel and the need for Graph Processing BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

GraphLab: A New Framework for Parallel Machine Learning

GraphLab: A New Framework for Parallel Machine Learning GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Logistic Regression

Logistic Regression Logistic Regression ddebarr@uw.edu 2016-05-26 Agenda Model Specification Model Fitting Bayesian Logistic Regression Online Learning and Stochastic Optimization Generative versus Discriminative Classifiers

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Case Study 4: Collabora1ve Filtering

Case Study 4: Collabora1ve Filtering Case Study 4: Collabora1ve Filtering Graph- Parallel Problems Synchronous v. Asynchronous ComputaPon Machine Learning for Big Data CSE547/STAT548, University of Washington Carlos Guestrin, guest lecturer

More information

SociaLite: A Datalog-based Language for

SociaLite: A Datalog-based Language for SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler

More information

Efficient and Scalable Friend Recommendations

Efficient and Scalable Friend Recommendations Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014 1 Introduction 2

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Access

Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Access Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Access Yongzhe Zhang National Institute of Informatics 3rd Spring Festival Workshop March 21, 2017 Outline Background of vertex-centric

More information

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018 Distributed Systems 21. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2018 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it easy to use MapReduce via scripting instead of

More information

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Graph Analytics and Machine Learning A Great Combination Mark Hornick

Graph Analytics and Machine Learning A Great Combination Mark Hornick Graph Analytics and Machine Learning A Great Combination Mark Hornick Oracle Advanced Analytics and Machine Learning November 3, 2017 Safe Harbor Statement The following is intended to outline our research

More information

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu

More information

CS November 2018

CS November 2018 Distributed Systems 1. Other parallel frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 018 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead

More information

Counting Triangles & The Curse of the Last Reducer. Siddharth Suri Sergei Vassilvitskii Yahoo! Research

Counting Triangles & The Curse of the Last Reducer. Siddharth Suri Sergei Vassilvitskii Yahoo! Research Counting Triangles & The Curse of the Last Reducer Siddharth Suri Yahoo! Research Why Count Triangles? 2 Why Count Triangles? Clustering Coefficient: Given an undirected graph G =(V,E) cc(v) = fraction

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives

More information

Using Numerical Libraries on Spark

Using Numerical Libraries on Spark Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Sublinear Models for Streaming and/or Distributed Data

Sublinear Models for Streaming and/or Distributed Data Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3, 2015 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index

More information

Graphs (Part II) Shannon Quinn

Graphs (Part II) Shannon Quinn Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation

More information

SociaLite: A Python-Integrated Query Language for

SociaLite: A Python-Integrated Query Language for SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo * Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M OBIS OCIAL RESEARCH GROUP * Intel Parallel Research

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

GraphChi: Large-Scale Graph Computation on Just a PC

GraphChi: Large-Scale Graph Computation on Just a PC OSDI 12 GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrölä (CMU) Guy Blelloch (CMU) Carlos Guestrin (UW) In co- opera+on with the GraphLab team. BigData with Structure: BigGraph social graph

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Opportunities and challenges in personalization of online hotel search

Opportunities and challenges in personalization of online hotel search Opportunities and challenges in personalization of online hotel search David Zibriczky Data Science & Analytics Lead, User Profiling Introduction 2 Introduction About Mission: Helping the travelers to

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Praynaa Rawlani. at the. August 2014 Fseovevber 20H L4-RARIES. Department of Electrical Engineering and Computer Science August 22, 2014

Praynaa Rawlani. at the. August 2014 Fseovevber 20H L4-RARIES. Department of Electrical Engineering and Computer Science August 22, 2014 Graph Analytics on Relational Databases by Praynaa Rawlani S.B., Electrical Engineering and Computer Science, MIT (2013) Submitted to the Department of Electrical Engineering and Computer Science in Partial

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster Summary Open source analytic frameworks, such as those in the Apache

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some

More information