GraphLab: A New Framework for Parallel Machine Learning

Size: px

Start display at page:

Download "GraphLab: A New Framework for Parallel Machine Learning"

Garry Murphy
6 years ago
Views:

1 GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010

2 Overview Programming ML Algorithms in Parallel Common Parallelism and MapReduce Global Synchronization Barriers GraphLab Data Dependency as a Graph Synchronization as Fold/Reduce Implementation and Experiments From Multicore to Distributed Environment

3 Parallel Processing for ML Parallel ML is a Necessity 13 Million Wikipedia Pages 3.6 Billion photos on Flickr etc Parallel ML is Hard to Program Concurrency v.s. Deadlock Load Balancing Debug etc

4 MapReduce is the Solution? High-level abstraction: Statistical Query Model [Chu et al, 2006] Weighted Linear Regression: only sufficient statistics Θ = A -1 b, A = Σw i (x i x it ), b = Σw i (x i y i )

5 MapReduce is the Solution? High-level abstraction: Statistical Query Model [Chu et al, 2006] Embarrassingly K-Means: Parallel only independent data assignments computation No Communication class mean needed = avg(x i ), x i in class

6 ML in MapReduce Single Reducer Multiple Mapper Iterative MapReduce needs global synchronization at the single reducer K-means EM for graphical models gradient descent algorithms, etc

7 Not always Embarrassingly Parallel Data Dependency: not MapReducable Gibbs Sampling Belief Propagation SVM etc Capture Dependency as a Graph!

8 Overview Programming ML Algorithms in Parallel Common Parallelism and MapReduce Global Synchronization Barriers GraphLab Data Dependency as a Graph Synchronization as Fold/Reduce Implementation and Experiments From Multicore to Distributed Environment

9 Key Idea of GraphLab Sparse Data Dependencies Local Computations X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9

10 GraphLab for ML High-level Abstract Express data dependencies Iterative Automatic Multicore Parallelism Data Synchronization Consistency Scheduling

11 Main Components of GraphLab Data Graph Shared Data Table GraphLab Model Scheduling Update Functions and Scopes

12 Data Graph A Graph with data associated with every vertex and edge. X 1 X 2 X 3 X 4 x 3 : Sample value C(X 3 ): sample counts X 5 X 6 X 7 X X X 10 X 11 Φ(X 6,X 9 ): Binary potential 8 9

13 Update Functions Operations applied on a vertex that transform data in the scope of the vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex

14 Scope Rules Consistency v.s. Parallelism Belief Propagation: Only uses edge data Gibbs Sampling: Needs to read adjacent vertices

15 Scheduling Scheduler determines the order of Update Function evaluations Static Scheduling Round Robin, etc Dynamic Scheduling FIFO, Priority Queue, etc

16 Dynamic Scheduling a CPU 1 a b c d h a e f g b h i j k CPU 2

17 Global Information Shared Data Table in Shared Memory Model parameters (updatable) Sufficient statistics (updatable) Constants, etc (fixed) Sync Functions for Updatable Shared Data Accumulate performs an aggregation over vertices Apply makes a final modification to the accumulated data

18 Sync Functions Much like Fold/Reduce Execute Aggregate over every vertices in turn Execute Apply once at the end Can be called Periodically when update functions are active (asynchronous) or By the update function or user code (synchronous)

19 GraphLab Data Graph Shared Data Table GraphLab Model Scheduling Update Functions and Scopes

20 Overview Programming ML Algorithms in Parallel Common Parallelism and MapReduce Global Synchronization Barriers GraphLab Data Dependency as a Graph Synchronization as Fold/Reduce Implementation and Experiments From Multicore to Distributed Environment

21 Implementation and Experiments Shared Memory Implemention in C++ using Pthreads Applications: Belief Propagation Gibbs Sampling CoEM Lasso etc (more on the project page)

22 Speedup Better Parallel Performance Optimal Colored Schedule Round robin schedule Number of CPUs

synchronization When Migrate to Clusters Rethink Scope synchronization

23 From Multicore to Distributed Enviroment MapReduce and GraphLab work well for Multicores Simple High-level Abstract Local computation + global synchronization When Migrate to Clusters Rethink Scope synchronization Rethink Shared Data single reducer Think Load Balancing Maybe think abstract model?

24 Thanks

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin.

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin. A New Parallel Framework for Machine Learning Joseph hgonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch Joe Hellerstein David O Hallaron Carnegie Mellon