Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014 Emily Fox 2014 1 ML in the Context of Parallel Architectures GPUs Multicore Clusters Clouds Supercomputers But scalable ML in these systems is hard, especially in terms of: 1. Programmability 2. Data distribution 3. Failures Emily Fox 2014 2 1
Move Towards Higher-Level Abstraction Distributed computing challenges are hard and annoying! 1. Programmability 2. Data distribution 3. Failures High-level abstractions try to simplify distributed programming by hiding challenges: Provide different levels of robustness to failures, optimizing data movement and communication, protect against race conditions Generally, you are still on your own WRT designing parallel algorithms Some common parallel abstractions: Lower-level: Pthreads: abstraction for distributed threads on single machine MPI: abstraction for distributed communication in a cluster of computers Higher-level: Map-Reduce (Hadoop: open-source version): mostly data-parallel problems GraphLab: for graph-structured distributed problems Emily Fox 2014 3 Simplest Type of Parallelism: Data Parallel Problems You have already learned a classifier What s the test error? You have 10B labeled documents and 1000 machines Problems that can be broken into independent subproblems are called data-parallel (or embarrassingly parallel) Map-Reduce is a great tool for this Focus of today s lecture but first a simple example Emily Fox 2014 4 2
Data Parallelism (MapReduce) 1 2 72 4 72 4 48 1 4 5 CPU. 1 CPU. 2 CPU.. 3 CPU.. 4 59 1 64 8 5 3 1 2 94 3 38 34 2 8 Solve a huge number of independent subproblems, e.g., extract features in images Emily Fox 2014 5 Map-Reduce Abstraction Map: Data-parallel over elements, e.g., documents Generate (key,value) pairs value can be any data type Reduce: Aggregate values for each key Must be commutative-associate operation Data-parallel over keys Generate (key,value) pairs Map-Reduce has long history in functional programming But popularized by Google, and subsequently by open-source Hadoop implementation from Yahoo! Emily Fox 2014 6 3
Map-Reduce Execution Overview Map Phase Shuffle Phase Reduce Phase M1 (k 1,v 1 ) (k 2,v 2 ) M1 (k 1,v 1 ) (k 2,v 2 ) Big Data Split data across machines M2 (k 1,v 1 ) (k 2,v 2 ) Assign tuple (k i,v i ) to machine h[k i ] M2 (k 3,v 3 ) (k 4,v 4 ) M1000 (k 1,v 1 ) (k 2,v 2 ) M1000 (k 5,v 5 ) (k 6,v 6 ) Emily Fox 2014 7 Issues with Map-Reduce Abstraction Often all data gets moved around cluster Very bad for iterative settings Definition of Map & Reduce functions can be unintuitive in many apps Graphs are challenging Computation is synchronous Emily Fox 2014 8 4
SGD for Matrix Factorization in Map-Reduce? " L (t+1) u R (t+1) v # " (1 t u )L (t) u (1 t v )R v (t) t t R v (t) t t L (t) u # Map and Reduce functions??? t = L (t) u R (t) v r uv Map-Reduce: Data-parallel over all mappers Data-parallel over reducers with same key Here, one update at a time! Emily Fox 2014 9 Matrix Factorization as a Graph 4 3 Women on the Verge of a Nervous Breakdown The Celebration City of God 2 5 Wild Strawberries La Dolce Vita Emily Fox 2014 10 5
Flashback to 1998 Why? First Google advantage: a Graph Algorithm & a System to Support it! Social Media Science Adver6sing Web " Graphs encode the rela6onships between: People Facts Products Interests Ideas " Big: 100 billions of ver6ces and edges and rich metadata " Facebook (10/2012): 1B users, 144B friendships " TwiUer (2011): 15B follower edges Emily Fox 2014 12 6
Facebook Graph Slide from Facebook Engineering presentation13 Label a Face and Propagate 14 7
Pairwise similarity not enough Not similar enough to be sure 15 Propagate SimilariZes & Co- occurrences for Accurate PredicZons similarity edges co-occurring faces further evidence 16 8
Example: Es9mate Poli9cal Bias Liberal ConservaZve 17 Topic Modeling (e.g., LDA) Cat Apple Growth Hat Plant 18 9
ML Tasks Beyond Data- Parallelism Data-Parallel Graph-Parallel Map Reduce Feature ExtracZon Cross ValidaZon CompuZng Sufficient StaZsZcs Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Collaborative Filtering Tensor Factorization Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Emily Fox 2014 19 Example of a Graph- Parallel Algorithm 10
PageRank Depends on rank of who follows them Depends on rank of who follows her What s the rank of this user? Rank? Loops in graph Must iterate! 21 PageRank IteraZon R[i] = + (1 R[j] ) X wji R[j] (j,i)2e wji R[i] " " α is the random reset probability wji is the prob. transizoning (similarity) from j to i 22 11
ProperZes of Graph Parallel Algorithms Dependency Graph Local Updates Iterative Computation My Rank Friends Rank 23 Addressing Graph- Parallel ML Data-Parallel Map Reduce Feature ExtracZon Cross ValidaZon CompuZng Sufficient StaZsZcs Graph-Parallel Graph- Parallel AbstracZon Graphical Models Semi-Supervised Gibbs Sampling Learning Belief Propagation Variational Opt. Label Propagation CoEM Collaborative Filtering Data-Mining Tensor Factorization PageRank Triangle Counting 24 12
Graph ComputaZon: Synchronous v. Asynchronous Bulk Synchronous Parallel Model: Pregel (Giraph) [Valiant 90] Compute Communicate Barrier Emily Fox 2014 26 13
Map-Reduce Execution Overview Map Phase Shuffle Phase Reduce Phase M1 (k 1,v 1 ) (k 2,v 2 ) M1 (k 1,v 1 ) (k 2,v 2 ) Big Data Split data across machines M2 (k 1,v 1 ) (k 2,v 2 ) Assign tuple (k i,v i ) to machine h[k i ] M2 (k 3,v 3 ) (k 4,v 4 ) M1000 (k 1,v 1 ) (k 2,v 2 ) M1000 (k 5,v 5 ) (k 6,v 6 ) Emily Fox 2014 27 BSP Execution Overview Compute Phase Communicate Phase M1 (vid 1,vid 1, v 1 ) (vid 2,vid 2, v 2 ) M1 Big Graph Split graph across machines M2 (vid 1,vid 1, v 1 ) (vid 2,vid 2, v 2 ) Message machine for every edge (vid,vid,val) M2 M1000 (vid 1,vid 1, v 1 ) (vid 2,vid 2, v 2 ) M1000 Emily Fox 2014 28 14
Bulk synchronous parallel model provably inefficient for some ML tasks Analyzing Belief PropagaZon [Gonzalez, Low, G. 09] focus here A B Priority Queue Smart Scheduling important influence Emily Fox 2014 30 15
Asynchronous Belief PropagaZon Challenge = Boundaries Many Updates Synthetic Noisy Image Cumulative Vertex Updates Few Updates Graphical Model Algorithm idenzfies and focuses on hidden sequenzal structure Emily Fox 2014 31 BSP ML Problem: Synchronous Algorithms can be Inefficient Run6me in Seconds 10000 8000 6000 4000 2000 0 Bulk Synchronous (e.g., Pregel) Asynchronous Splash BP 1 2 3 4 5 6 7 8 Number of CPUs Theorem: Bulk Synchronous BP O(#verZces) slower than Asynchronous BP Emily Fox 2014 32 16
Synchronous v. Asynchronous Bulk synchronous processing: Computation in phases All vertices participate in a phase Though OK to say no-op All messages are sent Simpler to build, like Map-Reduce No worries about race conditions, barrier guarantees data consistency Simpler to make fault-tolerant, save data on barrier Slower convergence for many ML problems In matrix-land, called Jacobi Iteration Implemented by Google Pregel 2010 Asynchronous processing: Vertices see latest information from neighbors Most closely related to sequential execution Harder to build: Race conditions can happen all the time Must protect against this issue More complex fault tolerance When are you done? Must implement scheduler over vertices Faster convergence for many ML problems In matrix-land, called Gauss-Seidel Iteration Implemented by GraphLab 2010, 2012 Emily Fox 2014 33 Acknowledgements Slides based on Carlos Guestrin s GraphLab talk Emily Fox 2014 34 17