More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
|
|
- Joanna Heath
- 5 years ago
- Views:
Transcription
1 More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University *Intel Labs
2 Distributed ML: one machine to many Setting: have iterative, parallel ML algorithm E.g. optimization, MCMC algorithms For topic models, regression, matrix factorization, SVMs, DNNs, etc. Critical updates executed on one machine, in parallel Worker threads share global model parameters θ via RAM for (t = to T) { dothings() parallelupdate(x,θ) dootherthings() } Parallelize over worker threads Share global model parameters via RAM θ θ θ θ θ θ θ θ θ θ θ θ θ
3 Distributed ML: one machine to many Want: scale up by distributing ML algorithm Must now share parameters over a network Seems like a simple task Many distributed tools available, so just pick one and go? Distributed Algorithm θ θ θ θ Single machine, multiple threads Multiple machines, communicating over network switches
4 4 Distributed ML Challenges Not quite that easy Two distributed challenges: Networks are slow Identical machines rarely perform equally Unequal performance Low bandwidth, High delay
5 5 Networks are (relatively) slow Low network bandwidth: 0.-GB/s (inter-machine) vs 0GB/s (CPU-RAM) Fewer parameters transmitted per second High network latency (messaging time): 0,000-00,000 ns (inter-machine) vs 00 ns (CPU-RAM) Wait much longer to receive parameters High bandwidth Low latency Low bandwidth High latency
6 6 Networks are (relatively) slow Parallel ML requires frequent synchronization Exchange 0-000K scalars per second, per thread Parameters not shared quickly enough communication bottleneck Significant bottleneck over a network! θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ
7 Seconds 7 Networks are (relatively) slow Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP For a clean setting with full control over machines and full network capacity Real clusters with many users have even worse network:compute ratios!
8 8 Machines don t perform equally Even when configured identically Variety of reasons: Vibrating hard drive Background programs; part of a distributed filesystem Other users Machine is a VM/cloud service Occasional, random slowdowns in different machines
9 9 Consequence: Scaling up ML is hard! Going from to N machines: Naïve implementations rarely yield N-fold speedup Slower convergence due to machine slowdowns, network bottlenecks If not careful, even worse than a single machine! Algorithm diverges due to errors from slowdowns!
10 0 Existing general-purpose scalable ML Theory-oriented Focus on algorithm correctness/convergence Examples: Cyclic fixed-delay schemes (Langford et al., Agarwal & Duchi) Single-machine asynchronous (Niu et al.) Naively-parallel SGD (Zinkevich et al.) Partitioned SGD (Gemulla et al.) May oversimplify systems issues e.g. need machines to perform consistently e.g. need lots of synchronization e.g. or even try not to communicate at all Systems-oriented Focus on high iteration throughput Examples: MapReduce: Hadoop and Mahout Spark Graph-based: GraphLab, Pregel May oversimplify ML issues e.g. assume algorithms just work in distributed setting, without proof e.g. must convert programs to new programming model; nontrivial effort
11 Existing general-purpose scalable ML Theory-oriented Focus on algorithm correctness/convergence Examples: Cyclic fixed-delay schemes (Langford et al., Agarwal & Duchi) Single-machine asynchronous (Niu et al.) Naively-parallel SGD (Zinkevich et al.) Partitioned SGD (Gemulla et al.) May oversimplify systems issues e.g. need machines to perform consistently e.g. need lots of synchronization e.g. or even try not to communicate at all Systems-oriented Focus on high iteration throughput Examples: MapReduce: Hadoop and Mahout Spark Graph-based: GraphLab, Pregel May oversimplify ML issues e.g. assume algorithms just work in distributed setting, without proof e.g. must convert programs to new programming model; nontrivial effort Can we take both sides into account?
12 Middle of the road approach Want: ML algorithms converge quickly under imperfect systems conditions e.g. slow network performance e.g. random machine slowdowns Parameters are not communicated consistently Existing work: mostly use one of two communication models Bulk Synchronous Parallel (BSP) Asynchronous (Async) First, understand pros and cons of BSP and Async
13 Bulk Synchronous Parallel Synchronization Barrier (Parameters read/updated here) Thread 4 5 Thread 4 5 Thread 4 5 Thread Time Threads synchronize (wait for each other) every iteration Threads all on same iteration # Parameters read/updated at synchronization barriers
14 4 The cost of synchronicity Thread Thread Thread Thread 4 (a) Machines perform unequally (b) Algorithmic workload imbalanced So threads must wait for each other Time End-of-iteration sync gets longer with larger clusters (due to slow network)
15 5 The cost of synchronicity Wasted computing time! Thread Thread Thread Thread 4 Time Threads must wait for each other End-of-iteration sync gets longer with larger clusters Precious computing time wasted
16 6 Asynchronous Parameters read/updated at any time Thread Thread Thread Thread Time Threads proceed to next iteration without waiting Threads not on same iteration # Parameters read/updated any time
17 7 Slowdowns and Async Difference in iterations parameter error Thread Thread Thread Thread Time Machine suddenly slows down (hard drive, background process, etc.) Causing iteration difference between threads Leading to error in parameters
18 8 Async worst-case situation Difference in iterations parameter error Thread Thread Thread Thread Time Large clusters have arbitrarily large slowdowns! Machines become inaccessible for extended periods Error becomes unbounded!
19 9 What we really want Partial synchronicity Spread network comms evenly (don t sync unless needed) Threads usually shouldn t wait but mustn t drift too far apart! Straggler tolerance Slow threads must somehow catch up Is there a middle ground between BSP and Async? Thread Thread Thread Thread Thread Thread Thread 4 Thread BSP??? Async
20 0 That middle ground Partial synchronicity Spread network comms evenly (don t sync unless needed) Threads usually shouldn t wait but mustn t drift too far apart! Straggler tolerance Slow threads must somehow catch up Thread Force threads to sync up Make thread catch up Thread Thread Thread Time
21 That middle ground How do we realize this? Thread Force threads to sync up Make thread catch up Thread Thread Thread Time
22 Stale Synchronous Parallel Staleness Threshold Thread waits until Thread has reached iter 4 Thread Thread Thread Thread Iteration Note: x-axis is now iteration count, not time! Allow threads to usually run at own pace Fastest/slowest threads not allowed to drift >S iterations apart Threads cache local (stale) versions of the parameters, to reduce network syncing
23 Stale Synchronous Parallel Staleness Threshold Thread Thread Thread Thread 4 Thread will always see these updates Thread may not see these updates (possible error) Iteration A thread at iter T sees all parameter updates before iter T-S Protocol: check cache first; if too old, get latest version from network Consequence: fast threads must check network every iteration Slow threads only check every S iterations fewer network accesses, so catch up!
24 4 SSP provides best-of-both-worlds SSP combines best properties of BSP and Async BSP-like convergence guarantees Threads cannot drift more than S iterations apart Every thread sees all updates before iteration T-S Asynchronous-like speed Threads usually don t wait (unless there is drift) Slower threads read from network less often, thus catching up SSP is a spectrum of choices Can be fully synchronous (S = 0) or very asynchronous (S ) Or just take the middle ground, and benefit from both! Thread Thread Thread Thread 4
25 5 Why does SSP converge? Instead of x true, SSP sees x stale = x true + error The error caused by staleness is bounded Over many iterations, average error goes to zero
26 6 Why does SSP converge? SSP approximates sequential execution Thread Thread Sequential execution Thread Thread Clock Compare actual update order to ideal sequential execution
27 7 Why does SSP converge? SSP approximates sequential execution Staleness Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock SSP may lose up to S iterations of updates to the left
28 8 Why does SSP converge? SSP approximates sequential execution Staleness Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock as well as gain up to S iterations of updates to the right
29 9 Why does SSP converge? SSP approximates sequential execution Error window (x)- = 5 iters Thread Thread Thread Sequential execution Possible error windows for this update: Thread Clock Thus, at most S- iterations of erroneous updates Hence numeric error in parameters is also bounded Partial, but bounded, loss of serializability
30 0 Convergence Theorem Want: minimize convex (Example: Stochastic Gradient) L-Lipschitz, problem diameter bounded by F Staleness s, using P threads across all machines Use step size
31 Convergence Theorem Want: minimize convex (Example: Stochastic Gradient) L-Lipschitz, problem diameter bounded by F Staleness s, using P threads across all machines Use step size Difference between SSP estimate and true optimum SSP converges according to Where T is the number of iterations Note: RHS bound contains (L, F) and (s, P) The interaction between theory and systems parameters
32 SSP solves Distributed ML challenges SSP is a synchronization model for fast and correct distributed ML For abelian parameter updates of the form θ new = θ old + Δ SSP reduces network traffic Threads use stale local cache whenever possible Addresses slow network and occasional machine slowdowns Cache Cache Cache Cache
33 SSP + Parameter Server = Easy Distributed ML We implement SSP as a parameter server (PS), called SSPTable Provides all machines with convenient access to global model parameter Can be run on multiple machines reduces load per machine SSPTable allows easy conversion of single-machine parallel ML algorithms Distributed shared memory programming style No need for complicated message passing Replace local memory access with PS access Worker Worker SSPTable (one or more machines) Worker Worker 4 Ahmed et al. (WSDM 0), Power and Li (OSDI 00) Single Machine Parallel Distributed with SSPTable UpdateVar(i) { old = y[i] delta = f(old) y[i] += delta } UpdateVar(i) { old = PS.read(y,i) delta = f(old) PS.inc(y,i,delta) }
34 4 SSPTable Programming Easy, table-based programming just commands! No message passing, barriers, locks, etc. read_row(table,row,s) Retrieve a table row with staleness s inc(table,row,col,value) Increment table s (row,col) by value clock() Inform PS that this thread is advancing to the next iteration
35 5 SSPTable Programming Just put global parameters in SSPTable! Examples: Topic Modeling (MCMC) Topic-word table Matrix Factorization (SGD) Factor matrices L, R Topic Topic Topic Topic 4 SSPTable R Lasso Regression (CD) Coefficients β L SSPTable supports generic classes of algorithms With these models as examples β
36 Seconds 6 SSPTable uses networks efficiently Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP
37 Seconds 7 SSPTable uses networks efficiently Time Breakdown: Compute vs Network LDA machines (56 cores), 0% data per iter Network waiting time Compute time BSP Staleness Network communication is a huge bottleneck with many machines SSP balances network and compute time
38 Log-Likelihood 8 SSPTable vs BSP and Async -9.00E E E E E E E E E+09 LDA on NYtimes Dataset LDA machines (56 cores), 0% docs per iter Seconds BSP (stale 0) async NYtimes data N = 00M tokens K = 00 topics V = 00K terms BSP has strong convergence guarantees but is slow Asynchronous is fast but has weak convergence guarantees
39 Log-Likelihood 9 SSPTable vs BSP and Async -9.00E E E E E E E+09 LDA on NYtimes Dataset LDA machines (56 cores), 0% docs per iter BSP (stale 0) stale NYtimes data N = 00M tokens K = 00 topics V = 00K terms -.5E E+09 Seconds async BSP has strong convergence guarantees but is slow Asynchronous is fast but has weak convergence guarantees SSPTable is fast and has strong convergence guarantees
40 Iterations Log-Likelihood 40 The Quality vs Quantity tradeoff Quantity: iterations versus time LDA machines, 0% data Seconds Quality: objective versus iterations LDA machines, 0% data -9.00E E E E E E E E E Iterations BSP (stale 0) stale 6 stale 4 stale 48 Progress per time is (iters/sec) * (progress/iter) High staleness yields more iters/sec, but lowers progress/iter Find the sweet spot staleness >0 for maximum progress per second
41 4 The Quality vs Quantity tradeoff More Staleness Progress per time is (iters/sec) * (progress/iter) High staleness yields more iters/sec, but lowers progress/iter Find the sweet spot staleness >0 for maximum progress per second
42 Objective 4 Matrix Factorization (Netflix).40E+09.0E+09.00E E E E+08.00E E+00 Objective function versus time MF machines (56 threads) BSP (stale 0) stale Seconds Netflix data 00M nonzeros 480K rows 8K columns rank 00
43 Objective 4 LASSO (Synthetic) 4.80E E E E E-0 Objective function versus time Lasso 6 machines (8 threads) BSP (stale 0) stale 0 stale 0 stale 40 stale 80 Synthetic data N = 500 samples P = 400K features 4.0E-0 4.0E Seconds
44 Log-Likelihood Inverse time to convergence 44 SSPTable scaling with # machines -8E+08-9E+08 LDA on NYtimes dataset (staleness = 0, k docs per core per iteration) Double # machines: 78% speedup converge in 56% time -E+09 -.E+09 -.E+09 -.E+09 machines (56 cores) 6 machines (8 cores) 8 machines (64 cores) 4 machines ( cores) machines (6 cores) machine (8 cores) Ideal Scaling SSP E+09 Seconds # machines SSP computational model scales with increasing # machines (given a fixed dataset)
45 45 Recent Results Using 8 machines * 6 cores = 8 threads 8GB RAM per machine Latent Dirichlet Allocation NYTimes dataset (00M tokens, 00K words, 0K topics) SSP 00K tokens/s GraphLab 80K tokens/s PubMed dataset (7.5B tokens, 4K words, 00 topics) SSP.M tokens/s GraphLab.8M tokens/s Network latent space role modeling Friendster network sample (9M nodes, 80M edges) 50 roles: SSP takes 4h to converge (vs 5 days on one machine)
46 46 Future work Theory SSP for MCMC Automatic staleness tuning Average-case analysis for better bounds Systems Load balancing Fault tolerance Prefetching Other consistency schemes Applications Hard-to-parallelize ML models DNNs, Regularized Bayes, Network Analysis models
47 47 Coauthors James Cipar Henggang Cui Jin Kyu Kim Seunghak Lee Phillip B. Gibbons Garth A. Gibson Gregory R. Ganger Eric P. Xing
48 48 Workshop Demo SSP is part of a bigger system: Petuum SSP parameter server STRADS dynamic variable scheduler More features in the works We have a demo! Topic modeling (8.M docs, 7.5B tokens, 4K words, 0K topics) Lasso regression (00K samples, 00M dimensions, 5 billion nonzeros) Network latent space modeling (9M nodes, 80M edges, 50 roles) At BigLearning 0 workshop (Monday)
49 49 Summary Distributed ML is nontrivial Slow network Unequal machine performance SSP addresses those problems Efficiently use network resources; reduces waiting time Allows slow machines to catch up Fast like Async, converges like BSP SSPTable parameter server provides easy table interface Quickly convert single-machine parallel ML algorithms to distributed Slides:
Lecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Research Showcase @ CMU Machine Learning Department School of Computer Science 12-213 More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Qirong Ho James Cipar Henggang Cui
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationFuture of Computing II: What s So Special About Big Learning? : Introduction to Computer Systems 28 th Lecture, Dec. 6, 2016
Carnegie Mellon Future of Computing II: What s So Special About Big Learning? 15-213: Introduction to Computer Systems 28 th Lecture, Dec. 6, 2016 Instructor: Phil Gibbons Bryant and O Hallaron, Computer
More informationMachine Learning. The Algorithm and System Interface of Distributed Machine Learning. Eric Xing , Fall Lecture 22, November 28, 2016
School of Computer Science Machine Learning 10-701, Fall 2016 The Algorithm and System Interface of Distributed Machine Learning Eric Xing Lecture 22, November 28, 2016 Reading: see post Eric Xing @ CMU,
More informationToward Scalable Deep Learning
한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 2015.10.16 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr Breakthrough: Big Data + Machine Learning
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationBenchmarking Apache Spark with Machine Learning Applications
Benchmarking Apache Spark with Machine Learning Applications Jinliang Wei, Jin Kyu Kim, Garth A. Gibson CMU-PDL-16-7 October 216 Parallel Data Laboratory Carnegie Mellon University Pittsburgh, PA 15213-389
More informationSolving the straggler problem for iterative convergent parallel ML
Solving the straggler problem for iterative convergent parallel ML Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing Carnegie Mellon
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel
More informationManaged Communication and Consistency for Fast Data-Parallel Iterative Analytics
Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics Jinliang Wei Wei Dai Aurick Qiao Qirong Ho? Henggang Cui Gregory R. Ganger Phillip B. Gibbons Garth A. Gibson Eric P. Xing
More informationSTRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning
STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning Jin Kyu Kim 1 Qirong Ho 2 Seunghak Lee 1 Xun Zheng 1 Wei Dai 1 Garth A. Gibson 1 Eric P. Xing 1 1 School of Computer Science,
More informationScaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationSparse Training Data Tutorial of Parameter Server
Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationExploiting characteristics of machine learning applications for efficient parameter servers
Exploiting characteristics of machine learning applications for efficient parameter servers Henggang Cui hengganc@ece.cmu.edu October 4, 2016 1 Introduction Large scale machine learning has emerged as
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationCS 6453: Parameter Server. Soumya Basu March 7, 2017
CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,
More informationAddressing the straggler problem for iterative convergent parallel ML
Addressing the straggler problem for iterative convergent parallel ML Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing Carnegie Mellon
More informationMALT: Distributed Data-Parallelism for Existing ML Applications
MALT: Distributed -Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus NEC Laboratories America {asim, kruus@nec-labs.com} Abstract We introduce MALT, a machine learning library that
More informationAnalyzing Stochastic Gradient Descent for Some Non- Convex Problems
Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Christopher De Sa Soon at Cornell University cdesa@stanford.edu stanford.edu/~cdesa Kunle Olukotun Christopher Ré Stanford University
More informationRESEARCH ARTICLE. Angel: a new large-scale machine learning system. Jie Jiang 1,2,, Lele Yu 1,,JiaweiJiang 1, Yuhong Liu 2 and Bin Cui 1, ABSTRACT
RESEARCH ARTICLE National Science Review 00: 1 21, 2017 doi: 10.1093/nsr/nwx018 Advance access publication 24 February 2017 INFORMATION SCIENCE Angel: a new large-scale machine learning system Jie Jiang
More informationLitz: An Elastic Framework for High-Performance Distributed Machine Learning
Litz: An Elastic Framework for High-Performance Distributed Machine Learning Aurick Qiao *,, Abutalib Aghayev, Weiren Yu *, Haoyang Chen *, Qirong Ho *, Garth A. Gibson, and Eric P. Xing *, * Petuum, Inc.
More informationParallelizing Big Data Machine Learning Algorithms with Model Rotation
Parallelizing Big Data Machine Learning Algorithms with Model Rotation Bingjing Zhang, Bo Peng, Judy Qiu School of Informatics and Computing Indiana University Bloomington, IN, USA Email: {zhangbj, pengb,
More informationParallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney
Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk
More informationScalable deep learning on distributed GPUs with a GPU-specialized parameter server
Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons Carnegie Mellon University CMU-PDL-15-107 October 2015 Parallel
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017
Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More informationDistributed Delayed Proximal Gradient Methods
Distributed Delayed Proximal Gradient Methods Mu Li, David G. Andersen and Alexander Smola, Carnegie Mellon University Google Strategic Technologies Abstract We analyze distributed optimization algorithms
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018
Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationCS-541 Wireless Sensor Networks
CS-541 Wireless Sensor Networks Lecture 14: Big Sensor Data Prof Panagiotis Tsakalides, Dr Athanasia Panousopoulou, Dr Gregory Tsagkatakis 1 Overview Big Data Big Sensor Data Material adapted from: Recent
More information15.1 Optimization, scaling, and gradient descent in Spark
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview
More informationA Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationMY WEAK CONSISTENCY IS STRONG WHEN BAD THINGS DO NOT COME IN THREES ZECHAO SHANG JEFFREY XU YU
MY WEAK CONSISTENCY IS STRONG WHEN BAD THINGS DO NOT COME IN THREES ZECHAO SHANG JEFFREY XU YU DISCLAIMER: NOT AN OLTP TALK HOW TO GET ALMOST EVERYTHING FOR NOTHING SHARED-MEMORY SYSTEM IS BACK shared
More informationarxiv: v1 [stat.ml] 30 Dec 2013
Petuum: A Framework for Iterative-Convergent Distributed ML arxiv:1312.7651v1 [stat.ml] 30 Dec 2013 Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim Seunghak Lee, Junming Yin, Qirong Ho and Eric P. Xing School
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationMapReduce ML & Clustering Algorithms
MapReduce ML & Clustering Algorithms Reminder MapReduce: A trade-off between ease of use & possible parallelism Graph Algorithms Approaches: Reduce input size (filtering) Graph specific optimizations (Pregel
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap Nandita Vijaykumar Dimitris Konomis Gregory R. Ganger Phillip B. Gibbons Onur Mutlu Carnegie Mellon University EH
More informationSync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources
Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources Guoyi Zhao, Lixin Gao and David Irwin Dept. of Electrical and Computer Engineering University of Massachusetts
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationTemplates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer
Templates for scalable data analysis 3 Distributed Latent Variable Models Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Variations on a theme inference for mixtures Parallel
More informationLitz: Elastic Framework for High-Performance Distributed Machine Learning
Litz: Elastic Framework for High-Performance Distributed Machine Learning Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Abutalib Aghayev, Carnegie Mellon University; Weiren Yu, Petuum, Inc.
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationUtilizing Datacenter Networks: Centralized or Distributed Solutions?
Utilizing Datacenter Networks: Centralized or Distributed Solutions? Costin Raiciu Department of Computer Science University Politehnica of Bucharest We ve gotten used to great applications Enabling Such
More informationAccelerating Machine Learning on Emerging Architectures
Accelerating Machine Learning on Emerging Architectures Big Simulation and Big Data Workshop January 9, 2017 Indiana University Judy Qiu Associate Professor of Intelligent Systems Engineering Indiana University
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationHARP: A MACHINE LEARNING FRAMEWORK ON TOP OF THE COLLECTIVE COMMUNICATION LAYER FOR THE BIG DATA SOFTWARE STACK. Bingjing Zhang
HARP: A MACHINE LEARNING FRAMEWORK ON TOP OF THE COLLECTIVE COMMUNICATION LAYER FOR THE BIG DATA SOFTWARE STACK Bingjing Zhang Submitted to the faculty of the University Graduate School in partial fulfillment
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationNoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre
NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data
More informationRACS: Extended Version in Java Gary Zibrat gdz4
RACS: Extended Version in Java Gary Zibrat gdz4 Abstract Cloud storage is becoming increasingly popular and cheap. It is convenient for companies to simply store their data online so that they don t have
More informationTowards the world s fastest k-means algorithm
Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationFlat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897
Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks
More informationCase Study 1: Estimating Click Probabilities
Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:
More informationarxiv: v1 [stat.ml] 31 Dec 2015
Strategies and Principles of Distributed Machine Learning on Big Data Eric P. Xing, Qirong Ho, Pengtao Xie, Wei Dai School of Computer Science, Carnegie Mellon University January 1, 2016 arxiv:1512.09295v1
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationOptimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa
Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationNative Offload of Haskell Repa Programs to Integrated GPUs
Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated
More informationMachine Learning at the Limit
Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @GTC, March, 2015 My Other Job(s) Yahoo [Chen, Pavlov, Canny, KDD 2009]*
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationGraphLab: A New Framework for Parallel Machine Learning
GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview
More informationHyperparameter optimization. CS6787 Lecture 6 Fall 2017
Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationAsynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?
Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? Florin Rusu Yujing Ma, Martin Torres (Ph.D. students) University of California Merced Machine Learning (ML) Boom Two SIGMOD
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationMySQL Database Scalability
MySQL Database Scalability Nextcloud Conference 2016 TU Berlin Oli Sennhauser Senior MySQL Consultant at FromDual GmbH oli.sennhauser@fromdual.com 1 / 14 About FromDual GmbH Support Consulting remote-dba
More informationPartitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization
Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More informationLitz: Elastic Framework for High-Performance Distributed Machine Learning
Litz: Elastic Framework for High-Performance Distributed Machine Learning Aurick Qiao 1,2, Abutalib Aghayev 2, Weiren Yu 1,3, Haoyang Chen 1, Qirong Ho 1, Garth A. Gibson 2,4, Eric P. Xing 1,2, 1 Petuum,
More informationMEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS
MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationI/O CANNOT BE IGNORED
LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More information