Scaling Distributed Machine Learning
|
|
- Andrew Hardy
- 6 years ago
- Views:
Transcription
1 Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017
2 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale problems 2
3 Large Scale Machine Learning More complex models Machine learning learns from data More data better accuracy can use more complex models Accuracy Data size 3
4 Ads Click Prediction Predict if an ad will be clicked Each ad impression is an example Logistic regression Single machine processes 1 million examples per second 4
5 Ads Click Prediction Predict if an ad will be clicked 700 Each ad impression is an example Logistic regression Single machine processes 1 million examples per second A typical industrial size problem has 100 billion examples Training data size (TB) billion unique features Year 5
6 Image Recognition Recognize the object in an image Convolutional neural network A state-of-the-art network Hundreds of layers Billions of floating-point operation for processing a single image 6
7 Distributed Computing for Large Scale Problems Distribute workload among many machines Widely available thanks to cloud providers (AWS, GCP, Azure) switch switch machine machine switch switch machine machine 7
8 Distributed Computing for Large Scale Problems Distribute workload among many switch switch machines switch switch Widely available thanks to cloud providers (AWS, GCP, Azure) machine machine Challenges machine machine Limited communication bandwidth (10x less than memory bandwidth) Large synchronization cost (1ms latency) Job failures Failure rate (%) Machine time (hour)
9 Distributed Optimization for Large Scale ML nx m[ min w i=1 f i (w) i=1 I i = {1, 2,...,n} X i2i i (w t ) X w + w t t+1 i2i i (w t ) Challenges Massive communication traffic Expensive global synchronization X i2i i (w t ) 9
10 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Large data size, complex models Fault tolerant Communication efficient Convergence guarantee Easy to use 10
11 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Parameter Server DBPG for non-convex non-smooth f i MXNet for deep learning EMSO for efficient minibatch SGD 11
12 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Parameter Server DBPG for non-convex non-smooth f i MXNet for deep learning EMSO for efficient minibatch SGD 12
13 Existing Open Source Systems in 2012 MPI (message passing interface) Hard to use for sparse problems No fault tolerance Key-value store, e.g. redis Expensive individual key-value pair communication Difficult to program on the server side Hadoop/Spark BSP data consistency makes efficient implementation challenging 13
14 Parameter Server Architecture Model [Smola 10, Dean 12] update Server machines push pull Worker machines Training data 14
15 Keys Features of our Implementation [Li et al, OSDI 14] Trade off data consistency for speed X Flexible consistency models User-defined filters Fault tolerance with chain replication X 15
16 Flexible Consistency Model iter 0 gradient push & pull execute after finished dependency iter 1 gradient push & pull Sequential / BSP iter 0 gradient push & pull iter 1 gradient push & pull Eventual / Total asynchronous [Smola 10] no dependency Flexible models via task dependency graph Bounded delay / SSP [Langford 09, Cipar 13]
17 User-defined Filters machine Data User defined encoder/decoder for efficient communication Encoder Lossless compression General data compression: LZ, LZR,.. efficient communication Lossy compression Decoder Random skip Fixed-point encoding Data machine 17
18 Fault Tolerance with Chain Replication Model is partitioned by consistent hashing Chain replication ack ack worker 0 server 0 server 1 push push Option: aggregation reduces backup traffic worker 0 ack push ack worker 1 ack push server 0 server 1 push 18
19 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Parameter Server DBPG for non-convex non-smooth f i MXNet for deep learning EMSO for efficient minibatch SGD 19
20 Proximal Gradient Method [Combettes 09] min w2 nx i=1 f i (w)+h(w) f i : continuously differentiable but not necessarily convex h: convex but possibly non-smooth w t+1 =Prox t " w t t n X f i (w t ) # Iterative update i=1 where Prox (x) := argmin y2 h(y)+ 1 2 kx yk2 20
21 Delayed Block Proximal Gradient [Li et al, NIPS 14] Algorithm design tailored for parameter model server implementation Update a block of coordinates each time Allow delay among blocks Use filters during communication data Only 300 lines of codes 21
22 Convergence Analysis Assumptions: Block Lipschitz continuity: within block Delay is bounded by τ L var, cross blocks L cor Lossy compressions such as random skip filter and significantly-modified filter DBPG converges to a stationary point if the learning rate is chosen as t < 1 L var + L cor 22
23 Experiments on Ads Click Prediction Real dataset used in production Time to achieve the same objective value 170 billion examples, 65 billion unique features, 636 TB in total sequential 2 computing waiting 1000 machines Sparse logistic regression time (hour) x best trade-off min w nx log(1 + exp( y i hx i,wi)) + kwk 1 i= Maximal delay τ 23
24 Filters to Reduce Communication Traffic 100 Server Key caching Traffic (%) x 40x 40x Cache feature IDs on both sender and receiver Data compression 0 Baseline Key Caching Compre- ssing KKT Filter KKT filter 100 Worker Shrink gradient to 0 based on the KKT condition Traffic (%) x 2.5x 12x 24 0 Baseline Key Caching Compre- ssing KKT Filter
25 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Parameter Server DBPG for non-convex non-smooth f i MXNet for deep learning EMSO for efficient minibatch SGD 25
26 Deep Learning is Unique Complex workloads Heterogeneous computing Easy to use programming interface deep learning trend in the past 5 years 26
27 Key Features of MXNet [Chen et al, NIPS 15 workshop] (corresponding author) Easy-to-use front-end Mixed programming Scalable and efficient back-end Computation and memory optimization Auto-parallelization Scaling to multiple GPU/machines 27
28 Mixed Programming Declarative programs are easy to optimize Imperative programming is flexible e.g. Numpy, Matlab, Torch, e.g. TensorFlow, Theano, Caffe, import mxnet as mx net = mx.symbol.variable('data') net = mx.symbol.fullyconnected( data=net, num_hidden=128) net = mx.symbol.softmaxoutput(data=net) model = mx.module.module(net) model.forward(data=c) model.backward() Good for defining the neural network import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b print(c) c += 1 Good for updating and interacting with the neural network 28
29 Back-end System Front-end import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1 import mxnet as mx net = mx.symbol.variable('data') net = mx.symbol.fullyconnected( data=net, num_hidden=128) net = mx.symbol.softmaxoutput(data=net) texec = mx.module.module(net) texec.forward(data=c) texec.backward() Back-end a b c weight Optimization Memory optimization + 1 fullc softmax bias Operator fusion and runtime compilation Scheduling Auto-parallelization 29
30 Scale to Multiple GPU Machines Hierarchical parameter server Network Switch 1.25 GB/s 10 Gbit Ethernet CPU CPUs Level-2 Servers GB/s PCIe x PCIe Switch Level-1 Servers 63 GB/s 4 PCIe x GPUs Workers GPU GPU GPU GPU 1000 lines of codes 30
31 Experiment Setup Minibatch SGD 1.2 million images with 1000 classes Draw a random set of examples I t at Resnet 152-layer model iteration t EC2 P2.8 xlarge Update 8 K80 GPUs per machine CPU w t+1 = w t t I t X i2i i (w t ) PCIe switches Synchronized updating GPU
32 Communication Cost Fix #GPUs per machine GPU/machine 2 GPU/machine 4 GPU/machine time (sec) GPU/machine # of GPUs 32
33 Scalability 115x speedup 1 Communication cost batch size/gpu= batch size/gpu=8 time (sec) 0.5 batch size/gpu= # of GPUs 33
34 Convergence 0.8 Increase learning rate by 5x Increase learning rate by 10x, decrease it at epoch 50, 80 Top-1 validation accuracy batch size=256 batch size=2,560 batch size=5, # of epochs 34
35 Scaling Distributed Machine Learning Distributed Systems Distributed Systems Distributed Systems Large Scale Optimization Parameter Server DBPG for non-convex non-smooth f i MXNet for deep learning EMSO for efficient minibatch SGD 35
36 Minibatch SGD Large batch size b in SGD Better parallelization within a batch Less switching/communication cost Better system performance Small batch size b Faster convergence convergence O(1/ p N + b/n) Worse rate Batch size (b) N: number of examples processed 36
37 Motivation [Li et al, KDD 14] Better system Improve converge rate for large batch size performance Example variance decreases with batch size Solve a more accurate optimization subproblem over each batch convergence Worse rate Batch size (b) 37
38 Efficient Minibatch SGD Define f It (w) := X i2i t f i (w). Minibatch SGD solves w t = argmin w2 apple first-order approximation conservative penalty f (w It t 1 )+h@f (w It t 1 ),w w t 1 i + 1 kw w t 1 k t EMSO solves the subproblem at each iteration w t = argmin w2 apple f It (w)+ 1 2 t kw w t 1 k 2 2 exact objective p For convex f i, choose t = O(b/ N). EMSO has convergence rate O(1/ p N) (compared to p O(1/ N + b/n) ) 38
39 Experiment Ads click prediction with fixed run time Single machine 10 machines SGD 0.22 SGD EMSO EMSO Objective Objective e3 1e4 1e e3 1e4 1e5 Batch size Batch size Extended to deep learning in [Keskar et al, arxiv 16] 39
40 Large-scale problems min w nx i=1 f i (w) Reduce communication cost Distributed systems Large scale optimization Co-design Communicate less With appropriate computational frameworks and algorithm Message compression design, distributed machine learning can be made simple, Relaxed data consistency fast, and scalable, both in theory and in practice. 40
41 Acknowledgement Committee members QQ & Alex Advisors with other 13 collaborators 41
42 Backup slides 42
43 Scaling to 16 GPUs in a Single Machine x Comm Cost bs/gpu=2 bs/gpu=4 time (sec) bs/gpu=8 bs/gpu=16 Communication dominates # of GPUs 43
44 Compare to a L-BFGS Based System computing waiting Objective System A Parameter Server Time (hour) System A Parameter Time (hour) Server 44
45 Sections not Covered AdaDelay: model the actual delay for asynchronized SGD Parsa: data placement to reduce communication cost Difacto: large scale factorization machine 45
Scaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationOptimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa
Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationCS 6453: Parameter Server. Soumya Basu March 7, 2017
CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,
More informationTensorFlow: A System for Learning-Scale Machine Learning. Google Brain
TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationToward Scalable Deep Learning
한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 2015.10.16 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr Breakthrough: Big Data + Machine Learning
More informationCuckoo Linear Algebra
Cuckoo Linear Algebra Li Zhou, CMU Dave Andersen, CMU and Mu Li, CMU and Alexander Smola, CMU and select advertisement p(click user, query) = logist (hw, (user, query)i) select advertisement find weight
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationDeep Learning and Its Applications
Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent
More informationEfficient Communication Library for Large-Scale Deep Learning
IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Security/public safety
More informationAsynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?
Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? Florin Rusu Yujing Ma, Martin Torres (Ph.D. students) University of California Merced Machine Learning (ML) Boom Two SIGMOD
More informationCNN optimization. Rassadin A
CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationNeural Network Neurons
Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given
More informationParallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney
Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationOptimization for Machine Learning
with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationMoonRiver: Deep Neural Network in C++
MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement
More informationDevice Placement Optimization with Reinforcement Learning
Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Mohammad Norouzi, Naveen Kumar, Rasmus Larsen, Yuefeng Zhou, Quoc Le, Samy Bengio, and
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationLecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018
Lecture 11: Distributed Training and Communication Protocols CSE599W: Spring 2018 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationTraining Deep Neural Networks (in parallel)
Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University
More informationSparse Training Data Tutorial of Parameter Server
Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job
More informationBeyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev
Beyond Training The next steps of Machine Learning Chris Parsons chrisparsons@uk.ibm.com @chrisparsonsdev /in/chrisparsonsdev What is this talk? Part 1 What is Machine Learning? AI Infrastructure PowerAI
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationDeep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity
Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms
More informationDeep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou
Deep Learning Training System: Adam & Google Cloud ML CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou Outline Project Adam: Building an Efficient and Scalable Deep Learning Training
More informationChainerMN: Scalable Distributed Deep Learning Framework
ChainerMN: Scalable Distributed Deep Learning Framework Takuya Akiba Preferred Networks, Inc. akiba@preferred.jp Keisuke Fukuda Preferred Networks, Inc. kfukuda@preferred.jp Shuji Suzuki Preferred Networks,
More informationParallel Stochastic Gradient Descent: The case for native GPU-side GPI
Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationFastText. Jon Koss, Abhishek Jindal
FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationResidual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina
Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,
More informationAnalyzing Stochastic Gradient Descent for Some Non- Convex Problems
Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Christopher De Sa Soon at Cornell University cdesa@stanford.edu stanford.edu/~cdesa Kunle Olukotun Christopher Ré Stanford University
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationMALT: Distributed Data-Parallelism for Existing ML Applications
MALT: Distributed -Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus NEC Laboratories America {asim, kruus@nec-labs.com} Abstract We introduce MALT, a machine learning library that
More informationIn-Place Activated BatchNorm for Memory- Optimized Training of DNNs
In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder Mapillary Research Paper: https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn
More informationProfiling DNN Workloads on a Volta-based DGX-1 System
Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,
More informationMocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT
Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:
More informationDemocratizing Machine Learning on Kubernetes
Democratizing Machine Learning on Kubernetes Joy Qiao, Senior Solution Architect - AI and Research Group, Microsoft Lachlan Evenson - Principal Program Manager AKS/ACS, Microsoft Who are we? The Data Scientist
More informationLAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning
LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen Georgios Giannakis Tao Sun Wotao Yin UMN, ECE UCLA, Math NeurIPS 208 Overview M := {,...,M} aaacaxicbvdlssnafj3uv62vqbvbzwarxjssseerhiibn4uk9gfnkjpjpb06mqkze6geuvfx3lhqxk/4c6/cdpmoa0hlhzouzd77wksrpv2ng+rslk6tr5r3cxtbe/s7tn7b20luoljcwsmzddaijdksuttzug3kqtfasodyhqz9tsprcoq+l0ej8sp0ydtigkkjds3jzipiwybk6trl3mrhgufvpwgn+nbzafqzacxizutmsjr7ntfxihwghoumunk9vwn0x6gpkaykunjsxvjeb6haekzylfmlj/nppjau6oemblsfndwpv6eyfcsdgotgem9fatelpxp6+x6ujszyhpuk04ni+kugagnm4yeglwzqndufyunmrxemkedymtjijwv8ezm0z6uuu3xvauv6ly+jci7bctgdlrgadxalmqafmhgez+avvflpovbn3mwwtwpnmi/sd6/aeiejx9
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationCharacterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures
Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationParallel Deep Network Training
Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationCS 179 Lecture 16. Logistic Regression & Parallel SGD
CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationAsynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features
Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationUsable while performant: the challenges building. Soumith Chintala
Usable while performant: the challenges building Soumith Chintala Problem Statement Deep Learning Workloads Problem Statement Deep Learning Workloads for epoch in range(max_epochs): for data, target in
More informationNVIDIA DLI HANDS-ON TRAINING COURSE CATALOG
NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG Valid Through July 31, 2018 INTRODUCTION The NVIDIA Deep Learning Institute (DLI) trains developers, data scientists, and researchers on how to use artificial
More informationS8765 Performance Optimization for Deep- Learning on the Latest POWER Systems
S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to
More informationAlternatives to Direct Supervision
CreativeAI: Deep Learning for Graphics Alternatives to Direct Supervision Niloy Mitra Iasonas Kokkinos Paul Guerrero Nils Thuerey Tobias Ritschel UCL UCL UCL TUM UCL Timetable Theory and Basics State of
More informationDL4J Components. Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions
DeepLearning4j DL4J Components Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions DL4J Components Open Source Deep Learning Library What is DeepLearning
More informationMapReduce ML & Clustering Algorithms
MapReduce ML & Clustering Algorithms Reminder MapReduce: A trade-off between ease of use & possible parallelism Graph Algorithms Approaches: Reduce input size (filtering) Graph specific optimizations (Pregel
More informationScalable deep learning on distributed GPUs with a GPU-specialized parameter server
Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons Carnegie Mellon University CMU-PDL-15-107 October 2015 Parallel
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationCERN openlab & IBM Research Workshop Trip Report
CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03 Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10
More informationTENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang
TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS by Google Research presented by Weichen Wang 2016.11.28 OUTLINE Introduction The Programming Model The Implementation Single
More informationKeras: Handwritten Digit Recognition using MNIST Dataset
Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationFrameworks in Python for Numeric Computation / ML
Frameworks in Python for Numeric Computation / ML Why use a framework? Why not use the built-in data structures? Why not write our own matrix multiplication function? Frameworks are needed not only because
More informationMachine Learning Basics. Sargur N. Srihari
Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1 Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2 Topics Stochastic Gradient
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationDistributed Delayed Proximal Gradient Methods
Distributed Delayed Proximal Gradient Methods Mu Li, David G. Andersen and Alexander Smola, Carnegie Mellon University Google Strategic Technologies Abstract We analyze distributed optimization algorithms
More informationA Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2
More informationDiffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting
Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Yaguang Li Joint work with Rose Yu, Cyrus Shahabi, Yan Liu Page 1 Introduction Traffic congesting is wasteful of time,
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationDistributed Optimization for Machine Learning
Distributed Optimization for Machine Learning Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch AI Summer School - MSR Cambridge - July 5 th Machine Learning Methods to Analyze
More informationParallel Deep Network Training
Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study
More informationLogistic Regression. Abstract
Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression
More informationCS-541 Wireless Sensor Networks
CS-541 Wireless Sensor Networks Lecture 14: Big Sensor Data Prof Panagiotis Tsakalides, Dr Athanasia Panousopoulou, Dr Gregory Tsagkatakis 1 Overview Big Data Big Sensor Data Material adapted from: Recent
More information732A54/TDDE31 Big Data Analytics
732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks
More information1 Overview Definitions (read this section carefully) 2
MLPerf User Guide Version 0.5 May 2nd, 2018 1 Overview 2 1.1 Definitions (read this section carefully) 2 2 General rules 3 2.1 Strive to be fair 3 2.2 System and framework must be consistent 4 2.3 System
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture
More informationWhy DNN Works for Speech and How to Make it More Efficient?
Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.
More informationUnsupervised Learning
Deep Learning for Graphics Unsupervised Learning Niloy Mitra Iasonas Kokkinos Paul Guerrero Vladimir Kim Kostas Rematas Tobias Ritschel UCL UCL/Facebook UCL Adobe Research U Washington UCL Timetable Niloy
More informationarxiv: v1 [cs.dc] 8 Jun 2018
PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Ganger Phil Gibbons Microsoft Research Carnegie Mellon University
More information10. Replication. Motivation
10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure
More informationGuarding user Privacy with Federated Learning and Differential Privacy
Guarding user Privacy with Federated Learning and Differential Privacy Brendan McMahan mcmahan@google.com DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationCarnegie Mellon University Implementation Tutorial of Parameter Server
Carnegie Mellon University Implementation Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu Model/Data partition Learn w from training data X W X Worker/Server Architecture Servers
More informationMachine Learning at the Limit
Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @GTC, March, 2015 My Other Job(s) Yahoo [Chen, Pavlov, Canny, KDD 2009]*
More information