Toward Scalable Deep Learning
|
|
- Paul Garry Sherman
- 5 years ago
- Views:
Transcription
1 한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University
2 Breakthrough: Big Data + Machine Learning Daphne Koller Andrew Ng
3 Shadow beyond the Revolution Training challenges < MNIST test > Ciresan, Dan Claudiu, et al. "Deep, big, simple neural nets for handwritten digit recognition." Neural computation (2010): T. Chilimbi et al. (OSDI 2014)
4 Machine Learning Representation + Training Sparse structured input/output regression Nonparametric Bayesian models Representations Graphical models Deep learning Training
5 Parallelism in Machine Learning Basic form of ML F D, θ = L D, θ + r(θ) Iterative-convergent θ t+1 = θ t + Δθ(D) Δθ(D) Data Parallel Model Parallel Δθ(D) Δθ(D 1 ) Δθ 1 (D) Δθ(D 2 ) Δθ 2 (D) Δθ(D n ) Δθ m (D) E. Xing & Q. Ho., 2015 KDD Tutorial
6 Deep Learning: A New Learning Paradigm Far more complex and larger than conventional ML models A large number of model parameters to learn Many (mostly simple) computations with latent variables Needs scaling up/out computation & numerical optimization
7 Dealing with the Challenges (1) Minimize computation [Bengio, 2014] Improve (reduce) the ratio of # OF COMPUTATIONS / # OF PARAMETERS Extreme success story (but poor generalization): decision trees O(n) computations for O(2 n ) parameters Extreme unlucky story: deep neural nets O(n) computations for O(n) parameters Example Conditional computation (Bengio, 2014)
8 Dealing with the Challenges (2) Scale-up approaches Co-processor/Accelerator (SIMD, GPGPU, ) Learning workload Enhanced single machine performance Organized in SIMD blocks 10-fold to 100-fold speed-up Stuck with memory constraints!
9 Dealing with the Challenges (3) Scale-out approaches Learning workload Distributed System (GraphLab, Hadoop, Spark,...) Can handle enormous size of data or model Split the entire workload using Data parallelism Model parallelism Parameter communication issues!
10 Notable ML Platforms Spark RDD-based programming model ML library (includes deep learning) GraphLab Gather-Apply-Scatter programming model Large-scale graph mining Petuum Key-value store + scheduler General-purpose large-scale ML
11 Notable ML Platforms GPU-based (scale-up) Keras( ) Distributed (scale-out) DistBelief [D. Jeffrey et al., "Large scale distributed deep networks," NIPS 2012] Project Adam [T. Chilimbi et al., "Project Adam: Building an efficient and scalable deep learning training system, OSDI 2014]
12 Recent Technological Trends DistBelief: supports both data and model parallelism J. Dean et al. "Large scale distributed deep networks," NIPS 2012
13 Recent Technological Trends GPU-accelerated library of primitives for DNN Used by frameworks such as Caffe, Theano, Ex) cudnn (v3) vs. cudnn (v2) on Caffe
14 Recent Technological Trends Open-source, distributed, commercial-grade DL framework DeepLearning4j ND4J (Scientific computing library for JVM) Scalable backend Apache Hadoop and Spark GPUs Partners
15 Recent Technological Trends Large-scale distributed machine learning Considers both data and model parallelism Key-value store + dynamic scheduler
16 Recent Technological Trends» Retainable Evaluator Execution Framework An Apache incubator project Package a variety of data-processing libraries in a reusable form MapReduce, query, graph processing and stream data processing REEF introduction (
17 Scalable Deep Learning Techniques 1) Data parallelism Hogwild! (B. Recht, et al., NIPS 2011) Downpour SGD (J. Dean, et al., NIPS 2012), Dogwild (C. Noel, et al., 2014) 2) Parameter Server (M. Li, et al., NIPS 2013) 3) Model parallelism (STRADS) (S. Lee, et al., NIPS 2014) 4) Acceleration with GPUs (CUDA convnet) Examples of distributed schemes
18 Data Parallelism Based on the independency between data Leads to concurrent executions for each data speed up Samples Attributes DATA Worker 1 Worker 2 Worker 3 Model Aggregation
19 Data Parallelism : Hogwild! Asynchronous running; Don t Lock! Don t Communicate! For each processor, calculate gradients independently Processors can overwrite each others work Y. Nishioka, Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures, IPDPS 2015
20 Data Parallelism : Hogwild! Guarantees a reasonable converge rate Exploits sparsity Better performance even in non-sparse examples than traditional synchronized techniques (e.g., SVM) B. Recht et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011
21 Data Parallelism : Downpour SGD, Dogwild! Hogwild: designed for shared-memory machines Limited scalability Expand the concept of Hogwild! to distributed systems Asynchronous update gradients to master or parameter server Ex) downpour SGD & Dogwild! (=distributed Hogwild!) J. Dean et al., "Large scale distributed deep networks," NIPS 2012
22 Parameter Server Parameter server Widely used concept for distributed machine learning Separate servers for parameters in the model Key features (Li et al., 2013) Efficient communication Flexible consistency models Elastic scalability Fault tolerance and durability Ease of Use M. Li et al.,"parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013
23 Parameter Server : Key Value Vector Model Usually expressed as a vector or an array Sparse data & linear model Not all parameters are used to calculate gradients Key value vector ww 1, ww 2,, ww nn {(i, w) i Feature, ww Weight} Used to transmit the parameters only which workers need Example: (ww 1, ww 2, ww 3, ww 4 ) (1, ww 1 ) (2, ww 2 ) (3, ww 3 ) (4, ww 4 )
24 Parameter Server : Interface Server node Data: a partition of the globally shared parameters Worker node Data: a portion of the training data Task: local Push Direction: Worker Server Data: Calculated update value Pull Direction : Server Worker Data: Updated parameter M. Li et al., "Parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013
25 Parameter Server : Data & model Partition Partition Model Server Push Pull Dimension Worker Data
26 STRADS STRADS (Lee et al., 2014) STRucture-Aware Dynamic Scheduler Parameter server with dynamic scheduler Chooses a set of parameters which can be updated in parallel Parameters are not transmitted between masters and workers S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning, NIPS 2014
27 STRADS : Execution Basic execution unit order Schedule Push - Pull Schedule Subject: Master Task Pick sets of model parameters that can be safely updated in parallel Push Subject: Master Tasks Dispatch computation jobs via the coordinator to the workers Execute push to compute partial updates for each parameter Pull Subject: key-value store Tasks Aggregate the partial updates Keep newly updated parameters
28 STRADS : Performance Performance advantages of STRADS Faster convergence Larger model size Latent Dirichlet Allocation Matrix Factorization S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning, NIPS 2014
29 CUDA-convnet Fast C++/CUDA implementation of convolutional neural networks Supports multiple-gpu training GPU1 GPU1 Fully-connected layers GPU2 GPU2 A. Krizhevsky, 2012 Convolutional layers
30 CUDA-convnet2 New features (wrt cuda-convnet): Improved training time Enhanced data parallelism, model parallelism, and hybrids Possible parallelizing schemes: (a) Computing fully-connected activities after assembling a big batch from laststage convlayer activities. (b) Each worker sending its last-stage convlayer activities to all the other workers in turns. In parallel with feedforward & backprop computation, the next worker updates its activities. A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, (c) All of the workers sending # eeeeeeeeeeeeeeee KK of their convlayer activities to all other workers. The workers then proceed as in (b)
31 CUDA-convnet2 : Model Parallelism (Fully Connected Layers) A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014
32 Caffe Open framework, models, and worked examples for deep learning Pure C++/CUDA architecture for deep learning (Python, Matlab interfaces) Fast, well-tested code Tools, reference models, demos, and recipes Seamless switch between CPU and GPU Application Object classification Learning semantic features Object detection Sequences Reinforcement learning Speech + text C5G71UMscNPlvArsWER41PsU/edit#slide=id.gc2fcdcce7_216_0
33 Caffe : Example LeNet A network is a set of layers and their connections Caffe creates and checks the net from the definition Layer : plain text scheme, not a code LeNet
34 Caffe : Pros and Cons Performance 2 ms / image on K40 GPU <1 ms inference with Caffe + cudnn v2 on Titan X 72 million images per day with batched IO Pros Fast way to apply deep neural networks Support GPU Many common and new functions are supported Python and Matlab binding Cons Only a few input formats and only one output format (HDF5)
35 DistBelief Introduced by Google Brain research team J. Dean et al., Large scale distributed deep networks, NIPS 2012 Use large-scale cluster to distribute training and inference Exploits both data & model parallelism Distributed optimization Algorithms using parameter server Downpour SGD Sandblaster L-BFGS Trains a DN w/ billions of params using tens of thousands of CPU cores Capable of training a deep network 30x larger State-of-art performance on ImageNet 1) (by 2012) Faster than a GPU on modestly sized deep networks 1) An image database w/ 16m images, 20k categories and 1b params
36 DistBelief : Partition Model Across Machines J. Dean et al., "Large scale distributed deep networks," NIPS 2012
37 DistBelief : Asynchronous Distributed SGD Computes gradient on partial data (3) ttttttpp jj = ww jj 300 h ww xx ii yy ii 2 ii=201 xx 201, yy 201,, (xx 300, yy 300 ) Asynchronous communication on partitioned data Utilization of parameter server J. Dean, et al. "Large scale distributed deep networks." NIPS 2012.
38 DistBelief : Downpour SGD Asynchronous distributed SGD Robust to machine failures Introduces additional stochasticity Adagrad Adaptive learning rate Improve robustness and scalability 1. Asynchronously fetching parameters to multiple model replicas 2. SGD process inside model 3. Asynchronously pushing gradients to parameter server J. Dean et al., "Large scale distributed deep networks," NIPS 2012
39 Adam Optimizes and balances computation and communication Exploits model parallelism Minimized memory bandwidth and communication overhead Achieves high performance and scalability Also with accuracy improvement Multi-threaded model parameter updates without locks Asynchronous batched parameter updates Supports training any combination of Stacked convolutional and fully-connected network layers
40 Adam : Architecture On a single machine: Multi-threaded training Fast weight updates without lock (similar to Hogwild!) Multiple machines: Model partitioning Reducing memory copies (= data transfer) using own network library Optimization of memory system: L3 cache, cache locality Use vector processing units for matrix multiplication Asynchronous mitigating (speed variance of machines) Asynchronous updates with a global parameter server T. Chilimbi et al., Project adam: Building an efficient and scalable deep learning training system, OSDI 14
41 Adam : Results Application: Mnist / ImageNet 120 machines: 90 (training) + 20 (parameter server) + 10 (image server) <Performance of training nodes> <Scaling model size with more workers> <Accuracy of two applications> 30 fewer machines 2x improvements
42 Petuum Data & model parallel approach Considers the three properties of ML stated below Three properties of general ML Error tolerance Robustness against limited errors in the middle of calculation Dynamic structural dependency Changes in correlation between parameters Non-uniform convergence Differences between the convergence speed for each parameters
43 Petuum : Architecture Scheduler The core of the model parallelism support User can schedule which parameters are updated by schedule( ) Partial updates are aggregated by pull( ) Worker Parameters are received by schedule( ) Updates are computed by push( ) Any data storage system can be used Parameter server Uses the Stale Synchronous Parallel (SSP) consistency model Table based or key value stores
44 Petuum : Stale Synchronous Parallel (SSP) A parallel consistency model Limits the difference of the number of iteration which have progressed between workers Reduces network synchronization and communication costs due to error tolerant convergence Ho et al.,"more effective distributed ml via a stale synchronous parallel parameter server, NIPS 2013
45 Petuum : Performance High relative speedup compared to other implementations Near-linear speed-up by increasing machines
46 SINGA Distributed deep learning platform for big data analytics Support CNN, RBM, RNN and others Flexible to run synchronous/asynchronous and hybrid framework Support various neural net partitioning schemes Design goals Generality Different categories of models Different training frameworks Scalability Scalable to a large model and training datasets ex) Trained with 1 billion parameters and 10M images Ease of use Provides a simple programming model Supports built-in models, Python binding, and web interface Useable without much awareness of the underlying distributed platform
47 SINGA : Distributed Training Worker Group Loads a subset of training data and computes gradients for model replica Workers within a group run synchronously Different worker groups run asynchronously Server Group Maintains on ParamShard Handles requests of multiple worker groups for parameter updates Synchronize with neighboring groups
48 SINGA : Configurations 1 server group & 1 worker group (synchronous frameworks) 1 server group & 1 worker groups (asynchronous frameworks) Co-locate worker and server AllReduce (Baidu s DeepImage) Dogwild (distributed Hogwild) Separate worker and server groups Sandblaster Downpour
49 SINGA : Pros and Cons Pros Easy to use and support programming without much awareness of the underlying distributed platform Distributed architecture using synchronous, asynchronous and hybrid updates Cons Limited scale-up support (e.g., no support for GPUs)
50 Summary In the era of big data, deep learning techniques show higher accuracy than the traditional machine learning algorithms. However, deep learning often requires a huge amount of resources for showing state-of-the art performance on large-scale data. This talk provides a survey of recent proposals for alleviating the computational challenges involved in training large-scale deep neural networks. With emphasis on examples of scale-up or scale-out techniques
Large Scale Distributed Deep Networks
Large Scale Distributed Deep Networks Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP630030 Data Intensive Computing Report, 2013 Yifu Huang (FDU CS) COMP630030 Report
More informationCS 179 Lecture 16. Logistic Regression & Parallel SGD
CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More informationTraining Deep Neural Networks (in parallel)
Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as
More informationAsynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms
Asynchronous Parallel Stochastic Gradient Descent A Numeric Core for Scalable Distributed Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationDL4J Components. Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions
DeepLearning4j DL4J Components Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions DL4J Components Open Source Deep Learning Library What is DeepLearning
More informationTensorFlow: A System for Learning-Scale Machine Learning. Google Brain
TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationParallel Deep Network Training
Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationParallel Deep Network Training
Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel
More informationAsynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?
Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? Florin Rusu Yujing Ma, Martin Torres (Ph.D. students) University of California Merced Machine Learning (ML) Boom Two SIGMOD
More informationParallel Implementation of Deep Learning Using MPI
Parallel Implementation of Deep Learning Using MPI CSE633 Parallel Algorithms (Spring 2014) Instructor: Prof. Russ Miller Team #13: Tianle Ma Email: tianlema@buffalo.edu May 7, 2014 Content Introduction
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationDEEP NEURAL NETWORKS AND GPUS. Julie Bernauer
DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int
More informationRESEARCH ARTICLE. Angel: a new large-scale machine learning system. Jie Jiang 1,2,, Lele Yu 1,,JiaweiJiang 1, Yuhong Liu 2 and Bin Cui 1, ABSTRACT
RESEARCH ARTICLE National Science Review 00: 1 21, 2017 doi: 10.1093/nsr/nwx018 Advance access publication 24 February 2017 INFORMATION SCIENCE Angel: a new large-scale machine learning system Jie Jiang
More informationDeep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou
Deep Learning Training System: Adam & Google Cloud ML CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou Outline Project Adam: Building an Efficient and Scalable Deep Learning Training
More informationMocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT
Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:
More informationSCALABLE DISTRIBUTED DEEP LEARNING
SEOUL Oct.7, 2016 SCALABLE DISTRIBUTED DEEP LEARNING Han Hee Song, PhD Soft On Net 10/7/2016 BATCH PROCESSING FRAMEWORKS FOR DL Data parallelism provides efficient big data processing: data collecting,
More informationCS 6453: Parameter Server. Soumya Basu March 7, 2017
CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,
More informationData Analytics and Machine Learning: From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro
More informationAccelerating Spark Workloads using GPUs
Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationCharacterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures
Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
More informationMachine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center
Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationDeep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.
Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,
More informationNear-Data Processing for Differentiable Machine Learning Models
Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2 and Sungroh Yoon 1,3 1 Electrical and Computer
More information11. Neural Network Regularization
11. Neural Network Regularization CS 519 Deep Learning, Winter 2016 Fuxin Li With materials from Andrej Karpathy, Zsolt Kira Preventing overfitting Approach 1: Get more data! Always best if possible! If
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationScalable deep learning on distributed GPUs with a GPU-specialized parameter server
Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons Carnegie Mellon University CMU-PDL-15-107 October 2015 Parallel
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationTowards Scalable Machine Learning
Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationGPU Acceleration for Machine Learning
GPU Acceleration for Machine Learning John Canny*^ * Computer Science Division University of California, Berkeley ^ Google Research, 2016 Outline BIDMach on single machines BIDMach on clusters DNNs for
More informationMachine Learning at the Limit
Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @GTC, March, 2015 My Other Job(s) Yahoo [Chen, Pavlov, Canny, KDD 2009]*
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationExploiting characteristics of machine learning applications for efficient parameter servers
Exploiting characteristics of machine learning applications for efficient parameter servers Henggang Cui hengganc@ece.cmu.edu October 4, 2016 1 Introduction Large scale machine learning has emerged as
More informationMALT: Distributed Data-Parallelism for Existing ML Applications
MALT: Distributed -Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus NEC Laboratories America {asim, kruus@nec-labs.com} Abstract We introduce MALT, a machine learning library that
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationarxiv: v2 [cs.ne] 26 Apr 2014
One weird trick for parallelizing convolutional neural networks Alex Krizhevsky Google Inc. akrizhevsky@google.com April 29, 2014 arxiv:1404.5997v2 [cs.ne] 26 Apr 2014 Abstract I present a new way to parallelize
More informationEffectively Scaling Deep Learning Frameworks
Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! I m excited to be here today and get the opportunity to present some of the work that we ve been doing at SVAIL, the
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationParallel learning of content recommendations using map- reduce
Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationReview: The best frameworks for machine learning and deep learning
Review: The best frameworks for machine learning and deep learning infoworld.com/article/3163525/analytics/review-the-best-frameworks-for-machine-learning-and-deep-learning.html By Martin Heller Over the
More informationSimilarities and Differences Between Parallel Systems and Distributed Systems
Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationParallel Stochastic Gradient Descent: The case for native GPU-side GPI
Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationA performance comparison of Deep Learning frameworks on KNL
A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationCS-541 Wireless Sensor Networks
CS-541 Wireless Sensor Networks Lecture 14: Big Sensor Data Prof Panagiotis Tsakalides, Dr Athanasia Panousopoulou, Dr Gregory Tsagkatakis 1 Overview Big Data Big Sensor Data Material adapted from: Recent
More informationMachine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich
Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression
More informationBrook: An Easy and Efficient Framework for Distributed Machine Learning
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationAsynchronous Distributed Data Parallelism for Machine Learning
Asynchronous Distributed Data Parallelis for Machine Learning Zheng Yan, Yunfeng Shao Shannon Lab, Huawei Technologies Co., Ltd. Beijing, China, 100085 {yanzheng, shaoyunfeng}@huawei.co Abstract Distributed
More informationOptimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa
Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationAnalyzing Stochastic Gradient Descent for Some Non- Convex Problems
Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Christopher De Sa Soon at Cornell University cdesa@stanford.edu stanford.edu/~cdesa Kunle Olukotun Christopher Ré Stanford University
More informationSINGA: Putting Deep Learning in the Hands of Multimedia Users
SINGA: Putting Deep Learning in the Hands of Multimedia Users Wei Wang, Gang Chen, Tien Tuan Anh Dinh, Jinyang Gao Beng Chin Ooi, Kian-Lee Tan, Sheng Wang School of Computing, National University of Singapore,
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationInteractive Machine Learning - and our new INRIA project
Interactive Machine Learning - and our new INRIA project John Canny Computer Science Division University of California, Berkeley June, 2014 Where is my computer? Where is my computer? Intel CPU NVIDIA
More informationMap Reduce Group Meeting
Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for
More informationKeras: Handwritten Digit Recognition using MNIST Dataset
Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationMachine Learning with Python
DEVNET-2163 Machine Learning with Python Dmitry Figol, SE WW Enterprise Sales @dmfigol Cisco Spark How Questions? Use Cisco Spark to communicate with the speaker after the session 1. Find this session
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationDynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin
Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing
More informationFTSGD: An Adaptive Stochastic Gradient Descent Algorithm for Spark MLlib
FTSGD: An Adaptive Stochastic Gradient Descent Algorithm for Spark MLlib Hong Zhang, Zixia Liu, Hai Huang, Liqiang Wang Department of Computer Science, University of Central Florida, Orlando, FL, USA IBM
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationTutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY
Tutorial on Keras CAP 6412 - ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY Deep learning packages TensorFlow Google PyTorch Facebook AI research Keras Francois Chollet (now at Google) Chainer Company
More informationDeep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity
Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationApache SystemML Declarative Machine Learning
Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationLecture 3: Theano Programming
Lecture 3: Theano Programming Misc Class Items Registration & auditing class Paper presentation Projects: ~10 projects in total ~2 students per project AAAI: Hinton s invited talk: Training data size increase
More informationChainerMN: Scalable Distributed Deep Learning Framework
ChainerMN: Scalable Distributed Deep Learning Framework Takuya Akiba Preferred Networks, Inc. akiba@preferred.jp Keisuke Fukuda Preferred Networks, Inc. kfukuda@preferred.jp Shuji Suzuki Preferred Networks,
More informationTemplates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU
Templates for scalable data analysis 2 Synchronous Templates Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Running Example Inbox Spam Running Example Inbox Spam Spam Filter
More information