DOWNLOAD PDF PARAMETER SERVER FOR DISTRIBUTED MACHINE LEARNING

Size: px

Start display at page:

Download "DOWNLOAD PDF PARAMETER SERVER FOR DISTRIBUTED MACHINE LEARNING"

Emery Horn
5 years ago
Views:

1 Chapter 1 : Metadata: A Comparison of Distributed Machine Learning Platforms The parameter server architecture, shown above, has two classes of nodes: The server nodes main- tain a partition of the globally shared parameters (machine local parameters are not synchronized by default). But so far this has mostly been limited to either a single GPU e. Distributed deep learning systems are typically CPU-based. Clearly the ideal would be to efficiently harness clusters of GPUs in a general-purpose framework. This is exactly what GeePS does, and the results are impressive. GeePS is a parameter server supporting data-parallel model training. In data parallel training, the input data is partitioned among workers on different machines, that collectively update shared model parameters. These parameters themselves may be sharded across machines. Using parameter servers to scale machine learning [In] the basic parameter server architecture, all state shared among application workers i. Client-side caches are also used to serve most operations locally. The consistency model can conform to the Bulk Synchronous Parallel BSP model, in which all updates from the previous clock must be visible before proceeding to the next clock, or can use a looser but still bounded model. For example, the Stale Synchronous Parallel model allows the fastest worker to be ahead of the slowest worker by a bounded number of clocks. While logically the parameter server is separate to the worker machines, in practice the server-side parameter server state is commonly sharded across the same machines as the worker state. Doing so was straightforward and immediately enabled distributed deep learning on GPUs, confirming the application programmability benefits of the data-parallel parameter server approachâ While it was easy to get working, the performance was not acceptable. As noted by Chilimbi et al. Also, as noted by them and others, the need to fit the full model, as well as a mini-batch of input data and intermediate neural network states, in the GPU memory limits the size of models that can be trained. Specializing a parameter server for GPUs To enable a parameter server to support parallel ML applications running on distributed GPUs the authors make three important changes: Batching operations One at a time read and update operations of model parameters values can significantly slow execution. To realize sufficient performance, our GPU-specialized parameter server supports batch-based interfaces for reads and updates. These changes make parameter server accesses much more efficient for GPU-based training. GeePS implements an operation sequence gathering mechanism that gathers the operation sequence either in the first iteration, or in a virtual iteration. Before real training starts, the application performs a virtual iteration with all GeePS calls being marked with a virtual flag. Operations are recorded by GeePS but no real actions are taken. Since the gathered access information is used only as a hint, knowing the exact operation sequence is not a requirement for correctness, but a performance optimization. Managing GPU memory The parameter server uses pre-allocated GPU buffers to pass data to an application, rather than copying the parameter data to application provided buffers. When an application wants to update parameter values, it also does so in GPU allocated buffers. The application can also store local non-parameter data e. The parameter server client library will be able to manage all the GPU memory on a machine, if the application keeps all its local data in the parameter server and uses the parameter-server managed buffers. When the GPU memory of a machine is not big enough to host all data, the parameter server will store parts of the data in CPU memory. Fortunately, iterative applications like neural network training typically apply the same parameter data accesses every iteration, so the parameter server can easily predict the Read operations and perform them in advance in the background. The parameter data is sharded across all instances, and cached locally with periodic refresh e. When an application issues a read or update operation it provides a list of keys. All parameters are fetched or updated in parallel on the GPU cores. The access index built from the list of keys can be built just once for each batch of keys using the operation sequence gathering process described earlier, and then re-used for each instance of the given batch access. While keeping the access buffer pool twice the peak size for double buffering, our policy will first try to pin the local data that is used at the peak in GPU memory, on order to reduce the peak size and thus the size of the buffer pool. Then, it will try to use the available capacity to pin Page 1

2 more local data and parameter cache date in GPU memory. Finally, it will add any remaining available GPU memory to the access buffer. Figure 8 below shows the throughput scalability of GeePS on an image classification task as compared to a CPU-based parameter server distributed system, and a single GPU node Caffe system. To evaluate convergence speed, we will compare the amount of time required to reach a given level of accuracy, which is a combination of image training throughput and model convergence per trained image. Using unmodified Caffe, a video classification RNN can support a maximum video length of 48 frames. In order to use a video length of frames, Ng et al. By contrast, with the memory management support of GeePS, we are able to train videos with up to frames, using solely data parallelism. Many recent ML model training systems, including for neural network training, use a parameter server architecture to share state among data-parallel workers executing on CPUs. Consistent reports indicate that, in such an architecture, some degree of asynchrony bounded or not in parameter update exchanges among workers leads to significantly faster convergence than when using BSP. We observe the opposite with data-parallel workers executing on GPUsâ while synchronization delays can be largely eliminated, as expected, convergence is much slower with the more asynchronous models because of reduced training quality. We believe there are two reasons causing this outcome. First, with our specializations, there is little to no communication delay for DNN applications, so adding data staleness does not increase the throughput much. Page 2

3 Chapter 2 : Train TensorFlow models with Azure Machine Learning Microsoft Docs Scaling Distributed Machine Learning with the Parameter Server Mu Liz, David G. Andersen, Jun Woo Park, Alexander J. Smola y, Amr Ahmed, Vanja Josifovski y, James Long, Eugene J. Shekita, Bor-Yiing Suy. Deep learning has achieved great success in many areas recently. It has attained state-of-the-art performance in applications ranging from image classification and speech recognition to time series forecasting. The key success factors of deep learning are â big volumes of data, flexible models and ever-growing computing power. With the increase in the number of parameters and training data, it is observed that the performance of deep learning can be improved dramatically. However, when models and training data get big, they may not fit in the memory of a single CPU or GPU machine, and thus model training could become slow. One of the approaches to handle this challenge is to use large-scale clusters of machines to distribute the training of deep neural networks DNNs. This technique enables a seamless integration of scalable data processing with deep learning. Other approach like using multiple GPUs on a single machine works well with modest data but could be inefficient for big data. I do so by walking you through two examples â one for image classification and another for time series forecasting. The source code is available on GitHub. Essentially, it is solving a stochastic optimization problem in a distributed fashion. To help understand this concept, I will introduce the basic ideas and few popular algorithms of distributed deep learning. Data Parallelism There are two main types of distributed deep learning frameworks, namely model parallelism and data parallelism. Model parallelism distributes the computations of a single model across multiple machines. As a simple example, it is possible to split the computations of the output of a perceptron single neuron by having each input node compute the product of the input and the associated weight. In contrast, data parallelism tries to parallelize the gradient descent over multiple machines by splitting training data into several partitions or data shards. Since this blog uses the data parallelism framework, I will briefly describe the typical steps in this framework. As illustrated in Figure 1, the training data is split into mini-batches over workers, with each mini-batch of size. When the training starts, every worker gets a copy of the model parameters e. Then, the local gradient information is sent back to the parameter server, which will average all the accumulated gradients and applies this combined gradient to update the model parameters. After this, the workers will download the new model parameters and the above process will be repeated. Typical architecture of data parallelism for deep learning [1]. Representative Algorithms In this blog, we exploit distributed optimization algorithms based on data parallelism. Many such algorithms have been proposed to speed up the training of DNNs, e. DOWNPOUR is an asynchronous optimization algorithm which allows the parameter server to update the model parameters whenever it receives the information from a worker. Although it is one of the most commonly used algorithms, it has large communication overhead and is not very stable with large number of workers. Environment Setup To illustrate the above concept, we will use the Distributed Keras Python package referred to as dist-keras in the examples. Note that you can also install relevant packages on Azure Databricks and perform distributed deep learning like this example. Before we create a new cluster, as shown in Figure 2, we can specify additional packages that we want to install in jupyter. Note that we need to install the packages on all the nodes. Hence, we should set runon option to be all-nodes in cluster. Another option to install packages on AZTK is to use a custom docker image with the packages preinstalled. You can easily create a Spark cluster in HDInsight and run Jupyter notebooks on it by following this tutorial. In particular, we just need to run the bash script shown in Figure 3 according to this guidance. Bash script for installing packages on HDInsight Spark cluster. Examples Image Classification Image classification is one of the first successful areas dominated by deep learning. In this example script, I train a convolutional network for handwritten digits classification using distributed deep learning on an AZTK Spark cluster. We can scale up or down the number of processes and the number of executors based on the need. Each executor and process correspond to a worker and a core of the worker, respectively. Then, we load the training data and testing data from the Page 3

4 MNIST dataset as Spark DataFrames and perform a series of transformations to convert the data into the format that Distributed Keras requires. After the data is prepared, we can define the DNN model using Keras and perform the distributed training on Spark with any algorithm provided by dist-keras, such as Figure 5: Use ADAG algorithm to train a convolutional network. Time Series Forecasting Time series forecasting is a ubiquitous problem in many domains, including energy, retail, finance, healthcare, and many others. Both are recurrent neural networks RNNs that have been revealed to be powerful in time series forecasting. In total, the data has rows and 3 informative columns indicating the time, hourly energy consumption, and temperature. Hourly energy consumption of New York City. The tricky parts are how to normalize the data and how to create features for time series forecasting using PySpark. MinMaxTransformer in Distributed Keras is exploited to normalize the range of each data column. Unlike MinMaxScaler in scikit learn, we need to specify the range of values before and after transformation in Figure 7. But this allows us to use MinMaxTransformer for the inverse transformation as well. Normalize the range of a data column. Since the transformers defined in Distributed Keras package operate on Spark DataFrame, I use window function in PySpark to create the input features and output targets: Create input features and output targets. Then, I assemble all the features into a vector and reshape the vectors into the format that Keras requires. Similarly, I do the assembling and reshaping for the target variables. After the data is prepared, two networks will be trained: Create and fit an LSTM model. The input sequence length and output sequence length are 24 and 1, respectively. Basically, the model maps the energy consumption in the last 24 hours to the consumption in the next hour. I use the last hour 5-day data as testing data and all the previous data as training data. After the model is trained, we can apply it to predict the energy consumption and convert the predictions to the original range of the energy consumptions. Training time and MAPE vs. Page 4

5 Chapter 3 : DMTK - Microsoft Research USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14) Scaling Distributed Machine Learning with the Parameter Server. Object detection powers some of the most widely adopted computer vision applications, from people counting in crowd control to pedestrian detection used by self-driving cars. Training an object detection model can take up to weeks on a single GPU, a prohibitively long time for experimenting with hyperparameters and model architectures. This blog will show how you can train an object detection model by distributing deep learning training to multiple GPUs. These GPUs can be on a single machine or several machines. You will learn how to perform distributed deep learning on Azure, and how you can do this using Horovod running on Azure Batch AI. Object Detection Object detection combines the task of classification with localization, outputting both a category and a set of coordinates representing the bounding box for each object that it detects in the image, as illustrated in Figure 1 below. Different computer vision tasks source Over the past few years, many exciting deep learning approaches for object detection have emerged. Models such as Faster R-CNN use a two-stage procedure to first propose regions containing some object followed by classification of the object in each region and adjusting its bounding box. These are a few examples of the array of model architectures available to you for doing object detection. Instead of taking the raw image as input, these object detection models work off the feature map produced by a backbone network, which is often the convolutional layers of a classification network such as ResNet. Amongst different object detection techniques, several promising approaches are introduced recently e. This paper provides a good overview of the trade-offs between different object detection architectures. Object Detection Approaches Tradeoffs between accuracy and inference time. Marker shapes indicate meta-architecture and colors indicate feature extractor. Each meta-architecture, feature extractor pair corresponds to multiple points on this plot due to changing input sizes, stride, etc. While in many situations a powerful GPU could carry out model training in a reasonable amount of time, for elaborate models such as object detectors, it can take days or weeks to complete. To make hyperparameter search and rapid iterative experimentations practical, we look to speed up training time by distributing the computation to multiple GPUs in a computer or even across a cluster of computers. Below we briefly discuss the several ways distributed training can be accomplished, and introduce Horovod, a distributed deep learning framework that can be used with TensorFlow, Keras and PyTorch. Model Parallelism and Data Parallelism The gain in speed from distributing training to more than one GPU comes from parallelizing compute operations across multiple processes, each running on a separate GPU, for instance. There are two approaches for doing this. In the model parallelism approach, the parameters of the model are distributed across multiple devices and one batch of data is processed in each iteration. This is helpful for very large models that cannot fit on a single device. Parallelizing a model requires it to be implemented with the compute resource in mind, and so there is no easy way to rely on a framework to do this for new models or device settings. Model parallelism When using data parallelism, the same copy of the training script a replica is run on all devices, but each device reads in a different chunk of data at each iteration. The gradients computed by all copies are averaged by some mechanism and the model gets updated. In distributed TensorFlow, parameter servers are used to average the gradients. Each process running in a distributed TensorFlow setup play either a worker or a parameter server role. Workers process training data compute the gradients of the model parameters and send them to one or more parameter servers to be averaged, and later obtain a copy of the updated model for the next iteration. To use this, the training code designed for running on a single GPU needs to be carefully adapted, a rather error prone process. In addition, this distribution often suffers from various scaling inefficiencies, and GPUs are not fully utilized. Parameter Server Approach Source: Horovod Presentation A different approach to gradient averaging was popularized by Baidu in early, called ring-allreduce, first implemented as a fork of TensorFlow. In this approach, workers are connected in a ring, communicating with two neighboring workers, Page 5

6 and can average gradients and disperse them without a central parameter server. Below is an illustration on how ring-allreduce works. Ring-allreduce approach Why Horovod? A key consideration in distributed deep learning is how to efficiently use the resources that are available CPUs, GPUs, and more. Horovod is an open source project initially developed at Uber that implements the ring-allreduce algorithm, first designed for TensorFlow. It provides several advantages when compared to the default TensorFlow implementation: The Horovod API enables you to easily convert a training script designed to run on one GPU to a distributed training ready script using a few lines of code. We will demonstrate how to do this in the next section. Horovod provides an improvement of training speed by more fully utilizing the GPUs. Horovod works with different deep learning frameworks: TensorFlow, Keras and PyTorch. Models written using these frameworks can be easily trained on Azure Batch AI, which has native support for Horovod. In addition, Batch AI enables you to train models used for different use cases at scale. The trained models can be deployed to the cloud, or edge devices. Model Training and Deployment In this section, you will learn how you can build, train and deploy an object detection model using Azure. We will use the following resources: Azure Blob Storage to store the dataset for easy access during training and evaluation. Horovod for distributed training. Azure Machine Learning, a suite of services for experiment run history, version control and model management and deployment. The diagram below illustrates the architecture of our solution. Each of these steps is discussed in the following sections. The COCO training and validation sets contain over k images representing scenes in everyday life, annotated with bounding boxes labeling 80 classes of common objects such as bicycles and cars, humans and pets, foods, and furniture. We downloaded the training and validation images and annotations from the COCO dataset download page, unzipped the files, and used the AzCopy utility to transfer them into a blob container on an Azure Storage Account for fast and easy access during training. During deployment, the Batch AI service coordinates setup tasks that need to be performed on each VM in the cluster. Installation of less common Python and Anaconda packages Mounting the blob container with our training data Ensuring each VM can access a file share on our Azure Storage Account where scripts, logs, and output models will be stored. One of the advantages of Horovod is its simplicity: Notably, you need to: Import the Horovod package Configure the number of GPUs visible to Horovod Use a distributed optimizer to wrap a regular optimizer, such as Adam Add necessary callbacks to avoid conflicts between workers when saving the model First, we added Horovod import and initialization statements at the top of the script: Page 6

7 Chapter 4 : Distributed Machine Learning Toolkit Many machine learning problems rely on large amounts of data for training and then for inference. Big internet scale companies train with terabytes or petabytes of data and create models out of it. This page describes the key concepts you need in order to make the most of your model training. How training works Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your training job on computing resources in the cloud. You create a TensorFlow application that trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment. You get your training and verification data into a source that Cloud ML Engine can access. When your application is ready to run, you must package it and transfer it to a Cloud Storage bucket that your project can access. This is automated when you use the gcloud command-line tool to run a training job. The Cloud ML Engine training service sets up resources for your job. It allocates one or more virtual machines called training instances based on your job configuration. Each training instance is set up by: Applying the standard machine image for the version of Cloud ML Engine your job uses. Loading your application package and installing it with pip. Installing any additional packages that you specify as dependencies. The training service runs your application, passing through any command-line arguments you specify when you create the training job. You can get information about your running job in the following ways: By requesting job details or running log streaming with the gcloud command-line tool. By programmatically making status requests to the training service. When your training job succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources. If you run a distributed TensorFlow job with Cloud ML Engine, you specify multiple machines nodes in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training: Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. The training service runs until your job succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status. If you are running a single-process job, the sole replica is the master for the job. One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration. One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers. A typical machine learning application The Cloud ML Engine training service is designed to have as little impact on your application as possible. This means you can focus on your TensorFlow code to define the model you want instead of being confined by a rigid structure. Most machine learning applications: Provide a way to get training data and evaluation data. Process data instances in batches. Use evaluation data to test the accuracy of the model how often it predicts the right value. Provide a way to export the trained model when the application finishes. Packaging your application Before you can run your training application with Cloud Machine Learning Engine, you must package your application and any additional dependencies you require, and upload the package to Cloud Storage bucket that your Google Cloud Platform project can access. The gcloud command-line tool automates much of the process. Specifically, you can use gcloud ml-engine jobs submit training to upload your application package and submit your training job. See the detailed instructions on packaging a training application. Submitting your training job Cloud Machine Learning Engine provides model training as an asynchronous batch service. You can submit a training job by running gcloud ml-engine jobs submit training from the command line or by sending a request to the API at projects. See the detailed instructions on starting a training job. Job ID You must give your training job a name that obeys these rules: It must be unique within your Google Cloud Platform project. It may only contain mixed-case letters, digits, and underscores. It must start with a letter. It must be no more Page 7

8 than characters long. You can use whatever job naming convention you want. If you run a lot of jobs, you may need to find your job ID in large lists. This convention makes it easy to sort lists of jobs by name, because all jobs for a model are then grouped together in ascending order. Scale tiers When running a training job on Cloud ML Engine you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself. To specify a scale tier, add it to the TrainingInput object in your job configuration. See the detailed definitions of scale tiers and machine types. Hyperparameter tuning If you want to use hyperparameter tuning, you must include configuration details when you create your training job. See a conceptual guide to hyperparameter tuning and how to use hyperparameter tuning. Regions and zones GCP uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on Cloud ML Engine, you specify the region that you want it to run in. If you must run your job in a different region from your data bucket, your job may take longer. Using job-dir as a common output directory You can specify the output directory for your job by setting a job directory when you configure the job. When you submit the job, Cloud ML Engine does the following: Validates the directory so that you can fix any problems before the job runs. Passes the path to your application as a command-line argument named --job-dir. You need to account for the --job-dir argument in your application. See the guide to starting a training job. Runtime version You should specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Specify a version that gives you the functionality you need. If you run the training job locally as well as in the cloud, make sure the local and cloud jobs use the same runtime version. Input data The data that you can use in your training job must obey the following rules to run on Cloud ML Engine: The data must be in a format that you can read and feed to your TensorFlow code. The data must be in a location that your code can access. This typically means that it should be stored with one of the GCP storage or big data services. Output data It is common for applications to output data, including checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. It is easiest to save your output files to a Cloud Storage bucket in the same GCP project as your training job. You should ensure that your training job is resilient to these restarts, by saving model checkpoints regularly, and by configuring your job to restore the most recent checkpoint. You usually save model checkpoints in the Cloud Storage path that you specify with the --job-dir argument in the gcloud ml-engine jobs submit training command. GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores. You can specify GPU-enabled machines to run your job, and the service allocates them for you. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically as always: We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training. Page 8

9 Chapter 5 : What is the reason to use parameter server in distributed tensorflow learning? - Stack Overflow We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. Especially in recent years, practices have demonstrated the trend that more training data and bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models from huge amount of data, because the task usually requires a large number of computation resources. These innovations make machine learning tasks on big data highly scalable, efficient, and flexible. We will continue to add new algorithms to DMTK in a regular basis. The current version of DMTK includes the following components more components will be added to the future versions: Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms. We believe that in order to push the frontier of distributed machine learning, we need the collective effort from the entire community, and need the organic combination of both machine learning innovations and system innovations. This belief strongly motivates us to open source the DMTK project. LightLDA LightLDA is a new, highly-efficient O 1 Metropolis-Hastings sampling algorithm, whose running cost is surprisingly agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers. In the distributed implementation, we leverage the hybrid data structure, model scheduling, and automatic pipelining functions provided by the DMTK framework, so as to make LightLDA capable for extremely Big Data and Big Model even on a modest computer cluster. In particular, on a cluster of as few as 8 machines, we can train a topic model with 1 million topics and a million-word vocabulary for a total of 10 trillion parameters, on a document collection with over billion tokens - a scale not yet reported even with thousands of machines in the literature. Distributed Multi-sense Word Embedding Word embedding has become a very popular tool to compute semantic representation of words, which can serve as high-quality word features for natural language processing tasks. We provide the distributed implementations of two word embedding algorithms. One is the standard word2vec algorithm, and the other is a multi-sense word embedding algorithm that learns multiple embedding vectors for polysemous words. By leveraging the model scheduling and automatic pipelining functions provided by the DMTK framework, we are able to train d word embedding vectors for a million-word vocabulary, on a document collection with over billion tokens, on a cluster of 8 machines. LightGBM is evidenced to be several times faster than existing implementations of gradient boosting trees, due to its fully greedy tree-growth method and histogram-based memory and computation optimization. It also has a complete solution for distributed training, based on the DMTK framework. DMTK is a platform deigned for distributed machine learning. Deep learning is not our focus, and the algorithms released in DMTK are mostly non-deep learning algorithms. If you want to use state-of-the-art deep learning tools, you are highly recommended to use Microsoft CNTK. We have close collaborations with CNTK and provide support to its asynchronous parallel training functionalities. Page 9

10 Chapter 6 : Distributed Deep Learning on AZTK and HDInsight Spark Clusters Machine Learning Blog Communication Efï cient Distributed Machine Learning with the Parameter Server Mu Li y, David G. Andersen, Alexander Smolaz, and Kai Yu Carnegie Mellon University ybaidu zgoogle. Monday, July 31, A Comparison of Distributed Machine Learning Platforms This paper surveys the design approaches used in distributed machine learning ML platforms and proposes future research directions. This is joint work with my students Kuo Zhang and Salem Alqahtani. These technologies have very promising applications in self-driving cars, digital health systems, CRM, advertising, internet of things, etc. Due to the huge dataset and model sizes involved in training, the ML platforms are often distributed ML platforms and employ 10s and s of workers in parallel to train the models. It is estimated that an overwhelming majority of the tasks in datacenters will be machine learning tasks in the near future. My background is in distributed systems, so we decided to study these ML platforms from a distributed systems perspective and analyze the communication and control bottlenecks for these platforms. We also looked at fault-tolerance and ease-of-programming in these platforms. We categorize the distributed ML platforms under 3 basic design approaches: We talk about each approach in brief, using Apache Spark as an example of the basic dataflow approach, PMLS Petuum as an example of the parameter-server model, and TensorFlow and MXNet as examples of the advanced dataflow model. We provide a couple evaluation results comparing their performance. See the paper for more evaluation results. Unfortunately, we were unable to evaluate at scale as a small team from academia. At the end of this post, I present concluding remarks and recommendation for future work for distributed ML platforms. Skip to the end, if you already have some experience with these distributed ML platforms. There are two kinds of operations: The DAG is compiled into stages. Each stage is executed as a series of tasks that run in parallel one task for each partition. Narrow dependencies are good for efficient execution, whereas wide dependencies introduce bottlenecks since they disrupt pipelining and require communication intensive shuffle operations. Distributed execution in Spark is performed by partitioning this DAG stages on machines. The figure shows the master-worker architecture clearly. The driver contains two scheduler components, the DAG scheduler and the task scheduler, and tasks and coordinates the workers. Spark was designed for general data processing, and not specifically for machine learning. In the basic setup, Spark stores the model parameters in the driver node, and the workers communicate with the driver to update the parameters after each iteration. For large scale deployments, the model parameters may not fit into the driver and would be maintained as an RDD. This introduces a lot of overhead because a new RDD will need to be created in each iteration to hold the updated model parameters. This is where the basic dataflow model the DAG in Spark falls short. Spark does not support iterations needed in ML well. It introduced the parameter-server PS abstraction for serving the iteration-intensive ML training process. The PS shown in the green boxes in the figure is maintained as distributed in-memory key-value store. Thus the PS scales well with respect to the number of nodes. The workers request up-to-date model parameters from their local PS copy and carry out computation over the partition of dataset assigned to them. The relaxed consistency model is still OK for ML training due to noise tolerance of the process. I had covered this in an April blog post. Here is my review of the DistBelief paper. From what I can tell, the major complaint about DistBelief was that it required messing with low-level code for writing ML applications. Google wanted any of its employees to be able to write ML code without requiring them to be well-versed in distributed execution this is the same reason why Google wrote the MapReduce framework for big data processing. So TensorFlow is designed to enable that goal. TensorFlow adopts the dataflow paradigm, but the advanced version where the computation graph does not need to be a DAG but can include cycles and support mutable state. I think Naiad design might have some influence on TensorFlow design. TensorFlow denotes computation with a directed graph of nodes and edges. The nodes represent computations, with mutable state. And the edges represent multidimensional data arrays tensors communicated between nodes. When you use the PS abstraction in TensorFlow, you use a Page 10

11 parameter-server and data parallelism. TensorFlow says you can do more complicated stuff, but that requires writing custom code and marching into uncharted territory. Some evaluation results For our evaluations we used Amazon EC2 m4. EBS Bandwidth is Mbps. We used two common machine learning tasks for evaluation: I am only providing couple graphs here, check our paper for more experiments. Our experiments had several limitations: This figure shows the speed of platforms for logistic regression. This figure shows the speed of platforms for DNNs. Spark sees greater performance loss going to two layers NN compared to single layer logistic regression. This is due to more iterative computation needed. We kept the parameters at the driver in Spark because they could fit, things would have been much worse if we kept the parameters in an RDD and updated after every iteration. This figure shows the CPU utilization of the platforms. Spark application seems to have significantly high CPU utilization, which comes mainly as serialization overhead. This problem has been pointed out before by earlier work. It is safe to say the parameter-server approach won for training in distributed ML platforms. As far as bottlenecks is concerned, network still remains as a bottleneck for distributed ML applications. However, there can be some surprises and subtleties. In Spark, the CPU overhead was becoming the bottleneck before the network limitations. The programming language used in Spark, i. Some tools addressing the problem for Spark data processing applications have been proposed recently, such as Ernest and CherryPick. There are many open questions for distributed systems support for ML runtime, such as resource scheduling and runtime performance improvement. What are suitable [distributed] programming abstractions for ML applications? Also more research needed for verification and validation testing DNNs with particularly problematic input of distributed ML applications. Chapter 7 : Multiverso â Parameter Server Platform For Distribute Machine Learning - Microsoft Researc Scaling Distributed Machine Learning with the Parameter Server Presented by: Liang Gong CS Class Presentation Fall Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Chapter 8 : Training Overview Cloud ML Engine for TensorFlow Google Cloud Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested. Chapter 9 : TrÃ novã nã TensorFlow modelå Azure Machine Learning Microsoft Docs Using parameter server can give you better network utilization, and lets you scale your models to more machines. A concrete example, suppose you have M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. Page 11

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models