DOWNLOAD PDF PARAMETER SERVER FOR DISTRIBUTED MACHINE LEARNING

Size: px
Start display at page:

Download "DOWNLOAD PDF PARAMETER SERVER FOR DISTRIBUTED MACHINE LEARNING"

Transcription

1 Chapter 1 : Metadata: A Comparison of Distributed Machine Learning Platforms The parameter server architecture, shown above, has two classes of nodes: The server nodes main- tain a partition of the globally shared parameters (machine local parameters are not synchronized by default). But so far this has mostly been limited to either a single GPU e. Distributed deep learning systems are typically CPU-based. Clearly the ideal would be to efficiently harness clusters of GPUs in a general-purpose framework. This is exactly what GeePS does, and the results are impressive. GeePS is a parameter server supporting data-parallel model training. In data parallel training, the input data is partitioned among workers on different machines, that collectively update shared model parameters. These parameters themselves may be sharded across machines. Using parameter servers to scale machine learning [In] the basic parameter server architecture, all state shared among application workers i. Client-side caches are also used to serve most operations locally. The consistency model can conform to the Bulk Synchronous Parallel BSP model, in which all updates from the previous clock must be visible before proceeding to the next clock, or can use a looser but still bounded model. For example, the Stale Synchronous Parallel model allows the fastest worker to be ahead of the slowest worker by a bounded number of clocks. While logically the parameter server is separate to the worker machines, in practice the server-side parameter server state is commonly sharded across the same machines as the worker state. Doing so was straightforward and immediately enabled distributed deep learning on GPUs, confirming the application programmability benefits of the data-parallel parameter server approachâ While it was easy to get working, the performance was not acceptable. As noted by Chilimbi et al. Also, as noted by them and others, the need to fit the full model, as well as a mini-batch of input data and intermediate neural network states, in the GPU memory limits the size of models that can be trained. Specializing a parameter server for GPUs To enable a parameter server to support parallel ML applications running on distributed GPUs the authors make three important changes: Batching operations One at a time read and update operations of model parameters values can significantly slow execution. To realize sufficient performance, our GPU-specialized parameter server supports batch-based interfaces for reads and updates. These changes make parameter server accesses much more efficient for GPU-based training. GeePS implements an operation sequence gathering mechanism that gathers the operation sequence either in the first iteration, or in a virtual iteration. Before real training starts, the application performs a virtual iteration with all GeePS calls being marked with a virtual flag. Operations are recorded by GeePS but no real actions are taken. Since the gathered access information is used only as a hint, knowing the exact operation sequence is not a requirement for correctness, but a performance optimization. Managing GPU memory The parameter server uses pre-allocated GPU buffers to pass data to an application, rather than copying the parameter data to application provided buffers. When an application wants to update parameter values, it also does so in GPU allocated buffers. The application can also store local non-parameter data e. The parameter server client library will be able to manage all the GPU memory on a machine, if the application keeps all its local data in the parameter server and uses the parameter-server managed buffers. When the GPU memory of a machine is not big enough to host all data, the parameter server will store parts of the data in CPU memory. Fortunately, iterative applications like neural network training typically apply the same parameter data accesses every iteration, so the parameter server can easily predict the Read operations and perform them in advance in the background. The parameter data is sharded across all instances, and cached locally with periodic refresh e. When an application issues a read or update operation it provides a list of keys. All parameters are fetched or updated in parallel on the GPU cores. The access index built from the list of keys can be built just once for each batch of keys using the operation sequence gathering process described earlier, and then re-used for each instance of the given batch access. While keeping the access buffer pool twice the peak size for double buffering, our policy will first try to pin the local data that is used at the peak in GPU memory, on order to reduce the peak size and thus the size of the buffer pool. Then, it will try to use the available capacity to pin Page 1

2 more local data and parameter cache date in GPU memory. Finally, it will add any remaining available GPU memory to the access buffer. Figure 8 below shows the throughput scalability of GeePS on an image classification task as compared to a CPU-based parameter server distributed system, and a single GPU node Caffe system. To evaluate convergence speed, we will compare the amount of time required to reach a given level of accuracy, which is a combination of image training throughput and model convergence per trained image. Using unmodified Caffe, a video classification RNN can support a maximum video length of 48 frames. In order to use a video length of frames, Ng et al. By contrast, with the memory management support of GeePS, we are able to train videos with up to frames, using solely data parallelism. Many recent ML model training systems, including for neural network training, use a parameter server architecture to share state among data-parallel workers executing on CPUs. Consistent reports indicate that, in such an architecture, some degree of asynchrony bounded or not in parameter update exchanges among workers leads to significantly faster convergence than when using BSP. We observe the opposite with data-parallel workers executing on GPUsâ while synchronization delays can be largely eliminated, as expected, convergence is much slower with the more asynchronous models because of reduced training quality. We believe there are two reasons causing this outcome. First, with our specializations, there is little to no communication delay for DNN applications, so adding data staleness does not increase the throughput much. Page 2

3 Chapter 2 : Train TensorFlow models with Azure Machine Learning Microsoft Docs Scaling Distributed Machine Learning with the Parameter Server Mu Liz, David G. Andersen, Jun Woo Park, Alexander J. Smola y, Amr Ahmed, Vanja Josifovski y, James Long, Eugene J. Shekita, Bor-Yiing Suy. Deep learning has achieved great success in many areas recently. It has attained state-of-the-art performance in applications ranging from image classification and speech recognition to time series forecasting. The key success factors of deep learning are â big volumes of data, flexible models and ever-growing computing power. With the increase in the number of parameters and training data, it is observed that the performance of deep learning can be improved dramatically. However, when models and training data get big, they may not fit in the memory of a single CPU or GPU machine, and thus model training could become slow. One of the approaches to handle this challenge is to use large-scale clusters of machines to distribute the training of deep neural networks DNNs. This technique enables a seamless integration of scalable data processing with deep learning. Other approach like using multiple GPUs on a single machine works well with modest data but could be inefficient for big data. I do so by walking you through two examples â one for image classification and another for time series forecasting. The source code is available on GitHub. Essentially, it is solving a stochastic optimization problem in a distributed fashion. To help understand this concept, I will introduce the basic ideas and few popular algorithms of distributed deep learning. Data Parallelism There are two main types of distributed deep learning frameworks, namely model parallelism and data parallelism. Model parallelism distributes the computations of a single model across multiple machines. As a simple example, it is possible to split the computations of the output of a perceptron single neuron by having each input node compute the product of the input and the associated weight. In contrast, data parallelism tries to parallelize the gradient descent over multiple machines by splitting training data into several partitions or data shards. Since this blog uses the data parallelism framework, I will briefly describe the typical steps in this framework. As illustrated in Figure 1, the training data is split into mini-batches over workers, with each mini-batch of size. When the training starts, every worker gets a copy of the model parameters e. Then, the local gradient information is sent back to the parameter server, which will average all the accumulated gradients and applies this combined gradient to update the model parameters. After this, the workers will download the new model parameters and the above process will be repeated. Typical architecture of data parallelism for deep learning [1]. Representative Algorithms In this blog, we exploit distributed optimization algorithms based on data parallelism. Many such algorithms have been proposed to speed up the training of DNNs, e. DOWNPOUR is an asynchronous optimization algorithm which allows the parameter server to update the model parameters whenever it receives the information from a worker. Although it is one of the most commonly used algorithms, it has large communication overhead and is not very stable with large number of workers. Environment Setup To illustrate the above concept, we will use the Distributed Keras Python package referred to as dist-keras in the examples. Note that you can also install relevant packages on Azure Databricks and perform distributed deep learning like this example. Before we create a new cluster, as shown in Figure 2, we can specify additional packages that we want to install in jupyter. Note that we need to install the packages on all the nodes. Hence, we should set runon option to be all-nodes in cluster. Another option to install packages on AZTK is to use a custom docker image with the packages preinstalled. You can easily create a Spark cluster in HDInsight and run Jupyter notebooks on it by following this tutorial. In particular, we just need to run the bash script shown in Figure 3 according to this guidance. Bash script for installing packages on HDInsight Spark cluster. Examples Image Classification Image classification is one of the first successful areas dominated by deep learning. In this example script, I train a convolutional network for handwritten digits classification using distributed deep learning on an AZTK Spark cluster. We can scale up or down the number of processes and the number of executors based on the need. Each executor and process correspond to a worker and a core of the worker, respectively. Then, we load the training data and testing data from the Page 3

4 MNIST dataset as Spark DataFrames and perform a series of transformations to convert the data into the format that Distributed Keras requires. After the data is prepared, we can define the DNN model using Keras and perform the distributed training on Spark with any algorithm provided by dist-keras, such as Figure 5: Use ADAG algorithm to train a convolutional network. Time Series Forecasting Time series forecasting is a ubiquitous problem in many domains, including energy, retail, finance, healthcare, and many others. Both are recurrent neural networks RNNs that have been revealed to be powerful in time series forecasting. In total, the data has rows and 3 informative columns indicating the time, hourly energy consumption, and temperature. Hourly energy consumption of New York City. The tricky parts are how to normalize the data and how to create features for time series forecasting using PySpark. MinMaxTransformer in Distributed Keras is exploited to normalize the range of each data column. Unlike MinMaxScaler in scikit learn, we need to specify the range of values before and after transformation in Figure 7. But this allows us to use MinMaxTransformer for the inverse transformation as well. Normalize the range of a data column. Since the transformers defined in Distributed Keras package operate on Spark DataFrame, I use window function in PySpark to create the input features and output targets: Create input features and output targets. Then, I assemble all the features into a vector and reshape the vectors into the format that Keras requires. Similarly, I do the assembling and reshaping for the target variables. After the data is prepared, two networks will be trained: Create and fit an LSTM model. The input sequence length and output sequence length are 24 and 1, respectively. Basically, the model maps the energy consumption in the last 24 hours to the consumption in the next hour. I use the last hour 5-day data as testing data and all the previous data as training data. After the model is trained, we can apply it to predict the energy consumption and convert the predictions to the original range of the energy consumptions. Training time and MAPE vs. Page 4

5 Chapter 3 : DMTK - Microsoft Research USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14) Scaling Distributed Machine Learning with the Parameter Server. Object detection powers some of the most widely adopted computer vision applications, from people counting in crowd control to pedestrian detection used by self-driving cars. Training an object detection model can take up to weeks on a single GPU, a prohibitively long time for experimenting with hyperparameters and model architectures. This blog will show how you can train an object detection model by distributing deep learning training to multiple GPUs. These GPUs can be on a single machine or several machines. You will learn how to perform distributed deep learning on Azure, and how you can do this using Horovod running on Azure Batch AI. Object Detection Object detection combines the task of classification with localization, outputting both a category and a set of coordinates representing the bounding box for each object that it detects in the image, as illustrated in Figure 1 below. Different computer vision tasks source Over the past few years, many exciting deep learning approaches for object detection have emerged. Models such as Faster R-CNN use a two-stage procedure to first propose regions containing some object followed by classification of the object in each region and adjusting its bounding box. These are a few examples of the array of model architectures available to you for doing object detection. Instead of taking the raw image as input, these object detection models work off the feature map produced by a backbone network, which is often the convolutional layers of a classification network such as ResNet. Amongst different object detection techniques, several promising approaches are introduced recently e. This paper provides a good overview of the trade-offs between different object detection architectures. Object Detection Approaches Tradeoffs between accuracy and inference time. Marker shapes indicate meta-architecture and colors indicate feature extractor. Each meta-architecture, feature extractor pair corresponds to multiple points on this plot due to changing input sizes, stride, etc. While in many situations a powerful GPU could carry out model training in a reasonable amount of time, for elaborate models such as object detectors, it can take days or weeks to complete. To make hyperparameter search and rapid iterative experimentations practical, we look to speed up training time by distributing the computation to multiple GPUs in a computer or even across a cluster of computers. Below we briefly discuss the several ways distributed training can be accomplished, and introduce Horovod, a distributed deep learning framework that can be used with TensorFlow, Keras and PyTorch. Model Parallelism and Data Parallelism The gain in speed from distributing training to more than one GPU comes from parallelizing compute operations across multiple processes, each running on a separate GPU, for instance. There are two approaches for doing this. In the model parallelism approach, the parameters of the model are distributed across multiple devices and one batch of data is processed in each iteration. This is helpful for very large models that cannot fit on a single device. Parallelizing a model requires it to be implemented with the compute resource in mind, and so there is no easy way to rely on a framework to do this for new models or device settings. Model parallelism When using data parallelism, the same copy of the training script a replica is run on all devices, but each device reads in a different chunk of data at each iteration. The gradients computed by all copies are averaged by some mechanism and the model gets updated. In distributed TensorFlow, parameter servers are used to average the gradients. Each process running in a distributed TensorFlow setup play either a worker or a parameter server role. Workers process training data compute the gradients of the model parameters and send them to one or more parameter servers to be averaged, and later obtain a copy of the updated model for the next iteration. To use this, the training code designed for running on a single GPU needs to be carefully adapted, a rather error prone process. In addition, this distribution often suffers from various scaling inefficiencies, and GPUs are not fully utilized. Parameter Server Approach Source: Horovod Presentation A different approach to gradient averaging was popularized by Baidu in early, called ring-allreduce, first implemented as a fork of TensorFlow. In this approach, workers are connected in a ring, communicating with two neighboring workers, Page 5

6 and can average gradients and disperse them without a central parameter server. Below is an illustration on how ring-allreduce works. Ring-allreduce approach Why Horovod? A key consideration in distributed deep learning is how to efficiently use the resources that are available CPUs, GPUs, and more. Horovod is an open source project initially developed at Uber that implements the ring-allreduce algorithm, first designed for TensorFlow. It provides several advantages when compared to the default TensorFlow implementation: The Horovod API enables you to easily convert a training script designed to run on one GPU to a distributed training ready script using a few lines of code. We will demonstrate how to do this in the next section. Horovod provides an improvement of training speed by more fully utilizing the GPUs. Horovod works with different deep learning frameworks: TensorFlow, Keras and PyTorch. Models written using these frameworks can be easily trained on Azure Batch AI, which has native support for Horovod. In addition, Batch AI enables you to train models used for different use cases at scale. The trained models can be deployed to the cloud, or edge devices. Model Training and Deployment In this section, you will learn how you can build, train and deploy an object detection model using Azure. We will use the following resources: Azure Blob Storage to store the dataset for easy access during training and evaluation. Horovod for distributed training. Azure Machine Learning, a suite of services for experiment run history, version control and model management and deployment. The diagram below illustrates the architecture of our solution. Each of these steps is discussed in the following sections. The COCO training and validation sets contain over k images representing scenes in everyday life, annotated with bounding boxes labeling 80 classes of common objects such as bicycles and cars, humans and pets, foods, and furniture. We downloaded the training and validation images and annotations from the COCO dataset download page, unzipped the files, and used the AzCopy utility to transfer them into a blob container on an Azure Storage Account for fast and easy access during training. During deployment, the Batch AI service coordinates setup tasks that need to be performed on each VM in the cluster. Installation of less common Python and Anaconda packages Mounting the blob container with our training data Ensuring each VM can access a file share on our Azure Storage Account where scripts, logs, and output models will be stored. One of the advantages of Horovod is its simplicity: Notably, you need to: Import the Horovod package Configure the number of GPUs visible to Horovod Use a distributed optimizer to wrap a regular optimizer, such as Adam Add necessary callbacks to avoid conflicts between workers when saving the model First, we added Horovod import and initialization statements at the top of the script: Page 6

7 Chapter 4 : Distributed Machine Learning Toolkit Many machine learning problems rely on large amounts of data for training and then for inference. Big internet scale companies train with terabytes or petabytes of data and create models out of it. This page describes the key concepts you need in order to make the most of your model training. How training works Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your training job on computing resources in the cloud. You create a TensorFlow application that trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment. You get your training and verification data into a source that Cloud ML Engine can access. When your application is ready to run, you must package it and transfer it to a Cloud Storage bucket that your project can access. This is automated when you use the gcloud command-line tool to run a training job. The Cloud ML Engine training service sets up resources for your job. It allocates one or more virtual machines called training instances based on your job configuration. Each training instance is set up by: Applying the standard machine image for the version of Cloud ML Engine your job uses. Loading your application package and installing it with pip. Installing any additional packages that you specify as dependencies. The training service runs your application, passing through any command-line arguments you specify when you create the training job. You can get information about your running job in the following ways: By requesting job details or running log streaming with the gcloud command-line tool. By programmatically making status requests to the training service. When your training job succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources. If you run a distributed TensorFlow job with Cloud ML Engine, you specify multiple machines nodes in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training: Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. The training service runs until your job succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status. If you are running a single-process job, the sole replica is the master for the job. One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration. One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers. A typical machine learning application The Cloud ML Engine training service is designed to have as little impact on your application as possible. This means you can focus on your TensorFlow code to define the model you want instead of being confined by a rigid structure. Most machine learning applications: Provide a way to get training data and evaluation data. Process data instances in batches. Use evaluation data to test the accuracy of the model how often it predicts the right value. Provide a way to export the trained model when the application finishes. Packaging your application Before you can run your training application with Cloud Machine Learning Engine, you must package your application and any additional dependencies you require, and upload the package to Cloud Storage bucket that your Google Cloud Platform project can access. The gcloud command-line tool automates much of the process. Specifically, you can use gcloud ml-engine jobs submit training to upload your application package and submit your training job. See the detailed instructions on packaging a training application. Submitting your training job Cloud Machine Learning Engine provides model training as an asynchronous batch service. You can submit a training job by running gcloud ml-engine jobs submit training from the command line or by sending a request to the API at projects. See the detailed instructions on starting a training job. Job ID You must give your training job a name that obeys these rules: It must be unique within your Google Cloud Platform project. It may only contain mixed-case letters, digits, and underscores. It must start with a letter. It must be no more Page 7

8 than characters long. You can use whatever job naming convention you want. If you run a lot of jobs, you may need to find your job ID in large lists. This convention makes it easy to sort lists of jobs by name, because all jobs for a model are then grouped together in ascending order. Scale tiers When running a training job on Cloud ML Engine you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself. To specify a scale tier, add it to the TrainingInput object in your job configuration. See the detailed definitions of scale tiers and machine types. Hyperparameter tuning If you want to use hyperparameter tuning, you must include configuration details when you create your training job. See a conceptual guide to hyperparameter tuning and how to use hyperparameter tuning. Regions and zones GCP uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on Cloud ML Engine, you specify the region that you want it to run in. If you must run your job in a different region from your data bucket, your job may take longer. Using job-dir as a common output directory You can specify the output directory for your job by setting a job directory when you configure the job. When you submit the job, Cloud ML Engine does the following: Validates the directory so that you can fix any problems before the job runs. Passes the path to your application as a command-line argument named --job-dir. You need to account for the --job-dir argument in your application. See the guide to starting a training job. Runtime version You should specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Specify a version that gives you the functionality you need. If you run the training job locally as well as in the cloud, make sure the local and cloud jobs use the same runtime version. Input data The data that you can use in your training job must obey the following rules to run on Cloud ML Engine: The data must be in a format that you can read and feed to your TensorFlow code. The data must be in a location that your code can access. This typically means that it should be stored with one of the GCP storage or big data services. Output data It is common for applications to output data, including checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. It is easiest to save your output files to a Cloud Storage bucket in the same GCP project as your training job. You should ensure that your training job is resilient to these restarts, by saving model checkpoints regularly, and by configuring your job to restore the most recent checkpoint. You usually save model checkpoints in the Cloud Storage path that you specify with the --job-dir argument in the gcloud ml-engine jobs submit training command. GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores. You can specify GPU-enabled machines to run your job, and the service allocates them for you. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically as always: We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training. Page 8

9 Chapter 5 : What is the reason to use parameter server in distributed tensorflow learning? - Stack Overflow We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. Especially in recent years, practices have demonstrated the trend that more training data and bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models from huge amount of data, because the task usually requires a large number of computation resources. These innovations make machine learning tasks on big data highly scalable, efficient, and flexible. We will continue to add new algorithms to DMTK in a regular basis. The current version of DMTK includes the following components more components will be added to the future versions: Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms. We believe that in order to push the frontier of distributed machine learning, we need the collective effort from the entire community, and need the organic combination of both machine learning innovations and system innovations. This belief strongly motivates us to open source the DMTK project. LightLDA LightLDA is a new, highly-efficient O 1 Metropolis-Hastings sampling algorithm, whose running cost is surprisingly agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers. In the distributed implementation, we leverage the hybrid data structure, model scheduling, and automatic pipelining functions provided by the DMTK framework, so as to make LightLDA capable for extremely Big Data and Big Model even on a modest computer cluster. In particular, on a cluster of as few as 8 machines, we can train a topic model with 1 million topics and a million-word vocabulary for a total of 10 trillion parameters, on a document collection with over billion tokens - a scale not yet reported even with thousands of machines in the literature. Distributed Multi-sense Word Embedding Word embedding has become a very popular tool to compute semantic representation of words, which can serve as high-quality word features for natural language processing tasks. We provide the distributed implementations of two word embedding algorithms. One is the standard word2vec algorithm, and the other is a multi-sense word embedding algorithm that learns multiple embedding vectors for polysemous words. By leveraging the model scheduling and automatic pipelining functions provided by the DMTK framework, we are able to train d word embedding vectors for a million-word vocabulary, on a document collection with over billion tokens, on a cluster of 8 machines. LightGBM is evidenced to be several times faster than existing implementations of gradient boosting trees, due to its fully greedy tree-growth method and histogram-based memory and computation optimization. It also has a complete solution for distributed training, based on the DMTK framework. DMTK is a platform deigned for distributed machine learning. Deep learning is not our focus, and the algorithms released in DMTK are mostly non-deep learning algorithms. If you want to use state-of-the-art deep learning tools, you are highly recommended to use Microsoft CNTK. We have close collaborations with CNTK and provide support to its asynchronous parallel training functionalities. Page 9

10 Chapter 6 : Distributed Deep Learning on AZTK and HDInsight Spark Clusters Machine Learning Blog Communication Efï cient Distributed Machine Learning with the Parameter Server Mu Li y, David G. Andersen, Alexander Smolaz, and Kai Yu Carnegie Mellon University ybaidu zgoogle. Monday, July 31, A Comparison of Distributed Machine Learning Platforms This paper surveys the design approaches used in distributed machine learning ML platforms and proposes future research directions. This is joint work with my students Kuo Zhang and Salem Alqahtani. These technologies have very promising applications in self-driving cars, digital health systems, CRM, advertising, internet of things, etc. Due to the huge dataset and model sizes involved in training, the ML platforms are often distributed ML platforms and employ 10s and s of workers in parallel to train the models. It is estimated that an overwhelming majority of the tasks in datacenters will be machine learning tasks in the near future. My background is in distributed systems, so we decided to study these ML platforms from a distributed systems perspective and analyze the communication and control bottlenecks for these platforms. We also looked at fault-tolerance and ease-of-programming in these platforms. We categorize the distributed ML platforms under 3 basic design approaches: We talk about each approach in brief, using Apache Spark as an example of the basic dataflow approach, PMLS Petuum as an example of the parameter-server model, and TensorFlow and MXNet as examples of the advanced dataflow model. We provide a couple evaluation results comparing their performance. See the paper for more evaluation results. Unfortunately, we were unable to evaluate at scale as a small team from academia. At the end of this post, I present concluding remarks and recommendation for future work for distributed ML platforms. Skip to the end, if you already have some experience with these distributed ML platforms. There are two kinds of operations: The DAG is compiled into stages. Each stage is executed as a series of tasks that run in parallel one task for each partition. Narrow dependencies are good for efficient execution, whereas wide dependencies introduce bottlenecks since they disrupt pipelining and require communication intensive shuffle operations. Distributed execution in Spark is performed by partitioning this DAG stages on machines. The figure shows the master-worker architecture clearly. The driver contains two scheduler components, the DAG scheduler and the task scheduler, and tasks and coordinates the workers. Spark was designed for general data processing, and not specifically for machine learning. In the basic setup, Spark stores the model parameters in the driver node, and the workers communicate with the driver to update the parameters after each iteration. For large scale deployments, the model parameters may not fit into the driver and would be maintained as an RDD. This introduces a lot of overhead because a new RDD will need to be created in each iteration to hold the updated model parameters. This is where the basic dataflow model the DAG in Spark falls short. Spark does not support iterations needed in ML well. It introduced the parameter-server PS abstraction for serving the iteration-intensive ML training process. The PS shown in the green boxes in the figure is maintained as distributed in-memory key-value store. Thus the PS scales well with respect to the number of nodes. The workers request up-to-date model parameters from their local PS copy and carry out computation over the partition of dataset assigned to them. The relaxed consistency model is still OK for ML training due to noise tolerance of the process. I had covered this in an April blog post. Here is my review of the DistBelief paper. From what I can tell, the major complaint about DistBelief was that it required messing with low-level code for writing ML applications. Google wanted any of its employees to be able to write ML code without requiring them to be well-versed in distributed execution this is the same reason why Google wrote the MapReduce framework for big data processing. So TensorFlow is designed to enable that goal. TensorFlow adopts the dataflow paradigm, but the advanced version where the computation graph does not need to be a DAG but can include cycles and support mutable state. I think Naiad design might have some influence on TensorFlow design. TensorFlow denotes computation with a directed graph of nodes and edges. The nodes represent computations, with mutable state. And the edges represent multidimensional data arrays tensors communicated between nodes. When you use the PS abstraction in TensorFlow, you use a Page 10

11 parameter-server and data parallelism. TensorFlow says you can do more complicated stuff, but that requires writing custom code and marching into uncharted territory. Some evaluation results For our evaluations we used Amazon EC2 m4. EBS Bandwidth is Mbps. We used two common machine learning tasks for evaluation: I am only providing couple graphs here, check our paper for more experiments. Our experiments had several limitations: This figure shows the speed of platforms for logistic regression. This figure shows the speed of platforms for DNNs. Spark sees greater performance loss going to two layers NN compared to single layer logistic regression. This is due to more iterative computation needed. We kept the parameters at the driver in Spark because they could fit, things would have been much worse if we kept the parameters in an RDD and updated after every iteration. This figure shows the CPU utilization of the platforms. Spark application seems to have significantly high CPU utilization, which comes mainly as serialization overhead. This problem has been pointed out before by earlier work. It is safe to say the parameter-server approach won for training in distributed ML platforms. As far as bottlenecks is concerned, network still remains as a bottleneck for distributed ML applications. However, there can be some surprises and subtleties. In Spark, the CPU overhead was becoming the bottleneck before the network limitations. The programming language used in Spark, i. Some tools addressing the problem for Spark data processing applications have been proposed recently, such as Ernest and CherryPick. There are many open questions for distributed systems support for ML runtime, such as resource scheduling and runtime performance improvement. What are suitable [distributed] programming abstractions for ML applications? Also more research needed for verification and validation testing DNNs with particularly problematic input of distributed ML applications. Chapter 7 : Multiverso â Parameter Server Platform For Distribute Machine Learning - Microsoft Researc Scaling Distributed Machine Learning with the Parameter Server Presented by: Liang Gong CS Class Presentation Fall Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. Chapter 8 : Training Overview Cloud ML Engine for TensorFlow Google Cloud Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested. Chapter 9 : Trà novã nã TensorFlow modelå Azure Machine Learning Microsoft Docs Using parameter server can give you better network utilization, and lets you scale your models to more machines. A concrete example, suppose you have M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. Page 11

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models

More information

Scaling Distributed Machine Learning with the Parameter Server

Scaling Distributed Machine Learning with the Parameter Server Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented

More information

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Toward Scalable Deep Learning

Toward Scalable Deep Learning 한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 2015.10.16 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr Breakthrough: Big Data + Machine Learning

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons Carnegie Mellon University CMU-PDL-15-107 October 2015 Parallel

More information

Deep Learning on AWS with TensorFlow and Apache MXNet

Deep Learning on AWS with TensorFlow and Apache MXNet Deep Learning on AWS with TensorFlow and Apache MXNet Julien Simon Global Evangelist, AI & Machine Learning @julsimon Renaud ALLIOUX CTO, Earthcube The Amazon ML Stack: Broadest & Deepest Set of Capabilities

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

NVIDIA DEEP LEARNING INSTITUTE

NVIDIA DEEP LEARNING INSTITUTE NVIDIA DEEP LEARNING INSTITUTE TRAINING CATALOG Valid Through July 31, 2018 INTRODUCTION The NVIDIA Deep Learning Institute (DLI) trains developers, data scientists, and researchers on how to use artificial

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python: Code Mania 2019 Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python: 1. Introduction to Artificial Intelligence 2. Introduction to python programming and Environment

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG Valid Through July 31, 2018 INTRODUCTION The NVIDIA Deep Learning Institute (DLI) trains developers, data scientists, and researchers on how to use artificial

More information

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain

More information

Deep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou

Deep Learning Training System: Adam & Google Cloud ML. CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou Deep Learning Training System: Adam & Google Cloud ML CSE 5194: Introduction to High- Performance Deep Learning Qinghua Zhou Outline Project Adam: Building an Efficient and Scalable Deep Learning Training

More information

Exploiting characteristics of machine learning applications for efficient parameter servers

Exploiting characteristics of machine learning applications for efficient parameter servers Exploiting characteristics of machine learning applications for efficient parameter servers Henggang Cui hengganc@ece.cmu.edu October 4, 2016 1 Introduction Large scale machine learning has emerged as

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30

More information

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS by Google Research presented by Weichen Wang 2016.11.28 OUTLINE Introduction The Programming Model The Implementation Single

More information

Review: The best frameworks for machine learning and deep learning

Review: The best frameworks for machine learning and deep learning Review: The best frameworks for machine learning and deep learning infoworld.com/article/3163525/analytics/review-the-best-frameworks-for-machine-learning-and-deep-learning.html By Martin Heller Over the

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. By Sanjay Surendranath Girija

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. By Sanjay Surendranath Girija TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS By Sanjay Surendranath Girija WHAT IS TENSORFLOW? TensorFlow is an interface for expressing machine learning algorithms, and

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Data Centers and Cloud Computing

Data Centers and Cloud Computing Data Centers and Cloud Computing CS677 Guest Lecture Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

YOUR APPLICATION S JOURNEY TO THE CLOUD. What s the best way to get cloud native capabilities for your existing applications?

YOUR APPLICATION S JOURNEY TO THE CLOUD. What s the best way to get cloud native capabilities for your existing applications? YOUR APPLICATION S JOURNEY TO THE CLOUD What s the best way to get cloud native capabilities for your existing applications? Introduction Moving applications to cloud is a priority for many IT organizations.

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

A Deep Learning Approach to Vehicle Speed Estimation

A Deep Learning Approach to Vehicle Speed Estimation A Deep Learning Approach to Vehicle Speed Estimation Benjamin Penchas bpenchas@stanford.edu Tobin Bell tbell@stanford.edu Marco Monteiro marcorm@stanford.edu ABSTRACT Given car dashboard video footage,

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Slides courtesy of Tim Wood Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

DL4J Components. Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions

DL4J Components. Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions DeepLearning4j DL4J Components Open Source DeepLearning Library CPU & GPU support Hadoop-Yarn & Spark integration Futre additions DL4J Components Open Source Deep Learning Library What is DeepLearning

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Data Centers and Cloud Computing. Data Centers

Data Centers and Cloud Computing. Data Centers Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

Machine Learning in WAN Research

Machine Learning in WAN Research Machine Learning in WAN Research Mariam Kiran mkiran@es.net Energy Sciences Network (ESnet) Lawrence Berkeley National Lab Oct 2017 Presented at Internet2 TechEx 2017 Outline ML in general ML in network

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Parallel Implementation of Deep Learning Using MPI

Parallel Implementation of Deep Learning Using MPI Parallel Implementation of Deep Learning Using MPI CSE633 Parallel Algorithms (Spring 2014) Instructor: Prof. Russ Miller Team #13: Tianle Ma Email: tianlema@buffalo.edu May 7, 2014 Content Introduction

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation

More information

CERN openlab & IBM Research Workshop Trip Report

CERN openlab & IBM Research Workshop Trip Report CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03 Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10

More information

Matrix Computations and " Neural Networks in Spark

Matrix Computations and  Neural Networks in Spark Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets

More information

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Accelerating GPU inferencing with DirectML and DirectX 12 Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Machine Learning Machine learning has become immensely popular over the last decade

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Democratizing Machine Learning on Kubernetes

Democratizing Machine Learning on Kubernetes Democratizing Machine Learning on Kubernetes Joy Qiao, Senior Solution Architect - AI and Research Group, Microsoft Lachlan Evenson - Principal Program Manager AKS/ACS, Microsoft Who are we? The Data Scientist

More information

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big

More information

WHITEPAPER. Pipelining Machine Learning Models Together

WHITEPAPER. Pipelining Machine Learning Models Together WHITEPAPER Pipelining Machine Learning Models Together Table of Contents Introduction 2 Performance and Organizational Benefits of Pipelining 4 Practical Use Case: Twitter Sentiment Analysis 5 Practical

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY Tutorial on Keras CAP 6412 - ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY Deep learning packages TensorFlow Google PyTorch Facebook AI research Keras Francois Chollet (now at Google) Chainer Company

More information

Onto Petaflops with Kubernetes

Onto Petaflops with Kubernetes Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes

More information

MoonRiver: Deep Neural Network in C++

MoonRiver: Deep Neural Network in C++ MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

15.1 Optimization, scaling, and gradient descent in Spark

15.1 Optimization, scaling, and gradient descent in Spark CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview

More information

Distributed Machine Learning: An Intro. Chen Huang

Distributed Machine Learning: An Intro. Chen Huang : An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous

More information

WHITE PAPER PernixData FVP

WHITE PAPER PernixData FVP WHITE PAPER PernixData FVP Technical White Paper 1 EXECUTIVE SUMMARY The last decade has seen virtualization become a mainstay in the enterprise data center. Enterprises are now looking to virtualize their

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Data Analytics on RAMCloud

Data Analytics on RAMCloud Data Analytics on RAMCloud Jonathan Ellithorpe jdellit@stanford.edu Abstract MapReduce [1] has already become the canonical method for doing large scale data processing. However, for many algorithms including

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm Instructions This is an individual assignment. Individual means each student must hand in their

More information

Apparel Classifier and Recommender using Deep Learning

Apparel Classifier and Recommender using Deep Learning Apparel Classifier and Recommender using Deep Learning Live Demo at: http://saurabhg.me/projects/tag-that-apparel Saurabh Gupta sag043@ucsd.edu Siddhartha Agarwal siagarwa@ucsd.edu Apoorve Dave a1dave@ucsd.edu

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

Leveraging AI on the Cloud to transform your business. Florida Business Analytics Forum 2018 at University of South Florida

Leveraging AI on the Cloud to transform your business. Florida Business Analytics Forum 2018 at University of South Florida Leveraging AI on the Cloud to transform your business Florida Business Analytics Forum 2018 at University of South Florida 1 My (unusual) path to Google Neural networks at NOAA 2 DNNs solved image analysis

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information