Deep Learning Frameworks with Spark and GPUs

Similar documents
Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

The Evolution of Big Data Platforms and Data Science

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

WITH INTEL TECHNOLOGIES

Building the Most Efficient Machine Learning System

High-Performance Training for Deep Learning and Computer Vision HPC

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Processing of big data with Apache Spark

An Introduction to Apache Spark

Onto Petaflops with Kubernetes

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Scalable Tools - Part I Introduction to Scalable Tools

Data Analytics and Machine Learning: From Node to Cluster

Managing Deep Learning Workflows

Why AI Frameworks Need (not only) RDMA?

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

Scaled Machine Learning at Matroid

DATA SCIENCE USING SPARK: AN INTRODUCTION

Review: The best frameworks for machine learning and deep learning

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Building the Most Efficient Machine Learning System

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Backtesting with Spark

Specialist ICT Learning

A performance comparison of Deep Learning frameworks on KNL

Accelerating Spark Workloads using GPUs

Efficient Communication Library for Large-Scale Deep Learning

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Integrate MATLAB Analytics into Enterprise Applications

CERN openlab & IBM Research Workshop Trip Report

IBM Deep Learning Solutions

Unifying Big Data Workloads in Apache Spark

Harp-DAAL for High Performance Big Data Computing

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Bare Metal Library. Abstractions for modern hardware Cyprien Noel

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud Computing & Visualization

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

A Tutorial on Apache Spark

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

SCALABLE DISTRIBUTED DEEP LEARNING

Twitter data Analytics using Distributed Computing

Contents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Sharing High-Performance Devices Across Multiple Virtual Machines

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

CSE 444: Database Internals. Lecture 23 Spark

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DEEP LEARNING INSTITUTE

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Spark Overview. Professor Sasu Tarkoma.

Democratizing Machine Learning on Kubernetes

Deep Learning mit PowerAI - Ein Überblick

Matrix Computations and " Neural Networks in Spark

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Scaling Distributed Machine Learning

Toward Scalable Deep Learning

Keras: Handwritten Digit Recognition using MNIST Dataset

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop, Yarn and Beyond

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

Deploying Machine Learning Models in Practice

Cloud, Big Data & Linear Algebra

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Integrate MATLAB Analytics into Enterprise Applications

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Integrate MATLAB Analytics into Enterprise Applications

Apache SystemML Declarative Machine Learning

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Application Acceleration Beyond Flash Storage

Towards Scalable Machine Learning

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

Machine Learning on VMware vsphere with NVIDIA GPUs

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Programming Systems for Big Data

Large-Scale GPU programming

Chapter 4: Apache Spark

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Embedded Technosolutions

New Developments in Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Transcription:

Deep Learning Frameworks with Spark and GPUs

Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they'll need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage. We'll discuss and show in action an examination of TensorflowOnSpark, CaffeOnSpark, DeepLearning4J, IBM's SystemML, and Intel's BigDL and distributed versions of various deep learning frameworks, namely TensorFlow, Caffe, and Torch.

The Deep Learning Landscape Frameworks - Tensorflow, MXNet, Caffe, Torch, Theano, etc Boutique Frameworks - TensorflowOnSpark, CaffeOnSpark Data Backends - File system, Amazon EFS, Spark, Hadoop, etc, etc. Cloud Ecosystems - AWS, GCP, IBM Cloud, etc, etc

Pipelining w/ job chains or Spark Stre Why Spark Ecosystem Many Data sources Environments Applications Real-time data In-memory RDDs (resilient distributed data sets) HDFS integration Data Scientist Workflow DataFrames SQL APIs

Approaches to Deep Learning + Spark

Yahoo: CaffeOnSpark, TensorFlowOnSpark Designed to run on existing Spark and Hadoop clusters, and use existing Spark libraries like SparkSQL or Spark s MLlib machine learning libraries Should be noted that the Caffe and TensorFlow versions used by these lag the release version by about 4-6 weeks Source: Yahoo

Skymind.ai: DeepLearning4J Deeplearning4j is an open-source, distributed deep-learning library written for Java and Scala. DL4J can import neural net models from most major frameworks via Keras, including TensorFlow, Caffe, Torch and Theano. Keras is employed as Deeplearning4j's Python API. Source: deeplearning4j.org

IBM: SystemML Apache SystemML runs on top of Apache Spark, where it automatically scales your data, line by line, determining whether your code should be run on the driver or an Apache Spark cluster. Algorithm customizability via R-like and Python-like languages. Multiple execution modes, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC. Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability. Limited set of algorithms supported so far Source: Niketan Panesar

Intel: BigDL Modeled after Torch, supports Scala and Python programs Scales out via Spark Leverages IBM MKL (Math Kernel Library) Source: MSDN

Google: TensorFlow Flexible, powerful deep learning framework that supports CPU, GPU, multi-gpu, and multi-server GPU with Tensorflow Distributed Keras support Strong ecosystem (we ll talk more about this) Source: Google

Why We Chose TensorFlow for Our Study

TensorFlow s Web Popularity We decided to focus on TensorFlow as it represent the majority the deep learning framework usage in the market right now. Source: http://redmonk.com/fryan/2016/06/06/a-look-at-popular-machine-learning-frameworks/

TensorFlow s Academic Popularity It is also worth noting that TensorFlow represents a large portion of what the research community is currently interested in. Source: https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106

A Quick Case Study

Model and Benchmark Information Our model uses the following: Dataset: CIFAR10 Architecture: Pictured to the right Optimizer: Adam For benchmarking, we will consider: Images per second Time to 85% accuracy (averaged across 10 runs) All models run on p2 AWS instances Model Configurations: Single server - CPU Single server - GPU Single server - multi-gpu Multi server distributed TensorFlow Multi server TensorFlow on Spark Input 2 Convolution Module (64 depth) 2 Convolution Module (128 depth) 4 Convolution Module (256 depth) Fully Connected Layer (1024 depth) Fully Connected Layer (1024 depth) Output Layer (10 depth) 2 Convolution Module 3X3 Convolution 3X3 Convolution 2X2 (2) Max Pooling 4 Convolution Module 3X3 Convolution 3X3 Convolution 3X3 Convolution 3X3 Convolution 2X2 (2) Max Pooling

TensorFlow - Single Server CPU and GPU This is really well documented and the basis for why most of the frameworks were created. Using GPUs for deep learning creates high returns quickly. Managing dependencies for GPU-enabled deep learning frameworks can be tedious (cuda drivers, cuda versions, cudnn versions, framework versions). Bitfusion can help alleviate a lot of these issues with our AMIs and docker containers Performance CPU: Images per second ~ 40 Time to 85% accuracy ~ GPU: Images per second ~ 1420 Time to 85% accuracy ~ Note: CPU can be sped up on more CPU focused machines. When going from CPU to GPU, it can be hugely beneficial to explicitly put data tasks on CPU and number crunching (gradient calculations) on GPUs with tf.device().

TensorFlow - Single Server Multi-GPU Implementation Details For this example we used 3 GPUs on a single machine (p2.8xlarge) Performance Images per second ~ 3200 Time to 85% accuracy ~ Code Changes Need to write code to define device placement with tf.device() Write code to calculate gradients on each GPU and then calculates an average gradient for the update

Either a cluster manager needs to be set up or jobs need to be kicked off individually on workers and parameter servers Distributed TensorFlow Implementation Details For this example we used 1 parameter server and 3 worker servers. Performance Images per second ~ 2250 Time to 85% accuracy ~ We used EFS, but other shared file systems or native file systems could be used Code Changes Need to write code to define a parameter server and worker server Need to implement special session: tf.monitoredsession()

TensorFlow on Spark Implementation Details For this example we used 1 parameter server and 3 worker servers. Performance Images per second ~ 2450 Time to 85% accuracy ~ Requires data to be placed in HDFS Code Changes TensorFlow on Spark requires renaming a few TF variables from a vanilla distributed TF implementation. (tf.train.server() becomes TFNode.start_cluster_server(), etc.) Need to understand multiple utils for reading/writing to HDFS

Performance Summary It should be noted again that the CPU performance is on a p2.xlarge For this example, local communication through multi-gpu is much more efficient than distributed communication TensorFlow on Spark provided speedups in the distributed setting (most likely from RDMA) For each processing step the effective batch size is identical (125 images for single GPU/CPU and 42 for the 3 GPU and 3 server distributed)

When to Use What and Practical Considerations

Practical Considerations Size of Data If the size of the input data is especially large (large images for example) the entire model might not fit onto a single GPU If the number of records is prohibitively large a shared file system or database might be required If the number of records is large convergence can be sped up using multiple GPUs or distributed models Size of Model If the size of the network is larger than your GPU s memory, splitting the model across multi-gpus (model parallel)

Number of Updates Practical Considerations Continued If the number of updates and the size of the updates are considerable, multiple GPU configurations on a single server should be Hardware Configurations Network speeds play a crucial role in distributed model settings GPUs connected with RDMA over Infiniband have shown significant speedup over more basic grpc over ethernet

Use multiple channels simultaneously where applicable: PCIe, multiple IB ports grpc vs RDMA vs MPI grpc and MPI are possible transports for distributed TF, however: Continued grpc latency is 100-200us/msg due to userspace <-> kernel switches MPI latency is 1-3us/message due to OS bypass, but requires communications to be serialized to a single thread. Otherwise, 100-200us latency TensorFlow typically spawns ~100 threads making MPI suboptimal Our approach, powered by Bitfusion Core compute virtualization engine, is to use native IB verbs + RDMA for the primary transport 1-3 us per message across all threads of execution

Key Takeaways and Future Work Key Takeaways GPUs are almost always better Saturating GPUs locally should be a priority before going to a distributed setting If your data is already built on Spark TensorFlow on Spark provides an easy way to integrate If you want to get the most recent TensorFlow features, TFoS has a version release delay Future Work Use TensorFlow on Spark on our Dell Infiniband cluster

Questions?