SCALABLE DISTRIBUTED DEEP LEARNING

Size: px

Start display at page:

Download "SCALABLE DISTRIBUTED DEEP LEARNING"

Kevin Fields
5 years ago
Views:

1 SEOUL Oct.7, 2016 SCALABLE DISTRIBUTED DEEP LEARNING Han Hee Song, PhD Soft On Net 10/7/2016

2 BATCH PROCESSING FRAMEWORKS FOR DL Data parallelism provides efficient big data processing: data collecting, feeding, cleaning Easy integration with distributed data storage solutions: HBase, HIVE, HDFS Fault-tolerance providing reliable, fail-safe job scheduling Additionally, for Apache Spark In-memory data processing without storing intermediate results to disk Nice single-framework for deep learning. However... 2

3 CHALLENGES OF USING SPARK FOR DL TRADITIONAL BATCH PROCESSING TASKS SHARDS OF DEEP LEARNING TASKS Each task independently mappable to mappers Gradient descent used in DL incur lots of communications among tasks Synchronously updatable for global updates among workers DL parameter updating is asynchronous 3

4 DISTRIBUTED DEEP LEARNING Data parallel Deep Learning* on Spark Use multiple model replicas to process different examples concurrently Spark Driver Spark Workers HDFS 4 * Jeff Dean, NIPS, 2013

5 DISTRIBUTED TRAINING ARCHITECTURE Schedule jobs when new training needed Apache Oozie Hadoop YARN Monitor and main memory usage Training data on HDFS Parameter server Spark driver Model trainer Model trainer Model trainer Model trainer Shared virtual GPUs Shared virtual GPUs Trained model on HDFS Control flow 5 Data flow

6 TRAINING PERFORMANCE Training GoogLeNet with Batch Normalization (lower error is better) Normalized error Baseline 1 GPU UCBerkeley SparkNet (4 x 4 GPU) DeepDatamining / SoftOnNet (4 x 4 GPU) Yahoo CaffeOnSpark (4 x 4 GPU) Parallelizing training with 16 GPUs yields 5x faster convergence Learning time (in hours) Source: Source information is 8 pt, italic 6

Problems DISTRIBUTED INFERENCE Real-time processing of 1,000s of videos Significant performance degradation due to IPC communication delay Low utilization of GPU Solution In each GPU card,

7 Problems DISTRIBUTED INFERENCE Real-time processing of 1,000s of videos Significant performance degradation due to IPC communication delay Low utilization of GPU Solution In each GPU card, improve GPU occupancy In the cluster, promote fine-grained resource allocation among machines For operational system, promoting efficiency in inference is much more important than in training 7

8 INFERENCE PERFORMANCE Time taken to process k 60 second videos (for a single GPU) 250 Execution time in seconds Each GPU can only process 2 videos in real-time # of tasks 8

9 EFFICIENT USE OF CORES IN EACH GPU Observation Inference on deep network does not use all 2,500 GPU cores GPU by default dedicates all cores to each task at a time Solution: Nvidia Multi-Process Service (MPS) Launches multiple kernels concurrently Concurrently processes multiple tasks partially occupying GPU cores 9

10 EFFICIENT USE OF CORES IN EACH GPU Micro-benchmark using Nvidia Visual Profiler default single-process mode Multiple-Process Service (MPS) mode Source: Priyanka, Improving GPU utilization with MPS, Nvidia GTC

11 INFERENCE PERFORMANCE WITH MPS Time taken to process k 60 second videos (for a single GPU) Execution time in seconds Default single process mode Multi-Process Service (MPS) Each GPU can now process as many as 8 videos in real-time # of tasks 11

DISTRIBUTED INFERENCE ARCHITECTURE Schedule jobs when New video arrives Apache Oozie Hadoop YARN Monitor GPUs, main memory, YARN labeling Video on HDFS Video to image Spark driver Trained model on

12 DISTRIBUTED INFERENCE ARCHITECTURE Schedule jobs when New video arrives Apache Oozie Hadoop YARN Monitor GPUs, main memory, YARN labeling Video on HDFS Video to image Spark driver Trained model on HDFS Object detection Trajecto ry tracking Object detection Trajecto ry tracking Object detection Trajecto ry tracking Object detection Trajecto ry tracking Assign tasks up to 4 x 8 Partial GPU Partial GPU Partial GPU Partial GPU Assign tasks up to 1 x 8 Annotated Video on HDFS Discovered object metadata on HBASE 12 Control flow Data flow

Mid-size enterprise SmartEye Large, multi-site

13 SMART EYE Deep-learning based Video Surveillance on the Cloud Cost-effective, reliable, and mobile solution to both small and large consumers Small-scale home user Mid-size enterprise SmartEye Large, multi-site Copyright enterprise, 2016 deep datamining, government, Inc. etc. 13

14 SEOUL Oct.7, 2016 THANK YOU

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training