A performance comparison of Deep Learning frameworks on KNL

Similar documents
Deep Learning with Tensorflow AlexNet

ImageNet Classification with Deep Convolutional Neural Networks

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Using CNN Across Intel Architecture

HPC Advisory COUNCIL

Convolutional Neural Networks

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Deep Learning for Computer Vision II

Fuzzy Set Theory in Computer Vision: Example 3

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

IBM Deep Learning Solutions

Convolutional Neural Network Layer Reordering for Acceleration

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

High Performance Computing

All You Want To Know About CNNs. Yukun Zhu

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Optimizing Machine Learning workloads on Intel Platforms

Lecture 37: ConvNets (Cont d) and Training

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

Real-Time Depth Estimation from 2D Images

CSE 559A: Computer Vision

Deep Learning and Its Applications

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Supplementary material for Analyzing Filters Toward Efficient ConvNet

GPU-Accelerated Deep Learning

Deep Neural Networks:

Profiling DNN Workloads on a Volta-based DGX-1 System

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Deep Learning Frameworks with Spark and GPUs

Caffe tutorial. Seong Joon Oh

Keras: Handwritten Digit Recognition using MNIST Dataset

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Fei-Fei Li & Justin Johnson & Serena Yeung

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning. MGS Lecture 3: Deep Learning

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Computer Vision Lecture 16

Deep Learning. Volker Tresp Summer 2017

TensorFlow* on Modern Intel Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel

Perceptron: This is convolution!

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Training Deep Neural Networks (in parallel)

WITH INTEL TECHNOLOGIES

Deep Learning with Intel DAAL

CNN Basics. Chongruo Wu

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Structured Prediction using Convolutional Neural Networks

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Towards Scalable Machine Learning

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Unified Deep Learning with CPU, GPU, and FPGA Technologies

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

POINT CLOUD DEEP LEARNING

Spatial Localization and Detection. Lecture 8-1

Effectively Scaling Deep Learning Frameworks

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Learning Deep Features for Visual Recognition

Deep Neural Networks for Market Prediction on the Xeon Phi *

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

CNN optimization. Rassadin A

Study of Residual Networks for Image Recognition

A Deep Learning primer

Deconvolutions in Convolutional Neural Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Accelerating Convolutional Neural Nets. Yunming Zhang

ENERGY EFFICIENCY ANALYSIS AND OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION. Xinbo Chen, B.Eng.

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

SVM multiclass classification in 10 steps 17/32

Deep Face Recognition. Nathan Sun

Transfer Learning. Style Transfer in Deep Learning

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Enabling the future of Artificial intelligence

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Convolutional Neural Networks

Keras: Handwritten Digit Recognition using MNIST Dataset

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability.

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

Know your data - many types of networks

OBJECT DETECTION HYUNG IL KOO

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Transcription:

A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018

Table of Contents 1. Problem description 2. Tested Neural Networks 3. Architectures and Software 4. Single node benchmark 5. Multinode benchmark 6. Conclusions R. Zanella DL on CINECA KNL 2/25

Neural Network Internals: classification problem Given a single input, a trained neural network is able to predict a distribution of probability: ŷ = f(x;θ) prediction input NN parameters R. Zanella DL on CINECA KNL 3/25

Neural Network Internals: classification problem Given a single input, a trained neural network is able to predict a distribution of probability: ŷ = f(x;θ) NN paramaters are chosen in order to minimize the average error on a given training set { x (i),y (i)} i=1,...,n : J(θ) = 1 N N ( L f(x (i) ;θ),y (i)) 1 R. Zanella DL on CINECA KNL 3/25

Neural Network Internals: classification problem Given a single input, a trained neural network is able to predict a distribution of probability: ŷ = f(x;θ) NN paramaters are chosen in order to minimize the average error on a given training set { x (i),y (i)} i=1,...,n : N J(θ) = 1 ( L f(x (i) ;θ),y (i)) N 1 a Stochastic Gradient Descent (SGD) algorithm step is: θ (k+1) = θ (k) ɛ k ĝ (k) where ĝ (k) = 1 n θ L(f(x (l) ;θ (k) ),y (l) ), l batch n = #batch R. Zanella DL on CINECA KNL 3/25

Convolutional Neural Networks ImageNet Large Scale Visual Recognition Competition (ILSVRC): annual software contest (since 2010); tasks: image classification, object localization/detection, scene detection; Tested Networks: AlexNet: winner of 2012 classification task, Overfeat: winner of 2013 localization task, remarkable results also in classification and detection VGG: winner of 2014 localization task, second place on classification Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, 2015 R. Zanella DL on CINECA KNL 4/25

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) Structure: five convolutional layers: kernel sizes 11 11, 5 5, three 3 3; output channels 64, 192, 384, 256, 256; three max-pool layers (kernel 3 3, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function. Krizhevsky, Sutskever, Hinton, ImageNet Classification with Deep Convolutional Neural Networks, 2012 R. Zanella DL on CINECA KNL 5/25

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) Structure: five convolutional layers: kernel sizes 11 11, 5 5, three 3 3; output channels 64, 192, 384, 256, 256; three max-pool layers (kernel 3 3, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function. A note on benchmarked network: original version exploited multiple GPUs, to overcome memory limits; benchmarked version: AlexNet_v2 (one device), Krizhevsky, Sutskever, Hinton, ImageNet Classification with Deep Convolutional Neural Networks, 2012 R. Zanella DL on CINECA KNL 5/25

OverFeat (Sermanet et al., 2014) Structure (fast model): five convolutional layers kernel sizes 11 11, 5 5, three 3 3; output channels 96, 256, 512, 1024, 1024; three max-pool layers (kernel 2 2, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function Sermanet et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2014 R. Zanella DL on CINECA KNL 6/25

OverFeat (Sermanet et al., 2014) Structure (fast model): five convolutional layers kernel sizes 11 11, 5 5, three 3 3; output channels 96, 256, 512, 1024, 1024; three max-pool layers (kernel 2 2, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function Differences w.r.t. AlexNet: remarkable size increase of output channels of convolutional layers (larger number of filters); no overlap on max-pool layers; Sermanet et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2014 R. Zanella DL on CINECA KNL 6/25

VGG (Simonyan, Zisserman, 2014) Structure (vgg_a/vgg11 network): eight convolutional layers fixed kernel size 3 3; output channels 64, 128, 2 256, 4 512; five max-pool layers (kernel 2 2, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function Simonyan, Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014 R. Zanella DL on CINECA KNL 7/25

VGG (Simonyan, Zisserman, 2014) Structure (vgg_a/vgg11 network): eight convolutional layers fixed kernel size 3 3; output channels 64, 128, 2 256, 4 512; five max-pool layers (kernel 2 2, stride 2 2); three fully-connected layers; Rectified Linear Unit (ReLU) as nonlinear activation function Differences w.r.t. AlexNet: slight size increase of output channels of convolutional layers; remarkable depth increase (conv layers: five eight); smaller kernels; no overlap on max-pool layers; Simonyan, Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014 R. Zanella DL on CINECA KNL 7/25

Systems description CPU architectures: 2x Intel Broadwell, 2x Intel Xeon E5-2697 v4@2.3ghz, 36 cores (total), 128 GB RAM Intel Knights Landing, Intel Xeon Phi7250 @1.4GHz, 68 cores, 96 GB RAM (+16 GB MCDRAM) 2x Intel Skylake, 2x Intel Xeon 8160 @2.1GHz, 2x Intel Xeon 8160 @2.1GHz, 48 cores (total), 192 GB RAM 1 R. Zanella DL on CINECA KNL 8/25

Systems description CPU architectures: 2x Intel Broadwell, 2x Intel Xeon E5-2697 v4@2.3ghz, 36 cores (total), 128 GB RAM Intel Knights Landing, Intel Xeon Phi7250 @1.4GHz, 68 cores, 96 GB RAM (+16 GB MCDRAM) 2x Intel Skylake, 2x Intel Xeon 8160 @2.1GHz, 2x Intel Xeon 8160 @2.1GHz, 48 cores (total), 192 GB RAM GPU architectures 1 : Nvidia K80: 5.6 Tflops peak performance (sp), 24 GB RAM Nvidia P100 (PCIe): 10.6 Tflops peak performance (sp), 16 GB RAM 1 hosted on: Intel Haswell, 2x Intel Xeon 2630 v3 @2.4GHz, 16 cores (total), 128 GB RAM R. Zanella DL on CINECA KNL 8/25

Systems description CPU architectures: 2x Intel Broadwell, 2x Intel Xeon E5-2697 v4@2.3ghz, 36 cores (total), 128 GB RAM Intel Knights Landing, Intel Xeon Phi7250 @1.4GHz, 68 cores, 96 GB RAM (+16 GB MCDRAM) 2x Intel Skylake, 2x Intel Xeon 8160 @2.1GHz, 2x Intel Xeon 8160 @2.1GHz, 48 cores (total), 192 GB RAM GPU architectures 1 : Nvidia K80: 5.6 Tflops peak performance (sp), 24 GB RAM Nvidia P100 (PCIe): 10.6 Tflops peak performance (sp), 16 GB RAM Knights Landing settings: Hyper-Threading is enabled; Memory Mode is cache; Cluster Mode is quadrant; Network type: Intel Omnipath, 100 Gb/s 1 hosted on: Intel Haswell, 2x Intel Xeon 2630 v3 @2.4GHz, 16 cores (total), 128 GB RAM R. Zanella DL on CINECA KNL 8/25

Tested software R. Zanella DL on CINECA KNL 9/25

Caffe Berkeley Vision and Learning Center (BVLC) C ++, Matlab, Python APIs prototxt text file for network definition GPU: CUDA, cudnn, (+nccl for single-node multi-gpu) CPU: BLAS implementation (ATLAS, MKL, or OpenBlas) tested configuration: BVLC/Caffe 1.0.0, CUDA 8.0, cudnn 6.0 R. Zanella DL on CINECA KNL 10/25

Caffe Berkeley Vision and Learning Center (BVLC) C ++, Matlab, Python APIs prototxt text file for network definition GPU: CUDA, cudnn, (+nccl for single-node multi-gpu) CPU: BLAS implementation (ATLAS, MKL, or OpenBlas) tested configuration: BVLC/Caffe 1.0.0, CUDA 8.0, cudnn 6.0 Intel branch of original project prerequisites: mkl-dnn (based on mklml) tested configuration: intel/caffe 1.0.0, mkl-dnn 0.9 R. Zanella DL on CINECA KNL 10/25

Caffe Berkeley Vision and Learning Center (BVLC) C ++, Matlab, Python APIs prototxt text file for network definition GPU: CUDA, cudnn, (+nccl for single-node multi-gpu) CPU: BLAS implementation (ATLAS, MKL, or OpenBlas) tested configuration: BVLC/Caffe 1.0.0, CUDA 8.0, cudnn 6.0 Intel branch of original project prerequisites: mkl-dnn (based on mklml) tested configuration: intel/caffe 1.0.0, mkl-dnn 0.9 Projects status: BVLC/Caffe: project closed with 1.0 (18 Apr 2017), development efforts moved to Caffe2; Intel/Caffe: latest release is 1.1.0 (13 Jan 2018). R. Zanella DL on CINECA KNL 10/25

Theano Montreal Institute for Learning Algorithms (MILA) Python API (dynamic C code generation) prerequisites: CUDA, cudnn, libgpuarray tested configuration: MILA/Theano (git: 12/07/2017), CUDA 8.0, cudnn 6.0, libgpuarray 0.6.8 R. Zanella DL on CINECA KNL 11/25

Theano Montreal Institute for Learning Algorithms (MILA) Python API (dynamic C code generation) prerequisites: CUDA, cudnn, libgpuarray tested configuration: MILA/Theano (git: 12/07/2017), CUDA 8.0, cudnn 6.0, libgpuarray 0.6.8 Intel branch of original project prerequisites: MKL tested configuration: intel/theano 1.1, MKL 2017 R. Zanella DL on CINECA KNL 11/25

Theano Montreal Institute for Learning Algorithms (MILA) Python API (dynamic C code generation) prerequisites: CUDA, cudnn, libgpuarray tested configuration: MILA/Theano (git: 12/07/2017), CUDA 8.0, cudnn 6.0, libgpuarray 0.6.8 Intel branch of original project prerequisites: MKL tested configuration: intel/theano 1.1, MKL 2017 Projects status: Mila/Theano: development closed, latest bug fix is 1.0.1 (7 Dec 2017);? Intel/Theano: latest release is 1.1.0 (1 Apr 2017). R. Zanella DL on CINECA KNL 11/25

Neon Intel (previously Nervana Systems) Python API prerequisites: CUDA tested configuration: Neon 2.1.0, CUDA 8.0, (mklml_lnx_2018) R. Zanella DL on CINECA KNL 12/25

Neon Intel (previously Nervana Systems) Python API prerequisites: CUDA tested configuration: Neon 2.1.0, CUDA 8.0, (mklml_lnx_2018) Projects status: Nervana/Neon: latest release is 2.6.0 (5 Jan 2018). R. Zanella DL on CINECA KNL 12/25

TensorFlow Google Python API prerequisites: CUDA, tested configuration: CPU: TensorFlow (git: 18/07/2017), (mklml_lnx_2018) GPU: Tensorflow 1.2.1, CUDA 8.0 https://software.intel.com/en-us/articles/-optimizations-on-modern-intel-architecture R. Zanella DL on CINECA KNL 13/25

TensorFlow Google Python API prerequisites: CUDA, tested configuration: CPU: TensorFlow (git: 18/07/2017), (mklml_lnx_2018) GPU: Tensorflow 1.2.1, CUDA 8.0 Project status: /: latest release 1.6.0 (28 Feb 2018) https://software.intel.com/en-us/articles/-optimizations-on-modern-intel-architecture R. Zanella DL on CINECA KNL 13/25

Software summary CPU Intel/Caffe Intel/Theano GPU BVLC/Caffe MILA/Theano TensorFlow Neon R. Zanella DL on CINECA KNL 14/25

Single node benchmark settings Network structures are simplified: no preprocessing phase is implemented; input data is in-memory generated; no dropout is considered; classification task: 1000 classes. R. Zanella DL on CINECA KNL 15/25

Single node benchmark settings Network structures are simplified: no preprocessing phase is implemented; input data is in-memory generated; no dropout is considered; classification task: 1000 classes. Network source codes: Caffe/Theano/Tensorflow: convnet-benchmarks source code is used; Neon: code is in Neon sources. R. Zanella DL on CINECA KNL 15/25

Single node benchmark settings Network structures are simplified: no preprocessing phase is implemented; input data is in-memory generated; no dropout is considered; classification task: 1000 classes. Network source codes: Caffe/Theano/Tensorflow: convnet-benchmarks source code is used; Neon: code is in Neon sources. Chosen performance measure: number of images per second, considering: forward step: the evaluation of the network on a batch (inference); forward-backward step: fwd step + the backpropagation of the errors on a batch (training). R. Zanella DL on CINECA KNL 15/25

AlexNet AlexNet (forward) AlexNet (forward-backward) 8000 256 512 P100 256 512 P100 7000 1024 2048 4096 2500 1024 2048 4096 6000 2000 5000 4000 1500 3000 Knights Landing 2x Skylake K80 1000 Knights Landing 2x Skylake K80 2000 1000 2x Broadwell 500 2x Broadwell 0 intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano 0 intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano R. Zanella DL on CINECA KNL 16/25

OverFeat OverFeat (forward) OverFeat (forward-backward) 128 128 3500 256 512 P100 1200 256 512 P100 1024 1024 3000 2048 2048 1000 2500 800 2000 600 1500 1000 Knights Landing 2x Skylake K80 400 Knights Landing 2x Skylake K80 500 2x Broadwell 200 2x Broadwell 0 0 intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano R. Zanella DL on CINECA KNL 17/25

VGG (I) VGG (forward) VGG (forward-backward) 64 128 P100 500 64 128 P100 1200 256 512 256 512 400 1000 800 300 600 400 200 2x Broadwell Knights Landing 2x Skylake K80 200 100 2x Broadwell Knights Landing 2x Skylake K80 0 intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano 0 intel/caffe intel/theano intel/caffe intel/theano intel/caffe intel/theano BVLC/caffe MILA/theano BVLC/caffe MILA/theano R. Zanella DL on CINECA KNL 18/25

VGG (II) VGG (forward) VGG (forward-backward) 350 64 128 256 Knights Landing 64 128 256 Knights Landing K80 300 512 K80 150 512 250 200 100 150 100 50 50 0 intel/caffe intel/theano BVLC/caffe MILA/theano 0 intel/caffe intel/theano BVLC/caffe MILA/theano R. Zanella DL on CINECA KNL 19/25

multinode benchmark Based on official TensorFlow benchmark suite: https://github.com//benchmarks input data can be in-memory generated, or from dataset (+preprocessing); contains definitions of fully functional neural networks (dropout is present); support for multinode (and multinode, multi-gpu runs); support for larger (and growing) number of models; but the support is for TensorFlow only. Goyal et al., Accurate, Large Minibatch SGD Training ImageNet in 1 Hour, 2017 R. Zanella DL on CINECA KNL 20/25

multinode benchmark Based on official TensorFlow benchmark suite: https://github.com//benchmarks input data can be in-memory generated, or from dataset (+preprocessing); contains definitions of fully functional neural networks (dropout is present); support for multinode (and multinode, multi-gpu runs); support for larger (and growing) number of models; but the support is for TensorFlow only. Multi node parallelization approach: given n = #batch, M = #nodes θ (k+1) = θ (k) ɛ k ĝ (k) where ĝ (k) = 1 Mn θ L(f(x (l) ;θ (k) ),y (l) ) m nodes l batch m Goyal et al., Accurate, Large Minibatch SGD Training ImageNet in 1 Hour, 2017 R. Zanella DL on CINECA KNL 20/25

Supported communication protocols TensorFlow supported protocols: grpc: google Remote Procedure Call grpc+verbs: grpc for administrative tasks (set up RDMA path), RDMA (Remote Direct Memory Access) for actual tensors (weights, gradients, etc) exchange. grpc+mpi: [not tested yet] https://github.com///blob/master//contrib/verbs/readme.md R. Zanella DL on CINECA KNL 21/25

Supported communication protocols TensorFlow supported protocols: grpc: google Remote Procedure Call grpc+verbs: grpc for administrative tasks (set up RDMA path), RDMA (Remote Direct Memory Access) for actual tensors (weights, gradients, etc) exchange. grpc+mpi: [not tested yet] Supported variable management procedures: parameter server: variables are stored on a parameter server that holds the master copy of the variable. For each step, each wn gets a copy of the variables from the ps, and sends its gradients to the ps; distributed replicated: wn has a copy of the variables, and updates its copy after the ps are all updated with the gradients from all wn [not tested yet]. https://github.com///blob/master//contrib/verbs/readme.md R. Zanella DL on CINECA KNL 21/25

Supported communication protocols TensorFlow supported protocols: grpc: google Remote Procedure Call grpc+verbs: grpc for administrative tasks (set up RDMA path), RDMA (Remote Direct Memory Access) for actual tensors (weights, gradients, etc) exchange. grpc+mpi: [not tested yet] Supported variable management procedures: parameter server: variables are stored on a parameter server that holds the master copy of the variable. For each step, each wn gets a copy of the variables from the ps, and sends its gradients to the ps; distributed replicated: wn has a copy of the variables, and updates its copy after the ps are all updated with the gradients from all wn [not tested yet]. Chosen performance measure: average number of images per second per node, considering: fwd step + the backpropagation of the errors on a batch (training); fixed local batch size comparison: overall batch size grows with nodes. https://github.com///blob/master//contrib/verbs/readme.md R. Zanella DL on CINECA KNL 21/25

grpc (forward-backward,vgg) 70 60 50 VGG (fwd-back), protocol: grpc, dedicated ps single 1 (+1) 2 (+1) 3 (+1) 4 (+1) 6 (+1) 40 30 20 10 0 64 128 256 512 R. Zanella DL on CINECA KNL 22/25

grpc+verbs (forward-backward,vgg) 70 VGG (fwd-back), protocol: grpc+verbs, dedicated ps 70 VGG (fwd-back), protocol: grpc+verbs, ps+wn 60 60 50 50 40 40 30 30 20 single 1 (+1) 20 single 2 (+1) 1 3 (+1) 2 10 4 (+1) 6 (+1) 10 3 4 8 (+1) 6 16 (+1) 8 0 64 128 256 512 0 64 128 256 512 R. Zanella DL on CINECA KNL 23/25

Conclusions Single node benchmarks: Best combination is Nvidia P100 and Neon. Nodes composed by a single Intel Knights Landing or 2x intel Skylake can achieve comparable, or slightly superior performances to a single Nvidia K80. Concerning CPU systems, no software outperforms the others. TensorFlow exhibit good performances in both architectures. R. Zanella DL on CINECA KNL 24/25

Conclusions Single node benchmarks: Best combination is Nvidia P100 and Neon. Nodes composed by a single Intel Knights Landing or 2x intel Skylake can achieve comparable, or slightly superior performances to a single Nvidia K80. Concerning CPU systems, no software outperforms the others. TensorFlow exhibit good performances in both architectures. Multinode benchmarks (TensorFlow only): RDMA (grpc+verbs) is fundamental for acceptable efficiency. R. Zanella DL on CINECA KNL 24/25

Conclusions Single node benchmarks: Best combination is Nvidia P100 and Neon. Nodes composed by a single Intel Knights Landing or 2x intel Skylake can achieve comparable, or slightly superior performances to a single Nvidia K80. Concerning CPU systems, no software outperforms the others. TensorFlow exhibit good performances in both architectures. Multinode benchmarks (TensorFlow only): RDMA (grpc+verbs) is fundamental for acceptable efficiency. Ongoing work: Caffe and Theano can exploit Intel Machine Learning Scaling Library (MLSL); TensorFlow can exploit also MPI. R. Zanella DL on CINECA KNL 24/25

References and Acknowledgements References: Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, 2015 Krizhevsky, Sutskever, Hinton, ImageNet Classification with Deep Convolutional Neural Networks, 2012 Sermanet et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2014 Simonyan, Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014 Goyal et al., Accurate, Large Minibatch SGD Training ImageNet in 1 Hour, 2017 https://software.intel.com/en-us/articles/-optimizations-on-modern-intel-architecture https://github.com///blob/master//contrib/verbs/readme.md We would like to thank: Andrea Luiselli (Intel), for support on new Intel software branches; Walter Riviera (Intel), for multinode TensorFlow support. R. Zanella DL on CINECA KNL 25/25

Net structures recap AlexNet OverFeat VGG_a (vgg11) input RGB image conv: k 11 11, ch 64, stride 4 conv: k 11 11, ch 96, stride 4 conv: k 3 3, ch 64, stride 1 maxpool: k 3 3, stride 2 maxpool: k 2 2, stride 2 maxpool: k 2 2, stride 2 conv: k 5 5, ch 192, stride 1 conv: k 5 5, ch 256, stride 1 conv: k 3 3, ch 128, stride 1 maxpool: k 3 3, stride 2 maxpool: k 2 2, stride 2 maxpool: k 2 2, stride 2 conv: k 3 3, ch 384, stride 1 conv: k 3 3, ch 512, stride 1 conv: k 3 3, ch 256, stride 1 conv: k 3 3, ch 256, stride 1 maxpool: k 2 2, stride 2 conv: k 3 3, ch 256, stride 1 conv: k 3 3, ch 1024, stride 1 conv: k 3 3, ch 512, stride 1 conv: k 3 3, ch 512, stride 1 maxpool: k 2 2, stride 2 conv: k 3 3, ch 256, stride 1 conv: k 3 3, ch 1024, stride 1 conv: k 3 3, ch 512, stride 1 conv: k 3 3, ch 512, stride 1 maxpool: k 3 3, stride 2 maxpool: k 2 2, stride 2 maxpool: k 2 2, stride 2 FC: output 4096 FC: output 3072 FC: output 4096 FC: output 4096 FC: output 4096 FC: output 4096 FC: output 1000 FC: output 1000 FC: output 1000 Table: Convolutional and fully-connected layers are followed by ReLU nonlinear function. R. Zanella DL on CINECA KNL 26/25

Reviewers feedbacks Not familiar with applications, so the legend on the graphs is unclear. Unless you want to compare across FWD and FWD-BWD, I would put the graphs on separate slides so that you can increase their sizes. I hope the talk includes a brief discussion about how the applications are different. It isn t clear how the work is being distributed and there are many gaps in the results. This seems a nice benchmark for a relevant case study. I suggest to improve the quality of the slide 6: what is on the y axis? Too many columns make the comparison difficult: I suggest to split the plots. On slide 7: capitalize Intel; R. Zanella DL on CINECA KNL 27/25

Reviewers questions You state that only TensorFlow supports MPI. Does that mean that all of the applications are running OpenMP-only for this work? Single-node benckmark results are obtained with applications running OpenMP-only. Multinode TensorFlow resides on grpc or grpc+verbs. Why aren t there results for 4096 NEON? Why are the GPU results missing 4096? Why are half of the GPU results missing 2048? K80 has a total of 24 GB of memory: for large batch sizes, all SDKs runs out of memory. Were any of the results surprising? Intel efforts on CPU based Deep Learning allows TensorFlow to request comparable cpu time either on KNL or on K80. Are there remaining challenges or questions to be answered? Comment on the parallelization of Neon: multithreading? All tested SDKs are exploiting multitreading parallelization. R. Zanella DL on CINECA KNL 28/25