Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability.

Size: px
Start display at page:

Download "Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability."

Transcription

1 Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability. arxiv: v4 [cs.cv] 5 Dec 2016 Janis Keuper Fraunhofer ITWM Competence Center High Performance Computing Kaiserslautern, Germany janis.keuper@itwm.fhg.de Abstract This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neural Networks (DNNs). The presented results show, that the current state of the art approach, using data-parallelized Stochastic Gradient Descent (SGD), is quickly turning into a vastly communication bound problem. In addition, we present simple but fixed theoretic constraints, preventing effective scaling of DNN training beyond only a few dozen nodes. This leads to poor scalability of DNN training in most practical scenarios. I. INTRODUCTION The tremendous success of Deep Neural Networks (DNNs) [18], [14] in a wide range of practically relevant applications has triggered a race to build larger and larger DNNs [20], which need to be trained with more and more data, to solve learning problems in fast extending fields of applications. However, training DNNs is a compute and data intensive CPU K80 TitanX KNL AlexNet: time per iteration 2s 0.9s 0.2s [10] 0.6s time till convergence 250h 112h 25h [10] 75h GoogLeNet: time per iteration 1.3s 0.36s s time till convergence 361h 100h - 89h TABLE I APPROXIMATE COMPUTATION TIMES FOR ALEXNET WITH BATCH SIZE B = 256 AND 450K ITERATIONS AND GOOGLENET WITH B = 32 AND 1000K ITERATIONS. KNL (XEON PHI KNIGHTS LANDING ) RESULTS WITH MKL17. TITANX WITH PASCAL GPU. SEE SECTION I-B3. task: current models take several ExaFLOP to compute, while processing hundreds of petabyte of data [20]. Table I gives an impression of the compute complexity and shows, that even the latest compute hardware will take days to train the medium sized benchmark networks used in our experiments. While a parallelization of the training problem over up to 8 GPUs hosted in a single compute node can be considered to be the current state of the art, available distributed approaches [4], [15], [1], [2], [7] yield disappointing results [19] in terms of scalability and efficiency. Figure 1 shows representative experimental evaluations, where strong scaling is stalling after only Franz-Josef Pfreundt Fraunhofer ITWM Competence Center High Performance Computing Kaiserslautern, Germany franz-josef.pfreundt@itwm.fhg.de Fig. 1. Experimental evaluation of DNN training scalability (strong scaling) for different DNNs with varying global batch sizes B. Results from an out of the box installation of IntelCaffe on a common HPC system (Details are given in section I-B) a few dozen nodes. In this paper, we investigate the theoretical and practical constraints preventing better scalability, namely model distribution overheads (in section II), data-parallelized matrix multiplication (section III) and training data distribution (section IV). A. Stochastic Gradient Descent Deep Neural Networks are trained using the Backpropagation Algorithm [16]. Numerically, this is formulated as a highly non-convex optimization problem in a very high dimensional space, which is typically solved via Stochastic Gradient Descent (SGD 1 ) [3]. SGD, using moderate mini-batch sizes B, provides stable convergence at fair computational costs on a 1 Usually, SGD with additional 2nd order terms (moments) are used, but this has no impact on the parallelization.

2 Fig. 2. Schematic overview of a distributed SGD implementation of the Backpropagation Algorithm. single node. However, it is very hard to parallelize. This is due to the inherently sequential nature of the algorithm, shown in equation 1 and algorithm 1. w t+1 w t ɛ w x j (w t ), (1) where w t represents the current sate (e.g. the weights at the neurons), ɛ defines the step-size and w x j (w t ) is computed from a given loss-function over the forward results of a small set of training samples (called mini-batch and the given training labels). In fact, there are only two ways to speedup SGD: (I) computing updates w faster and (II) making larger update steps ɛ. While (I) is hard to achieve in a distributed setting given already low compute times (< 1s) per iteration, (II) is bound by the difficult topologies of the non-convex problems, causing SGD to diverge easily. 1) Parallelizing SGD: Figure 2 shows the data-parallel version of SGD [4], which is commonly used for single node multi-gpu and distributed implementations: The global batch of B training samples for the current iteration is split into n equal sized sets of size b (with b = B/n) of training samples which are then fed to n workers holding synchronous local copies of the model state. The results (gradients) off all workers are then accumulated and used to update the model. Hence, the entire approach is implementing a simple mapreduce scheme. Notably, this scheme implies two different levels parallelization: the data- and task-parallel [12] Inner Parallelization, located at the compute units of the nodes using parallel algorithms to compute the forward and backward operations within the layers of the DNN (see section III for details on the local parallelization of layer operations), and the Outer Parallelization over the distributed batches. B. Experimental Setup 1) Benchmarks: We apply two widely used convolutional networks (CNNs), AlexNet [13] and GoogLeNet [22], for the Algorithm 1 Mini-Batch SGD with samples X = {x 0,..., x m }, iterations T, step-size ɛ, batch size B Require: ɛ > 0 1: for all t = 0... T do 2: randomly draw batch M B samples from X 3: Init w t = 0 4: for all x M do 5: aggregate update w w x j (w t ) 6: update w t+1 w t ɛ w t 7: return w T benchmarking of our experimental evaluations. Both neural networks follow different strategies to learn predictive models for the ImageNet [17] visual recognition challenge. While AlexNet implements a rather shallow network with 3 dominant fully-connected (FC) layers, is GoogLeNet using a very deep network with many convolutional layers. Table II shows the technical details of both networks. AlexNet GoogLeNet ExaFLOP to convergence # Iterations till convergence 450k 1000k Model bit FP 250 MB 50 MB Default batch size Default step-size # Layers # Convolutional layers 5 59 # Fully-connected (FC) layers 3 1 # Weights in FC layers 55M 1M TABLE II PROPERTIES OF THE DEEP NEURAL NETWORKS USED FOR THE FOLLOWING BENCHMARKS. 2) Software Framework: We use the MPI based distributed Version (IntelCaffe) [1] of the popular open source framework Caffe [9] for our evaluation. IntelCaffe was built with CUDA

3 Fig. 3. Communication overhead for different models and batch sizes. The scalability stalls when the compute times drop below the communication times, leaving compute units idle. Hence becoming an communication bound problem. Results were generated using a binary tree communication scheme [7]. 7.5 and cudnn 5 using the latest Intel compiler, MKL 2 and IntelMPI. 3) Hardware: All distributed experiments were conducted on a HPC cluster with nodes holding a dual Xeon E v3 CPU ( GHz), a NVIDIA Tesla K80 GPU and FDR-Infiniband interconnects. II. DISTRIBUTION OVERHEAD The parallelization of DNN training via SGD (as shown in algorithm 1 and figure 2) requires the communication of the model w t and the computed gradients w t between all nodes in every iteration t. Since w has to be synchronous in all nodes and w t+1 can not be computed before w t is available, the entire communication has to be completed before the next iteration t + 1. Naturally, one would try to overlap this communication (which can be done layer by layer) with the compute times. However, there are several pitfalls to this strategy: (I) w and w have the size of all weights in the neural network, which can be hundreds of megabyte 3, (II) the compute times per iteration 4 are rather low and are decreasing further when scaling to more nodes (see section III), (III) communication can not start before the forward pass of the network has succeeded (practically cutting the overlay times by half). Ironically, faster compute units (e.g. newer GPUs) in the compute nodes increase the fundamental problem, that the communication time exceeds the compute time, after scaling to only a few nodes. Leaving valuable compute units idle. Figure 2 Some CPU experiments used the latest DNN extensions of the MKL17 library which provides special purpose functions for the fast implementation of several layer types like cudnn for CUDA. 3 See table II for details. 4 See table I. 3 shows the strong divergence of communication- and compute times. Depending on the model size, the training problem becomes communication bound after scaling to only 4 to 8 nodes. This directly correlates to the general scaling results shown in figure 1. Figure 3 also shows, that the network layout has a large impact on the crucial communication/compute ratio: shallow networks with many neurons per layer (like AlexNet) scale worse than deep networks with less neurons (like GoogLeNet) where longer compute times meet smaller model sizes. A. Limited Network Bandwidth Limited network bandwidth is one of the key bottlenecks towards the scalability of distributed DNN training. Recently, there have been several approaches proposed to overcome this problem: e.g. [7] introduced a binary communication tree, which reduces the network load to a maximum of log 2 (n) peer to peer model/gradient sends at a time. However, expecting linear speedups at the compute side, figure 3 shows that this approach will only move the intersection of the communication/compute ratio by a small factor, as the additional overhead is increasing with the depth of the communication tree. Other methods try to reduce the model size before the communication. This can be done by (I) a redesign of the network [8] - eliminating unused weights, (II) limiting the numerical precision of the model weights ([6] has shown that one byte per weight is enough), (III) compression (which is available in [1]) or (IV) transmitting only sparse gradient and model information [21]. All these methods have practical impact, moving the scalability by the factor of the model reduction rate. But non of these approaches is able to solve the problem in principle. As model sizes are increasing much faster than the available network bandwidth, the communication overhead remains an unsolved problem. III. COMPUTATIONAL COSTS AND SCALING OF MATRIX MULTIPLICATIONS The previously discussed communication overhead is actually a well known problem that has recently been drawing more and more attention in the deep learning community [7], [8], [6], [21]. But communication overhead is not the only problem preventing DNN scalability: there is an even more severe limitation, which turns out to be a hard theoretical constraint. We illustrate this problem by means of a simple experiment: assuming that the communication in the distributed DNN training was free, one would expect close to linear strong scaling 5 properties. However, figure 4 shows, that this is not the case. Again, scalability stalls after only a few nodes. While it is obvious that it is not possible to split the global batch into local batches b < 1, thus imposing a strict scalability limit at n = B, the limitation induced by the batch size are taking effect even for b >> 1. To allow further investigation of these results, we provide a layer by layer analysis on the computational complexity and scalability of out benchmark networks. 5 because distributed SGD is data-parallel

4 Fig. 4. Evaluation of the scalability assuming free communication (simulated by measuring the compute times at a single node at decreasing batch sizes). Results for different compute units. figure 5). More interesting is the growing portion of compute time spend in the InnerProduct (= Fully-Connected) 7 layer. Figure 6 shows the same tendencies for layer computations on GPUs, where the LRN layer has no significant impact. Yet another interesting observation can be made in figure 15, which shows the impact of the convolution optimization of the cudnn 8 library used in figure 6. Even more evident than the relative compute portions of the different layer types shown in figures 5, 6, 15 and 14, are the scaling properties by the different layer types. Figure 5 depicts these for DNN training on CPUs (see figure 14 for results on the new Xeon-Phi): all but one layer types show almost perfect linear scaling. Only the significantly compute intensive InnerProduct layer scales poorly for batch sizes b < 64, which is equivalent to scaling to only n > 4 nodes for the original batch size B = 256. On the GPU, the crucial InnerProduct layer scales much A. Layer by Layer Analysis Fig. 5. Evaluation of the relative compute time for each layer type (several layers of the same type are accumulated) per training iteration on a single node CPU based system. Top: results for AlexNet. Bottom: results for GoogLeNet. Figure 5 shows the analysis for DNN training on CPUs. The dominance of the Local Response Normalization layer (LRN) is caused by a rather poor multi-threaded implementation 6 in Caffe and is neglectable in terms of scalability (as shown in 6 This problem has been fixed by the MKL17 implementation as shown figure 14. Fig. 6. Evaluation of the relative compute time for each layer type (several layers of the same type are accumulated) per training iteration on a single node GPU based system (one K80). Top: results for AlexNet. Bottom: results for GoogLeNet. better than on the CPU, but still fails linear speedup as we can see accelerations factors around 32x at 256 nodes. Again, the speedup stalls for batch sizes b 32.

5 Fig. 7. Speedup achieved by reducing the batch size - computed from the results in Fig. 5. Top: results for AlexNet. Bottom: results for GoogLeNet. Fig. 8. Speedup achieved by reducing the batch size - computed from the results in Fig. 6. Top: results for AlexNet. Bottom: results for GoogLeNet. Layer # operations matrix sizes Fully Connected 1 b I I O Convolutional b C I I Z Softmax b I Definitions: I: Input size from top layer O: Output size of this layer b: local Batch size (train or validation) C: Number of filters c: Number of input channels (RBG image: c = 3) P: Patch size (i.e. pixel) k: kernel size Z: Effective size after kernel application. ( P ) 2 For convolution Z := (k/2) TABLE III SIZE AND NUMBER OF OF THE MATRIX MULTIPLICATIONS (SGEMM) PER FORWARD PASS FOR SELECTED LAYERS. B. Scaling Fully-Connected Layers The layer by layer analysis revealed the impact of the Fully- Connected (FC) layers on the overall scalability. FC layers are the conventional neural layers in the deep network architecture, where the actual decision boundaries of a classification problem are modeled. Typically, FC layers hold a large number of neurons which are connected to all inputs. Computationally, The changing name convention is due to the naming used in Caffe. 8 The optimization strategy is also available in MKL17, as shown in figure FC layers perform a single 9 matrix multiplication per pass. Table III shows the impact of the batch size b on the size, shape and number of matrix multiplications. While b only affects the number of matrix operations for Convolutional layers (which can implemented task-parallel [12]), it directly reshapes the left-hand matrix in the FC sgemm operation in a very unfavorable way. For typically large I and O (e.g. for a layer in AlexNet we find I = 4096, O = 9192), b = B/n decreases from B = producing degenerated (maximal non-square) matrices. This degeneration hurts the Inner Parallelization of the matrix multiplication (see section I-A1), where the sgemm is either multi-threaded by the MKL Blas Library or parallelized via cudablas. Both implementations have an optimal performance for square matrices and suffer from the degeneration [5]. Hence, speedups gained by the Outer Parallelization (the data-parallel SGD) start to harm the performance of Inner Parallelizations and cause a scalability deadlock. Figure 9 shows the impact of b on the MKL sgemm speedup performance. It is not surprising, that this evaluation shows exactly the same speedup characteristic as the overall communication free scaling experiment in fig. 4.

6 Fig. 9. MKL SGEMM: impact of the batch size b on the MKL sgemm speedup performance for matrix multiplications with the shape b These matrix shapes correspond to the sgemms computed in the largest Fully-Connected layer of AlexNet. Fig. 11. Scaling properties for increased global batch sizes in the free communication scenario. Yellow lines show the results for AlexNet, blue lines GoogLeNet and red lines indicate perfect linear speedup. NOTE: All speedups are computed with respect to the compute time of the enlarged global batch sizes, not the original batch sizes (B = 256 for AlexNet and B = 32 for GoogLeNet). Fig. 10. Full validation accuracy plot for AlexNet with different large batch sizes. Settings [B = 256, ɛ = 0.01, iter = 450k], [B = 512, ɛ = 0.02, iter = 225k], [B = 1024, ɛ = 0.04, iter = 112k], [B = 2048, ɛ = 0.08, iter = 56k] C. Increasing the global Batch Size A simple way to overcome the stalling scalability beyond 8 nodes (or b < 32), has recently been suggested in [7] and is also utilized in [1]: increasing the global batch size B to the extend, that the worker batch size keeps an effective size b 32 for the Inner Parallelization. Figure 11 shows, that this strategy is actually providing almost perfect linear speedup up to 128 nodes for a global batch size of However, these results have to be taken with strong caution: Increasing the global batch size also increases the computational complexity of the problem linearly. Beyond a certain batch size, SGD will not converge significantly faster in terms of the number of iterations. Hence, large batch sizes will increase the computation time per iteration while the number of iterations stays constant. In order to reduce the number of iterations till convergence, one would have to increase the step size as well. The authors of [7] argue, that larger batch sizes 9 Actually there is a second small matrix multiplication for the computation of the bias which we neglect here. will provide more stable gradient information which should allow larger step sizes. If it was possible to increase the step size in the same way this is done with the batch size, one would yield perfect linear scaling. Sadly, this is hardly the case. Figure 10 shows the accuracy plots for AlexNet, computed till full convergence with differently large global batch sizes. The experiments were performed on a single KNL node to avoid possible interferences in a distributed setting 10. The step sizes ɛ were increased according to the batch size as suggested by [7], while the number of iterations has been decreased by the same factor. The results are quite disappointing: while we reach linear speedup as expected, the validation accuracy is suffering significantly: These experimental results confirm the theoretic batch size speedup step-size accuracy ɛ = % ɛ = % ɛ = % ɛ = % TABLE IV EFFECT OF CHOOSING LARGER STEP-SIZES ɛ ON THE RESULTING TEST/VALIDATION ACCURACY. OUR EXPERIMENTS SHOW, THAT LARGER BATCH SIZES CON NOT COMPENSATE FOR THE LOSS IN ACCURACY. analysis by [11], who showed that large batch sizes lead to sharp minima with poorer generalization properties. Considering, that an early stopping of the original problem when reaching the according error rates yields almost the same speedup as the parallelized large batch variants, shows that this approach might not be suitable to solve the scaling problem of the matrix multiplications. 10 The KNL provides enough memory for such large batch sizes. On common GPUs with 12GB memory, the bach size limit is b = 256 for AlexNet and b = 128 for GoogLeNet.

7 D. Non-Scaling Layers Figure 8 also shows very poor scalability for some layers like Dropout, Pooling or LRN. This is mostly due to the fact that these layers are computed so fast, that the latency of loading the data to the GPU becomes the dominant constant factor. Overall, these particular layers consume only a marginal portion of the total compute time 11 (0.1% for AlexNet and 0.3% for GoogLeNet, assuming that all other layers are parallelizable). Applying Amdahl s Law shows in figure 12, that this still affects the scalability in the long run. Again, scalability begins to stall at n > 32. Fig. 12. Effect of non-scaling layers to the overall scalability after Amdahl s Law. IV. PARALLEL TRAINING DATA ACCESS So far, the analysis in the previous sections neglected another crucial bottleneck towards scalable distributed DNN training: the distribution of the training data (a.k.a. the batches) to the worker nodes. We specifically avoided this problem in all prior experiments by holding copies of the entire training set on local SSDs on every worker node. However, this approach not only requires the availability of NVRAM (or other high speed local storage) at every node, it is also very inefficient to copy hundreds of gigabyte 12 to each worker node before the actual training can be started. A. Network Bandwidth Revised Using a centralized storage for the training data, like a database server or a distributed file system, would offer a more convenient and resource effective solution. Compared to the petabytes of communication caused by the distributed SGD (see section II), the distribution load of training data appears to be neglectable: e.g. for AlexNet we have 100 epochs (=full pass of the training data) till convergence, resulting in GB= 15TB of total data traffic compared to MB 2(n 1) in gradient and update communication 13. But this assumption holds only as long as the SGD communication leaves some bandwidth for the data transfers. 11 See figure ImageNet 1000 [17] has 150GB of training data, larger real world problems easily exceed many terabytes. 13 Assuming a parameter server with n workers and 450k iterations. Fig. 13. Influence of the Data layer on compute times. This figure shows, that storing the training data on a distributed file system is prune to cause huge performance bottlenecks. Compute units idle during the time spend in the data layer (Compare these results with the compute times shown in Fig. 6 were the training data was stored locally). Figure 13 shows the practical consequences of storing the training data on a Lustre distributed file system 14 when the network bandwidth is exceeded by the SGD communication. B. Small Files - High Speed Random Access Bandwidth is not the only problem when it comes to the usage of parallel file systems 15. There are also latency issues, which are caused by the structure of the training data used in many deep learning applications: typically learning samples come as large collections of small files (e.g. images, audio sequences or texts) which need to be accessed at random during the DNN training. Many workers simultaneously polling the file system metadata servers for large numbers of random files easily causes large response times or even the break down of the distributed file system. V. CONCLUSIONS In this paper, we analysed and discussed the major theoretical and practical limits of current approaches towards scalable distributed DNN training. We showed three specific bottlenecks, 14 Storage on SSDs, Interconnect FDR Infiniband. 15 Which are the standard storage solution on HPC clusters.

8 namely the communication overhead, parallelization of matrix operations and training data distribution which need to be solved in order to achieve a sustainably scalable solution which should allow strong scaling to thousands of nodes. Currently, effective scaling is not possible beyond 16 nodes. VI. ACKNOWLEDGMENTS The authors thank the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for generous allocations of computer time. REFERENCES [1] [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arxiv preprint arxiv: , [3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages Springer, [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages , [5] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages IEEE, [6] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. CoRR, abs/ , 392, [7] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR, abs/ , [8] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 1mb model size. arxiv preprint arxiv: , [9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv: , [10] J. Johnson [11] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arxiv preprint arxiv: , [12] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arxiv preprint arxiv: , [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages , [14] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): , [15] H. Ma, F. Mao, and G. W. Taylor. Theano-mpi: a theano-based distributed training framework. arxiv preprint arxiv: , [16] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages , [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): , [18] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 117, [19] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. On parallelizability of stochastic gradient descent for speech dnns. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages IEEE, [20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv: , [21] R. Spring and A. Shrivastava. Scalable and sustainable deep learning via randomized hashing. arxiv preprint arxiv: , [22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, APPENDIX Fig. 14. Additional layer by layer results for the KNL (see section III-A for details). Top: Proportional compute times by layer type and batch size for AlexNet. Bottom: Speedups by layer type ans batch size for AlexNet. Fig. 15. Layer by Layer analysis for GoogLeNet without cudnn.

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Towards Scalable Machine Learning

Towards Scalable Machine Learning Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms Asynchronous Parallel Stochastic Gradient Descent A Numeric Core for Scalable Distributed Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer

More information

A performance comparison of Deep Learning frameworks on KNL

A performance comparison of Deep Learning frameworks on KNL A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Supplementary material for Analyzing Filters Toward Efficient ConvNet Supplementary material for Analyzing Filters Toward Efficient Net Takumi Kobayashi National Institute of Advanced Industrial Science and Technology, Japan takumi.kobayashi@aist.go.jp A. Orthonormal Steerable

More information

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer

More information

arxiv: v1 [cs.ne] 12 Jul 2016 Abstract

arxiv: v1 [cs.ne] 12 Jul 2016 Abstract Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures Hengyuan Hu HKUST hhuaa@ust.hk Rui Peng HKUST rpeng@ust.hk Yu-Wing Tai SenseTime Group Limited yuwing@sensetime.com

More information

Tiny ImageNet Visual Recognition Challenge

Tiny ImageNet Visual Recognition Challenge Tiny ImageNet Visual Recognition Challenge Ya Le Department of Statistics Stanford University yle@stanford.edu Xuan Yang Department of Electrical Engineering Stanford University xuany@stanford.edu Abstract

More information

arxiv: v1 [cs.dc] 27 Sep 2017

arxiv: v1 [cs.dc] 27 Sep 2017 Slim-DP: A Light Communication Data Parallelism for DNN Shizhao Sun 1,, Wei Chen 2, Jiang Bian 2, Xiaoguang Liu 1 and Tie-Yan Liu 2 1 College of Computer and Control Engineering, Nankai University, Tianjin,

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18, REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS Asmita Goswami [1], Lokesh Soni [2 ] Department of Information Technology [1] Jaipur Engineering College and Research Center Jaipur[2]

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

arxiv: v1 [cs.cv] 4 Dec 2014

arxiv: v1 [cs.cv] 4 Dec 2014 Convolutional Neural Networks at Constrained Time Cost Kaiming He Jian Sun Microsoft Research {kahe,jiansun}@microsoft.com arxiv:1412.1710v1 [cs.cv] 4 Dec 2014 Abstract Though recent advanced convolutional

More information

arxiv: v6 [cs.cv] 7 Nov 2017

arxiv: v6 [cs.cv] 7 Nov 2017 ImageNet Training in Minutes Yang You 1, Zhao Zhang 2, Cho-Jui Hsieh 3, James Demmel 1, Kurt Keutzer 1 UC Berkeley 1, TACC 2, UC Davis 3 {youyang, demmel, keutzer}@cs.berkeley.edu; zzhang@tacc.utexas.edu;

More information

Neural Networks with Input Specified Thresholds

Neural Networks with Input Specified Thresholds Neural Networks with Input Specified Thresholds Fei Liu Stanford University liufei@stanford.edu Junyang Qian Stanford University junyangq@stanford.edu Abstract In this project report, we propose a method

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

arxiv: v9 [cs.cv] 12 Dec 2017

arxiv: v9 [cs.cv] 12 Dec 2017 ImageNet Training in Minutes Yang You 1, Zhao Zhang 2, Cho-Jui Hsieh 3, James Demmel 1, Kurt Keutzer 1 UC Berkeley 1, TACC 2, UC Davis 3 {youyang, demmel, keutzer}@cs.berkeley.edu; zzhang@tacc.utexas.edu;

More information

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn

More information

Real-time convolutional networks for sonar image classification in low-power embedded systems

Real-time convolutional networks for sonar image classification in low-power embedded systems Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information

Profiling DNN Workloads on a Volta-based DGX-1 System

Profiling DNN Workloads on a Volta-based DGX-1 System Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

arxiv: v3 [cs.cv] 21 Jul 2017

arxiv: v3 [cs.cv] 21 Jul 2017 Structural Compression of Convolutional Neural Networks Based on Greedy Filter Pruning Reza Abbasi-Asl Department of Electrical Engineering and Computer Sciences University of California, Berkeley abbasi@berkeley.edu

More information

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset Suyash Shetty Manipal Institute of Technology suyash.shashikant@learner.manipal.edu Abstract In

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Additive Manufacturing Defect Detection using Neural Networks

Additive Manufacturing Defect Detection using Neural Networks Additive Manufacturing Defect Detection using Neural Networks James Ferguson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville Knoxville, Tennessee 37996 Jfergu35@vols.utk.edu

More information

Elastic Neural Networks for Classification

Elastic Neural Networks for Classification Elastic Neural Networks for Classification Yi Zhou 1, Yue Bai 1, Shuvra S. Bhattacharyya 1, 2 and Heikki Huttunen 1 1 Tampere University of Technology, Finland, 2 University of Maryland, USA arxiv:1810.00589v3

More information

Training Deep Neural Networks (in parallel)

Training Deep Neural Networks (in parallel) Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as

More information

Cross-domain Deep Encoding for 3D Voxels and 2D Images

Cross-domain Deep Encoding for 3D Voxels and 2D Images Cross-domain Deep Encoding for 3D Voxels and 2D Images Jingwei Ji Stanford University jingweij@stanford.edu Danyang Wang Stanford University danyangw@stanford.edu 1. Introduction 3D reconstruction is one

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks

Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks Shizhao Sun 1, Wei Chen 2, Jiang Bian 2, Xiaoguang Liu 1, and Tie-Yan Liu 2 1 College of Computer and Control Engineering,

More information

A LAYER-BLOCK-WISE PIPELINE FOR MEMORY AND BANDWIDTH REDUCTION IN DISTRIBUTED DEEP LEARNING

A LAYER-BLOCK-WISE PIPELINE FOR MEMORY AND BANDWIDTH REDUCTION IN DISTRIBUTED DEEP LEARNING 017 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 5 8, 017, TOKYO, JAPAN A LAYER-BLOCK-WISE PIPELINE FOR MEMORY AND BANDWIDTH REDUCTION IN DISTRIBUTED DEEP LEARNING Haruki

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

arxiv: v2 [cs.ne] 26 Apr 2014

arxiv: v2 [cs.ne] 26 Apr 2014 One weird trick for parallelizing convolutional neural networks Alex Krizhevsky Google Inc. akrizhevsky@google.com April 29, 2014 arxiv:1404.5997v2 [cs.ne] 26 Apr 2014 Abstract I present a new way to parallelize

More information

Handwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks

Handwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks Downloaded from orbit.dtu.dk on: Nov 22, 2018 Handwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks Gallus, Michal; Nannarelli, Alberto Publication date: 2018 Document

More information

Object Detection and Its Implementation on Android Devices

Object Detection and Its Implementation on Android Devices Object Detection and Its Implementation on Android Devices Zhongjie Li Stanford University 450 Serra Mall, Stanford, CA 94305 jay2015@stanford.edu Rao Zhang Stanford University 450 Serra Mall, Stanford,

More information

Parallel and Distributed Deep Learning

Parallel and Distributed Deep Learning Parallel and Distributed Deep Learning Vishakh Hegde Stanford University vishakh@stanford.edu Sheema Usmani Stanford University sheema@stanford.edu Abstract The goal of this report is to explore ways to

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

Structured Prediction using Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer

More information

arxiv: v1 [cs.cv] 5 Oct 2015

arxiv: v1 [cs.cv] 5 Oct 2015 Efficient Object Detection for High Resolution Images Yongxi Lu 1 and Tara Javidi 1 arxiv:1510.01257v1 [cs.cv] 5 Oct 2015 Abstract Efficient generation of high-quality object proposals is an essential

More information

Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet

Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet 1 Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet Naimish Agarwal, IIIT-Allahabad (irm2013013@iiita.ac.in) Artus Krohn-Grimberghe, University of Paderborn (artus@aisbi.de)

More information

Asynchronous Parallel Stochastic Gradient Descent

Asynchronous Parallel Stochastic Gradient Descent Asynchronous Parallel Stochastic Gradient Descent ABSTRACT A Numeric Core for Scalable Distributed Machine Learning Algorithms The implementation of a vast majority of machine learning (ML) algorithms

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce Zehua Cheng,Zhenghua Xu State Key Laboratory of Reliability and Intelligence of Electrical Equipment Hebei University of Technology

More information

KamiNet A Convolutional Neural Network for Tiny ImageNet Challenge

KamiNet A Convolutional Neural Network for Tiny ImageNet Challenge KamiNet A Convolutional Neural Network for Tiny ImageNet Challenge Shaoming Feng Stanford University superfsm@stanford.edu Liang Shi Stanford University liangs@stanford.edu Abstract In this paper, we address

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9-1 Administrative A2 due Wed May 2 Midterm: In-class Tue May 8. Covers material through Lecture 10 (Thu May 3). Sample midterm released on piazza. Midterm review session: Fri May 4 discussion

More information

Using Convolutional Neural Networks in Robots with Limited Computational Resources: Detecting NAO Robots while Playing Soccer

Using Convolutional Neural Networks in Robots with Limited Computational Resources: Detecting NAO Robots while Playing Soccer Using Convolutional Neural Networks in Robots with Limited Computational Resources: Detecting NAO Robots while Playing Soccer Nicolás Cruz, Kenzo Lobos-Tsunekawa, and Javier Ruiz-del-Solar Advanced Mining

More information

In Defense of Fully Connected Layers in Visual Representation Transfer

In Defense of Fully Connected Layers in Visual Representation Transfer In Defense of Fully Connected Layers in Visual Representation Transfer Chen-Lin Zhang, Jian-Hao Luo, Xiu-Shen Wei, Jianxin Wu National Key Laboratory for Novel Software Technology, Nanjing University,

More information

REVISITING DISTRIBUTED SYNCHRONOUS SGD

REVISITING DISTRIBUTED SYNCHRONOUS SGD REVISITING DISTRIBUTED SYNCHRONOUS SGD Jianmin Chen, Rajat Monga, Samy Bengio & Rafal Jozefowicz Google Brain Mountain View, CA, USA {jmchen,rajatmonga,bengio,rafalj}@google.com 1 THE NEED FOR A LARGE

More information

All You Want To Know About CNNs. Yukun Zhu

All You Want To Know About CNNs. Yukun Zhu All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Scaling Deep Learning on Multiple In-Memory Processors

Scaling Deep Learning on Multiple In-Memory Processors Scaling Deep Learning on Multiple In-Memory Processors Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena AMD Research, Advanced Micro Devices, Inc. {lifan.xu, dongping.zhang, nuwan.jayasena}@amd.com ABSTRACT

More information

Groupout: A Way to Regularize Deep Convolutional Neural Network

Groupout: A Way to Regularize Deep Convolutional Neural Network Groupout: A Way to Regularize Deep Convolutional Neural Network Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill eunbyung@cs.unc.edu Abstract Groupout is a new technique

More information

Architecting new deep neural networks for embedded applications

Architecting new deep neural networks for embedded applications Architecting new deep neural networks for embedded applications Forrest Iandola 1 Machine Learning in 2012 Sentiment Analysis LDA Object Detection Deformable Parts Model Word Prediction Linear Interpolation

More information

arxiv: v1 [cs.cv] 26 Aug 2016

arxiv: v1 [cs.cv] 26 Aug 2016 Scalable Compression of Deep Neural Networks Xing Wang Simon Fraser University, BC, Canada AltumView Systems Inc., BC, Canada xingw@sfu.ca Jie Liang Simon Fraser University, BC, Canada AltumView Systems

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications Anand Joshi CS229-Machine Learning, Computer Science, Stanford University,

More information

Using CNN Across Intel Architecture

Using CNN Across Intel Architecture white paper Artificial Intelligence Object Classification Intel AI Builders Object Classification Using CNN Across Intel Architecture Table of Contents Abstract...1 1. Introduction...1 2. Setting up a

More information

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms

More information

Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal

Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal arxiv:1705.10748v3 [cs.cv] 10 Apr 2018 Chih-Ting Liu, Yi-Heng Wu, Yu-Sheng Lin, and Shao-Yi Chien Media

More information

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Deconvolutions in Convolutional Neural Networks

Deconvolutions in Convolutional Neural Networks Overview Deconvolutions in Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Deconvolutions in CNNs Applications Network visualization

More information

BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks

BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks Surat Teerapittayanon Harvard University Email: steerapi@seas.harvard.edu Bradley McDanel Harvard University Email: mcdanel@fas.harvard.edu

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

FCHD: A fast and accurate head detector

FCHD: A fast and accurate head detector JOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 FCHD: A fast and accurate head detector Aditya Vora, Johnson Controls Inc. arxiv:1809.08766v2 [cs.cv] 26 Sep 2018 Abstract In this paper, we

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

arxiv: v1 [cs.cv] 28 Apr 2015

arxiv: v1 [cs.cv] 28 Apr 2015 Speeding Up Neural Networks for Large Scale Classification using WTA Hashing arxiv:1504.07488v1 [cs.cv] 28 Apr 2015 Amir H. Bakhtiary Universitat Oberta de Catalunya Email: abakhtiary@uoc.edu Abstract

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Face Recognition A Deep Learning Approach

Face Recognition A Deep Learning Approach Face Recognition A Deep Learning Approach Lihi Shiloh Tal Perl Deep Learning Seminar 2 Outline What about Cat recognition? Classical face recognition Modern face recognition DeepFace FaceNet Comparison

More information

Convolutional Layer Pooling Layer Fully Connected Layer Regularization

Convolutional Layer Pooling Layer Fully Connected Layer Regularization Semi-Parallel Deep Neural Networks (SPDNN), Convergence and Generalization Shabab Bazrafkan, Peter Corcoran Center for Cognitive, Connected & Computational Imaging, College of Engineering & Informatics,

More information

Parallel Deep Convolutional Neural Network Training by Exploiting the Overlapping of Computation and Communication

Parallel Deep Convolutional Neural Network Training by Exploiting the Overlapping of Computation and Communication 2017 IEEE 24th International Conference on High Performance Computing Parallel Deep Convolutional Neural Network Training by Exploiting the Overlapping of Computation and Communication Sunwoo Lee, Dipendra

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors

More information

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Charles R. Qi Hao Su Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas Stanford University 1. Details

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Supervised Learning of Classifiers

Supervised Learning of Classifiers Supervised Learning of Classifiers Carlo Tomasi Supervised learning is the problem of computing a function from a feature (or input) space X to an output space Y from a training set T of feature-output

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

Serving deep learning models in a serverless platform

Serving deep learning models in a serverless platform Serving deep learning models in a serverless platform Vatche Ishakian vishakian@bentley.edu Bentley University Vinod Muthusamy vmuthus@us.ibm.com IBM T.J. Watson Research Center Aleksander Slominski aslom@us.ibm.com

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Exploration of the Effect of Residual Connection on top of SqueezeNet A Combination study of Inception Model and Bypass Layers

Exploration of the Effect of Residual Connection on top of SqueezeNet A Combination study of Inception Model and Bypass Layers Exploration of the Effect of Residual Connection on top of SqueezeNet A Combination study of Inception Model and Layers Abstract Two of the most popular model as of now is the Inception module of GoogLeNet

More information

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS

IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS IDENTIFYING PHOTOREALISTIC COMPUTER GRAPHICS USING CONVOLUTIONAL NEURAL NETWORKS In-Jae Yu, Do-Guk Kim, Jin-Seok Park, Jong-Uk Hou, Sunghee Choi, and Heung-Kyu Lee Korea Advanced Institute of Science and

More information