ENERGY EFFICIENCY ANALYSIS AND OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION. Xinbo Chen, B.Eng.

Size: px
Start display at page:

Download "ENERGY EFFICIENCY ANALYSIS AND OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION. Xinbo Chen, B.Eng."

Transcription

1 ENERGY EFFICIENCY ANALYSIS AND OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION by Xinbo Chen, B.Eng. A thesis submitted to the Graduate Council of Texas State University in partial fulfillment of the requirements for the degree of Master of Science With a Major in Computer Science May 2016 Committee Members: Ziliang Zong, Chair Yijuan Lu Martin Burtscher

2 COPYRIGHT By Xinbo Chen 2016

3 FAIR USE AND AUTHOR S PERMISSION STATEMENT Fair Use This work is protected by the Copyright Laws of the United States (Public Law , section 107). Consistent with fair use as defined in the Copyright Laws, brief quotations from this material are allowed with proper acknowledgment. Use of this material for financial gain without the author s express written permission is not allowed. Duplication Permission As the copyright holder of this work I, Xinbo Chen, authorize duplication of this work, in whole or in part, for educational or scholarly purposes only.

4 DEDICATION Dedicated to my parents and my United States family, whose support and encouragement during my two years graduate studies motivated me to complete this thesis.

5 ACKNOWLEDGEMENTS I am very appreciated Dr. Ziliang Zong for his guidance and support during my graduate study; he is a great advisor not only in teaching and researching, but also in life principles. I would also like to thank Da Li, Dr. Yijuan Lu, and Dr. Martin Burtscher for their expertise, feedback, and overall support in achieving this thesis work. v

6 TABLE OF CONTENTS Page ACKNOWLEDGEMENTS... v LIST OF TABLES... ix LIST OF FIGURES... xi LIST OF ABBREVIATIONS... xii ABSTRACT... xiii CHAPTER 1.INTRODUCTION BACKGROUND RELATED WORK ENERGY CONSUMPTION ANALYSIS OF NEURAL NETWORKS IN INNN DIFFERENT LEARNING FRAMEWORKS Introduction of Training Framework Caffe Torch TensorFlow MXNet Benchmark Setup Data Benchmark Neural Network Benchmark System Setup GPU Mode Results and Discussions vi

7 4.3.1 Native GPU v.s. GPU with cudnn Library Overall Framework Energy Consumption and Performance INNN Comparison CPU Mode Results and Discussions Math Libraries for CPU Caffe - CPU Mode Energy Efficiency Analysis with BLAS Conclusion NETWORK WISE ENERGY CONSUMPTION ANALYSIS Each Layer Energy Occupation Batch Size Factor of Neural Network LAYER WISE ENERGY CONSUMPTION ANALYSIS Convolutional Layer Common Convolutional Layers Data Input Size Kernel Size Kernel Number Conclusion Fully Connected Layer Pooling Layer Input Size Filter Kernel Size Pooling Method Conclusion HARDWARE TUNING DVFS K20m and Titan X GPU Comparison Conclusion CONCLUSION FUTURE WORK vii

8 REFERENCES viii

9 LIST OF TABLES Table Page 1-1: Recent Year CNNs Top 5 Error : Community involvements of four deep learning frameworks as of 03/18/ : Convent Benchmark parameters setting for four deep learning frameworks : Caffe and Torch energy and performance comparison in native GPU :Caffe and Torch 7 performance and energy with OverFeat : Caffe and Torch7 energy efficiency comparison in GPU with cudnn : Caffe and Torch7 actual execution time and energy efficiency comparison in GPU with cudnn : Four framework s overall energy consumption and performance : BLAS library energy efficiency and performance comparison in Caffe : Separate Layer Time and Energy Consumption : AlexNet Energy Consumption with different batch size : Convolution Layer Specification : The performance comparison of six convolutional layers : The GPU energy comparison of six convolutional layers : Kernel and input size parameter of six convolutional layer : Convolutional layer with different data input size : Convolutional layer with different kernel size : Convolutional layer with different kernel number : The variety impact of performance and energy of convolutional layer : The configuration setting of three fully connected layers : The performance of three fully connected layer : The energy consumption of three fully connected layer : Different input size of pooling layer : Different input size of pooling layer performance and energy : Configuration setting of three pooling layers : Different kernel size of pooling layer performance : Different kernel size of pooling layer GPU energy : The pooling method configuration of pooling layers : Pooling layers performance with various pooling method : Pooling layers energy consumption with various pooling method ix

10 7-1: Support memory frequency and core frequency options in K20m GPU : Titan X and K20m hardware comparison [Nvidia, 2012; Nvidia 2014] : Titan X GPU and K20m GPU comparison x

11 LIST OF FIGURES Figure Page 2-1: LeNet: an algorithm to recognize hand write letter : Caffe s power consumption with native GPU : Torch s power consumption with native GPU : Caffe s power consumption in GPU Mode with cudnn : Torch s AlexNet power consumption in GPU Mode with cudnn : Torch benchmark script file source code : Torch s power consumption in GPU Mode with cudnn in 50 Dry-run : The actual execution time and power consumption of Caffe : The actual execution time and power consumption of Torch : TensorFlow s power consumption in GPU Mode with cudnn : MXNet s power consumption in GPU Mode with cudnn : Caffe - CPU mode with Atlas energy consumption : Caffe - CPU mode with OpenBLAS energy consumption : Caffe - CPU mode with MKL energy consumption : Time Occupation percentage of layers : Energy Occupation percentage of layers : Batch size parameter influence of performance : Batch size parameter influence of energy : Convolutional layer GPU energy with varied data input size : One pooling method: max pooling : Performance impact of DVFS in K20m : Total system energy impact of DVFS : GPU Average Power with the impact of DVFS : GPU energy portion with the impact of DVFS : K20m energy curve of Caffe with AlexNet : Titan X energy curve of Caffe with AlexNet xi

12 LIST OF ABBREVIATIONS Abbreviation Description CPU - Central Processing Unit GPU - Graphic Processing Unit GPGPU - General-purpose GPU CUDA - Compute Unified Device Architecture CNN - Convolutional Neural Network AI - Artificial Intelligence FC - Fully Connected Layer cudnn - CUDA Deep Neural Network Library xii

13 ABSTRACT In recent years, convolutional neural network (CNN) has been widely used to improve the training time and accuracy of image recognition applications. These CNNs are based on deep learning algorithms, which simulate the learning process of human s brain. After sufficient number of images being used to train the neural networks, high recognition accuracy of images can be achieved. In the past few years, especially after GPUs are utilized to carry heavy-duty computation, the accuracy and training time of machine learning algorithms have been significantly improved (e.g. the error rate has dropped from 28.2% in 2010 to 6.66% in 2014). The newest neural network developed by Microsoft in 2015 has already surpassed human s recognition ability with the error rate less than 5%. This is an exciting achievement but also indicates that the room for accuracy improvement is narrowing. On the other hand, due to the massive volume of training data sets, the increasing complexity of neural network structure, and the significant amount of computation, the training process consumes more and more time and energy. In this thesis, we conduct a comprehensive study on analyzing the performance and energy impact of a variety of outer and inner factors in popular CNN algorithms, which provides detailed workload characterization to facilitate the design of more energy efficient CNN algorithms and training frameworks. xiii

14 1. INTRODUCTION In 1958, Frank Rosenblatt proposed the Perceptron concepts and a theory on how do neurons in human brains operate [Rosenblatt, 1958]. This theory created a new field of artificial intelligence, called neural network. In 1989, Yann LeCun et al. applied neural network algorithms to recognize handwritten ZIP codes with the training time to be approximately 3 days. After that, numerous interesting practices such as the wake-sleep algorithm [Hinton, 1995] and vanishing gradient problem [Hochreiter, 1998] appeared. However, the slow speed of training continues to be a key factor to impede the advancement of neural network s algorithms and applications. This situation started to change after 2010 with the high speed GPGPU s being utilized to speedup the training time. For example, a single K20 GPU can achieve 1.17 trillion floating-point operations per second [NVIDIA, 2013] and the new Maxwell GPU Titan X is even faster, which makes it feasible to train complex neural networks with large data sets in reasonable time. In the past few years, we have witnessed the rapid growth of deep learning algorithms (i.e. Learning algorithms with more than one stage of non-linear feature transformation), which have been successfully used to solve challenging problems such as image recognition, speech recognition, and natural language processing. This thesis focus on the image recognition field. To evaluate a variety of deep learning algorithms for object detection and image classification at large scale, the ImageNet Large Scale Visual Recognition Challenge 1

15 (ILSVRC) has been held annually since The imagenet benchmark with 15 million manual-labeled high-resolution images belonging to roughly 22,000 categories is used to train and evaluate the accuracy of state of art neural network algorithms [Russakovsky, 2015]. This challenge uses the percent of top 5 error rate (i.e. If correct class is not int the top 5 classification recommended by the algorithm, it is counted as an error) as the evaluation metric for accuracy. Table 1-1 shows the error rate of winner algorithms in the past six years, which clearly demonstrates the significant improvement in accuracy [Russakovsky, 2015]. Table 1-1: Recent Year CNNs Top 5 Error Year Percent of Top 5 Error (Microsoft) (GoogLeNet) (OverFeat) (AlexNet) After Microsoft s algorithm already surpassed human recognition ability (~5%) [He, 2015], neural network research has achieved the goal of matching or beating human s capability on image recognition in terms of accuracy. However, human consumes extremely less amount of energy to recognize a picture. One new direction of neural 2

16 network research is to design more energy-efficient deep learning algorithms without compromising accuracy. This is a challenging task because there are too many factors (e.g. hardware configurations, different learning framework, external matrix calculation libraries, neural network structure, configuration setting of each layer inside a network, etc) that can affect the training time, the energy consumption and the accuracy. All such factors have unknown contribution to the total energy consumption. To the best of my knowledge, there is no existing study that comprehensively investigates the energy behavior of neural network, especially CNN based deep learning algorithms. The major contributions of this thesis are summarized below: 1) It describes a strategy to analyze neural network energy consumption from the high level to the detailed level. 2) It presents a comparative study of performance and energy efficiency of four popular neural network training frameworks and finds the most energy efficient framework. 3) It studies the impact of a variety of external math libraries (for both GPUs and CPUs) on the performance and energy consumption of deep learning algorithms. 4) It presents the decomposed energy consumption of each layer in AlexNet and studies how do different factors affect energy behaviors of each layer. 5) It studies the impact of DVFS on the energy consumption of deep learning algorithms when running them on K20 and Titan X GPUs. 3

17 This thesis is organized as follows. In Chapter 4, I discuss the energy efficiency of several popular deep learning frameworks. Since neural network training mostly in GPU, I mainly focus on GPU mode of those frameworks. In Chapter 5, I analyze the network wise energy consumption of neural networks. Particularly, I show the energy percentage of each layer consumes during the total network training time. In Chapter 6, major energy consuming layers are broken down from neural network and analyzed separately. In Chapter 7, I use GPU tuning technology to optimize the energy consumption of neural network training and also compare different hardware. 4

18 2. BACKGROUND In this Chapter, I present the background information and terminologies that help to understand how do deep learning algorithms work and the experimental results shown in later chapters. The convolutional neural network simulates the way human process and recognize images. It consists of multiple layers. The classic CNNs have three kinds of layers: the convolutional layer, the pooling layer and the fully connected layer. For example, Figure 2-1 shows a simple CNN include all these three layers [LeCun, 1998]. Each layer contains thousands or millions of neurons. A single neuron takes some input, computes the weighted sum of inputs, and sends output to the neurons in the next layer. Figure 2-1: LeNet: an algorithm to recognize hand write letter The following example briefly illustrates how convolution neural network works. CNNs take images as input, for gray image, the channel is 1, for RGB image, the channel is 3. Most current CNNs accept fixed size RGB image as input. If there is a 5 x 5 size gray (channel = 1) image pixel matrix and a 3 x 3 size kernel (or filter). We assume this kernel is used to capture border feature of this image. 5

19 With the setting of stride = 1, and bias = 1, kernel does inner production with picture matrix, and move one stride, continuing inner production to traverse the whole image matrix, the result is with ( ) width and (5-3 +1) height. This process called convolution and the result called feature map is the following matrix. The kernel is used to detect a specified feature from an image. Each inner production needs to add a bias, so the final feature map matrix is If we want to capture as many features as possible from this image, we need more different kernels. If there are 100 kernels, then we have 100 feature maps as output after this convolutional layer. We can use the following two formula to measure the workload of one convolutional layer, (1) Parameter = (Kernel width x Kernel height + 1) x feature map quantity x channel

20 (2) Connection = Parameter x Feature map width x Feature map height Based on the formula, the parameter of this convolutional layer is (3 x 3 +1) x 100 x 1= The connection quantity is 1000 x 3 x 3 = Typically, a threshold function (or activation function) will follow a convolution layer. For example, in AlexNet, the activation function is called a Rectified Linear Unit, or ReLu layer. Its formula is f(x) = max (0, x). The x is the input of a neuron. If x >0, the neuron will be activated and pass value to the next layer (not activated otherwise). The pooling layers, or down sampling layer, are usually placed between two convolutional layers. It partitions the input feature map into a set of non-overlapping rectangles (the stride size is equal to the filter size). There are mainly two methods of pooling: max pooling and average pooling which will be explained in section 6.3. The pooling layer progressively reduces the amount of parameters as well as controls overfitting. Overfitting means that the training model has higher accuracy in recognizing images of pre-trained data sets, but has lower accuracy in recognizing unseen images. This is usually caused by data noises or limited training data sets. Finally, after several convolutional and pooling layers, the CNN algorithms finalize the results via fully connected layers. Neurons in a fully connected layer have full connections to all neurons in the previous layer. The output can hence be computed with an one-dimensional matrix multiplication followed by a bias offset. Each value of this 7

21 one-dimensional matrix is the classification probability of this image. Forward propagation refers to the learning process from convolution layers to fully connected layers. Classifying an image only needs forward propagation. Backward propagation is needed to adjust deviation when the training algorithms leverage the temporary training results to verify whether or not the temporary recognition results have a big offset. 8

22 3. RELATED WORK Over the last decade, graphics processing units (GPUs) have been widely used in many fields for application accelerations (e.g. Bioinformatics [Truong, 2014], Graph [Li, 2013], Nested Parallelism [Wu, 2016], Irregular Loops [Li, 2015]). Nowadays, the peak double-precision performance of high-end GPUs from Nvidia is well above 1 teraflops [NVIDIA, 2013]. With the emergence of high speed GPUs and the availability of large data sets for training, we have witnessed the significant improvement of deep learning algorithms in terms of training time and accuracy and the boom of deep learning applications. The Visual Geometry Group (VGG) of University of Oxford has designed a 16 layers model with 7.4% top 5 error and a 19-layer model with 7.3% top 5 error rate [Simonyan, 2014]. In 2015, Microsoft implemented a model with 152 layers - 8x deeper then VGG nets with 3.57 % error [Zhang, 2015]. The fast development not only represents in the structure of neural network, but also reflects in wider scope. For example, The Deep Face deployed at Facebook for Auto tagging [Taigman, 2014]; Estimation of person pose [Tompson, 2014]; Generating a descriptive sentence [Lebret, 2015] etc. In spite of the advancement in developing deeper and more complicated neural network structures, the research on investigating the energy consumption behavior of different neural network and training framework is still in its infancy. To the best of my knowledge, there are no comprehensive studies on analyzing the energy-aware deep learning algorithms. With the size of data sets increase exponentially and the ever 9

23 increased energy consumption for training such data sets, it is desirable to design more energy efficient deep learning algorithms. This thesis contributes to this field by presenting the energy behaviors of numerous well-known deep learning algorithms and exploring the impact of various factors on the energy consumption of these algorithms. Although other accelerators like Xeon Phi and FPGA can provide high throughput and are widely used in high performance computing domains including tree/graph traversal [Li, 2014], information fusion [Song, 2011] and deep learning [Lacey, 2016], this thesis focuses on quantifying and studying power and energy behaviors on GPU platforms. 10

24 4. ENERGY CONSUMPTION ANALYSIS OF NEURAL NETWORKS IN DIFFERENT LEARNING FRAMEWORKS In this chapter, I present a comparative study of four popular deep learning frameworks, namely Caffe, Torch, TensorFlow, and MXNet, in terms of performance and energy efficiency. I evaluate these two aspects for both CPU and GPU settings. 4.1 Introduction of Training Framework In this section, I first briefly introduce each framework and their features. Table 4-1 shows the number of users in Google groups and the number of open source contributors for each framework in their Github repositories. It is clear that these four frameworks are widely used and supported by the deep learning community. Table 4-1: Community involvements of four deep learning frameworks as of 03/18/2016 Measures Caffe Torch7 TensorFlow MXNet Number of N/A members in Google groups Number of contributors in Github Caffe Caffe, an abbreviation of Convolutional Architecture for Fast Feature Embedding, is a well-known and widely used open source framework that was originally designed by the U.C. Berkeley Vision and Learning Center (BVLC) then co-designed by community contributors. Caffe is written in C++, and supports CUDA for GPU computation. It is designed with expressive architecture and supports open source math libraries to ensure 11

25 high performance. In addition, Caffe emphasizes usability by allowing users to configure each layer of a neural network easily without complicated coding [Jia, 2014] Torch Torch is a computational framework written in Lua, which is a multi-paradigm scripting language. Compared with C, Lua is more readable and easy to learn. In addition, rich interfaces to C keeps Lua s high performance for large scale applications. Torch currently is used by large tech companies such as Google DeepMind and Facebook, which devote in-house teams to customize their deep learning platforms. In our experiments, we used the newest version Torch 7 to train neural networks TensorFlow TensorFlow is another open source framework which derives from the Google Brain project. It is an interface for expressing machine learning algorithms, and an implementation for executing neural network training algorithms [Abadi, 2016]. TensorFlow is famous for its flexibility to express a wide variety of algorithms, including computer vision, robotics, and natural language processing etc. It also has strong portability to run on desktop, server, or even mobile computing platforms MXNet MXNet is a lightweight, portable and flexible mobile deep learning framework. It supports Python, R, Julia, Go, and JavaScript, which is the framework that supports most programming languages [Chen, 2015]. 12

26 4.2 Benchmark Setup There are two types of benchmark in this thesis. The first type of benchmark is the data benchmark, which includes the training data with training setting such as iteration quantity. The second type of benchmark is neural network algorithms. Due to numerous memory occupation during training, the K20 GPU can only fully support AlexNet [Krizhevsky, 2012], and partly support OverFeat [Sermanet, 2013] because some frameworks require more memory to run Data Benchmark The benchmark I select comes from an open source project called Convent Benchmark, which supports most publicly accessible implementation of CNNs under the same data sets entry [Convent-Benchmark, 2015]. In this benchmark, the author, Soumith Chintala, a member of Facebook AI research team, picked popular ImageNet models and clock the time for full forward and backward pass in his machine. However, this benchmark does not measure the energy consumption of each learning framework. I provide valuable extension to this benchmark by comparing the energy consumption of the aforementioned four frameworks. The benchmark parameters are shown in Table

27 Table 4-2: Convent Benchmark parameters setting for four deep learning frameworks Convent Benchmark Parameter Data Set Training Iteration Random generated data 10 times The data sets in Convent Benchmark is randomly generated and able to response to a variety of input size at different layers. The data will be fixed and reused during each iteration's training. I set the training iteration number as 10 to ensure sufficient number of power samples can be collected for energy consumption calculation Neural Network Benchmark I select AlexNet and OverFeat as algorithm benchmarks. Both of them have 128 batch size for each iteration s input. Other configurations are the same as the published configurations in BVLC s model zoo [Model Zoo, 2015] System Setup All experiments are performed on a single machine running on CentOS 7 with Intel Xeon E GHz; Nvidia Tesla K20m with 5GB memory; 32GB DDR3 main memory; and 128 GB SSD hard drive. The drivers and libraries used in our experiments include CUDA 7.0, cudnn v3, OPENBLAS , Caffe (commit ID be163be), Torch 7 (commit ID eb8d7f2), and TensorFlow (commit ID fd464ca), MXNet (commit ID d25053). For the power measurement, the CPU and DRAM power data are collected via the Intel Running Average Power Limit (RAPL) interface [Intel, 2012] and the GPU power is obtained via the Nvidia s System Management Interface. 14

28 4.3 GPU Mode Results and Discussions Native GPU v.s. GPU with cudnn Library In this subsection, I define the GPU without cudnn library as native GPU. The NVIDIA CUDA Deep Neural Network Library (cudnn) is a GPU-accelerated library of primitives for deep neural networks. The cudnn allows deep learning developers and researchers to focus on designing and training neural network models rather than spending their time to tune the low-level hardware performance counters for the best performance [cudnn, 2013]. Since TensorFlow and MXNet are already embeded with cudnn when released, Caffe and Torch7 are used to evaluate the impact of cudnn on performance and energy consumption. Figure 4-1: Caffe s power consumption with native GPU 15

29 Figure 4-1 shows the energy curve of Caffe during benchmark training. As shown in this figure, at the beginning the GPU power (red line) stays idle (< 50W) for around one second. Then after one fluctuation s initialization, ten iterations of similar fluctuations are observed, which represent the ten training iterations. When GPU is idle, CPU is running the benchmark to load and locate the training data, read neural network configuration file, etc. When GPU is doing initialization, CPU goes back to the idle state, followed by CPU s power burst to over 60Wto transfer the data GPU requested for training. Figure 4-2: Torch s power consumption with native GPU As we can see from figure 4-2, GPU stays in idle over three seconds, which means there are three seconds of performance loss and energy waste for the CPU waiting for GPU finishing its work. During running, CPU has two power spikes. The first one appears 16

30 right after GPU initialization and the second one happens after Torch starts doing backward calculation. Table 4-3: Caffe and Torch energy and performance comparison in native GPU Time (s) CPU GPU CPU GPU average average energy (J) energy Power Power (J) (W) (W) Native GPU GPU and CPU energy (J) Caffe Torch Table 4-3 shows the performance and energy consumption of two frameworks on AlexNet with native GPU. Under these restrictions, Caffe results in the best performance and the lowest energy consumption. Note that in Figure 4-2 of Torch, the power of GPU keeps idle for 3.5 seconds, then GPU instantly goes up to over 120 W. The similar initialization (GPU stays idle) in Caffe only takes 0.9 second. If consider without initialization time, Torch and Caffe have very similar performance and energy consmpution. To compare whether caffe is always faster than Torch in native GPU, I also executed the same benchmark with the OverFeat neural network. 17

31 Native GPU Table 4-4:Caffe and Torch 7 performance and energy with OverFeat Time (s) CPU GPU CPU GPU GPU and average average energy (J) energy CPU Power Power (J) energy (W) (W) (J) Caffe Torch Table 4-4 shows that Caffe saves 9% of time and 9.1% of energy when training the OverFeat algorithm than using Torch. Next, I evaluate how can cudnn accelerate these two frameworks. Figure 4-3 demonstrates several interesting observations: 1) the total training time drops from 9.64 second to 6.61 second; 2) the average GPU power is much higher than native GPU. Figure 4-3: Caffe s power consumption in GPU Mode with cudnn 18

32 Since cudnn has nothing to do with CPU acceleration, and its optimization does not affect Caffe s architecture, the initialization and CPU power curve doesn t change. However, this three seconds time saving saves energy for both GPU and CPU. Table 4-5: Caffe and Torch7 energy efficiency comparison in GPU with cudnn Time (s) CPU GPU CPU GPU average average energy (J) energy(j) Power Power(W) (W) GPU with cudnn GPU and CPU energy(j) Caffe Torch In addition, there are huge difference in Torch diagram. From Figure 4-4, we can observe Figure 4-4: Torch s AlexNet power consumption in GPU Mode with cudnn a long straight power line (at 85 W) that lasts over ten seconds after GPU s idle state and 19

33 before the training starts. On the contrary, based on the execution log of Torch, it finishes its training process in 2.58 second, which is the same duration for forward and backward propagation in Figure 4-4 (From the highest point to the end of the red line). However, the time recorded by our power meter reflects the true execution time of the entire benchmark script, which means that Torch starts its own timer right after the first forward propagation and ignored the time before training. To figure out this conflict, I check the source code of Torch s benchmark. The crucial portion is marked in Figure print('running on device: '.. cutorch.getdeviceproperties(cutorch.getdevice ()).name) 24 print('cudnn version: '.. cudnn.version) steps = nb of steps in loop to average perf 27 ndryruns = function makeinput(config, size) 30 local layout = config[4] 31 local osize 32 if layout == 'BDHW' then 33 osize = size 34 elseif layout == 'DHWB' then 35 osize = {size[2],size[3],size[4],size[1]} 36 elseif layout == 'BHWD' then 37 osize = {size[1], size[3], size[4], size[2]} 38 end 39 return torch.randn(torch.longstorage(osize)) 40 end for i=1,#nets do 43 for j=1,#libs do Figure 4-5: Torch benchmark script file source code 20

34 44 collectgarbage() 45 local model,model_name,size = nets[i](libs[j]) 46 model=model:cuda() 47 local input = makeinput(libs[j],size):cuda() 48 local lib_name = libs[j][5] 49 print('modeltype: '.. model_name, 'Kernels: '.. lib_name, 50 'Input shape: '.. input:size(1).. 'x'.. input:size(2).. 51 'x'.. input:size(3).. 'x'.. input:size(4)) dry-run for i=1,ndryruns do model:zerogradparameters() local output = model:updateoutput(input) local gradinput = model:updategradinput(input, output) model:accgradparameters(input, output) cutorch.synchronize() collectgarbage() end local tmf, tmbi, tmbg 63 sys.tic() 64 for t = 1,steps do 65 output = model:updateoutput(input) 66 end 67 cutorch.synchronize() 68 tmf = sys.toc()/steps 69 print(string.format("%-30s %25s %10.2f", lib_name, ':updateoutput():', tmf*1000)) collectgarbage() 72 sys.tic() 73 for t = 1,steps do 74 model:updategradinput(input, output) 75 end 76 cutorch.synchronize() 77 tmbi = sys.toc()/steps 78 print(string.format("%-30s %25s %10.2f", lib_name, ':updategradinput() :', tmbi*1000)) collectgarbage() 81 sys.tic() Figure 4-5: Continued 21

35 82 local ok = 1 83 for t = 1,steps do 84 ok = pcall(function() model:accgradparameters(input, output) end) 85 end Figure 4-5: Continued In Figure 4-5, line 52 to line 60 is the Dry-run code, which is used to prepare for training. To verify whether the straight line is related to the Dry-run code, I first set up independent timer to measure duration of each portion of code. Initialization takes 4.2 second, Dry-run portion takes near 12 seconds, forward and backward propagation takes 2.58 second, which altogether represent the energy curve in Figure 4-4. Additionally, I change the value of ndryruns from 1 to 50 and retrain again, the result shows in Figure 4-5. The initialization time in Figure 4-5 stays the same (4.2 second), the Dry-run portion takes 25.1 second, and the forward and backward propagation remains the same (2.58 second). Therefore, its clear that the straight line is related to the Dry-run code. Lua is known as one of the fastest scripting languages and very popular in game industry, which requires low latency [Lua Docuement, 2016]. This delay phenomenon in Torch should not relate to which programming language the framework choose. 22

36 Figure 4-6: Torch s power consumption in GPU Mode with cudnn in 50 Dry-run Then I decide to comment the whole Dry-run code and execute again. We surprisingly find that the forward and backward propagation time increases from 2.58 second to second, while Dry Run time becomes 0 second. It means the over ten seconds straight line still exists. Therefore the Dry-run is not meaningless and its the true initialization for forward and backward propagations. If such initialization time is included in the actual training time, Torch will be much slower and less energy efficient than Caffe. Note that Caffe also has near one second initialization time on GPU, I would like to see how much energy and time are actually used to train the neural network. I cut off both 23

37 Caffe and Torch s initialization time. The results are shown in Figure 4-7 and Figure 4-8. Figure 4-7: The actual execution time and power consumption of Caffe Figure 4-8: The actual execution time and power consumption of Torch 24

38 Table 4-6: Caffe and Torch7 actual execution time and energy efficiency comparison in GPU with cudnn GPU Time (s) CPU average Power (W) GPU average Power(W) CPU energy (J) GPU energy(j) GPU and CPU energy(j) Caffe-cudnn Caffe-native Torch7-cudnn Torch7-native Table 4-6 shows the energy consumption and performance caused by the training code. To analyze how cudnn promote the training process and ensure the results are comparable, I also isolate the actual execution of Caffe and Torch with native GPU. With the help of cudnn, Caffe trains the same data sets with 1.58 time faster and consumes 1.41 time less energy. Torch achieves 2.85 times of speedup and saves 2.55 times of energy Overall Framework Energy Consumption and Performance Comparison In this experiment, I only evaluate the impact of cudnn on the TensorFlow and the MXNet (results are shown in Figure 4-9, Figure 4-10 and Table 4 7). 25

39 . Figure 4-9: TensorFlow s power consumption in GPU Mode with cudnn Figure 4-10: MXNet s power consumption in GPU Mode with cudnn 26

40 Table 4-7: Four framework s overall energy consumption and performance GPU Time (s) CPU average Power (W) Caffe GPU average Power(W) CPU energy (J) GPU energy(j) GPU and CPU energy(j) Torch TensorFlow MXNet Table 4-7 shows the total energy consumption and performance of four deep learning frameworks, which include their initialization and actual training process. Overall, Caffe has the best performance and energy efficiency among these four benchmarks. 4.4 CPU Mode Results and Discussions Math Libraries for CPU The Basic Linear Algebra Subprograms, or BLAS libraries, are used to support CPU performing linear algebra operations such as matrix operations. Since Caffe provides flexible interfaces to incorporate different external libraries, I choose Caffe to compare three popular BLAS libraries: Atlas, OpenBLAS, and MKL. Atlas is the abbreviation of Automatically Tuned Linear Algebra Software, which provides C and Fortran interfaces to a portable and efficient BLAS implementation. OpenBLAS is an open source project of BLAS with many optimizations for specific processor type. MKL, or Math Kernel Library, is the commercialized BLAS of Intel Corporation, which has been specially optimized for Intel CPUs. All of these three BLAS libraries support multiple threads. 27

41 4.4.2 Caffe - CPU Mode Energy Efficiency Analysis with BLAS It s predictable that CPU will train neural network much slower than GPU. Therefore, I only select AlexNet to reduce the time of generating experimental results. Figure 4-11: Caffe - CPU mode with Atlas energy consumption 28

42 Figure 4-12: Caffe - CPU mode with OpenBLAS energy consumption Figure 4-13: Caffe - CPU mode with MKL energy consumption 29

43 Table 4-8: BLAS library energy efficiency and performance comparison in Caffe Caffe -CPU Time (s) CPU average Power (W) CPU energy (J) Atlas OpenBLAS MKL From Figure 4-11, 4-12, 4-13, and Table 4-8, we can observe that MKL helps CPU running in a high average power with less fluctuation. Fluctuated utilization of hardware is normal in deep learning training process because of its iterate characteristics. From the Atlas and OpenBLAS energy figures we can see that after each iteration s training, power drops to the same or similar level of the idle state, which is the first point drawing in the blue curve. This means after every training iteration, the CPU takes a break and needs a short amount of time to warm up again before training next iteration. However, MKL keeps CPU in high power after one iteration, which helps CPU better utilize its resources during training time. Compared to Atlas, MKL achieves 2.28 times of speedup and saves1.3 times of energy. 4.5 Conclusion In this chapter, I compare the performance and energy behavior of four learning frameworks in the GPU mode, and then discuss the impact of three math libraries on performance and energy in the CPU mode. For the overall training process, Caffe is the best choice because of its good training performance and short initialization time. For 30

44 only training portion, Torch 7 is the fastest one. Since the initialization step is unavoidable for each framework, so Caffe is the best choice both in performance and energy saving. In the CPU mode, MKL shows much better acceleration of Intel CPUs than other two libraries. Therefore, in the experiments shown in Chapters 5 and 6, I chose Caffe to train the neural network with the cudnn and MKL acceleration libraries enabled for GPU and CPU respectively. 31

45 5. NETWORK WISE ENERGY CONSUMPTION ANALYSIS In Chapter 4, I investigate the energy efficiency of different training framework and conclude that Caffe shows good energy efficiency and performance cross two neural networks. This Chapter focuses on studying the energy efficiency of the neural network itself. Since Caffe is both user friendly and highly efficient, I choose Caffe in this Chapter as the training framework to further explore the impact of neural network inner structure on performance and energy efficiency. Because of the time consuming of disassembling a neural network, I only disassemble AlexNet in this Chapter. I analyze the energy consumption distribution of all major layers and study the impact of batch size on energy consumption. 5.1 Each Layer Energy Occupation To split an integrated neural network into separate layers, the key point is acknowledgment of each layers input and output data size. For example, the previous layers output size should perfectly match the next layer s input size. It is difficult to disassemble the layers of a neural network. Fortunately, AlexNet has relatively simpler structure and I select it to perform the studies of this Chapter. The primary difference of an integrated neural network and separate layers collections are data. For neural network, I need to input data in one time, then data will transmit between layers and generate output. However, separate layers need execute respectively with unrelated data. Since different data will not affect the energy consumption of each layer, 32

46 only data input size matters. As a result, I do not need to use actual data calculated by previous layer as the next layer s input data. Rather, I can use benchmark data with specified input size each time when I target only on one layer. If this break-down approach succeeds, the total breakdown time and accumulated energy consumption of each layer should be equal or similar to the time and energy consumption of the integrated neural network. Since some layers run faster, in order to collect sufficient number of power samples for accurate energy calculation, I use benchmark data with 100 iterations to train each layer as well as the whole AlexNet. 33

47 Table 5-1: Separate Layer Time and Energy Consumption Layer Time (Second) Energy Consumption (Joules) Conv Conv Conv Conv Conv conv1-relu conv2-relu conv3-relu conv4-relu conv5-relu pooling pooling pooling Fully connected Fully connected Fully connected Accumulations Integrated AlexNet From the last two rows in Table 5-1, the accumulation of each layers time is seconds, and the total energy is Joules. These two data are very close to the whole AlexNet training time and energy consumption, which means our break-down approach is acceptable. To analyze which layer consumes more energy, I present each 34

48 layer s percentage of total energy in Figure 5-1 and Figure 5-2. Figure 5-1: Time Occupation percentage of layers Figure 5-2: Energy Occupation percentage of layers Figures 5-1 and 5-2 demonstrate that convolutional layer uses 85% of total time and 87% 35

49 of the total energy. The fully connected layer accounts for 11% of total time and 10% of total energy. The pooling layer is responsible for 3% of time and 2% of energy consumption. Based on the break-down approach of AlexNet and the energy distribution of each layer, researchers will be able to get more knowledge about which layer should put more concentration in order to reduce the overall energy consumption of neural networks. 5.2 Batch Size Factor of Neural Network Batch size is an important setting of neural network but unrelated to neural network structure. A bigger batch size indicates that the data can better represent the full data sets' feature. However, loading in a bigger batch requires more GPU memory thereby may consume more energy. Table 5-2: AlexNet Energy Consumption with different batch size Batch Size Time (s) CPU average Power (W) GPU average Power(W) CPU energy (J) GPU energy(j) GPU and CPU energy(j) In this section, I evaluate integrated AlexNet and OverFeat with batch size ranging from 16 to 128 to in terms of training time and energy consumption. Table 5-2 shows that the training time increases proportionally with batch size. In 36

50 addition, I find that the average power of both CPU and GPU goes up as batch size increases. Figures 5-3 and 5-4 reveal the linear growth of training time and energy consumption with batch size. Figure 5-3: Batch size parameter influence of performance Figure 5-4: Batch size parameter influence of energy 37

51 In addition, I use the same strategy to analyze the batch size effect on OverFeat, Table 5-3 shows that OverFeat s performance and energy consumption both grow linearly with the batch size. Table 5-3:OverFeat Energy Consumption with different batch size Batch Size Time (s) CPU average Power (W) GPU average Power(W) CPU energy (J) GPU energy(j) GPU and CPU energy(j)

52 6. LAYER WISE ENERGY CONSUMPTION ANALYSIS In this chapter, I focus on analyzing power and energy behaviors in each layer. Based on our analysis in previous sections, there are main types of layers to a neural network: 1) Convolutional Layer, 2) Fully-Connected Layer; and 3) Pooling Layer. I use these three layers to analyze layer wise power / energy consumption. Since I already conclude that GPU is more energy-efficient than CPU, I only concentrate on GPU energy analysis in this chapter. 6.1 Convolutional Layer Common Convolutional Layers In our AlexNet analysis, I find that convolutional layer consume over 85% of total energy in training. This is important to understand which factors contribute to the energy consumption most for convolutional layers. First of all, I conduct the measurement on six most commonly seen convolutional layers adapted from [Convent Benchmark, 2015]. These layers are frequently used in these championship neural network of ILSVRC in recent years. The following table shows the configuration of six layers. 39

53 Convolution Layer # Table 6-1: Convolution Layer Specification Input Size Batch Size Feature Map Kernel Size (Channel->Kernel number) Stride L1 128 x > x 11 1 x 1 L2 64 x > x 9 1 x 1 L3 32 x > x 9 1 x 1 L4 16 x > x 7 1 x 1 L5 13 x > x 3 1 x 1 L6 27 x > x 5 1 x 1 In this experiment, I run each layer with 100 iterations with the same benchmark settings to collect results. Table 6-2: The performance comparison of six convolutional layers Convolution Layer # Performance (Second) L L L L L L

54 Table 6-3: The GPU energy comparison of six convolutional layers Convolution Layer # GPU Average Power(Watt) GPU energy (Joules) L L L L L L Based on Table 6-2, 6-3, I can conclude that L1-L6 lead to similar average GPU power regardless the various kernel sizes and input sizes. Compare the layer with the largest power (L2, 12q.78 Watt) with the layer with the smallest power (L1, Watt), L2 only leads to 9% more power. Since each layer has more or less similar average power, the differences in total energy consumption are determined by the processing time. It is intuitive that layers with larger kernels and larger inputs would requires longer processing time, thus would consume more energy. Our initialized thinking is that the total number of kernels parameters and total size of input data determine the processing time and energy. For convolutional layer, numbers of kernels parameters are calculated using the following formula: Kernel parameter = (kernel width x kernel height + 1) x channel x output feature map number Total size of input data is quantified using following rule: Total size of input data = input width x input height x batch size x output feature map 41

55 number Based on above formulas, I calculate numbers in Table 6-4. Table 6-4: Kernel and input size parameter of six convolutional layer Convolution Layer # Kernel Parameter Input Size Parameter L L L L L L At the first glance, it seems that there is not any obvious pattern for the energy consumption for each layer if we only consider single factor between number of kernel parameter and the size of input data. L2 consumes the most energy, but its total number of kernel parameters is smaller than L3 and L5. But if we take both factors into consideration, we can easily see that why L2 consumes more power. Although L2 s number of kernel parameters is roughly half of those in L3 and L5, its input has roughly 4 times compared to L3 and L5. So the total energy consumption of L2 is more than L3 and L5. From Table 6-4, L1 has the smallest kernel parameter, but L1 isn t the fastest one. Also L1 has similar performance with L3, but L3 s kernel parameter is 38 times of L1 s parameter. So it s clear the kernel parameter can t determine a convolutional layer s performance and energy consumption. 42

56 Although all six layers with the same batch size and stride, with the integrated influence of input size, kernel number, and kernel size, it s hard to find obvious principles to explain how one layer has better performance and energy saving than a another one. Because the data input size, kernel size, and kernel quantity, all play an important role in determining workload of each layer, they all affect performance and energy consumption on GPU. Thus, it s hard to figure out any patterns based on these commonly used convolutional layers unless I conduct a serial of controlled experiments. So in the following subsections, I only change one factor while fix all the other factors, which decouple these factors and is much easier to find the rules of energy consumption Data Input Size In this subsection, I keep the kernel size and kernel quantity, but vary the size of input feature map. To best model power behavior of real-world neural network, I carefully select the layer configurations and input data size. The kernel size I use is 3 x 3, which is the most commonly seen kernel size among all ImageNet winner models; The number of input feature map is 96 and the number of output feature map is also set to

57 Table 6-5: Convolutional layer with different data input size Input Size Performance (second) GPU Average Power(Watt) GPU energy (Joules) 13 x x x x x x 224 GPU out of memory In the current setting, the input data size is the only changing factor. From Table 6-5, we can see that when the input data size is small, increasing input data size will increase both the GPU average power and total energy consumption. However, when the input data size is bigger than 56 x 56, the power will be stable while the energy consumption continues to increase. Figure 6-1 shows how the energy changes with input data size. The x-axis is the input size and the y-axis is the energy. I can conclude that with the data input size increased, the energy of training growing 44

58 linearly Kernel Size Figure 6-1: Convolutional layer GPU energy with varied data input size In this section, I want to analyze the influence of kernel size on energy. I configure the input data in the size of 64 x 64. The number of input feature map is set to 128 and the number of output feature map is also set to 128. So the total number of kernel for each covolutional layer is 128 x

59 Table 6-6: Convolutional layer with different kernel size Kernel Size Performance (second) GPU Average Power(Watt) GPU energy (Joules) 3 x x x x x Kernel size is extremely important to reduce the quantity of parameters. The larger kernel size means one nerve cell can learn more features in bigger region of pictures. But too big kernel size will affect neural network s accuracy. From the Table 6-6 we can clearly see that for small kernels, increasing kernel size will lead to increase of both power and energy consumption. However, once the kernel size is relative big (e.g. 7x7), increasing kernel size will not affect power while the energy consumption always increases Kernel Number In the last experiment, I want to see the effect on energy when total number of kernels in a convolutional layer is changing. To achieve this goal, I keep the input feature map size to 64x64, and use kernels in the size of 3x3. 46

60 Kernel Number (input -> output) Table 6-7: Convolutional layer with different kernel number Performance (second) GPU Average Power(Watt) GPU energy (Joules) 3 -> 96 (288) > 128(8192) > 128(16384) > 384(49152) >384(147456) From Table 6-7, we can see that the performance and energy consumption of one convolutional layer is strongly related to the multiplication of input and output kernel number, except of the 3->96 layer, start from 64->128 to 384->384 these four layers, the training time consuming increase in the same speed of the multiplication of kernel number. For example, 128 -> 128 is two times of 64->128, the time also increased near two times. This phenomenon also reflects in GPU energy increment Conclusion From the previous three sections analysis of convolution layer s performance and energy consumption in input size, kernel size, and kernel number three aspects. I summarize the impact of each factor of one convolution in Table 6-8 based on the gap between minimum and maximum performance and energy consumption with three factors increasing. 47

61 Table 6-8: The variety impact of performance and energy of convolutional layer Time Increment Energy Increment Input Size increase 297 x 100 x 121 x Kernel Size increase 13 x 100 x 10 x Kernel number increase 512 x 132 x 142 x From Table 6-8, the setting of these three factors has different influence of performance and energy increment. First of all, all three factors increasing will result in both performance and energy go up. Secondly, input size and kernel number has relatively weak affect of time and energy. Thirdly, kernel size has lower affect of energy increment but stronger impact of performance. 6.2 Fully Connected Layer In this subsection, based on the number of the layer s parameter, I choose three fully connected layer (large, median, and small) to test their energy consumption. The quantity of parameter of one fully connected layer is calculated by following formula: Fully connected parameter = (feature map number input input + 1) x output. I want to see how the parameter quantity affect fully connected layer s performance and energy consumption. 48

62 Table 6-9: The configuration setting of three fully connected layers Input Size Feature Map (Kernel Size) Output Size Parameter Quantity Large ,752,832 Median ,781,312 Small ,100,096 The following table shows the performance of three fully connected layers. From Small to Median, the parameter quantity increase 4.09 times, which is similar to the 3.89 times performance drop. In the same manner, Large has 2.25 times parameter number than Median, and also 2.06 times slower than Median. Table 6-10: The performance of three fully connected layer Performance (Second) Large Median Small Table 6-11: The energy consumption of three fully connected layer GPU Average Power (Watt) GPU Energy (Joules) Large Median Small Table 6-11 shows, the GPU average power increase with the parameter becomes bigger. The GPU energy also shows linearly increasing with parameter. Therefore, based on the analysis of this subsection, both performance and energy 49

63 consumption has linear relation with fully connected layer parameters. 6.3 Pooling Layer The pooling layer is used to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network. Figure 6-2: One pooling method: max pooling For example, the figure 6-2 shows how the pooling layer with max method works. With the 2 2 size of filter kernel, the max-valued pixel is picked up to represent the 4 pixels area. After this process, 75% area was discarded and the depth dimension remains unchanged. Besides max pooling, there are another popular pooling methods is average pooling, which calculate average of all pixel value inside filter as the activation in each pooling region, but not the biggest one. In this subsection, I analyze the variables of pooling layer, I find the variables can be 50

64 from: (1) input size, (2) filter kernel size, (3) pooling method. Based on these three portions, I modify one variable while keeping other two fixed Input Size In the first experiment, I choose three different sizes of input and keep filter kernel size as 3 3, and also keep pooling method as max pooling. These three pooling layers come from AlexNet, but I set the feature map all the same to maintain the impartiality of experiments. Table 6-12: Different input size of pooling layer Input Size Feature Map Kernel Size Pooling Method Pooling Max Pooling Max Pooling Max Table 6-13: Different input size of pooling layer performance and energy Performance (Second) GPU Average Power (Watt) GPU Energy (Joules) Pooling Pooling Pooling From Table 6-12, 6-13, they directly show input size is positively correlated to performance, GPU average power, and energy. Based on previous chapter data, pooling layer only occupy 2% of total training time and energy of neural network, so the base energy is relatively small. Compared with the input size affect energy exponentially, the input size in pooling layer has linear relation with energy. 51

65 6.3.2 Filter Kernel Size In this subsection, I reset the feature map as the original number of AlexNet to reflect real world neural network. Since recent imagenet winners commonly choose 3 3 or 2 2 kernel size, I design the experiments with these two size. Noted I only compare the kernel size influence between each pooling layer, so the difference of feature map will not affect experiment result. Table 6-14: Configuration setting of three pooling layers Input Size Feature Map Kernel Size Pooling Method Pooling / 2 2 Max Pooling / 2 2 Max Pooling / 2 2 Max For the performance I can see in Table 6-15, the kernel size does not has big impact of performance. Although I average the time of 5 runs, the difference already in millisecond Table 6-15: Different kernel size of pooling layer performance Performance (second) 3 3 Kernel 2 2 Kernel Pooling Pooling Pooling level. So I bet there is not big influence of performance between these two sizes of kernels. Table 6-16: Different kernel size of pooling layer GPU energy GPU Energy (Joules) 3 3 Kernel 2 2 Kernel Pooling Pooling Pooling Both performance and energy difference between two kinds of kernel becomes bigger, 52

66 For small input size, choosing 3 3 kernel size has not much difference of choosing 2 2 from energy aspect. For larger input size, 2 2 can save more time and energy. From this subsection experiment we know, although small kernel size means the kernel sliding window need to move more time than bigger kernel to traverse the whole input data, small kernel also save time to do every down sampling than big kernel size and finally save energy Pooling Method In this subsection, I keep input size and kernel size, and change pooling method to compare how energy efficient of these two methods in poling layer. Table 6-17: The pooling method configuration of pooling layers Input Size Feature Map Kernel Size Pooling Method Pooling Max/Average Pooling Max/ Average Pooling Max/ Average Table 6-18: Pooling layers performance with various pooling method Performance(second) Max Pooling Average Pooling Pooling Pooling Pooling

67 Table 6-19: Pooling layers energy consumption with various pooling method GPU energy (Joules) Max Pooling Average Pooling Pooling Pooling Pooling From table 6-18, 6-19, max pooling have better performance and also more energy efficiency than average pooling in each layer. Since Max pooling only pick the max value of sliding window, which need less calculation than calculating average of all the pixel value that helping to reduce the workload of GPU, then save more energy. In here, with the input size increasing, the ratio of average pooling energy and max pooling becomes smaller from 1.77 to Conclusion In this chapter, I systematically analysis the major layers. Then I discuss main factors of each layers influence of layer performance and energy consumption. In convolutional layer, I discuss three different factor s contribution of layer. For fully connected layer, one factor used to modify to see how it affect the layer. In pooling layer, I visualize three kinds of setting s effect of pooling. Based on all the results, we can see how single factors affect one layer, then affect the whole neural network in both performance and energy consumption field. 54

68 7. HARDWARE TUNING In previous chapters, I analyze how the frameworks and neural network itself will affect energy consumption of training process. But I also want to save more energy by tuning hardware. Because most of the training job is responsible for GPU, so this tuning mainly focus on GPU tuning. In our work, I experimentally study the impacts of DVFS about neural network training performance and energy efficiency in Nvidia K20m GPU. Besides, I compare the K20m and Titan X GPU performance and energy consumption difference with the same benchmark. 7.1 DVFS Dynamic Voltage and Frequency Scaling (DVFS) is an advanced power-saving technology whose aim is to lower a component s power state while still meeting the performance requirement of the running workload [Ge, 2013]. The K20m GPU I used is Nvidia Kepler architecture. It includes total 2496 CUDA cores, six 64-bit memory controllers, and 5 GB global memory. The GPU cores and memory are capable of DVFS and support the clock frequency shown in Table

69 Table 7-1: Support memory frequency and core frequency options in K20m GPU Memory Frequency (MHz) GPU Core Frequency(MHz) To control the clock frequency of the K20m GPU, I used nvidia-smi utility or Nvidia System Management Interface. This interface is a command line utility that uses the Nvidia Management Library (NVML) for management and control of Nvidia devices [Price, 2014]. With the help of GPU boost commands on k20m GPU, I test six experiments which include all the pair combinations from table 7-1. The benchmark for this experiment is based on previous Convent Benchmark but with 100 iteration. The experiment performance results shows in Figure 7-1 with 100 times forward and backward propagation. 56

70 Figure 7-1: Performance impact of DVFS in K20m Figure 7-1 shows how big influence of GPU frequency affect the performance. The best performance is second with core speed with combination of 758 MHz and memory with 2600 MHz, which is 3 times faster than GPU with 324 MHz core speed and 324 MHz memory speed. In addition, when the memory frequency is fixed in 2600 MHz, the performance going better with the increasing of GPU core frequency. But this increase is limited. When core speed from 614 MHz to Max 758 MHz, the time saving is second. The average time saving prompted 2.6 second each change when memory fixed in 2600 MHz. 57

71 Figure 7-2: Total system energy impact of DVFS The Figure 7-2 shows the system energy of the whole training process. The system energy in here includes CPU and GPU energy consumption. From our data, CPU portion of average power doesn t affected too much during GPU tuning and keep in 63 ~ 67 Watt. So that means when GPU can run faster with higher frequency, CPU energy portion will drop certainly. Figure7-3 shows the average power of GPU under different frequency pair. 58

72 Figure 7-3: GPU Average Power with the impact of DVFS In Figure 7-3, both memory and core frequency with 324 MHz has Watt average power, which slightly higher than 46-watt idle power of K20m. It means the utilization of GPU is very low under this frequency pair. Then the GPU power go up with the core frequency increased and arrives the top when core speed is 758 MHz with Watt. 59

73 Figure 7-4: GPU energy portion with the impact of DVFS Compared with Figure 7-2 (total energy) with Figure 7-4 (GPU energy), I find that the most energy saving comes from CPU but not GPU. Even though GPU run faster with DVFS tuning, the average power goes high, so the energy of GPU doesn t improve much. From Figure 7-4, when K20m GPU is in 614/2600 pair, it achieves the best GPU energy efficiency with 6206 Joules, then with the faster core speed, K20m consume more energy generally. When K20m GPU with 758/2600 frequency, the second highest GPU energy consumption can not stop this pair becomes the best system energy efficiency option because of the saving in CPU make up the extra lost in GPU. To sum up, DVFS has pros (better performance) and cons (higher average power) of GPU running in neural network training, but DVFS helps GPU runs faster, CPU can save more energy to make up the shortage of DVFS, the total energy can drop slightly. 60

74 7.2 K20m and Titan X GPU Comparison When my work going to the end, I got Titan X GPU from Nvidia s donation, so I add this subsection to make a brief comparison of Titan X and K20m GPU s performance and energy efficiency with the same benchmark. Titan X is the newest and strongest GPU in current market. From table 7-2, Titan X has 3072 CUDA cores, and a total 12 GB global memory. Also, Titan X is MaxWell architecture that each CUDA core is stronger than cores with Kepler architecture of K20m. Based on such specifications, I can assume that Titan X has better performance than K20m, but I want to quantify the result and I also interested in whether Titan X can be more energy efficient than K20m GPU. Table 7-2: Titan X and K20m hardware comparison [Nvidia, 2012; Nvidia 2014] Chip CUDA Core single-precision Max Power floating point performance Titan X GM TFLOPS 250 W (MaxWell) K20m GK100 (Kepler) TFLOPS 225 W I used the Convent Benchmark with 100 iterations and AlexNet running in Caffe framework to compare two GPUs. Table 7-3: Titan X GPU and K20m GPU comparison Time CPU Average Power GPU Average Power Total Energy Titan X K20m From Table 7-3, Titan X has 3.9 times faster than K20m. With almost two time higher 61

75 average GPU power then K20m, Titan X still can achieve 2.56 times energy efficiency. Figure 7-5: K20m energy curve of Caffe with AlexNet Figure 7-6: Titan X energy curve of Caffe with AlexNet 62

A performance comparison of Deep Learning frameworks on KNL

A performance comparison of Deep Learning frameworks on KNL A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

Accelerating Convolutional Neural Nets. Yunming Zhang

Accelerating Convolutional Neural Nets. Yunming Zhang Accelerating Convolutional Neural Nets Yunming Zhang Focus Convolutional Neural Nets is the state of the art in classifying the images The models take days to train Difficult for the programmers to tune

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong Using Capsule Networks for Image and Speech Recognition Problems by Yan Xiong A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved November 2018 by the

More information

INTRODUCTION TO DEEP LEARNING

INTRODUCTION TO DEEP LEARNING INTRODUCTION TO DEEP LEARNING CONTENTS Introduction to deep learning Contents 1. Examples 2. Machine learning 3. Neural networks 4. Deep learning 5. Convolutional neural networks 6. Conclusion 7. Additional

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC. IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC. CONTENTS Deep Learning Review Implementation on GPU using cudnn Optimization Issues Introduction to VUNO-Net DEEP LEARNING REVIEW BRIEF HISTORY OF NEURAL

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Fuzzy Set Theory in Computer Vision: Example 3

Fuzzy Set Theory in Computer Vision: Example 3 Fuzzy Set Theory in Computer Vision: Example 3 Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Purpose of these slides are to make you aware of a few of the different CNN architectures

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD. Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,

More information

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Fuzzy Set Theory in Computer Vision: Example 3, Part II Fuzzy Set Theory in Computer Vision: Example 3, Part II Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Resource; CS231n: Convolutional Neural Networks for Visual Recognition https://github.com/tuanavu/stanford-

More information

Deep Face Recognition. Nathan Sun

Deep Face Recognition. Nathan Sun Deep Face Recognition Nathan Sun Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an

More information

Report: Privacy-Preserving Classification on Deep Neural Network

Report: Privacy-Preserving Classification on Deep Neural Network Report: Privacy-Preserving Classification on Deep Neural Network Janno Veeorg Supervised by Helger Lipmaa and Raul Vicente Zafra May 25, 2017 1 Introduction In this report we consider following task: how

More information

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2 About nervana

More information

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int

More information

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Yelp Restaurant Photo Classification

Yelp Restaurant Photo Classification Yelp Restaurant Photo Classification Rajarshi Roy Stanford University rroy@stanford.edu Abstract The Yelp Restaurant Photo Classification challenge is a Kaggle challenge that focuses on the problem predicting

More information

CSE 559A: Computer Vision

CSE 559A: Computer Vision CSE 559A: Computer Vision Fall 2018: T-R: 11:30-1pm @ Lopata 101 Instructor: Ayan Chakrabarti (ayan@wustl.edu). Course Staff: Zhihao Xia, Charlie Wu, Han Liu http://www.cse.wustl.edu/~ayan/courses/cse559a/

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9-1 Administrative A2 due Wed May 2 Midterm: In-class Tue May 8. Covers material through Lecture 10 (Thu May 3). Sample midterm released on piazza. Midterm review session: Fri May 4 discussion

More information

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deploying Deep Learning Networks to Embedded GPUs and CPUs Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets) Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets) Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez, Senior Researcher, Microsoft Module Outline

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective

More information

Face Detection CUDA Accelerating

Face Detection CUDA Accelerating Face Detection CUDA Accelerating Jaromír Krpec Department of Computer Science VŠB Technical University Ostrava Ostrava, Czech Republic krpec.jaromir@seznam.cz Martin Němec Department of Computer Science

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Neural Network Exchange Format

Neural Network Exchange Format Copyright Khronos Group 2017 - Page 1 Neural Network Exchange Format Deploying Trained Networks to Inference Engines Viktor Gyenes, specification editor Copyright Khronos Group 2017 - Page 2 Outlook The

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

POINT CLOUD DEEP LEARNING

POINT CLOUD DEEP LEARNING POINT CLOUD DEEP LEARNING Innfarn Yoo, 3/29/28 / 57 Introduction AGENDA Previous Work Method Result Conclusion 2 / 57 INTRODUCTION 3 / 57 2D OBJECT CLASSIFICATION Deep Learning for 2D Object Classification

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Profiling DNN Workloads on a Volta-based DGX-1 System

Profiling DNN Workloads on a Volta-based DGX-1 System Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,

More information

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Deep Learning with Torch

Deep Learning with Torch Deep Learning with Torch The good, the bad, the ugly since 2002 Jimmy Ba jimmy@psi.utoronto.ca What is Torch? Year 2012 Google Answer: Torch7 provides a Matlab-like environment for state-of-the-art machine

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th Today s Story Why does CNN matter to the embedded world? How to enable CNN in

More information

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python. Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC)

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Jakob Verbeek 2017-2018 Biological motivation Neuron is basic computational unit of the brain about 10^11 neurons in human brain Simplified neuron model as linear threshold

More information

Additive Manufacturing Defect Detection using Neural Networks

Additive Manufacturing Defect Detection using Neural Networks Additive Manufacturing Defect Detection using Neural Networks James Ferguson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville Knoxville, Tennessee 37996 Jfergu35@vols.utk.edu

More information

All You Want To Know About CNNs. Yukun Zhu

All You Want To Know About CNNs. Yukun Zhu All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing feature 3 PC 3 Beate Sick Many slides are taken form Hinton s great lecture on NN: https://www.coursera.org/course/neuralnets

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Deep Learning Based Real-time Object Recognition System with Image Web Crawler , pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department

More information

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Cisco UCS C480 ML M5 Rack Server Performance Characterization White Paper Cisco UCS C480 ML M5 Rack Server Performance Characterization The Cisco UCS C480 ML M5 Rack Server platform is designed for artificial intelligence and machine-learning workloads. 2018 Cisco

More information

Semantic Segmentation

Semantic Segmentation Semantic Segmentation UCLA:https://goo.gl/images/I0VTi2 OUTLINE Semantic Segmentation Why? Paper to talk about: Fully Convolutional Networks for Semantic Segmentation. J. Long, E. Shelhamer, and T. Darrell,

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

A Deep Learning Approach to Vehicle Speed Estimation

A Deep Learning Approach to Vehicle Speed Estimation A Deep Learning Approach to Vehicle Speed Estimation Benjamin Penchas bpenchas@stanford.edu Tobin Bell tbell@stanford.edu Marco Monteiro marcorm@stanford.edu ABSTRACT Given car dashboard video footage,

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Minimizing Computation in Convolutional Neural Networks

Minimizing Computation in Convolutional Neural Networks Minimizing Computation in Convolutional Neural Networks Jason Cong and Bingjun Xiao Computer Science Department, University of California, Los Angeles, CA 90095, USA {cong,xiao}@cs.ucla.edu Abstract. Convolutional

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

DL Tutorial. Xudong Cao

DL Tutorial. Xudong Cao DL Tutorial Xudong Cao Historical Line 1960s Perceptron 1980s MLP BP algorithm 2006 RBM unsupervised learning 2012 AlexNet ImageNet Comp. 2014 GoogleNet VGGNet ImageNet Comp. Rule based AI algorithm Game

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Transfer Learning. Style Transfer in Deep Learning

Transfer Learning. Style Transfer in Deep Learning Transfer Learning & Style Transfer in Deep Learning 4-DEC-2016 Gal Barzilai, Ram Machlev Deep Learning Seminar School of Electrical Engineering Tel Aviv University Part 1: Transfer Learning in Deep Learning

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics? Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto How to learn many layers of features

More information

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications Ryosuke Tanno and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1. INTRODUCTION Deep

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Application of Deep Learning Techniques in Satellite Telemetry Analysis.

Application of Deep Learning Techniques in Satellite Telemetry Analysis. Application of Deep Learning Techniques in Satellite Telemetry Analysis. Greg Adamski, Member of Technical Staff L3 Technologies Telemetry and RF Products Julian Spencer Jones, Spacecraft Engineer Telenor

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Lecture 37: ConvNets (Cont d) and Training

Lecture 37: ConvNets (Cont d) and Training Lecture 37: ConvNets (Cont d) and Training CS 4670/5670 Sean Bell [http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by] (Unrelated) Dog vs Food [Karen Zack, @teenybiscuit]

More information