Convolutional Neural Networks for Object Classication in CUDA

Size: px
Start display at page:

Download "Convolutional Neural Networks for Object Classication in CUDA"

Transcription

1 Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky April 16, Introduction Here I will present my implementation of a simple convolutional neural network in CUDA. The network takes as input a colour image and produces as its output one of ten possible class labels. The convolution operations, which account for 90% of the time required to train this network, are x faster on the GPU than on an Intel Core 2 Duo 2.4GHz. 2 Related work I have not found any other implementations of 2D convolution. The CUDA SDK has an implementation for the separable case, but that is quite dierent. I am interested in non-separable lters. I have heard from Yann LeCun of NYU (the primary pusher of convolutional neural nets) that students at UC Irvine have claimed a 100x speedup of convolutional neural net implementations on the GPU. 3 Neurons A neuron is simply a function. It takes an input x, computes some function f(x) and outputs it. In my implementation, I use neurons that compute the logistic function f(x) = e x, which is plotted in Figure 1. This is by far the most commonly used function. It has the nice property that its output is eectively linear in the input if the size of the input is small. This means that neural networks with small weights essentially compute a linear function, and gradually increasing the weights allows one to control the degree of nonlinearity. The degree of nonlinearity controls the capacity of the neural network. For example, a network with purely linear neurons can only compute linear functions. Figure 1: The logistic function f(x) = 1 1+e x.[1] 1

2 4 Feed-forward neural networks Figure 2: A feed-forward neural network with one hidden layer. Feed-forward neural networks link together layers of neurons. Every neuron in layer l is connected to every neuron in layer l + 1, but there are no intra-layer connections. This architecture is depicted in Figure 2. On each connection there is a real-valued weight. Neuron k in layer l receives as input the value where b l k is the bias into neuron k in layer l, N l 1 is the number of neurons in layer l 1, w l 1 ik N l 1 x l k = b l k + i=1 w l 1 ik yl 1 i is the weight between unit i in layer l 1 and unit k in layer l, and y l 1 i is the output of unit i in layer l 1. The neuron then computes its output y l k = f(x l k) where f is any dierentiable function of the neuron's total input (as mentioned, I use the logistic function). The neurons in the data layer just output the data. Finally, we come up with a function E(y L 1,..., y L N L ) of the output that we would like the neural net to maximize (this can be seen as just another layer on top of the output layer), where L is the number of layers in the neural network. E should be dierentiable so is readily computable. yk L Training the network consists of clamping the data neurons at the data and updating the parameters (the weights and biases) in the direction of the gradient. The derivatives can be computed as follows: w l 1 ik b l k = x l y l 1 i k = x l k where x l k y l k = = y l k yk l x l k yk L Nl+1 i=1 x l+1 i w l ki if l = L otherwise (1) 2

3 and is assumed to be readily computable. From this, derivatives with respect to all the weights and biases can be yk L computed, working down from the top layer. This is known as the backpropagation algorithm. Typically, the gradient is averaged over a whole batch of images, and then the parameters are updated with this average gradient. This is known as batch learning. If the network is to be used for classication, as in my case, the number of outputs is typically set at the number of possible classes. The output of each output unit then corresponds in some way to the network's belief in some class label (I will get more specic about this in the next section when I describe my network). 5 Convolutional neural networks In ordinary feed-forward neural networks, every neuron in layer l is connected to every neuron in layer l 1. This is ne if there is no local structure in the activations of the neurons of layer l 1. But images have just this kind of structure. Nearby pixels are highly correlated and faraway pixels are weakly correlated. Therefore we would like to encourage the neural net to focus on extracting local features of the image. Convolutional nets achieve this by connecting each unit in the hidden layer to only a small local patch of units in the image. In my case, this patch is 8 8. A convolutional net with one hidden unit will apply this 8 8 lter at all possible locations in the image (with a 4px-wide padding of zeros on each edge). The outputs of the hidden layer will then consist of the outputs of this one lter, applied all over the image. This is where the convolution comes in. But having just one hidden unit is too limiting, even though its lter is applied all over the image and thus the unit has over a thousand outputs. We would like to have lots of hidden units that learn dierent lters, each applied all over the image. This scenario is depicted in Figure 3. In the gure there are F hidden units in total, and each plate represents the outputs produced by one hidden unit by convolving its 8 8 lter with the image and applying the logistic function. There are outputs instead of because GPUs are more fond of numbers like 32 than 33, and the neural net will not miss those few extra missing outputs at the edges of the image. In a standard neural net, to get outputs from the hidden layer, we would need it to have units in that hidden layer. Here this is achieved with just one unit. So we can think of convolutional neural nets as ordinary neural nets, but with the constraint that certain groups of units must share weights (and also that the units be only locally-connected to the image). Clearly, the outputs of a hidden unit when applied to nearby regions of the image will be similar. Therefore the values in nearby regions of the same plate of hidden unit outputs (Figure 3) in the hidden layer are similar. So convolutional nets often include a subsampling layer right above the convolutional layer, to locally average nearby hidden unit responses. In my network, the averaging units average 4 4 non-overlapping patches of hidden unit outputs. The averaging layer reduces the size of the plates of hidden unit responses from to 8 8. It can also be argued that losing some precision as to the exact location of the features in the image is actually advantageous, because it achieves a greater degree of invariance. It often turns out in vision that the precise location of a feature is not as important as its approximate location relative to other features. The averaging units produce a combined total of 8 8 F outputs. From this point on, the network is just an ordinary fully-connected feed-forward network. The network has 10 outputs because my dataset consists of 60,000 images in 10 classes. The 8 8 F averaging units are connected to all 10 of the output units. Output unit k computes the function y k = e x k 10 j=1 exj where x k is the total input going into output unit k. Since 10 k=1 y k = 1, y k is interpreted as the probability that the network assigns to class label k. To train the network, I maximize the average log probability of getting the correct label. To evaluate the network's performance, I show it an image and compare the class to which it assigns the highest probability with the true label of the image. In my results I simply report the percentage of labels that it got correct. Convolutional neural nets are actively developed by Yann LeCun's group at New York University, and they have been shown to achieve state-of-the-art performance in handwritten digit recognition as well as other classication tasks. The types of networks that achieve such performance are typically more complicated than the one that I have implemented. Specically, they have more than one layer of convolution. But the principle is the same. 5.1 Training convolutional neural networks Mathematically, the algorithm for training convolutional neural networks is the same as it is for training ordinary feedforward networks. This is because, as mentioned above, a convolutional network can be viewed as a feed-forward network with weight-sharing constraints. In the next section I will describe in more detail the GPU kernels that need to be implemented to compute the various derivatives. 3

4 Figure 3: The architecture of the convolutional neural network that I implemented. The convolutional units compute the logistic function: f(x) = 1 1+e. The output layer is a standard logistic regression classier: f(x x i ) = ex i P 10 j=1 ex j. Not all the connections are shown, to avoid cluttering the diagram. For simplicity, this diagram does not make it explicit that the images and lters are colour. In reality, each image should be thought of as having dimensions and each lter

5 (a) (b) Figure 4: The layout of (a) the images and (b) the lters in memory. Note that both images and lters are stored as oats each pixel consumes 4 bytes of memory. This is because in many applications, the data will have been pre-processed in some way, or normalized to be in the range 0-1, etc. The lters too must be oats because the lters represent the weights on the connections of the neural net. One cannot take the derivative with respect to a discrete quantity. 6 Implementation in CUDA 6.1 Image format The rst step in training the convolutional net is to convolve the F lters of the net with the N images that make up a batch of training cases. I generally used 64 lters and batches of 128 images, although these values are runtime variables. Figure 4 makes clear the exact format of the data and lters. 6.2 Convolution of 8x8 lters with 32x32 images The algorithm for doing this convolution is explained in Figure 5. It uses a block size of 8 32 = 256. Each block convolves one image with one lter, so the grid size is F N. The algorithm uses 14 registers and 3264 bytes of shared memory, so it achieves 100% occupancy. It is 125x faster than the equivalent CPU algorithm on a Core 2 Duo 2.4GHz when F = 64, N = 128. This convolution takes 13.18ms on the GPU. This is the best algorithm that I have been able to nd. I have tried variants in which each thread convolves more than one lter with the image, but they perform slightly worse (they also don't achieve 100% occupancy). I have also tried a variant that uses 512 threads per block and loads the entire image into shared memory at once. It also performed slightly worse because loading was slightly more complicated. I have made some eort to make sure that there are no warp-splitting if statements. In the few conditionals that exist, all threads in the same warp take the same branch (except the initial lter load). I have checked that removing all loads 5

6 and their corresponding if statements improves the algorithm's performance by only 2%, so memory bandwidth does not appear to be a bottleneck. All of the loops have bounds that are known at compile time, so they are unrolled by the compiler. All global memory accesses are coalesced and there are no bank conicts in accessing shared memory. I have also tried to minimize the amount of pointer arithmetic. I got a slight speed improvement by incrementing pointers at each loop iteration rather than recomputing the index of the array that I want to access. 6.3 Subsampling (averaging) layer Figure 6 makes clear the goal of this step. We need to go from a array of oats to an 8 8 array by averaging every non-overlapping 4 4 region. Figure 7 describes the algorithm. The algorithm uses blocks of 8 32 threads so the grid size is again F N. The algorithm has 2-way bank conicts in one of its steps, but avoids bank conicts in all other steps. All global memory accesses are coalesced. The index computations are particularly simple because they only involve division by powers of 2. Avoiding all bank conicts requires slightly more complicated index computations which negate the benets. All of the reduction in this algorithm is performed by threads in the same warp so little synchronization is necessary. In any case, this step accounts for less than 1% of the time required to train the convolutional net. The kernel uses 1152 bytes of shared memory and 10 registers, and so it achieves full occupancy. This kernel is 31x faster than its CPU counterpart when F = 64, N = Output and derivative computation Once we have the averaged outputs of the previous step, computing the outputs of the neural net is a trivial task since the remaining layers are fully connected. I will not go into detail for lack of space; I will just say that it involves some calls to CUBLAS. The derivatives with respect to the output units and the averaging units are likewise computed with a few matrix multiplications. 6.5 Derivatives with respect to convolutional hidden units Given the derivatives with respect to the outputs of the averaging units, it is conceptually simple to compute the derivatives with respect to the outputs of the convolutional units. The averaging units simply locally average the convolutional unit outputs, so the derivative with respect to a given convolutional unit output will be the same as the derivative with respect to the averaging unit assigned to its region. This means we require a supersampling kernel. This step is essentially the reverse of the step of step 6.3. Figure 8 depicts the goal of this step. There is one minor complication that is that the CUBLAS SGEMM (matrix multiplication) routine outputs its result in column-major order. Since I used CUBLAS to compute the derivatives with respect to the activities of the averaging units, I will have to deal with this (because I want the output of the kernel to be in row-major order for the next step). Figure 9 depicts the format of the input matrix. Figure 10 explains the implementation of the algorithm. All reads and writes are coalesced and there are no bank conicts. When 4 threads read the same value, that value should be broadcast so there should be no bank conict there. This kernel as well achieves 100% occupancy and accounts for less than 1% of the time spent training the convolutional net. This kernel is 58x faster than its CPU counterpart when F = 64, N = 128. This kernel computes derivatives with respect to the outputs of the convolutional units. Computing derivatives with respect to the inputs to the convolutional units is accomplished with a simple vector element-wise multiplication, since from equation (1) we know x l k = yk l yk l x l k = yk l (yk)(1 l yk), l the latter part is due to the fact that y l k = 1 1+e xl k. 6.6 Derivatives with respect to weights between data and convolutional units Once we have the derivatives with respect to the inputs to the convolutional units, the nal step is to compute the derivatives with respect to the weights between the data units and the convolutional units. For a particular lter and image pair, the derivative with respect to weight w ik is the average over all applications of the lter of the derivative with 6

7 Figure 5: The algorithm for convolving 8 8 colour lters with colour images with 4px padding of zeros on the left and top edges and 3px padding on the right and bottom edges. The block size is (y, x) = (8, 32). Each block convolves one lter with one image. (1) 3/4 of the threads load the entire lter into shared memory. Although they do not need all three colour channels at once, there is a slight advantage to loading them all together, and the shared memory is available. (2) The threads allocate a region of shared memory (highlighted) and initialize it with zeros. The threads then load the chunk of the (red colour channel of the) image that corresponds to the non-border area. The 8 32 threads then compute the 8 32 dot products with this chunk of the image and store the result in a local variable. Each thread computes one dot product. This result will later have to be added to the result from the blue and green colour channels and then written out to global memory. (3) The threads shift the shared memory window down by 8 pixels and perform the 8 32 dot products with the new image chunk. They repeat the process twice more. Each time, there are 8 32 dot products to perform. At the last step, the window is seen to exceed the size of the padding because there is only 3px of padding at the bottom. This points out the fact that the threads really only need a region of shared memory, but the extra row doesn't hurt. After the above steps, each thread has 4 local variables with the results of the 4 dot products that the thread computed. The threads now load the green channel of the image and repeat the above steps for the green channel of the lter, adding to the 4 local variables. Then the threads repeat the steps for the blue channel. Finally, each thread writes its 4 outputs to global memory. 7

8 Figure 6: The goal of the local averaging kernel: to go from the array of oats (left) to the 8 8 array (right) by averaging the highlighted 4 4 regions. Figure 7: The local averaging algorithm. The block size is (y, x) = (8, 32). Each block subsamples one plate of hidden unit outputs (pictured left). The threads allocate a 4 64 array in shared memory with 8 padding columns, to avoid bank conicts later. The values are summed as pictured. Thread 0 sums the 4 values in the blue column and writes the result to the corresponding cell in shared memory. Thread 1 sums the yellow column, thread 2 the orange, and thread 3 the green. The rest of the threads proceed in an analogous fashion. Notice that this produces 2-way bank conicts since threads 0 and 2 write to the same bank, as do 1 and 3, 4 and 6, etc. The reads from global memory are of course coalesced. After this step is complete, the threads in the rst 8 warps do the remaining reduction by summing out the columns of the shared memory array, 2 threads per column. So each warp sums out 16 columns. No synchronization is necessary in this step. At this step there are no bank conicts because thread i reads from bank (i%2) 8+i/2 (modulo 16). Once all the nal values are written to the rst row of the shared memory array, the rst two warps divide those values by 16 and write them out to global memory. This step also has no bank conicts because thread i accesses bank i (modulo 16). The writes to global memory are coalesced. 8

9 Figure 8: The goal of the kernel of step 6.5 for one image and one lter. We need to supersample the derivatives with respect to the averaging units (left) by a factor of 4 4. So each value on the left must get written to 16 locations on the right. Figure 9: The layout of the input matrix for step 6.5. These are derivatives with respect to the outputs of the averaging units. The images are in columns because the CUBLAS SGEMM routine outputs in column-major order. 9

10 1. 2. Figure 10: The supersampling algorithm of step 6.5. The block size is The grid size is (y, x) = (64F/16, N/16). The grid simply partitions the input matrix into chunks (refer to Figure 9 for the input layout). (1) Each block allocates a 16 (16 + 1) region of shared memory and reads its chunk of the input matrix into shared memory, without transposing it. The threads supersample the matrix in the following way: Threads 0, 1, 2, 3 read the blue value (0, 0) of (1). Threads 4, 5, 6, 7 read the yellow value (1, 0). The rest of the threads proceed analogously. Note that there are only enough threads to read four of the columns of the shared memory array in this way. (2) Each thread then writes the value it read to four dierent rows of the output matrix. The hue in (2) indicates the thread that is doing the writing. Thread 0 writes the rst column, thread 1 the second, etc. (Note that the threads that read the second column of the shared memory array will write to a dierent image (output matrix) than the ones that read the rst column. Only one image is pictured in (2).) The threads repeat (2) three more times, each time processing four more columns of the shared memory array. 10

11 Figure 11: The convolution required to compute the derivatives with respect to the weights between the data units and the convolutional units. This convolution produces 8 8 outputs per image-lter pair. Each image must be convolved with F lters, but unlike in the convolution of step 6.2, the F lters are dierent for each image. respect to w ik. This amounts to the convolution depicted in Figure 11. The lters now are the derivatives with respect to hidden unit inputs which we computed in the previous step. Figure 12 describes the algorithm. This kernel is 25% faster than the convolution kernel of section 6.2 even though the kernels require an identical number of FLOPs. This kernel uses a block size of 8 8 8, 5856 bytes of shared memory, and 16 registers, so it too achieves 100% occupancy. All global memory accesses are coalesced. There are no bank conicts due to the one column of padding depicted in Figure 12. There are no warp-splitting if statements. I have checked that removing all global memory loads and their if statements makes the kernel only 5% faster. Here too I obtained a slight speedup from incrementing pointers at each loop iteration rather than recomputing indices. As can be discerned from Figure 12 (bottom), on the second iteration the kernel loads a chunk of the image matrix into shared memory, but 64% of that chunk is already in shared memory from the previous load. Despite the redundancy, it turns out that this is the fastest alternative, due to the simplicity fo the code. This kernel is 156x faster than the equivalent CPU kernel when F = 64, N = 128. This convolution takes 10.54ms on the GPU. 6.7 Other derivatives Once we have the above derivatives, we need to do a simple reduction to average over all images in the batch. I won't go into detail here because it's straightforward. The bias derivatives are likewise computed by averaging the derivatives with respect to the hidden unit inputs. 7 Results 7.1 Testing methodology I have tested each of my kernels against its CPU equivalent. Each kernel produces a matrix of results. I compared the maximum absolute dierence between the matrix produced by the CPU and the matrix produced by the GPU. Observing that the maximum dierence was something like 10 5 instilled condence that, at the very least, the GPU code was doing an excellent job of replicating the bugs in the CPU code. 7.2 Experiments I have trained the convolutional neural net on my dataset of 60,000 images from 10 classes; it takes about 20 minutes to fully train with 64 convolutional hidden units. It eventually learns to guess the correct class about 58% of the time on a test set. The best result I have achieved with other methods is about 65%, so the convolutional net is not doing too badly. A non-convolutional neural net with one hidden layer with 10,000 hidden units gets about 50% of test cases right. 11

12 Figure 12: The algorithm for performing the convolution of Figure 11. The block size is The x and y dimensions of the thread index indicate which of 64 dot products (see Figure 11) the thread is to compute. Each block convolves one image with 16 lters. The z dimension indicates which two lters the thread is concerned with. Here, I have drawn just one image and one lter. The lters are the derivatives that we computed in step 6.5. The images are the input data. They are colour, but only one colour is shown. The threads process each colour in turn (it is the outer-most loop). The threads initialize the shared memory matrix with zeros. (Top-left) The rst of 512 threads load an chunk of the image into shared memory. (Top-centre) The threads load a 4 16 chunk of the lter into shared memory. Each thread computes the length-64 dot product between the lter and image that it needs to compute here. (Top-right) The threads shift to the right by 16 pixels the shared memory window on the lter and add to the dot product from the previous window. (Bottom) The threads shift the shared memory window on the image down by 4 pixels and repeat the above steps. They continue shifting the windows until all chunks of the lter and image have been processed. Then they write the output to global memory and repeat all of the above for the next colour. 12

13 Figure 13: The 64 lters learned by the convolutional neural net. Each lter is applied by overlaying it on a colour image and taking the dot product. Figure 13 shows the 64 lters that this convolutional net learned. The lters have weights, and each weight is connected to a pixel in the image. So we can visualize these lters as if they were 8 8 colour images. I initialized the weights randomly, so this at least conrms that the network learns some sensible lters. 8 Possible improvements I would have liked to have had more time to explore kernels in which each block convolves with multiple lters. It may be hard to maintain full occupancy in such kernels, but it may not be needed. I suspect that giving each thread more work, even at the expense of occupancy, may produce some speedups. Maybe an algorithm of the sort used for step 6.6 could work for step 6.2. The only dierence would be that the threads would be able to load the entire lter into shared memory. It may also be useful to consider alternative data representations. For example, instead of storing all the red pixel values, then all the green values, then all the blue values, I could store the red, green, and blue values for each pixel one after the other. Or, instead of storing the lters and images in row-major order, I could store one of them in column-major order. This would make it easy for one thread to compute a whole bunch of convolutions at once since it would be able to get the nth pixel of a bunch of images in just one global memory transaction. 9 Conclusion It works! The next step is to build convolutional nets with multiple layers of convolution. References [1] Figure from the Wikipedia article 13

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

Programming Exercise 4: Neural Networks Learning

Programming Exercise 4: Neural Networks Learning Programming Exercise 4: Neural Networks Learning Machine Learning Introduction In this exercise, you will implement the backpropagation algorithm for neural networks and apply it to the task of hand-written

More information

For Monday. Read chapter 18, sections Homework:

For Monday. Read chapter 18, sections Homework: For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks to recognize

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

(Refer Slide Time: 01:25)

(Refer Slide Time: 01:25) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 32 Memory Hierarchy: Virtual Memory (contd.) We have discussed virtual

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Convolutional Deep Belief Networks on CIFAR-10

Convolutional Deep Belief Networks on CIFAR-10 Convolutional Deep Belief Networks on CIFAR-10 Alex Krizhevsky kriz@cs.toronto.edu 1 Introduction We describe how to train a two-layer convolutional Deep Belief Network (DBN) on the 1.6 million tiny images

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Report: Privacy-Preserving Classification on Deep Neural Network

Report: Privacy-Preserving Classification on Deep Neural Network Report: Privacy-Preserving Classification on Deep Neural Network Janno Veeorg Supervised by Helger Lipmaa and Raul Vicente Zafra May 25, 2017 1 Introduction In this report we consider following task: how

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one

More information

Why equivariance is better than premature invariance

Why equivariance is better than premature invariance 1 Why equivariance is better than premature invariance Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto with contributions from Sida Wang

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why? Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of

More information

Artificial Neuron Modelling Based on Wave Shape

Artificial Neuron Modelling Based on Wave Shape Artificial Neuron Modelling Based on Wave Shape Kieran Greer, Distributed Computing Systems, Belfast, UK. http://distributedcomputingsystems.co.uk Version 1.2 Abstract This paper describes a new model

More information

Face Detection using GPU-based Convolutional Neural Networks

Face Detection using GPU-based Convolutional Neural Networks Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

Process Time Comparison between GPU and CPU

Process Time Comparison between GPU and CPU Process Time Comparison between GPU and CPU Abhranil Das July 20, 2011 Hamburger Sternwarte, Universität Hamburg Abstract This report discusses a CUDA program to compare process times on a GPU and a CPU

More information

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to:

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to: F2007/Unit6/1 UNIT 6 OBJECTIVES General Objective:To understand the basic memory management of operating system Specific Objectives: At the end of the unit you should be able to: define the memory management

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information

(Refer Slide Time: 1:43)

(Refer Slide Time: 1:43) (Refer Slide Time: 1:43) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology, Madras Lecture - 27 Pattern Detector So, we talked about Moore

More information

[February 1, 2017] Neural Networks for Dummies Rolf van Gelder, Eindhoven, NL

[February 1, 2017] Neural Networks for Dummies Rolf van Gelder, Eindhoven, NL [February 1, 2017] Neural Networks for Dummies 2017 Rolf van Gelder, Eindhoven, NL Contents Introduction... 2 How does it work?... 4 Training a Network... 4 Testing / Using a Network... 4 The Test Case:

More information

Exercise 2: Hopeld Networks

Exercise 2: Hopeld Networks Articiella neuronnät och andra lärande system, 2D1432, 2004 Exercise 2: Hopeld Networks [Last examination date: Friday 2004-02-13] 1 Objectives This exercise is about recurrent networks, especially the

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Accelerating Convolutional Neural Nets. Yunming Zhang

Accelerating Convolutional Neural Nets. Yunming Zhang Accelerating Convolutional Neural Nets Yunming Zhang Focus Convolutional Neural Nets is the state of the art in classifying the images The models take days to train Difficult for the programmers to tune

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Why C? Because we can t in good conscience espouse Fortran.

Why C? Because we can t in good conscience espouse Fortran. C Tutorial Why C? Because we can t in good conscience espouse Fortran. C Hello World Code: Output: C For Loop Code: Output: C Functions Code: Output: Unlike Fortran, there is no distinction in C between

More information

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper Deep Convolutional Neural Networks Nov. 20th, 2015 Bruce Draper Background: Fully-connected single layer neural networks Feed-forward classification Trained through back-propagation Example Computer Vision

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

Apparel Classifier and Recommender using Deep Learning

Apparel Classifier and Recommender using Deep Learning Apparel Classifier and Recommender using Deep Learning Live Demo at: http://saurabhg.me/projects/tag-that-apparel Saurabh Gupta sag043@ucsd.edu Siddhartha Agarwal siagarwa@ucsd.edu Apoorve Dave a1dave@ucsd.edu

More information

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt Energy Based Models, Restricted Boltzmann Machines and Deep Networks Jesse Eickholt ???? Who s heard of Energy Based Models (EBMs) Restricted Boltzmann Machines (RBMs) Deep Belief Networks Auto-encoders

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Motion tracking on CUDA architecture

Motion tracking on CUDA architecture Motion tracking on CUDA architecture Marek Blazewicz1, Krzysztof Kurowski1, Bogdan Ludwiczak1, Krystyna Woloszczuk1 1 Poznan Supercomputing and Networking Center Applications Department ul. Noskowskiego

More information

A Vision System for Automatic State Determination of Grid Based Board Games

A Vision System for Automatic State Determination of Grid Based Board Games A Vision System for Automatic State Determination of Grid Based Board Games Michael Bryson Computer Science and Engineering, University of South Carolina, 29208 Abstract. Numerous programs have been written

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute ParCube W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute 2017-11-07 Which pairs intersect? Abstract Parallelization of a 3d application (intersection detection). Good

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22)

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22) Digital Image Processing Prof. P. K. Biswas Department of Electronics and Electrical Communications Engineering Indian Institute of Technology, Kharagpur Module Number 01 Lecture Number 02 Application

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

Minimizing Computation in Convolutional Neural Networks

Minimizing Computation in Convolutional Neural Networks Minimizing Computation in Convolutional Neural Networks Jason Cong and Bingjun Xiao Computer Science Department, University of California, Los Angeles, CA 90095, USA {cong,xiao}@cs.ucla.edu Abstract. Convolutional

More information

Probability Models.S4 Simulating Random Variables

Probability Models.S4 Simulating Random Variables Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard Probability Models.S4 Simulating Random Variables In the fashion of the last several sections, we will often create probability

More information

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation GPU Technology Conference 2012 May 15, 2012 Thomas M. Benson, Daniel P. Campbell, Daniel A. Cook thomas.benson@gtri.gatech.edu

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

DEEP NEURAL NETWORKS FOR OBJECT DETECTION DEEP NEURAL NETWORKS FOR OBJECT DETECTION Sergey Nikolenko Steklov Institute of Mathematics at St. Petersburg October 21, 2017, St. Petersburg, Russia Outline Bird s eye overview of deep learning Convolutional

More information

Weighted Convolutional Neural Network. Ensemble.

Weighted Convolutional Neural Network. Ensemble. Weighted Convolutional Neural Network Ensemble Xavier Frazão and Luís A. Alexandre Dept. of Informatics, Univ. Beira Interior and Instituto de Telecomunicações Covilhã, Portugal xavierfrazao@gmail.com

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies

More information

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Contents 1 Introduction to Using Excel Spreadsheets 2 1.1 A Serious Note About Data Security.................................... 2 1.2

More information

Lab 2: Support Vector Machines

Lab 2: Support Vector Machines Articial neural networks, advanced course, 2D1433 Lab 2: Support Vector Machines March 13, 2007 1 Background Support vector machines, when used for classication, nd a hyperplane w, x + b = 0 that separates

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

CPSC 320 Sample Solution, Playing with Graphs!

CPSC 320 Sample Solution, Playing with Graphs! CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey. Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics? Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto How to learn many layers of features

More information

Unsupervised Deep Learning for Scene Recognition

Unsupervised Deep Learning for Scene Recognition Unsupervised Deep Learning for Scene Recognition Akram Helou and Chau Nguyen May 19, 2011 1 Introduction Object and scene recognition are usually studied separately. However, research [2]shows that context

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information