Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011

Size: px

Start display at page:

Download "Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011"

Kenneth Stewart
5 years ago
Views:

1 Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011 Project Description Neural networks can often have hundreds, if not thousands of neurons when used to solve a pattern matching task. Specifically, backpropogation neural networks must, when responding to an input, 'ripple' the effect of the input across each and every layer before producing an output. Furthermore, when training, this 'rippling' must go from input to output and then back from the output into the hidden layers. Obviously, depending up the size of the network, these tasks can be computationally daunting. In this project, a backpropogation neural network will be modelled and computed with the use of a GPU vector-processor such that each neuron will occupy one or many individual PEs. This is thought to be an interesting parallel computation task for a number of reasons: (1) Since some layers of neurons (PEs) must fire while others remain idle, it will require significant effort to coordinate PE behavior; (2) Since each neuron (PE) in a layer must be able to read the output of many or all of the neurons in the previous layer, there is a significant risk of memory access collisions; and (3) given the number of computations needed to determine the output weight for a neuron, there is a chance that multilevel parallelism may be used, i.e. for each neuron being handled in parallel, multiple PEs may be used to compute its weight. Analysis and Results Broad Results The overall goals of the project were: (1) to achieve a basic vectorization of a backpropogation neural network; (2) to explore the coordination and other issues that arise by running the neural network on a GPU; and (3) to achieve an extreme vectorization of a backpropogation neural network. During the project all three of these goals were met. On top of this, both the basic and extreme vector versions of the neural network vastly outperformed the sequential version. Furthermore, an thorough understanding of the GPU hardware was gained. The GPU threading model is something that is not well covered in most texts. It was through this project that an understanding of how to fully exploit the threading model of the GPU was gained; blocks and threads need to be specified so that the streaming-multiprocessors and the cores within are used with the greatest efficiency. It was also through this project that the details of kernel thread synchronization were learned; the only way to synchronize across blocks is via kernel calls; the procedure syncthreads() only synchronizes threads within blocks. Detailed Results The simplest and most straightforward vector version is called vectorized simple. Below is the runtime comparison of it versus the sequential version. 1

2 From the chart above, it is clear that this simple vectorization outperforms the sequential version across all of the test networks. The next two vector versions, vectorized warp bad and vectorized warp good, are meant to display the effects of allocating blocks and threads within the GPU and how these settings can affect the utilization of the GPU s streaming multiprocessors (SMs). For the record, the version vectorized simple does a poor job of allocating blocks and threads; its block/thread configuration results each thread residing within its own warp. Vectorized warp bad allocates 50 threads per block. This means that each SM that is doing its processing will end up with two warps (one of 32 threads and one of 18 threads) to manage; the SM can only run one warp at a time so the other warp will remain idle. However, this is still better than vectorized simple. Vectorized warp good allocates 32 threads per block. This means that each SM that is doing its processing will end up with one warp to manage, unless more than 448 threads are needed (which is the case for test networks, Net 4, Net 6, and Net 8). However, even if say 500 threads are needed, most SMs will remain with only one warp to manage; only two will be saddled with an extra warp. This means that most warps can be fully processed without waiting on other warps to finish. Below is the runtime comparison of the vectorized simple, vectorized warp bad and vectorized warp good. 2

3 As the chart shows, the results aren t as stark as one might imagine. However, it is clear that as the need for parallelism increases (like in the wide test networks Net 4, Net 6, and Net 8), vectorized warp good version does outperform the other versions. Still, it isn t yet clear why it doesn t perform as well for the versions that require less parallelism. However, the theory is that the warped versions given the higher active thread to size of memory to be accessed (density) experience a slow down due to memory bank collisions. This is especially the case for the test networks that have 200 or fewer neurons per layer (Nets 1, 2, 3, 5, and 7). As the number of neurons per layer increases (say to 500, like in Nets 4, 6, and 8), the warped versions are able to spread out their memory accesses over a great space of memory, which results in fewer collisions and better runtime. This is a significant result. In essence, it means that even though there isn t significant documentation on the exact layout of global memory on the GPU, faster access can still be achieved, in certain circumstances, by deliberately choosing a sparse data structure. Certainly, if the neural network software were to be redesigned today, this is something that would drive the design of the neural network data structure. The next vector version, vectorized kcm, is meant to display the overhead of making repeated kernel calls. The vectorized simple was written so that weight adjustment step is done with two loops over all of the layers in the network; each of the iterations invokes another kernel call. The vectorized kcm version combines these loops and the kernel calls within. Below is the runtime comparison of the vectorized simple and vectorized kcm. 3

4 The vectorized kcm version does indeed yield modest results, but not as stark as hoped. The vectorized kcm version led to the creation of a version that was initially called vectorized full-kcm. However, this version was ultimately dubbed unworkable since it required block-level synchronization, which is not possible on NVidia GPUs without separate kernel calls. This version was eventually redubbed vectorized kcm failed. Just to see whether it could be made to work at all, it was run within a single block. Below is the runtime comparison of the vectorized simple, vectorized kcm, and vectorized kcm failed versions. From the chart, it is easy to see that vectorized kcm failed was total failure. Running it within a single block doomed it to a very modest parallelism (However, it still outperforms the sequential version). The next vector version, vectorized mass, is meant to be a more fully parallelized version of vectorized simple. While vectorized simple parallizes the neurons only, vectorized mass parallizes the processing of the weights as well. 4

5 Below is the runtime comparison of the vectorized simple and vectorized mass versions. Clearly, from the chart, vectorized mass was a complete success. It outperforms vectorized simple asymptotically with the size of the neural network. The next and final vector version, vectorized kcm mass, iis meant to combine what was learned from vectorized mass with what was learned from vectorized kcm. Essentially, it is the vectorized mass version with weight adjustment step combined. This version, though it is only a modest improvement upon vectorized mass, was the version that ultimately performed the best. Below is the runtime comparison of the vectorized mass and vectorized kcm mass versions. 5

6 Now that all of the versions have been compared locally, below is a global comparison of all versions. Again, all of the vector versions outperform the sequential version. The versions that employ massive parallelism outperform all comers. Below is a chart that compares the speedups offered by the various vector versions. As expected, the chart shows that the versions employing massive parallelism enjoy the largest speedups against the sequential version. In fact, on Net 8, vectorized mass and vectorized kcm mass 6

7 offer more than a 20 times speedup. Finally, now that the runtimes and speedups of the vector versions are known, it is worth noting how efficiently each uses the parallel resources of the GPU. Below is a chart that compares the efficiencies of the various versions. As is obvious from chart, the vectorized kcm and vectorized simple versions offer the most efficiency; but, of course, this comes with a smaller speedup. The vectorized mass and vectorized kcm mass are the least efficient but offer the most significant speedup. As is typical in parallel processing, with the commitment of more resources comes more speed. Overall, the project was a success. Neural networks can be effectively processed on GPUs. Furthermore, not only can they be processed on GPUs, it appears to be desirable to do so. GPUs offer very significant speedups over sequential processing. Down the road, one can imagine, for very large networks, using OpenMP to distribute portions of the network to various nodes. However, of simply passing the network portions off to the cores on each node, perhaps it would be more desirable to pass the network portions off to the various GPUSs on each node. Compiling and Running Instructions Compiling To compile the sequential version, execute the following: g++ RunNNetwork.cpp NNetworkUtils.cpp NNetwork.cpp -o RunNNetwork To compile any of the vector versions, execute the following: nvcc -arch sm_20 RunNNetwork.cu NNetworkCuda.cu NNetwork.Utils.cpp NNetwork.cu -o RunNNetwork Note that the architecture switch is specified because doubles are used and because it makes placing printf statements in kernel code possible. 7

8 Running Whether running the sequential or one of the vector versions, two arguments are required. One is a configuration file and the other is a test file. The configuration file contains the information necessary for building and training a neural network. The test file contains the information necessary for testing the neural network. To run the sequential version, execute the following: bpsh <node> <path to>/runnnetwork <path to>/network_config.cfg <path to>/network_test.tst Below is a good example: bpsh 6 /home/derek.kern/csc7551/project/sequential/runnnetwork /home/derek.kern/csc7551/project/ nnetwork1.cfg /home/derek.kern/csc7551/project/nnetwork1.tst Running the vector versions requires a node with a GPU. Also, all of the vector versions take a final optional argument: GPU number. This allows the parallel code to be run on either GPU #0 or GPU #1 on the respective node. To run the sequential version, execute the following: bpsh <node> <path to>/runnnetwork <path to>/network_config.cfg <path to>/network_test.tst <gpu #> Below is a good example: bpsh 14 /home/derek.kern/csc7551/project/vectorized_simple/runnnetwork /home/derek.kern/csc7551/project/ nnetwork1.cfg /home/derek.kern/csc7551/project/nnetwork1.tst 1 Code Sequential Version RunNNetwork.cpp #include "NNetwork.h" #include "NNetworkUtils.h" bool check_command_line( int argc, char* argv[] ) { Make sure that the correct arguments were passed. FILE *fp = NULL; bool ok = true; if( argc < 3 ) { cout << "Format: RunNNetwork <network configuration file> <network test file>" << endl; cout << "Arguments:" << endl; cout << " network configuration file - This file should contain parameters for" << endl; cout << " network size, training rate, etc as " << endl; cout << " a set of data to train the network" << endl; cout << " network test file - This file should contain data for testing the " << endl; cout << " network after it has been trained" << endl; ok = false; else { Make sure that the configuration file exists. if( fp = fopen( argv[1], "r" ) ) { fclose( fp else { cout << "Specified network configuration file [" << argv[1] << "] doesn't exist or cannot be opened" << endl; ok = false; Make sure that the test file exists if( fp = fopen( argv[2], "r" ) ) { 8

9 fclose( fp else { cout << "Specified network test file [" << argv[2] << "] doesn't exist or cannot be opened" << endl; ok = false; return ok; int main( int argc, char* argv[] ) { Main function for running the network First make sure that the user has provided the necessary input. if (!check_command_line( argc, argv ) ) { return 1; Read in the network configuration. NNetworkConfig nnc = read_network_configuration( argv[1] Read in the network tests. TestInputs tests = read_network_tests( argv[2], nnc->layer_config->input_layer_size(), nnc->layer_config->output_layer_size() Build the neural network. NeuralNetwork net = build_neural_network( nnc->layer_config Initialize the network to begin with. initialize_neural_network( net Train the network. do_network_training( net, nnc->tests, nnc->params Test the network and report on results. cout << "Applying test data to network:" << endl; apply_network_tests( net, tests Free up the memory associated with the neural network. destroy_neural_network( net free( net return 0; NNetworkUtils.h #ifndef nnetworkutils_h #define nnetworkutils_h #include <stdlib.h> #include <string.h> #define LINE_SIZE 1024 NNetworkConfig read_network_configuration( char *config_filename TestInputs read_network_tests( char *test_filename, int input_layer_size, int output_layer_size 9

10 TestInputs _read_network_tests( FILE *test_file, int input_layer_size, int output_layer_size #endif NNetworkUtils.cpp #include "NNetwork.h" NeuralNetwork build_neural_network( NetworkLayerConfig layer_config ) { Build the neural network that corresponds to the layer configuration. NeuralNetwork net = (NeuralNetwork) malloc( sizeof( struct NeuralNetwork ) int total_neurons_needed = layer_config->total_neurons_needed_for_network( int total_weights_needed = layer_config->total_neuron_weights_needed_for_network( Setup the basic layer layout. net->layer_count = layer_config->layer_count; Copy the sizes of the layers. net->layer_sizes = (int*) malloc( sizeof( int ) * net->layer_count for( int i = 0; i < net->layer_count; i++ ) { net->layer_sizes[i] = layer_config->layer_sizes[i]; Setup the memory for the neuronal weights Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory for caching of the neuronal weights. Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->cached_weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory of the outputs of the neurons. Total output slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->outputs = (double*) malloc( sizeof( double ) * total_neurons_needed Setup the memory of the errors of the neurons. Total error slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->errors = (double*) malloc( sizeof( double ) * total_neurons_needed return net; void destroy_neural_network( NeuralNetwork net ) { Free memory from the network network. 10

11 Delete the memory for the neuron weights. free( net->weights Delete the memory for the weight caching. free( net->cached_weights Delete the memory for the neuron outputs. free( net->outputs Delete the memory for the error (differences). free( net->errors Finally, clear out the layer sizes. free( net->layer_sizes void initialize_neural_network( NeuralNetwork net ) { Initial the weights of the network with random values and zero out the cache. int i_offset, j_offset; Seed the random number generator. srand( (unsigned) time( NULL ) Set the neuronal weights to random values. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->weights[i_offset + j_offset + k] = (double) ( rand() ) / ( RAND_MAX / 2 ) - 1; Zero out the weight cache. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->cached_weights[i_offset + j_offset + k] = 0.0f; 11

12 void feedforward( NeuralNetwork net, double *inputs ) { Feed the inputs forward through the neural network until the outpus are determined. double weighted_sum; Start by putting the inputs onto the input layer. for( int j = 0; j < net->layer_sizes[0]; j++ ) { net->outputs[0 + j] = inputs[j]; Now ripple the effect of the input across the layers. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Apply the result to each neuron in the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_feedforward( i, net->outputs, net->weights, iw_offset, io_offset, io_prev_offset, net- >layer_sizes[i - 1], j void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j ) { Do the feedforward, but model it for kernel computation. double weighted_sum; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; Reset the sum. weighted_sum = 0.0f; Sum the outputs from the previous layer, adjusted by the connection weights. for( int k = 0; k < prev_layer_size; k++ ) { weighted_sum += outputs[io_prev_offset + k] * weights[iw_offset + jw_offset + k]; Now, for this neuron, set the output. outputs[io_offset + j] = calculate_sigmoid( weighted_sum + weights[iw_offset + jw_offset + prev_layer_size] void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params ) { 12

13 Feed the inputs forward through the neural network until the outpus are determined. Afterwards, turn around and neuro-connection weights so that they more reliably produce the desired output. double weighted_sum; Start by feeding forward the input values. This will put values onto the output nodes. We can then compare these to the desired values and backpropogate the changes. feedforward( net, inputs Calculate the error values for the output layer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { net->errors[i_offset + j] = ( net->outputs[i_offset + j] * ( 1 - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) Calculate the error values for the hidden layers. for( int i = net->layer_count - 2; i > 0; i-- ) { Figure out layer-based weight and output/error offsets int iw_next_offset = net->total_neuron_weights_before_layer( i + 1 int io_offset = net->total_neurons_before_layer( i int io_next_offset = net->total_neurons_before_layer( i + 1 Calculate the error for each neuron in the layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_backfeed_errors( i, net->outputs, net->weights, net->errors, iw_next_offset, io_offset, io_next_offset, net->layer_sizes[i], net->layer_sizes[i + 1], j Adjust the weights according to the learning momentum for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_momentum( i, net->weights, net->cached_weights, params->learning_momentum, iw_offset, net->layer_sizes[i - 1], j Adjust weights according to the learning rate. Also, cache the weights. 13

14 for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_rate( i, net->weights, net->cached_weights, net->outputs, net->errors, params->learning_rate, io_offset, io_prev_offset, iw_offset, net->layer_sizes[i - 1], j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j ) { Do the backfeed of errors, but model it for kernel computation. double weighted_sum = 0.0f; Sum the weighted errors from the layer after the current one. for( int k = 0; k < next_layer_size; k++ ) { Figure out the neuron-based weight offset int kw_offset = k * current_layer_size; weighted_sum += errors[io_next_offset + k] * weights[iw_next_offset + j + kw_offset]; Set the error. errors[io_offset + j] = outputs[io_offset + j] * ( 1 - outputs[io_offset + j] ) * weighted_sum; void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { weights[iw_offset + jw_offset + k] += ( learning_momentum * cached_weights[iw_offset + jw_offset + k] void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j ) { 14

15 Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { cached_weights[iw_offset + jw_offset + k]= ( learning_rate * errors[io_offset + j] * outputs[io_prev_offset + k] weights[iw_offset + jw_offset + k] += cached_weights[iw_offset + jw_offset + k]; double calculate_sigmoid( double value ) { Calculate the sigmoid function for the value. return (double) ( 1 / ( 1 + exp( -value ) ) double get_mean_square_error( NeuralNetwork net, double *desired_outputs ) { Get the mean square error of the network based upon the desired outputs. double error = 0; Sum the error up from the output layer int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { error += ( ( desired_outputs[j] - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) return error / 2; double get_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return net->outputs[i_offset + index]; int get_rounded_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network, but rounded into an integer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return (int) floor( net->outputs[i_offset + index] double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params ) { Iteratively train the neural network and report on the progress. double error = 0.0f; long iteration = 0, total_iterations = 0; float backprop_runtime, total_backprop_runtime = 0, runtime, total_runtime = 0; cout << endl << "Training the network:" << endl; for ( iteration = 0; iteration < params->training_max_iterations ; iteration++ ) { runtime = ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) 15

16 Setup to record the time total_iterations += 1; Train through backpropogation backprop_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) backpropogate( net, tests->input_values[iteration % tests->test_count], tests->desired_output_values[iteration % tests->test_count], params total_backprop_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - backprop_runtime How bad is the error? error = get_mean_square_error( net, tests->desired_output_values[iteration % tests->test_count] if( error < params->training_threshold ) { cout << "Network has been trained. It took " << iteration << " iterations." << endl; cout << "Final error is " << error << endl << endl; break; Report on the training process. if ( iteration % ( params->training_max_iterations / 10 ) == 0 ) { cout << "Current error is " << error << ". Continuing with training..." << endl; Add to the total runtime total_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - runtime if ( iteration == params->training_max_iterations ) { error = get_mean_square_error( net, tests->desired_output_values[(iteration - 1) % tests->test_count] cout << "Maximum of " << iteration << " iterations completed with error of " << error << endl; Write out the time for backpropogation. cout << endl << "Total time in backpropogation: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_backprop_runtime / 1000 ) << " seconds" << endl; cout << "Average time per backpropogation: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_backprop_runtime / total_iterations ) << " milliseconds" << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; void apply_network_tests( NeuralNetwork net, TestInputs tests ) { Apply the tests to the neural network. Report on the success failure. int total_iterations = 0; float feedforward_runtime, total_feedforward_runtime = 0, runtime, total_runtime = 0; for ( int test_index = 0; test_index < tests->test_count; test_index++ ) { runtime = ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Start by feeding forward the provided test inputs. feedforward_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) 16

17 feedforward( net, tests->input_values[test_index] total_feedforward_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - feedforward_runtime Now, report what the expected output is. cout << "For test input " << ( test_index + 1 ) << endl; cout << " Expected = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << (int) tests->desired_output_values[test_index][i]; cout << endl; Finally, report what the actual output was. cout << " Received = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << get_rounded_output_value( net, i cout << endl << endl; total_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - runtime Write out the time for feedforward. cout << endl << "Total time in feedforward: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_feedforward_runtime / 1000 ) << " seconds " << endl; cout << "Average time per feedforward: " << setiosflags( ios::fixed ) << setprecision( 9 ) << ( total_feedforward_runtime / total_iterations ) << " milliseconds " << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; NNetwork.h #ifndef nnetwork_h #define nnetwork_h #include <assert.h> #include <iostream> #include <iomanip> #include <stdio.h> #include <math.h> #include <time.h> using namespace std; typedef struct NeuralNetwork { These variables will hold information about the layers int layer_count; int *layer_sizes; This will hold the weights of the neurons. Used to be a double***. double *weights; This will preserve weights for later use. Used to be a double***. double *cached_weights; This will hold the output for the neurons. Used to be a double**. 17

18 double *outputs; This will hold the difference between the target training values and the current outputs. Used to be a double**. double *errors; int input_layer_size() { return layer_sizes[0]; int output_layer_size() { return layer_sizes[layer_count - 1]; int total_neurons_in_network() { int total = 0; for( int i = 0; i < layer_count; i++ ) total += layer_sizes[i]; return total; int total_neurons_before_layer( int layer_number ) { int total = 0; for( int i = 0; i < layer_number; i++ ) total += layer_sizes[i]; return total; int total_neuron_weights_in_network() { int total = 0; for( int i = 1; i < layer_count; i++ ) { total += ( layer_sizes[i - 1] * layer_sizes[i] return total; int total_neuron_weights_before_layer( int layer_number ) { int total = 0; for( int i = 1; i < layer_number; i++ ) { total += ( layer_sizes[i - 1] * layer_sizes[i] return total; *NeuralNetwork; typedef struct TrainingParameters { This setting determines how quickly the network will learn. double learning_rate; This setting determines the momentum of learning. double learning_momentum; This setting determines the point where the network is finished learning. double training_threshold; This setting determines the maximum number of iterations to train. long training_max_iterations; *TrainingParameters; typedef struct TestInput { 18

19 This will hold input values for this training input double **input_values; This will hold desired output values for this training input. double **desired_output_values; This will hold the number of tests stored. int test_count; This will hold the number of values that are stored each of the input and output values vector. int input_value_size; int output_value_size; *TestInputs; typedef struct NetworkLayerConfig { This will hold details about the network config. int layer_sizes[100]; int layer_count; int input_layer_size() { return layer_sizes[0]; int output_layer_size() { return layer_sizes[layer_count - 1]; int total_neurons_needed_for_network() { int total = 0; for( int i = 0; i < layer_count; i++ ) total += layer_sizes[i]; return total; int total_neuron_weights_needed_for_network() { int total = 0; for( int i = 1; i < layer_count; i++ ) { total += ( ( layer_sizes[i - 1] + 1 ) * layer_sizes[i] return total; *NetworkLayerConfig; typedef struct NNetworkConfig { This will hold onto the layer configuration. NetworkLayerConfig layer_config; This will hold onto training parameters. TrainingParameters params; This will hold onto training inputs. TestInputs tests; *NNetworkConfig; Function prototypes NeuralNetwork build_neural_network( NetworkLayerConfig layer_config void initialize_neural_network( NeuralNetwork net 19

20 void destroy_neural_network( NeuralNetwork net void feedforward( NeuralNetwork net, double *inputs void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params double calculate_sigmoid( double value double get_mean_square_error( NeuralNetwork net, double *desired_outputs double get_output_value( NeuralNetwork net, int index int get_rounded_output_value( NeuralNetwork net, int index double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params void apply_network_tests( NeuralNetwork net, TestInputs tests void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, int prev_layer_size, int j void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j #endif NNetwork.cpp #include "NNetwork.h" NeuralNetwork build_neural_network( NetworkLayerConfig layer_config ) { Build the neural network that corresponds to the layer configuration. NeuralNetwork net = (NeuralNetwork) malloc( sizeof( struct NeuralNetwork ) int total_neurons_needed = layer_config->total_neurons_needed_for_network( int total_weights_needed = layer_config->total_neuron_weights_needed_for_network( Setup the basic layer layout. net->layer_count = layer_config->layer_count; Copy the sizes of the layers. net->layer_sizes = (int*) malloc( sizeof( int ) * net->layer_count for( int i = 0; i < net->layer_count; i++ ) { net->layer_sizes[i] = layer_config->layer_sizes[i]; Setup the memory for the neuronal weights Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory for caching of the neuronal weights. Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->cached_weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory of the outputs of the neurons. 20

21 Total output slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->outputs = (double*) malloc( sizeof( double ) * total_neurons_needed Setup the memory of the errors of the neurons. Total error slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->errors = (double*) malloc( sizeof( double ) * total_neurons_needed return net; void destroy_neural_network( NeuralNetwork net ) { Free memory from the network network. Delete the memory for the neuron weights. free( net->weights Delete the memory for the weight caching. free( net->cached_weights Delete the memory for the neuron outputs. free( net->outputs Delete the memory for the error (differences). free( net->errors Finally, clear out the layer sizes. free( net->layer_sizes void initialize_neural_network( NeuralNetwork net ) { Initial the weights of the network with random values and zero out the cache. int i_offset, j_offset; Seed the random number generator. srand( (unsigned) time( NULL ) Set the neuronal weights to random values. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; 21

22 for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->weights[i_offset + j_offset + k] = (double) ( rand() ) / ( RAND_MAX / 2 ) - 1; Zero out the weight cache. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->cached_weights[i_offset + j_offset + k] = 0.0f; void feedforward( NeuralNetwork net, double *inputs ) { Feed the inputs forward through the neural network until the outpus are determined. double weighted_sum; Start by putting the inputs onto the input layer. for( int j = 0; j < net->layer_sizes[0]; j++ ) { net->outputs[0 + j] = inputs[j]; Now ripple the effect of the input across the layers. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Apply the result to each neuron in the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_feedforward( i, net->outputs, net->weights, iw_offset, io_offset, io_prev_offset, net- >layer_sizes[i - 1], j void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j ) { Do the feedforward, but model it for kernel computation. double weighted_sum; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; 22

23 Reset the sum. weighted_sum = 0.0f; Sum the outputs from the previous layer, adjusted by the connection weights. for( int k = 0; k < prev_layer_size; k++ ) { weighted_sum += outputs[io_prev_offset + k] * weights[iw_offset + jw_offset + k]; Now, for this neuron, set the output. outputs[io_offset + j] = calculate_sigmoid( weighted_sum + weights[iw_offset + jw_offset + prev_layer_size] void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params ) { Feed the inputs forward through the neural network until the outpus are determined. Afterwards, turn around and neuro-connection weights so that they more reliably produce the desired output. double weighted_sum; Start by feeding forward the input values. This will put values onto the output nodes. We can then compare these to the desired values and backpropogate the changes. feedforward( net, inputs Calculate the error values for the output layer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { net->errors[i_offset + j] = ( net->outputs[i_offset + j] * ( 1 - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) Calculate the error values for the hidden layers. for( int i = net->layer_count - 2; i > 0; i-- ) { Figure out layer-based weight and output/error offsets int iw_next_offset = net->total_neuron_weights_before_layer( i + 1 int io_offset = net->total_neurons_before_layer( i int io_next_offset = net->total_neurons_before_layer( i + 1 Calculate the error for each neuron in the layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_backfeed_errors( i, net->outputs, net->weights, net->errors, iw_next_offset, io_offset, io_next_offset, net->layer_sizes[i], net->layer_sizes[i + 1], j 23

24 Adjust the weights according to the learning momentum for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_momentum( i, net->weights, net->cached_weights, params->learning_momentum, iw_offset, net->layer_sizes[i - 1], j Adjust weights according to the learning rate. Also, cache the weights. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_rate( i, net->weights, net->cached_weights, net->outputs, net->errors, params->learning_rate, io_offset, io_prev_offset, iw_offset, net->layer_sizes[i - 1], j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j ) { Do the backfeed of errors, but model it for kernel computation. double weighted_sum = 0.0f; Sum the weighted errors from the layer after the current one. for( int k = 0; k < next_layer_size; k++ ) { Figure out the neuron-based weight offset int kw_offset = k * current_layer_size; weighted_sum += errors[io_next_offset + k] * weights[iw_next_offset + j + kw_offset]; Set the error. errors[io_offset + j] = outputs[io_offset + j] * ( 1 - outputs[io_offset + j] ) * weighted_sum; void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, 24

25 int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { weights[iw_offset + jw_offset + k] += ( learning_momentum * cached_weights[iw_offset + jw_offset + k] void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { cached_weights[iw_offset + jw_offset + k]= ( learning_rate * errors[io_offset + j] * outputs[io_prev_offset + k] weights[iw_offset + jw_offset + k] += cached_weights[iw_offset + jw_offset + k]; double calculate_sigmoid( double value ) { Calculate the sigmoid function for the value. return (double) ( 1 / ( 1 + exp( -value ) ) double get_mean_square_error( NeuralNetwork net, double *desired_outputs ) { Get the mean square error of the network based upon the desired outputs. double error = 0; Sum the error up from the output layer int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { error += ( ( desired_outputs[j] - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) return error / 2; double get_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return net->outputs[i_offset + index]; 25

26 int get_rounded_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network, but rounded into an integer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return (int) floor( net->outputs[i_offset + index] double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params ) { Iteratively train the neural network and report on the progress. double error = 0.0f; long iteration = 0, total_iterations = 0; float backprop_runtime, total_backprop_runtime = 0, runtime, total_runtime = 0; cout << endl << "Training the network:" << endl; for ( iteration = 0; iteration < params->training_max_iterations ; iteration++ ) { runtime = ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Train through backpropogation backprop_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) backpropogate( net, tests->input_values[iteration % tests->test_count], tests->desired_output_values[iteration % tests->test_count], params total_backprop_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - backprop_runtime How bad is the error? error = get_mean_square_error( net, tests->desired_output_values[iteration % tests->test_count] if( error < params->training_threshold ) { cout << "Network has been trained. It took " << iteration << " iterations." << endl; cout << "Final error is " << error << endl << endl; break; Report on the training process. if ( iteration % ( params->training_max_iterations / 10 ) == 0 ) { cout << "Current error is " << error << ". Continuing with training..." << endl; Add to the total runtime total_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - runtime if ( iteration == params->training_max_iterations ) { error = get_mean_square_error( net, tests->desired_output_values[(iteration - 1) % tests->test_count] cout << "Maximum of " << iteration << " iterations completed with error of " << error << endl; Write out the time for backpropogation. cout << endl << "Total time in backpropogation: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_backprop_runtime / 1000 ) << " seconds" << endl; cout << "Average time per backpropogation: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_backprop_runtime / total_iterations ) << " milliseconds" << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; 26

27 cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; void apply_network_tests( NeuralNetwork net, TestInputs tests ) { Apply the tests to the neural network. Report on the success failure. int total_iterations = 0; float feedforward_runtime, total_feedforward_runtime = 0, runtime, total_runtime = 0; for ( int test_index = 0; test_index < tests->test_count; test_index++ ) { runtime = ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Start by feeding forward the provided test inputs. feedforward_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) feedforward( net, tests->input_values[test_index] total_feedforward_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - feedforward_runtime Now, report what the expected output is. cout << "For test input " << ( test_index + 1 ) << endl; cout << " Expected = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << (int) tests->desired_output_values[test_index][i]; cout << endl; Finally, report what the actual output was. cout << " Received = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << get_rounded_output_value( net, i cout << endl << endl; total_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - runtime Write out the time for feedforward. cout << endl << "Total time in feedforward: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_feedforward_runtime / 1000 ) << " seconds " << endl; cout << "Average time per feedforward: " << setiosflags( ios::fixed ) << setprecision( 9 ) << ( total_feedforward_runtime / total_iterations ) << " milliseconds " << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; Common To All Vector Versions RunNNetwork.cu #include "NNetwork.h" #include "NNetworkUtils.h" #include "NNetworkCuda.h" bool check_command_line( int argc, char* argv[] ) { Make sure that the correct arguments were passed. 27

28 FILE *fp = NULL; bool ok = true; if( argc < 3 ) { cout << "Format: RunNNetwork <network configuration file> <network test file>" << endl; cout << "Arguments:" << endl; cout << " network configuration file - This file should contain parameters for" << endl; cout << " network size, training rate, etc as " << endl; cout << " a set of data to train the network" << endl; cout << " network test file - This file should contain data for testing the " << endl; cout << " network after it has been trained" << endl; ok = false; else { Make sure that the configuration file exists. if( fp = fopen( argv[1], "r" ) ) { fclose( fp else { cout << "Specified network configuration file [" << argv[1] << "] doesn't exist or cannot be opened" << endl; ok = false; Make sure that the test file exists if( fp = fopen( argv[2], "r" ) ) { fclose( fp else { cout << "Specified network test file [" << argv[2] << "] doesn't exist or cannot be opened" << endl; ok = false; return ok; int main( int argc, char* argv[] ) { Main function for running the network First make sure that the user has provided the necessary input. if (!check_command_line( argc, argv ) ) { return 1; If we received a 3rd argument, then it must be the GPU number. if ( argc == 4 ) { Select the GPU that was called for. selectgpubynumber( argv[3] Make sure that CUDA resources get cleaned up on exit. atexit( cleanupcuda Read in the network configuration. NNetworkConfig nnc = read_network_configuration( argv[1] Read in the network tests. TestInputs tests = read_network_tests( argv[2], nnc->layer_config->input_layer_size(), 28

29 nnc->layer_config->output_layer_size() Build the neural network. NeuralNetwork net = build_neural_network( nnc->layer_config Initialize the network to begin with. initialize_neural_network( net Train the network. do_network_training( net, nnc->tests, nnc->params Test the network and report on results. cout << "Applying test data to network:" << endl; apply_network_tests( net, tests Free up the memory associated with the neural network. destroy_neural_network( net free( net return 0; NNetworkCuda.h #ifndef nnetworkcuda_h #define nnetworkcuda_h #include <stdio.h> #include <cuda.h> #define err ) ( HandleError( err, FILE, LINE ) ) Prototypes void HandleError( cudaerror_t err, const char *file, int line void checkcudaerror( const char *msg, bool exitonerror void selectgpubynumber( char *device_number void cleanupcuda( void #endif NNetworkCuda.cu #include "NNetworkCuda.h" void HandleError( cudaerror_t err, const char *file, int line ) { Handle and report on CUDA errors. if ( err!= cudasuccess ) { printf( "%s in %s at line %d\n", cudageterrorstring( err ), file, line exit( EXIT_FAILURE void checkcudaerror( const char *msg, bool exitonerror ) { Check cuda error and print result if appropriate. cudaerror_t err = cudagetlasterror( if( cudasuccess!= err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudageterrorstring(err) if (exitonerror) { 29

Parallel Processing Neural Networks on SIMD/GPU Architectures. CSC7551 Derek Kern December 8th, 2011

Parallel Processing Neural Networks on SIMD/GPU Architectures CSC7551 Derek Kern December 8th, 2011 Quick Apology I have 80 slides and ~75 minutes So, we are going to move pretty fast I apologize in advance