Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011

Size: px
Start display at page:

Download "Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011"

Transcription

1 Parallel Processing Neural Networks on SIMD/GPU Architectures by Derek Kern CSC7551, December 8th, 2011 Project Description Neural networks can often have hundreds, if not thousands of neurons when used to solve a pattern matching task. Specifically, backpropogation neural networks must, when responding to an input, 'ripple' the effect of the input across each and every layer before producing an output. Furthermore, when training, this 'rippling' must go from input to output and then back from the output into the hidden layers. Obviously, depending up the size of the network, these tasks can be computationally daunting. In this project, a backpropogation neural network will be modelled and computed with the use of a GPU vector-processor such that each neuron will occupy one or many individual PEs. This is thought to be an interesting parallel computation task for a number of reasons: (1) Since some layers of neurons (PEs) must fire while others remain idle, it will require significant effort to coordinate PE behavior; (2) Since each neuron (PE) in a layer must be able to read the output of many or all of the neurons in the previous layer, there is a significant risk of memory access collisions; and (3) given the number of computations needed to determine the output weight for a neuron, there is a chance that multilevel parallelism may be used, i.e. for each neuron being handled in parallel, multiple PEs may be used to compute its weight. Analysis and Results Broad Results The overall goals of the project were: (1) to achieve a basic vectorization of a backpropogation neural network; (2) to explore the coordination and other issues that arise by running the neural network on a GPU; and (3) to achieve an extreme vectorization of a backpropogation neural network. During the project all three of these goals were met. On top of this, both the basic and extreme vector versions of the neural network vastly outperformed the sequential version. Furthermore, an thorough understanding of the GPU hardware was gained. The GPU threading model is something that is not well covered in most texts. It was through this project that an understanding of how to fully exploit the threading model of the GPU was gained; blocks and threads need to be specified so that the streaming-multiprocessors and the cores within are used with the greatest efficiency. It was also through this project that the details of kernel thread synchronization were learned; the only way to synchronize across blocks is via kernel calls; the procedure syncthreads() only synchronizes threads within blocks. Detailed Results The simplest and most straightforward vector version is called vectorized simple. Below is the runtime comparison of it versus the sequential version. 1

2 From the chart above, it is clear that this simple vectorization outperforms the sequential version across all of the test networks. The next two vector versions, vectorized warp bad and vectorized warp good, are meant to display the effects of allocating blocks and threads within the GPU and how these settings can affect the utilization of the GPU s streaming multiprocessors (SMs). For the record, the version vectorized simple does a poor job of allocating blocks and threads; its block/thread configuration results each thread residing within its own warp. Vectorized warp bad allocates 50 threads per block. This means that each SM that is doing its processing will end up with two warps (one of 32 threads and one of 18 threads) to manage; the SM can only run one warp at a time so the other warp will remain idle. However, this is still better than vectorized simple. Vectorized warp good allocates 32 threads per block. This means that each SM that is doing its processing will end up with one warp to manage, unless more than 448 threads are needed (which is the case for test networks, Net 4, Net 6, and Net 8). However, even if say 500 threads are needed, most SMs will remain with only one warp to manage; only two will be saddled with an extra warp. This means that most warps can be fully processed without waiting on other warps to finish. Below is the runtime comparison of the vectorized simple, vectorized warp bad and vectorized warp good. 2

3 As the chart shows, the results aren t as stark as one might imagine. However, it is clear that as the need for parallelism increases (like in the wide test networks Net 4, Net 6, and Net 8), vectorized warp good version does outperform the other versions. Still, it isn t yet clear why it doesn t perform as well for the versions that require less parallelism. However, the theory is that the warped versions given the higher active thread to size of memory to be accessed (density) experience a slow down due to memory bank collisions. This is especially the case for the test networks that have 200 or fewer neurons per layer (Nets 1, 2, 3, 5, and 7). As the number of neurons per layer increases (say to 500, like in Nets 4, 6, and 8), the warped versions are able to spread out their memory accesses over a great space of memory, which results in fewer collisions and better runtime. This is a significant result. In essence, it means that even though there isn t significant documentation on the exact layout of global memory on the GPU, faster access can still be achieved, in certain circumstances, by deliberately choosing a sparse data structure. Certainly, if the neural network software were to be redesigned today, this is something that would drive the design of the neural network data structure. The next vector version, vectorized kcm, is meant to display the overhead of making repeated kernel calls. The vectorized simple was written so that weight adjustment step is done with two loops over all of the layers in the network; each of the iterations invokes another kernel call. The vectorized kcm version combines these loops and the kernel calls within. Below is the runtime comparison of the vectorized simple and vectorized kcm. 3

4 The vectorized kcm version does indeed yield modest results, but not as stark as hoped. The vectorized kcm version led to the creation of a version that was initially called vectorized full-kcm. However, this version was ultimately dubbed unworkable since it required block-level synchronization, which is not possible on NVidia GPUs without separate kernel calls. This version was eventually redubbed vectorized kcm failed. Just to see whether it could be made to work at all, it was run within a single block. Below is the runtime comparison of the vectorized simple, vectorized kcm, and vectorized kcm failed versions. From the chart, it is easy to see that vectorized kcm failed was total failure. Running it within a single block doomed it to a very modest parallelism (However, it still outperforms the sequential version). The next vector version, vectorized mass, is meant to be a more fully parallelized version of vectorized simple. While vectorized simple parallizes the neurons only, vectorized mass parallizes the processing of the weights as well. 4

5 Below is the runtime comparison of the vectorized simple and vectorized mass versions. Clearly, from the chart, vectorized mass was a complete success. It outperforms vectorized simple asymptotically with the size of the neural network. The next and final vector version, vectorized kcm mass, iis meant to combine what was learned from vectorized mass with what was learned from vectorized kcm. Essentially, it is the vectorized mass version with weight adjustment step combined. This version, though it is only a modest improvement upon vectorized mass, was the version that ultimately performed the best. Below is the runtime comparison of the vectorized mass and vectorized kcm mass versions. 5

6 Now that all of the versions have been compared locally, below is a global comparison of all versions. Again, all of the vector versions outperform the sequential version. The versions that employ massive parallelism outperform all comers. Below is a chart that compares the speedups offered by the various vector versions. As expected, the chart shows that the versions employing massive parallelism enjoy the largest speedups against the sequential version. In fact, on Net 8, vectorized mass and vectorized kcm mass 6

7 offer more than a 20 times speedup. Finally, now that the runtimes and speedups of the vector versions are known, it is worth noting how efficiently each uses the parallel resources of the GPU. Below is a chart that compares the efficiencies of the various versions. As is obvious from chart, the vectorized kcm and vectorized simple versions offer the most efficiency; but, of course, this comes with a smaller speedup. The vectorized mass and vectorized kcm mass are the least efficient but offer the most significant speedup. As is typical in parallel processing, with the commitment of more resources comes more speed. Overall, the project was a success. Neural networks can be effectively processed on GPUs. Furthermore, not only can they be processed on GPUs, it appears to be desirable to do so. GPUs offer very significant speedups over sequential processing. Down the road, one can imagine, for very large networks, using OpenMP to distribute portions of the network to various nodes. However, of simply passing the network portions off to the cores on each node, perhaps it would be more desirable to pass the network portions off to the various GPUSs on each node. Compiling and Running Instructions Compiling To compile the sequential version, execute the following: g++ RunNNetwork.cpp NNetworkUtils.cpp NNetwork.cpp -o RunNNetwork To compile any of the vector versions, execute the following: nvcc -arch sm_20 RunNNetwork.cu NNetworkCuda.cu NNetwork.Utils.cpp NNetwork.cu -o RunNNetwork Note that the architecture switch is specified because doubles are used and because it makes placing printf statements in kernel code possible. 7

8 Running Whether running the sequential or one of the vector versions, two arguments are required. One is a configuration file and the other is a test file. The configuration file contains the information necessary for building and training a neural network. The test file contains the information necessary for testing the neural network. To run the sequential version, execute the following: bpsh <node> <path to>/runnnetwork <path to>/network_config.cfg <path to>/network_test.tst Below is a good example: bpsh 6 /home/derek.kern/csc7551/project/sequential/runnnetwork /home/derek.kern/csc7551/project/ nnetwork1.cfg /home/derek.kern/csc7551/project/nnetwork1.tst Running the vector versions requires a node with a GPU. Also, all of the vector versions take a final optional argument: GPU number. This allows the parallel code to be run on either GPU #0 or GPU #1 on the respective node. To run the sequential version, execute the following: bpsh <node> <path to>/runnnetwork <path to>/network_config.cfg <path to>/network_test.tst <gpu #> Below is a good example: bpsh 14 /home/derek.kern/csc7551/project/vectorized_simple/runnnetwork /home/derek.kern/csc7551/project/ nnetwork1.cfg /home/derek.kern/csc7551/project/nnetwork1.tst 1 Code Sequential Version RunNNetwork.cpp #include "NNetwork.h" #include "NNetworkUtils.h" bool check_command_line( int argc, char* argv[] ) { Make sure that the correct arguments were passed. FILE *fp = NULL; bool ok = true; if( argc < 3 ) { cout << "Format: RunNNetwork <network configuration file> <network test file>" << endl; cout << "Arguments:" << endl; cout << " network configuration file - This file should contain parameters for" << endl; cout << " network size, training rate, etc as " << endl; cout << " a set of data to train the network" << endl; cout << " network test file - This file should contain data for testing the " << endl; cout << " network after it has been trained" << endl; ok = false; else { Make sure that the configuration file exists. if( fp = fopen( argv[1], "r" ) ) { fclose( fp else { cout << "Specified network configuration file [" << argv[1] << "] doesn't exist or cannot be opened" << endl; ok = false; Make sure that the test file exists if( fp = fopen( argv[2], "r" ) ) { 8

9 fclose( fp else { cout << "Specified network test file [" << argv[2] << "] doesn't exist or cannot be opened" << endl; ok = false; return ok; int main( int argc, char* argv[] ) { Main function for running the network First make sure that the user has provided the necessary input. if (!check_command_line( argc, argv ) ) { return 1; Read in the network configuration. NNetworkConfig nnc = read_network_configuration( argv[1] Read in the network tests. TestInputs tests = read_network_tests( argv[2], nnc->layer_config->input_layer_size(), nnc->layer_config->output_layer_size() Build the neural network. NeuralNetwork net = build_neural_network( nnc->layer_config Initialize the network to begin with. initialize_neural_network( net Train the network. do_network_training( net, nnc->tests, nnc->params Test the network and report on results. cout << "Applying test data to network:" << endl; apply_network_tests( net, tests Free up the memory associated with the neural network. destroy_neural_network( net free( net return 0; NNetworkUtils.h #ifndef nnetworkutils_h #define nnetworkutils_h #include <stdlib.h> #include <string.h> #define LINE_SIZE 1024 NNetworkConfig read_network_configuration( char *config_filename TestInputs read_network_tests( char *test_filename, int input_layer_size, int output_layer_size 9

10 TestInputs _read_network_tests( FILE *test_file, int input_layer_size, int output_layer_size #endif NNetworkUtils.cpp #include "NNetwork.h" NeuralNetwork build_neural_network( NetworkLayerConfig layer_config ) { Build the neural network that corresponds to the layer configuration. NeuralNetwork net = (NeuralNetwork) malloc( sizeof( struct NeuralNetwork ) int total_neurons_needed = layer_config->total_neurons_needed_for_network( int total_weights_needed = layer_config->total_neuron_weights_needed_for_network( Setup the basic layer layout. net->layer_count = layer_config->layer_count; Copy the sizes of the layers. net->layer_sizes = (int*) malloc( sizeof( int ) * net->layer_count for( int i = 0; i < net->layer_count; i++ ) { net->layer_sizes[i] = layer_config->layer_sizes[i]; Setup the memory for the neuronal weights Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory for caching of the neuronal weights. Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->cached_weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory of the outputs of the neurons. Total output slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->outputs = (double*) malloc( sizeof( double ) * total_neurons_needed Setup the memory of the errors of the neurons. Total error slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->errors = (double*) malloc( sizeof( double ) * total_neurons_needed return net; void destroy_neural_network( NeuralNetwork net ) { Free memory from the network network. 10

11 Delete the memory for the neuron weights. free( net->weights Delete the memory for the weight caching. free( net->cached_weights Delete the memory for the neuron outputs. free( net->outputs Delete the memory for the error (differences). free( net->errors Finally, clear out the layer sizes. free( net->layer_sizes void initialize_neural_network( NeuralNetwork net ) { Initial the weights of the network with random values and zero out the cache. int i_offset, j_offset; Seed the random number generator. srand( (unsigned) time( NULL ) Set the neuronal weights to random values. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->weights[i_offset + j_offset + k] = (double) ( rand() ) / ( RAND_MAX / 2 ) - 1; Zero out the weight cache. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->cached_weights[i_offset + j_offset + k] = 0.0f; 11

12 void feedforward( NeuralNetwork net, double *inputs ) { Feed the inputs forward through the neural network until the outpus are determined. double weighted_sum; Start by putting the inputs onto the input layer. for( int j = 0; j < net->layer_sizes[0]; j++ ) { net->outputs[0 + j] = inputs[j]; Now ripple the effect of the input across the layers. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Apply the result to each neuron in the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_feedforward( i, net->outputs, net->weights, iw_offset, io_offset, io_prev_offset, net- >layer_sizes[i - 1], j void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j ) { Do the feedforward, but model it for kernel computation. double weighted_sum; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; Reset the sum. weighted_sum = 0.0f; Sum the outputs from the previous layer, adjusted by the connection weights. for( int k = 0; k < prev_layer_size; k++ ) { weighted_sum += outputs[io_prev_offset + k] * weights[iw_offset + jw_offset + k]; Now, for this neuron, set the output. outputs[io_offset + j] = calculate_sigmoid( weighted_sum + weights[iw_offset + jw_offset + prev_layer_size] void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params ) { 12

13 Feed the inputs forward through the neural network until the outpus are determined. Afterwards, turn around and neuro-connection weights so that they more reliably produce the desired output. double weighted_sum; Start by feeding forward the input values. This will put values onto the output nodes. We can then compare these to the desired values and backpropogate the changes. feedforward( net, inputs Calculate the error values for the output layer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { net->errors[i_offset + j] = ( net->outputs[i_offset + j] * ( 1 - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) Calculate the error values for the hidden layers. for( int i = net->layer_count - 2; i > 0; i-- ) { Figure out layer-based weight and output/error offsets int iw_next_offset = net->total_neuron_weights_before_layer( i + 1 int io_offset = net->total_neurons_before_layer( i int io_next_offset = net->total_neurons_before_layer( i + 1 Calculate the error for each neuron in the layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_backfeed_errors( i, net->outputs, net->weights, net->errors, iw_next_offset, io_offset, io_next_offset, net->layer_sizes[i], net->layer_sizes[i + 1], j Adjust the weights according to the learning momentum for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_momentum( i, net->weights, net->cached_weights, params->learning_momentum, iw_offset, net->layer_sizes[i - 1], j Adjust weights according to the learning rate. Also, cache the weights. 13

14 for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_rate( i, net->weights, net->cached_weights, net->outputs, net->errors, params->learning_rate, io_offset, io_prev_offset, iw_offset, net->layer_sizes[i - 1], j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j ) { Do the backfeed of errors, but model it for kernel computation. double weighted_sum = 0.0f; Sum the weighted errors from the layer after the current one. for( int k = 0; k < next_layer_size; k++ ) { Figure out the neuron-based weight offset int kw_offset = k * current_layer_size; weighted_sum += errors[io_next_offset + k] * weights[iw_next_offset + j + kw_offset]; Set the error. errors[io_offset + j] = outputs[io_offset + j] * ( 1 - outputs[io_offset + j] ) * weighted_sum; void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { weights[iw_offset + jw_offset + k] += ( learning_momentum * cached_weights[iw_offset + jw_offset + k] void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j ) { 14

15 Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { cached_weights[iw_offset + jw_offset + k]= ( learning_rate * errors[io_offset + j] * outputs[io_prev_offset + k] weights[iw_offset + jw_offset + k] += cached_weights[iw_offset + jw_offset + k]; double calculate_sigmoid( double value ) { Calculate the sigmoid function for the value. return (double) ( 1 / ( 1 + exp( -value ) ) double get_mean_square_error( NeuralNetwork net, double *desired_outputs ) { Get the mean square error of the network based upon the desired outputs. double error = 0; Sum the error up from the output layer int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { error += ( ( desired_outputs[j] - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) return error / 2; double get_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return net->outputs[i_offset + index]; int get_rounded_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network, but rounded into an integer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return (int) floor( net->outputs[i_offset + index] double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params ) { Iteratively train the neural network and report on the progress. double error = 0.0f; long iteration = 0, total_iterations = 0; float backprop_runtime, total_backprop_runtime = 0, runtime, total_runtime = 0; cout << endl << "Training the network:" << endl; for ( iteration = 0; iteration < params->training_max_iterations ; iteration++ ) { runtime = ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) 15

16 Setup to record the time total_iterations += 1; Train through backpropogation backprop_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) backpropogate( net, tests->input_values[iteration % tests->test_count], tests->desired_output_values[iteration % tests->test_count], params total_backprop_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - backprop_runtime How bad is the error? error = get_mean_square_error( net, tests->desired_output_values[iteration % tests->test_count] if( error < params->training_threshold ) { cout << "Network has been trained. It took " << iteration << " iterations." << endl; cout << "Final error is " << error << endl << endl; break; Report on the training process. if ( iteration % ( params->training_max_iterations / 10 ) == 0 ) { cout << "Current error is " << error << ". Continuing with training..." << endl; Add to the total runtime total_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - runtime if ( iteration == params->training_max_iterations ) { error = get_mean_square_error( net, tests->desired_output_values[(iteration - 1) % tests->test_count] cout << "Maximum of " << iteration << " iterations completed with error of " << error << endl; Write out the time for backpropogation. cout << endl << "Total time in backpropogation: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_backprop_runtime / 1000 ) << " seconds" << endl; cout << "Average time per backpropogation: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_backprop_runtime / total_iterations ) << " milliseconds" << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; void apply_network_tests( NeuralNetwork net, TestInputs tests ) { Apply the tests to the neural network. Report on the success failure. int total_iterations = 0; float feedforward_runtime, total_feedforward_runtime = 0, runtime, total_runtime = 0; for ( int test_index = 0; test_index < tests->test_count; test_index++ ) { runtime = ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Start by feeding forward the provided test inputs. feedforward_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) 16

17 feedforward( net, tests->input_values[test_index] total_feedforward_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - feedforward_runtime Now, report what the expected output is. cout << "For test input " << ( test_index + 1 ) << endl; cout << " Expected = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << (int) tests->desired_output_values[test_index][i]; cout << endl; Finally, report what the actual output was. cout << " Received = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << get_rounded_output_value( net, i cout << endl << endl; total_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - runtime Write out the time for feedforward. cout << endl << "Total time in feedforward: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_feedforward_runtime / 1000 ) << " seconds " << endl; cout << "Average time per feedforward: " << setiosflags( ios::fixed ) << setprecision( 9 ) << ( total_feedforward_runtime / total_iterations ) << " milliseconds " << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; NNetwork.h #ifndef nnetwork_h #define nnetwork_h #include <assert.h> #include <iostream> #include <iomanip> #include <stdio.h> #include <math.h> #include <time.h> using namespace std; typedef struct NeuralNetwork { These variables will hold information about the layers int layer_count; int *layer_sizes; This will hold the weights of the neurons. Used to be a double***. double *weights; This will preserve weights for later use. Used to be a double***. double *cached_weights; This will hold the output for the neurons. Used to be a double**. 17

18 double *outputs; This will hold the difference between the target training values and the current outputs. Used to be a double**. double *errors; int input_layer_size() { return layer_sizes[0]; int output_layer_size() { return layer_sizes[layer_count - 1]; int total_neurons_in_network() { int total = 0; for( int i = 0; i < layer_count; i++ ) total += layer_sizes[i]; return total; int total_neurons_before_layer( int layer_number ) { int total = 0; for( int i = 0; i < layer_number; i++ ) total += layer_sizes[i]; return total; int total_neuron_weights_in_network() { int total = 0; for( int i = 1; i < layer_count; i++ ) { total += ( layer_sizes[i - 1] * layer_sizes[i] return total; int total_neuron_weights_before_layer( int layer_number ) { int total = 0; for( int i = 1; i < layer_number; i++ ) { total += ( layer_sizes[i - 1] * layer_sizes[i] return total; *NeuralNetwork; typedef struct TrainingParameters { This setting determines how quickly the network will learn. double learning_rate; This setting determines the momentum of learning. double learning_momentum; This setting determines the point where the network is finished learning. double training_threshold; This setting determines the maximum number of iterations to train. long training_max_iterations; *TrainingParameters; typedef struct TestInput { 18

19 This will hold input values for this training input double **input_values; This will hold desired output values for this training input. double **desired_output_values; This will hold the number of tests stored. int test_count; This will hold the number of values that are stored each of the input and output values vector. int input_value_size; int output_value_size; *TestInputs; typedef struct NetworkLayerConfig { This will hold details about the network config. int layer_sizes[100]; int layer_count; int input_layer_size() { return layer_sizes[0]; int output_layer_size() { return layer_sizes[layer_count - 1]; int total_neurons_needed_for_network() { int total = 0; for( int i = 0; i < layer_count; i++ ) total += layer_sizes[i]; return total; int total_neuron_weights_needed_for_network() { int total = 0; for( int i = 1; i < layer_count; i++ ) { total += ( ( layer_sizes[i - 1] + 1 ) * layer_sizes[i] return total; *NetworkLayerConfig; typedef struct NNetworkConfig { This will hold onto the layer configuration. NetworkLayerConfig layer_config; This will hold onto training parameters. TrainingParameters params; This will hold onto training inputs. TestInputs tests; *NNetworkConfig; Function prototypes NeuralNetwork build_neural_network( NetworkLayerConfig layer_config void initialize_neural_network( NeuralNetwork net 19

20 void destroy_neural_network( NeuralNetwork net void feedforward( NeuralNetwork net, double *inputs void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params double calculate_sigmoid( double value double get_mean_square_error( NeuralNetwork net, double *desired_outputs double get_output_value( NeuralNetwork net, int index int get_rounded_output_value( NeuralNetwork net, int index double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params void apply_network_tests( NeuralNetwork net, TestInputs tests void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, int prev_layer_size, int j void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j #endif NNetwork.cpp #include "NNetwork.h" NeuralNetwork build_neural_network( NetworkLayerConfig layer_config ) { Build the neural network that corresponds to the layer configuration. NeuralNetwork net = (NeuralNetwork) malloc( sizeof( struct NeuralNetwork ) int total_neurons_needed = layer_config->total_neurons_needed_for_network( int total_weights_needed = layer_config->total_neuron_weights_needed_for_network( Setup the basic layer layout. net->layer_count = layer_config->layer_count; Copy the sizes of the layers. net->layer_sizes = (int*) malloc( sizeof( int ) * net->layer_count for( int i = 0; i < net->layer_count; i++ ) { net->layer_sizes[i] = layer_config->layer_sizes[i]; Setup the memory for the neuronal weights Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory for caching of the neuronal weights. Total weight slots needed is given by the following: Sum from i to layer_count: layer_sizes[i - 1] * layer_sizes[i] See total_neuron_weights_needed_for_network() for details. net->cached_weights = (double*) malloc( sizeof( double ) * total_weights_needed Setup the memory of the outputs of the neurons. 20

21 Total output slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->outputs = (double*) malloc( sizeof( double ) * total_neurons_needed Setup the memory of the errors of the neurons. Total error slots needed is given by the following: Sum from i to layer_count: layer_sizes[i] See total_neurons_needed_for_network() for details net->errors = (double*) malloc( sizeof( double ) * total_neurons_needed return net; void destroy_neural_network( NeuralNetwork net ) { Free memory from the network network. Delete the memory for the neuron weights. free( net->weights Delete the memory for the weight caching. free( net->cached_weights Delete the memory for the neuron outputs. free( net->outputs Delete the memory for the error (differences). free( net->errors Finally, clear out the layer sizes. free( net->layer_sizes void initialize_neural_network( NeuralNetwork net ) { Initial the weights of the network with random values and zero out the cache. int i_offset, j_offset; Seed the random number generator. srand( (unsigned) time( NULL ) Set the neuronal weights to random values. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; 21

22 for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->weights[i_offset + j_offset + k] = (double) ( rand() ) / ( RAND_MAX / 2 ) - 1; Zero out the weight cache. for( int i = 1; i < net->layer_count; i++ ) { i_offset = net->total_neuron_weights_before_layer( i for( int j = 0; j < net->layer_sizes[i]; j++ ) { This is the total number of weights in this layer prior to this neuron. j_offset = j * net->layer_sizes[i - 1]; for( int k = 0; k < net->layer_sizes[i - 1] + 1; k++ ) { net->cached_weights[i_offset + j_offset + k] = 0.0f; void feedforward( NeuralNetwork net, double *inputs ) { Feed the inputs forward through the neural network until the outpus are determined. double weighted_sum; Start by putting the inputs onto the input layer. for( int j = 0; j < net->layer_sizes[0]; j++ ) { net->outputs[0 + j] = inputs[j]; Now ripple the effect of the input across the layers. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Apply the result to each neuron in the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_feedforward( i, net->outputs, net->weights, iw_offset, io_offset, io_prev_offset, net- >layer_sizes[i - 1], j void kernel_feedforward( int layer_number, double *outputs, double *weights, int iw_offset, int io_offset, int io_prev_offset, int prev_layer_size, int j ) { Do the feedforward, but model it for kernel computation. double weighted_sum; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; 22

23 Reset the sum. weighted_sum = 0.0f; Sum the outputs from the previous layer, adjusted by the connection weights. for( int k = 0; k < prev_layer_size; k++ ) { weighted_sum += outputs[io_prev_offset + k] * weights[iw_offset + jw_offset + k]; Now, for this neuron, set the output. outputs[io_offset + j] = calculate_sigmoid( weighted_sum + weights[iw_offset + jw_offset + prev_layer_size] void backpropogate( NeuralNetwork net, double *inputs, double *desired_outputs, TrainingParameters params ) { Feed the inputs forward through the neural network until the outpus are determined. Afterwards, turn around and neuro-connection weights so that they more reliably produce the desired output. double weighted_sum; Start by feeding forward the input values. This will put values onto the output nodes. We can then compare these to the desired values and backpropogate the changes. feedforward( net, inputs Calculate the error values for the output layer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { net->errors[i_offset + j] = ( net->outputs[i_offset + j] * ( 1 - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) Calculate the error values for the hidden layers. for( int i = net->layer_count - 2; i > 0; i-- ) { Figure out layer-based weight and output/error offsets int iw_next_offset = net->total_neuron_weights_before_layer( i + 1 int io_offset = net->total_neurons_before_layer( i int io_next_offset = net->total_neurons_before_layer( i + 1 Calculate the error for each neuron in the layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_backfeed_errors( i, net->outputs, net->weights, net->errors, iw_next_offset, io_offset, io_next_offset, net->layer_sizes[i], net->layer_sizes[i + 1], j 23

24 Adjust the weights according to the learning momentum for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_momentum( i, net->weights, net->cached_weights, params->learning_momentum, iw_offset, net->layer_sizes[i - 1], j Adjust weights according to the learning rate. Also, cache the weights. for( int i = 1; i < net->layer_count; i++ ) { Figure out the layer-based weight and output/error offsets int iw_offset = net->total_neuron_weights_before_layer( i int io_offset = net->total_neurons_before_layer( i int io_prev_offset = net->total_neurons_before_layer( i - 1 Adjust the weight for each neuron within the current layer. for( int j = 0; j < net->layer_sizes[i]; j++ ) { Mock up the kernel computation. kernel_backpropogation_apply_rate( i, net->weights, net->cached_weights, net->outputs, net->errors, params->learning_rate, io_offset, io_prev_offset, iw_offset, net->layer_sizes[i - 1], j void kernel_backpropogation_backfeed_errors( int layer_number, double *outputs, double *weights, double *errors, int iw_next_offset, int io_offset, int io_next_offset, int current_layer_size, int next_layer_size, int j ) { Do the backfeed of errors, but model it for kernel computation. double weighted_sum = 0.0f; Sum the weighted errors from the layer after the current one. for( int k = 0; k < next_layer_size; k++ ) { Figure out the neuron-based weight offset int kw_offset = k * current_layer_size; weighted_sum += errors[io_next_offset + k] * weights[iw_next_offset + j + kw_offset]; Set the error. errors[io_offset + j] = outputs[io_offset + j] * ( 1 - outputs[io_offset + j] ) * weighted_sum; void kernel_backpropogation_apply_momentum( int layer_number, double *weights, double *cached_weights, double learning_momentum, int iw_offset, 24

25 int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { weights[iw_offset + jw_offset + k] += ( learning_momentum * cached_weights[iw_offset + jw_offset + k] void kernel_backpropogation_apply_rate( int layer_number, double *weights, double *cached_weights, double *outputs, double *errors, double learning_rate, int io_offset, int io_prev_offset, int iw_offset, int prev_layer_size, int j ) { Apply the momentum to the weights, but model it for kernel computation. double weighted_sum = 0.0f; Figure out the neuron-based weight int jw_offset = j * prev_layer_size; for( int k = 0; k < prev_layer_size; k++ ) { cached_weights[iw_offset + jw_offset + k]= ( learning_rate * errors[io_offset + j] * outputs[io_prev_offset + k] weights[iw_offset + jw_offset + k] += cached_weights[iw_offset + jw_offset + k]; double calculate_sigmoid( double value ) { Calculate the sigmoid function for the value. return (double) ( 1 / ( 1 + exp( -value ) ) double get_mean_square_error( NeuralNetwork net, double *desired_outputs ) { Get the mean square error of the network based upon the desired outputs. double error = 0; Sum the error up from the output layer int i_offset = net->total_neurons_before_layer( net->layer_count - 1 for( int j = 0; j < net->layer_sizes[net->layer_count - 1]; j++ ) { error += ( ( desired_outputs[j] - net->outputs[i_offset + j] ) * ( desired_outputs[j] - net->outputs[i_offset + j] ) return error / 2; double get_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return net->outputs[i_offset + index]; 25

26 int get_rounded_output_value( NeuralNetwork net, int index ) { Return the specified output value from the network, but rounded into an integer. int i_offset = net->total_neurons_before_layer( net->layer_count - 1 return (int) floor( net->outputs[i_offset + index] double do_network_training( NeuralNetwork net, TestInputs tests, TrainingParameters params ) { Iteratively train the neural network and report on the progress. double error = 0.0f; long iteration = 0, total_iterations = 0; float backprop_runtime, total_backprop_runtime = 0, runtime, total_runtime = 0; cout << endl << "Training the network:" << endl; for ( iteration = 0; iteration < params->training_max_iterations ; iteration++ ) { runtime = ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Train through backpropogation backprop_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) backpropogate( net, tests->input_values[iteration % tests->test_count], tests->desired_output_values[iteration % tests->test_count], params total_backprop_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - backprop_runtime How bad is the error? error = get_mean_square_error( net, tests->desired_output_values[iteration % tests->test_count] if( error < params->training_threshold ) { cout << "Network has been trained. It took " << iteration << " iterations." << endl; cout << "Final error is " << error << endl << endl; break; Report on the training process. if ( iteration % ( params->training_max_iterations / 10 ) == 0 ) { cout << "Current error is " << error << ". Continuing with training..." << endl; Add to the total runtime total_runtime += ( ( clock() / (double) ( CLOCKS_PER_SEC / 1000 ) ) - runtime if ( iteration == params->training_max_iterations ) { error = get_mean_square_error( net, tests->desired_output_values[(iteration - 1) % tests->test_count] cout << "Maximum of " << iteration << " iterations completed with error of " << error << endl; Write out the time for backpropogation. cout << endl << "Total time in backpropogation: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_backprop_runtime / 1000 ) << " seconds" << endl; cout << "Average time per backpropogation: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_backprop_runtime / total_iterations ) << " milliseconds" << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; 26

27 cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; void apply_network_tests( NeuralNetwork net, TestInputs tests ) { Apply the tests to the neural network. Report on the success failure. int total_iterations = 0; float feedforward_runtime, total_feedforward_runtime = 0, runtime, total_runtime = 0; for ( int test_index = 0; test_index < tests->test_count; test_index++ ) { runtime = ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) Setup to record the time total_iterations += 1; Start by feeding forward the provided test inputs. feedforward_runtime = ( clock() / (float)( CLOCKS_PER_SEC / 1000 ) feedforward( net, tests->input_values[test_index] total_feedforward_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - feedforward_runtime Now, report what the expected output is. cout << "For test input " << ( test_index + 1 ) << endl; cout << " Expected = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << (int) tests->desired_output_values[test_index][i]; cout << endl; Finally, report what the actual output was. cout << " Received = "; for( int i = 0; i < tests->output_value_size; i++ ) { cout << get_rounded_output_value( net, i cout << endl << endl; total_runtime += ( ( clock() / (double)( CLOCKS_PER_SEC / 1000 ) ) - runtime Write out the time for feedforward. cout << endl << "Total time in feedforward: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_feedforward_runtime / 1000 ) << " seconds " << endl; cout << "Average time per feedforward: " << setiosflags( ios::fixed ) << setprecision( 9 ) << ( total_feedforward_runtime / total_iterations ) << " milliseconds " << endl << endl; cout << "Total time iterating: " << setiosflags( ios::fixed ) << setprecision( 5 ) << ( total_runtime / 1000 ) << " seconds" << endl; cout << "Average time per iteration: " << setiosflags( ios::fixed ) << setprecision( 7 ) << ( total_runtime / total_iterations ) << " milliseconds" << endl << endl; Common To All Vector Versions RunNNetwork.cu #include "NNetwork.h" #include "NNetworkUtils.h" #include "NNetworkCuda.h" bool check_command_line( int argc, char* argv[] ) { Make sure that the correct arguments were passed. 27

28 FILE *fp = NULL; bool ok = true; if( argc < 3 ) { cout << "Format: RunNNetwork <network configuration file> <network test file>" << endl; cout << "Arguments:" << endl; cout << " network configuration file - This file should contain parameters for" << endl; cout << " network size, training rate, etc as " << endl; cout << " a set of data to train the network" << endl; cout << " network test file - This file should contain data for testing the " << endl; cout << " network after it has been trained" << endl; ok = false; else { Make sure that the configuration file exists. if( fp = fopen( argv[1], "r" ) ) { fclose( fp else { cout << "Specified network configuration file [" << argv[1] << "] doesn't exist or cannot be opened" << endl; ok = false; Make sure that the test file exists if( fp = fopen( argv[2], "r" ) ) { fclose( fp else { cout << "Specified network test file [" << argv[2] << "] doesn't exist or cannot be opened" << endl; ok = false; return ok; int main( int argc, char* argv[] ) { Main function for running the network First make sure that the user has provided the necessary input. if (!check_command_line( argc, argv ) ) { return 1; If we received a 3rd argument, then it must be the GPU number. if ( argc == 4 ) { Select the GPU that was called for. selectgpubynumber( argv[3] Make sure that CUDA resources get cleaned up on exit. atexit( cleanupcuda Read in the network configuration. NNetworkConfig nnc = read_network_configuration( argv[1] Read in the network tests. TestInputs tests = read_network_tests( argv[2], nnc->layer_config->input_layer_size(), 28

29 nnc->layer_config->output_layer_size() Build the neural network. NeuralNetwork net = build_neural_network( nnc->layer_config Initialize the network to begin with. initialize_neural_network( net Train the network. do_network_training( net, nnc->tests, nnc->params Test the network and report on results. cout << "Applying test data to network:" << endl; apply_network_tests( net, tests Free up the memory associated with the neural network. destroy_neural_network( net free( net return 0; NNetworkCuda.h #ifndef nnetworkcuda_h #define nnetworkcuda_h #include <stdio.h> #include <cuda.h> #define err ) ( HandleError( err, FILE, LINE ) ) Prototypes void HandleError( cudaerror_t err, const char *file, int line void checkcudaerror( const char *msg, bool exitonerror void selectgpubynumber( char *device_number void cleanupcuda( void #endif NNetworkCuda.cu #include "NNetworkCuda.h" void HandleError( cudaerror_t err, const char *file, int line ) { Handle and report on CUDA errors. if ( err!= cudasuccess ) { printf( "%s in %s at line %d\n", cudageterrorstring( err ), file, line exit( EXIT_FAILURE void checkcudaerror( const char *msg, bool exitonerror ) { Check cuda error and print result if appropriate. cudaerror_t err = cudagetlasterror( if( cudasuccess!= err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudageterrorstring(err) if (exitonerror) { 29

Parallel Processing Neural Networks on SIMD/GPU Architectures. CSC7551 Derek Kern December 8th, 2011

Parallel Processing Neural Networks on SIMD/GPU Architectures. CSC7551 Derek Kern December 8th, 2011 Parallel Processing Neural Networks on SIMD/GPU Architectures CSC7551 Derek Kern December 8th, 2011 Quick Apology I have 80 slides and ~75 minutes So, we are going to move pretty fast I apologize in advance

More information

HW4-2. float phi[size][size]={};!

HW4-2. float phi[size][size]={};! HW 4 #include #include //atoi #include #include #include #include //timing routines #include #include #define SIZE 256 using

More information

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community http://csc.cs.rit.edu History and Evolution of Programming Languages 1. Explain the relationship between machine

More information

ECE264 Spring 2014 Exam 2, March 11, 2014

ECE264 Spring 2014 Exam 2, March 11, 2014 ECE264 Spring 2014 Exam 2, March 11, 2014 In signing this statement, I hereby certify that the work on this exam is my own and that I have not copied the work of any other student while completing it.

More information

Chapter 15 - C++ As A "Better C"

Chapter 15 - C++ As A Better C Chapter 15 - C++ As A "Better C" Outline 15.1 Introduction 15.2 C++ 15.3 A Simple Program: Adding Two Integers 15.4 C++ Standard Library 15.5 Header Files 15.6 Inline Functions 15.7 References and Reference

More information

CS2141 Software Development using C/C++ C++ Basics

CS2141 Software Development using C/C++ C++ Basics CS2141 Software Development using C/C++ C++ Basics Integers Basic Types Can be short, long, or just plain int C++ does not define the size of them other than short

More information

Lab Instructor : Jean Lai

Lab Instructor : Jean Lai Lab Instructor : Jean Lai Group related statements to perform a specific task. Structure the program (No duplicate codes!) Must be declared before used. Can be invoked (called) as any number of times.

More information

Functions in C++ Problem-Solving Procedure With Modular Design C ++ Function Definition: a single

Functions in C++ Problem-Solving Procedure With Modular Design C ++ Function Definition: a single Functions in C++ Problem-Solving Procedure With Modular Design: Program development steps: Analyze the problem Develop a solution Code the solution Test/Debug the program C ++ Function Definition: A module

More information

Lecture 3. Review. CS 141 Lecture 3 By Ziad Kobti -Control Structures Examples -Built-in functions. Conditions: Loops: if( ) / else switch

Lecture 3. Review. CS 141 Lecture 3 By Ziad Kobti -Control Structures Examples -Built-in functions. Conditions: Loops: if( ) / else switch Lecture 3 CS 141 Lecture 3 By Ziad Kobti -Control Structures Examples -Built-in functions Review Conditions: if( ) / else switch Loops: for( ) do...while( ) while( )... 1 Examples Display the first 10

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Functions in C C Programming and Software Tools. N.C. State Department of Computer Science

Functions in C C Programming and Software Tools. N.C. State Department of Computer Science Functions in C C Programming and Software Tools N.C. State Department of Computer Science Functions in C Functions are also called subroutines or procedures One part of a program calls (or invokes the

More information

OpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith

OpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith OpenACC Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith 1 Introduction For this introduction, we are assuming you are familiar with libraries that use a pragma directive based structure,

More information

Functions in C C Programming and Software Tools

Functions in C C Programming and Software Tools Functions in C C Programming and Software Tools N.C. State Department of Computer Science Functions in C Functions are also called subroutines or procedures One part of a program calls (or invokes the

More information

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto Ricardo Rocha Department of Computer Science Faculty of Sciences University of Porto Adapted from the slides Revisões sobre Programação em C, Sérgio Crisóstomo Compilation #include int main()

More information

04. CUDA Data Transfer

04. CUDA Data Transfer 04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Dr M Kasim A Jalil. Faculty of Mechanical Engineering UTM (source: Deitel Associates & Pearson)

Dr M Kasim A Jalil. Faculty of Mechanical Engineering UTM (source: Deitel Associates & Pearson) Lecture 9 Functions Dr M Kasim A Jalil Faculty of Mechanical Engineering UTM (source: Deitel Associates & Pearson) Objectives In this chapter, you will learn: To understand how to construct programs modularly

More information

cs3157: another C lecture (mon-21-feb-2005) C pre-processor (3).

cs3157: another C lecture (mon-21-feb-2005) C pre-processor (3). cs3157: another C lecture (mon-21-feb-2005) C pre-processor (1). today: C pre-processor command-line arguments more on data types and operators: booleans in C logical and bitwise operators type conversion

More information

Fast Introduction to Object Oriented Programming and C++

Fast Introduction to Object Oriented Programming and C++ Fast Introduction to Object Oriented Programming and C++ Daniel G. Aliaga Note: a compilation of slides from Jacques de Wet, Ohio State University, Chad Willwerth, and Daniel Aliaga. Outline Programming

More information

Chapter 14 - Advanced C Topics

Chapter 14 - Advanced C Topics Chapter 14 - Advanced C Topics Outline 14.1 Introduction 14.2 Redirecting Input/Output on UNIX and DOS Systems 14.3 Variable-Length Argument Lists 14.4 Using Command-Line Arguments 14.5 Notes on Compiling

More information

Advanced Topics in CUDA C

Advanced Topics in CUDA C Advanced Topics in CUDA C S. Sundar and M. Panchatcharam August 9, 2014 S. Sundar and M. Panchatcharam ( IIT Madras, ) Advanced CUDA August 9, 2014 1 / 36 Outline 1 Julia Set 2 Julia GPU 3 Compilation

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Introduction to C++ Systems Programming

Introduction to C++ Systems Programming Introduction to C++ Systems Programming Introduction to C++ Syntax differences between C and C++ A Simple C++ Example C++ Input/Output C++ Libraries C++ Header Files Another Simple C++ Example Inline Functions

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Linked List using a Sentinel

Linked List using a Sentinel Linked List using a Sentinel Linked List.h / Linked List.h Using a sentinel for search Created by Enoch Hwang on 2/1/10. Copyright 2010 La Sierra University. All rights reserved. / #include

More information

Kurt Schmidt. October 30, 2018

Kurt Schmidt. October 30, 2018 to Structs Dept. of Computer Science, Drexel University October 30, 2018 Array Objectives to Structs Intended audience: Student who has working knowledge of Python To gain some experience with a statically-typed

More information

Functions. Computer System and programming in C Prentice Hall, Inc. All rights reserved.

Functions. Computer System and programming in C Prentice Hall, Inc. All rights reserved. Functions In general, functions are blocks of code that perform a number of pre-defined commands to accomplish something productive. You can either use the built-in library functions or you can create

More information

Functions. Angela Chih-Wei Tang ( 唐之瑋 ) Department of Communication Engineering National Central University JhongLi, Taiwan.

Functions. Angela Chih-Wei Tang ( 唐之瑋 ) Department of Communication Engineering National Central University JhongLi, Taiwan. Functions Angela Chih-Wei Tang ( 唐之瑋 ) Department of Communication Engineering National Central University JhongLi, Taiwan 2009 Fall Outline 5.1 Introduction 5.3 Math Library Functions 5.4 Functions 5.5

More information

Multiple Choice (Questions 1 14) 28 Points Select all correct answers (multiple correct answers are possible)

Multiple Choice (Questions 1 14) 28 Points Select all correct answers (multiple correct answers are possible) Name Closed notes, book and neighbor. If you have any questions ask them. Notes: Segment of code necessary C++ statements to perform the action described not a complete program Program a complete C++ program

More information

Chapter 10 - Notes Applications of Arrays

Chapter 10 - Notes Applications of Arrays Chapter - Notes Applications of Arrays I. List Processing A. Definition: List - A set of values of the same data type. B. Lists and Arrays 1. A convenient way to store a list is in an array, probably a

More information

Multiple Choice (Questions 1 14) 28 Points Select all correct answers (multiple correct answers are possible)

Multiple Choice (Questions 1 14) 28 Points Select all correct answers (multiple correct answers are possible) Name Closed notes, book and neighbor. If you have any questions ask them. Notes: Segment of code necessary C++ statements to perform the action described not a complete program Program a complete C++ program

More information

Distributed Real-Time Control Systems. Lecture 17 C++ Programming Intro to C++ Objects and Classes

Distributed Real-Time Control Systems. Lecture 17 C++ Programming Intro to C++ Objects and Classes Distributed Real-Time Control Systems Lecture 17 C++ Programming Intro to C++ Objects and Classes 1 Bibliography Classical References Covers C++ 11 2 What is C++? A computer language with object oriented

More information

CSCI 171 Chapter Outlines

CSCI 171 Chapter Outlines Contents CSCI 171 Chapter 1 Overview... 2 CSCI 171 Chapter 2 Programming Components... 3 CSCI 171 Chapter 3 (Sections 1 4) Selection Structures... 5 CSCI 171 Chapter 3 (Sections 5 & 6) Iteration Structures

More information

Matlab? Chapter 3-4 Matlab and IPT Basics. Working Environment. Matlab Demo. Array. Data Type. MATLAB Desktop:

Matlab? Chapter 3-4 Matlab and IPT Basics. Working Environment. Matlab Demo. Array. Data Type. MATLAB Desktop: Matlab? Lecture Slides ME 4060 Machine Vision and Vision-based Control Chapter 3-4 Matlab and IPT Basics By Dr. Debao Zhou 1 MATric LABoratory data analysis, prototype and visualization Matrix operation

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

CS3157: Advanced Programming. Outline

CS3157: Advanced Programming. Outline CS3157: Advanced Programming Lecture #8 Feb 27 Shlomo Hershkop shlomo@cs.columbia.edu 1 Outline More c Preprocessor Bitwise operations Character handling Math/random Review for midterm Reading: k&r ch

More information

CSE123. Program Design and Modular Programming Functions 1-1

CSE123. Program Design and Modular Programming Functions 1-1 CSE123 Program Design and Modular Programming Functions 1-1 5.1 Introduction A function in C is a small sub-program performs a particular task, supports the concept of modular programming design techniques.

More information

1 PHASE1PRUNE INTRODUCTION 1

1 PHASE1PRUNE INTRODUCTION 1 1 PHASE1PRUNE INTRODUCTION 1 1. Introduction. Phase one of Kociemba s two-phase algorithm involves finding a sequence of moves that takes an arbitrary position into the H group, generated by U, F 2, R2,

More information

Pointers, Dynamic Data, and Reference Types

Pointers, Dynamic Data, and Reference Types Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation The new operator The delete operator Dynamic Memory Allocation for Arrays 1 C++ Data Types simple

More information

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco CS 326 Operating Systems C Programming Greg Benson Department of Computer Science University of San Francisco Why C? Fast (good optimizing compilers) Not too high-level (Java, Python, Lisp) Not too low-level

More information

Agenda. The main body and cout. Fundamental data types. Declarations and definitions. Control structures

Agenda. The main body and cout. Fundamental data types. Declarations and definitions. Control structures The main body and cout Agenda 1 Fundamental data types Declarations and definitions Control structures References, pass-by-value vs pass-by-references The main body and cout 2 C++ IS AN OO EXTENSION OF

More information

COMP322 - Introduction to C++ Lecture 02 - Basics of C++

COMP322 - Introduction to C++ Lecture 02 - Basics of C++ COMP322 - Introduction to C++ Lecture 02 - Basics of C++ School of Computer Science 16 January 2012 C++ basics - Arithmetic operators Where possible, C++ will automatically convert among the basic types.

More information

Programming. C++ Basics

Programming. C++ Basics Programming C++ Basics Introduction to C++ C is a programming language developed in the 1970s with the UNIX operating system C programs are efficient and portable across different hardware platforms C++

More information

CSE au Midterm Exam Nov. 2, 2018 Sample Solution

CSE au Midterm Exam Nov. 2, 2018 Sample Solution Question 1. (16 points) Build tools and make. We re building a C++ software back-end prototype for a new food web site. So far, we ve got the following source files with the code for two main programs

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

C Review. MaxMSP Developers Workshop Summer 2009 CNMAT

C Review. MaxMSP Developers Workshop Summer 2009 CNMAT C Review MaxMSP Developers Workshop Summer 2009 CNMAT C Syntax Program control (loops, branches): Function calls Math: +, -, *, /, ++, -- Variables, types, structures, assignment Pointers and memory (***

More information

Lecture 04 FUNCTIONS AND ARRAYS

Lecture 04 FUNCTIONS AND ARRAYS Lecture 04 FUNCTIONS AND ARRAYS 1 Motivations Divide hug tasks to blocks: divide programs up into sets of cooperating functions. Define new functions with function calls and parameter passing. Use functions

More information

C Functions. 5.2 Program Modules in C

C Functions. 5.2 Program Modules in C 1 5 C Functions 5.2 Program Modules in C 2 Functions Modules in C Programs combine user-defined functions with library functions - C standard library has a wide variety of functions Function calls Invoking

More information

Common Misunderstandings from Exam 1 Material

Common Misunderstandings from Exam 1 Material Common Misunderstandings from Exam 1 Material Kyle Dewey Stack and Heap Allocation with Pointers char c = c ; char* p1 = malloc(sizeof(char)); char** p2 = &p1; Where is c allocated? Where is p1 itself

More information

Main Program. C Programming Notes. #include <stdio.h> main() { printf( Hello ); } Comments: /* comment */ //comment. Dr. Karne Towson University

Main Program. C Programming Notes. #include <stdio.h> main() { printf( Hello ); } Comments: /* comment */ //comment. Dr. Karne Towson University C Programming Notes Dr. Karne Towson University Reference for C http://www.cplusplus.com/reference/ Main Program #include main() printf( Hello ); Comments: /* comment */ //comment 1 Data Types

More information

The output: The address of i is 0xbf85416c. The address of main is 0x80483e4. arrays.c. 1 #include <stdio.h> 3 int main(int argc, char **argv) 4 {

The output: The address of i is 0xbf85416c. The address of main is 0x80483e4. arrays.c. 1 #include <stdio.h> 3 int main(int argc, char **argv) 4 { Memory A bit is a binary digit, either 0 or 1. A byte is eight bits, and can thus represent 256 unique values, such as 00000000 and 10010110. Computer scientists often think in terms of hexadecimal, rather

More information

THE C STANDARD LIBRARY & MAKING YOUR OWN LIBRARY. ISA 563: Fundamentals of Systems Programming

THE C STANDARD LIBRARY & MAKING YOUR OWN LIBRARY. ISA 563: Fundamentals of Systems Programming THE C STANDARD LIBRARY & MAKING YOUR OWN LIBRARY ISA 563: Fundamentals of Systems Programming Announcements Homework 2 posted Homework 1 due in two weeks Typo on HW1 (definition of Fib. Sequence incorrect)

More information

CSE 333 Final Exam June 6, 2017 Sample Solution

CSE 333 Final Exam June 6, 2017 Sample Solution Question 1. (24 points) Some C and POSIX I/O programming. Given an int file descriptor returned by open(), write a C function ReadFile that reads the entire file designated by that file descriptor and

More information

CSE 333 Autumn 2013 Midterm

CSE 333 Autumn 2013 Midterm CSE 333 Autumn 2013 Midterm Please do not read beyond this cover page until told to start. A question involving what could be either C or C++ is about C, unless it explicitly states that it is about C++.

More information

CSE 333 Midterm Exam July 24, Name UW ID#

CSE 333 Midterm Exam July 24, Name UW ID# Name UW ID# There are 6 questions worth a total of 100 points. Please budget your time so you get to all of the questions. Keep your answers brief and to the point. The exam is closed book, closed notes,

More information

Chapter Four: Loops. Slides by Evan Gallagher. C++ for Everyone by Cay Horstmann Copyright 2012 by John Wiley & Sons. All rights reserved

Chapter Four: Loops. Slides by Evan Gallagher. C++ for Everyone by Cay Horstmann Copyright 2012 by John Wiley & Sons. All rights reserved Chapter Four: Loops Slides by Evan Gallagher The Three Loops in C++ C++ has these three looping statements: while for do The while Loop while (condition) { statements } The condition is some kind of test

More information

10/23/02 21:20:33 IO_Examples

10/23/02 21:20:33 IO_Examples 1 Oct 22 22:07 2000 extractor1.c Page 1 istream &operator>>( istream &in, Point &p ){ char junk; in >> junk >> p.x >> junk >> p.y >> junk; return in; 2 Oct 22 22:07 2000 extractor2.c Page 1 istream &operator>>(

More information

ESC101N: Fundamentals of Computing End-sem st semester

ESC101N: Fundamentals of Computing End-sem st semester ESC101N: Fundamentals of Computing End-sem 2010-11 1st semester Instructor: Arnab Bhattacharya 8:00-11:00am, 15th November, 2010 Instructions 1. Please write your name, roll number and section below. 2.

More information

Building on the foundation. Now that we know a little about cout cin math operators boolean operators making decisions using if statements

Building on the foundation. Now that we know a little about cout cin math operators boolean operators making decisions using if statements Chapter 5 Looping Building on the foundation Now that we know a little about cout cin math operators boolean operators making decisions using if statements Advantages of Computers Computers are really

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Chapter 3 - Functions

Chapter 3 - Functions Chapter 3 - Functions 1 Outline 3.1 Introduction 3.2 Program Components in C++ 3.3 Math Library Functions 3.4 Functions 3.5 Function Definitions 3.6 Function Prototypes 3.7 Header Files 3.8 Random Number

More information

CPSC 427: Object-Oriented Programming

CPSC 427: Object-Oriented Programming CPSC 427: Object-Oriented Programming Michael J. Fischer Lecture 10 October 1, 2018 CPSC 427, Lecture 10, October 1, 2018 1/20 Brackets Example (continued from lecture 8) Stack class Brackets class Main

More information

Tutorial 13 Salary Survey Application: Introducing One- Dimensional Arrays

Tutorial 13 Salary Survey Application: Introducing One- Dimensional Arrays Tutorial 13 Salary Survey Application: Introducing One- Dimensional Arrays Outline 13.1 Test-Driving the Salary Survey Application 13.2 Introducing Arrays 13.3 Declaring and Initializing Arrays 13.4 Constructing

More information

Lab 6. Review of Variables, Formatting & Loops By: Dr. John Abraham, Professor, UTPA

Lab 6. Review of Variables, Formatting & Loops By: Dr. John Abraham, Professor, UTPA Variables: Lab 6 Review of Variables, Formatting & Loops By: Dr. John Abraham, Professor, UTPA We learned that a variable is a name assigned to the first byte of the necessary memory to store a value.

More information

My malloc: mylloc and mhysa. Johan Montelius HT2016

My malloc: mylloc and mhysa. Johan Montelius HT2016 1 Introduction My malloc: mylloc and mhysa Johan Montelius HT2016 So this is an experiment where we will implement our own malloc. We will not implement the world s fastest allocator, but it will work

More information

The American University in Cairo Department of Computer Science & Engineering CSCI &09 Dr. KHALIL Exam-I Fall 2011

The American University in Cairo Department of Computer Science & Engineering CSCI &09 Dr. KHALIL Exam-I Fall 2011 The American University in Cairo Department of Computer Science & Engineering CSCI 106-07&09 Dr. KHALIL Exam-I Fall 2011 Last Name :... ID:... First Name:... Form I Section No.: EXAMINATION INSTRUCTIONS

More information

Integer Data Types. Data Type. Data Types. int, short int, long int

Integer Data Types. Data Type. Data Types. int, short int, long int Data Types Variables are classified according to their data type. The data type determines the kind of information that may be stored in the variable. A data type is a set of values. Generally two main

More information

The following program computes a Calculus value, the "trapezoidal approximation of

The following program computes a Calculus value, the trapezoidal approximation of Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn

More information

CSCI-243 Exam 2 Review February 22, 2015 Presented by the RIT Computer Science Community

CSCI-243 Exam 2 Review February 22, 2015 Presented by the RIT Computer Science Community CSCI-43 Exam Review February, 01 Presented by the RIT Computer Science Community http://csc.cs.rit.edu C Preprocessor 1. Consider the following program: 1 # include 3 # ifdef WINDOWS 4 # include

More information

CA341 - Comparative Programming Languages

CA341 - Comparative Programming Languages CA341 - Comparative Programming Languages David Sinclair Dynamic Data Structures Generally we do not know how much data a program will have to process. There are 2 ways to handle this: Create a fixed data

More information

CS 376b Computer Vision

CS 376b Computer Vision CS 376b Computer Vision 09 / 25 / 2014 Instructor: Michael Eckmann Today s Topics Questions? / Comments? Enhancing images / masks Cross correlation Convolution C++ Cross-correlation Cross-correlation involves

More information

C Syntax Arrays and Loops Math Strings Structures Pointers File I/O. Final Review CS Prof. Jonathan Ventura. Prof. Jonathan Ventura Final Review

C Syntax Arrays and Loops Math Strings Structures Pointers File I/O. Final Review CS Prof. Jonathan Ventura. Prof. Jonathan Ventura Final Review CS 2060 Variables Variables are statically typed. Variables must be defined before they are used. You only specify the type name when you define the variable. int a, b, c; float d, e, f; char letter; //

More information

1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol.

1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol. 1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol. B. Outputs to the console a floating point number f1 in scientific format

More information

Review Topics. Final Exam Review Slides

Review Topics. Final Exam Review Slides Review Topics Final Exam Review Slides!! Transistors and Gates! Combinational Logic! LC-3 Programming!! Original slides from Gregory Byrd, North Carolina State University Modified slides by Chris Wilcox,

More information

Chapter Four: Loops II

Chapter Four: Loops II Chapter Four: Loops II Slides by Evan Gallagher & Nikolay Kirov Chapter Goals To understand nested loops To implement programs that read and process data sets To use a computer for simulations Processing

More information

ECE264 Fall 2013 Exam 3, November 20, 2013

ECE264 Fall 2013 Exam 3, November 20, 2013 ECE264 Fall 2013 Exam 3, November 20, 2013 In signing this statement, I hereby certify that the work on this exam is my own and that I have not copied the work of any other student while completing it.

More information

Reference operator (&)

Reference operator (&) Pointers Each cell can be easily located in the memory because it has a unique address and all the memory cells follow a successive pattern. For example, if we are looking for cell 1776 we know that it

More information

GPU Programming. Rupesh Nasre.

GPU Programming. Rupesh Nasre. GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 Debugging Debugging parallel programs is difficult. Non-determinism due to thread-scheduling Output can be different

More information

Two s Complement Review. Two s Complement Review. Agenda. Agenda 6/21/2011

Two s Complement Review. Two s Complement Review. Agenda. Agenda 6/21/2011 Two s Complement Review CS 61C: Great Ideas in Computer Architecture (Machine Structures) Introduction to C (Part I) Instructor: Michael Greenbaum http://inst.eecs.berkeley.edu/~cs61c/su11 Suppose we had

More information

When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to.

When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to. Refresher When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to. i.e. char *ptr1 = malloc(1); ptr1 + 1; // adds 1 to pointer

More information

ECE264 Fall 2013 Exam 2, October 24, 2013

ECE264 Fall 2013 Exam 2, October 24, 2013 ECE Fall 0 Exam, October, 0 If this is an on-line exam, you have 0 minutes to finish the exam. When the time limit is reached, the system will automatically close. If this is a paper exam, you have 0 minutes.

More information

Multiple Choice (Questions 1 13) 26 Points Select all correct answers (multiple correct answers are possible)

Multiple Choice (Questions 1 13) 26 Points Select all correct answers (multiple correct answers are possible) Name Closed notes, book and neighbor. If you have any questions ask them. Notes: Segment of code necessary C++ statements to perform the action described not a complete program Program a complete C++ program

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Variables. Data Types.

Variables. Data Types. Variables. Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable. We had to write several lines of code, compile them, and then execute the resulting

More information

Programming in C. Pointers and Arrays

Programming in C. Pointers and Arrays Programming in C Pointers and Arrays NEXT SET OF SLIDES FROM DENNIS FREY S FALL 2011 CMSC313 http://www.csee.umbc.edu/courses/undergraduate/313/fall11/" Pointers and Arrays In C, there is a strong relationship

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

For personnal use only

For personnal use only Inverting Large Images Using CUDA Finnbarr P. Murphy (fpm@fpmurphy.com) This is a simple example of how to invert a very large image, stored as a vector using nvidia s CUDA programming environment and

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

A Crash Course in C. Steven Reeves

A Crash Course in C. Steven Reeves A Crash Course in C Steven Reeves This class will rely heavily on C and C++. As a result this section will help students who are not familiar with C or who need a refresher. By the end of this section

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 4

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 4 BIL 104E Introduction to Scientific and Engineering Computing Lecture 4 Introduction Divide and Conquer Construct a program from smaller pieces or components These smaller pieces are called modules Functions

More information

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.

Non-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words. , ean, arithmetic s s on acters Comp Sci 1570 Introduction to C++ Outline s s on acters 1 2 3 4 s s on acters Outline s s on acters 1 2 3 4 s s on acters ASCII s s on acters ASCII s s on acters Type: acter

More information

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018 GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA 2017 10 July 2017 Santa Barbara, CA EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS

More information

C: How to Program. Week /Apr/23

C: How to Program. Week /Apr/23 C: How to Program Week 9 2007/Apr/23 1 Review of Chapters 1~5 Chapter 1: Basic Concepts on Computer and Programming Chapter 2: printf and scanf (Relational Operators) keywords Chapter 3: if (if else )

More information

Computing and Statistical Data Analysis Lecture 3

Computing and Statistical Data Analysis Lecture 3 Computing and Statistical Data Analysis Lecture 3 Type casting: static_cast, etc. Basic mathematical functions More i/o: formatting tricks Scope, namspaces Functions 1 Type casting Often we need to interpret

More information

today cs3157-fall2002-sklar-lect05 1

today cs3157-fall2002-sklar-lect05 1 today homework #1 due on monday sep 23, 6am some miscellaneous topics: logical operators random numbers character handling functions FILE I/O strings arrays pointers cs3157-fall2002-sklar-lect05 1 logical

More information

Project 1: Convex hulls and line segment intersection

Project 1: Convex hulls and line segment intersection MCS 481 / David Dumas / Spring 2014 Project 1: Convex hulls and line segment intersection Due at 10am on Monday, February 10 0. Prerequisites For this project it is expected that you already have CGAL

More information

CSE 333 Midterm Exam Sample Solution 7/28/14

CSE 333 Midterm Exam Sample Solution 7/28/14 Question 1. (20 points) C programming. For this question implement a C function contains that returns 1 (true) if a given C string appears as a substring of another C string starting at a given position.

More information