wget

Size: px

Start display at page:

Download "wget https://www.csc2.ncsu.edu/faculty/efg/506/f17/www/homework/p2/program2.tgz"

Julian Eaton
6 years ago
Views:

1 ECE/CSC 506: Architecture of Parallel Computers Program 2: Simulating Ocean Currents (Serial, OpenMP, and CUDA versions) Due: Friday, September 29, 2017 Preliminary version, differences between preliminary version and this version 1. Overall Problem Description In this project, you will add new features to a trace-driven Ocean Current Simulator. You can fetch the simulation to login.hpc.ncsu.edu with the command wget You are provided with superclass grid.cpp and a derived class solver_serial.cpp. There are incomplete functions in solver_serial that need to be completed. You will work on three versions of this simulation, a serial version, an OpenMP version, and a CUDA version. Your project should build on a Linux machine. The most challenging part of this machine problem is to understand the loop dependences and the decomposition of the tasks. In this project, you will implement red-and-black ordering as discussed in the lecture. The purpose is to understand the importance of parallelization of a program. In the OpenMP implementation, you will need to build a new derived class solver_omp.cpp in line with solver_serial. In the CUDA version, you need to build a new derived class solver_cuda.cu. Your project should build on the login.hpc.ncsu.edu.. The objective of this part of project is to understand how to parallelize a serial application. 2. Simulator The specifications of the project are as follows. A grid (array), which has a regular dimension (N N), is created and initialized with the values from the input trace file. The heart of this simulator is an equation solver function which solves a simple partial differential equation on the grid. The border rows and columns do not participate in the computation, as the boundary values do not change. The interior (N 2) (N 2) points are updated by the equation solver. The computation proceeds over a number of sweeps. In each sweep, it operates on all the elements of the grid, replacing the value of each element with a weighted average of itself and its four nearest neighbor elements. The updates are done in place in the grid, so a point sees the new values of the points above and to the left of it, and the old values of the points below it and to its right. During each sweep, the equation solver also computes the average difference of an updated element from its previous value. If this average difference over all elements is smaller than a predefined tolerance parameter, the solution is said to have converged and the solver exits at the end of the sweep. Otherwise, it performs another sweep and tests for convergence again. 1

2 This project exploits parallelism using red-black ordering. The idea is to separate the grid points into alternating red points and black points as on a checkerboard, as shown in the figure, so that no red point is adjacent to another red point, and no black point is adjacent to another black point. Since each point reads only its four nearest neighbors, in order to compute a red point, we do not need the updated value of any other red point, but only the updated values of the above and left black points (in a standard sweep), and vice versa. We can therefore divide a grid sweep into two phases: first compute all red points, and then compute all black points. Within each phase there are no dependences among grid points, so we can compute all red points in parallel, then synchronize globally, and then compute all black points in parallel. Red point Black point 3. Building the simulator You are provided with superclasses grid.cpp and solver_serial.cpp, along with main.cpp. The grid class is used to create a grid array and define the necessary functions needed to traverse the grid, which are as below. void initialize_grid(file* p_file); void set_tol_value(float tol) {tolerance = tol;} void print_grid(); virtual void simulate_eqn_solver() = 0; The initialize_grid method initializes the grid with the values from the input file. The method set_tol_value is used to update the tolerance value associated with the grid. The print_grid method displays the final contents of the grid. The simulate_eqn_solver function performs the finite differential operation over the grid and invokes the function for red-black ordering. This function is declared as a pure virtual function since the actual definition is done in solver_serial.cpp. The solver_serial.cpp superclass contains the functions simulate_eqn_solver and red_black_ordering. You need to implement them as shown in the pseudocode at the top of the next page. 2

3 while (!done) do /* outermost loop over sweeps */ diff = 0; /* initialize maximum difference to 0 */ for i -> 1 to n do /* sweep over non-border points of grid */ for j -> 1 to n do temp = A[i,j]; /* save old value of element */ A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j]); /*compute average */ diff += abs(a[i,j] - temp); end for end for if (diff/(n*n) < TOL) then done = 1; end while Once you're ready to build your program, you can compile as follows: make serial g++ -O0 -Wall -Werror -D SERIAL -c main.cpp -o SERIAL/main.o g++ -O0 -Wall -Werror -D SERIAL -c grid.cpp -o SERIAL/grid.o g++ -O0 -Wall -Werror -D SERIAL -c solver_serial.cpp -o SERIAL/solver_serial.o g++ -O0 -Wall -Werror -D SERIAL -o ocean_sim_serial SERIAL/main.o SERIAL/grid.o SERIAL/solver_serial.o -lm FA OCEAN SIMULATOR SERIAL VERSION Compilation Done ---> nothing else to make :) Executables called ocean_sim_serial will be created. In order to run your simulator, you need to execute the following command:./ocean_sim_serial dimension tolerance trace_file./ocean_sim_serial dimension tolerance trace_file num_of_threads where ocean_sim_serial is the executables of the Ocean simulator generated after making dimension is the grid dimension tolerance is the point of convergence for the equation solver trace_file is the input file that has the dummy ocean current trace. num_of_threads is the number of threads or 3

4 Your output should match the given validation runs in terms of given results and format. You will need to match the results using the diff command. You can use the following command diff iw given output file your output file You can dump the output from your simulator to stdout and redirect it to a file using the > operator. You will be provided with outputs of 5 validation runs. You may build the omp version and the cuda version similarly by running make omp make cuda ocean_sim_omp and ocean_sim_cuda will be generated for each implementation respectively. You may also run make to generate all three executables in one go, once you have tested all three implementations. Please assure that your serial and omp program will run on hpc.ncsu.edu, and your cuda program will run on the ARC Cluster. TAs will use these environments to verify your code. Editing, Compiling and Running Please refer to the Guide to ARC document for how to connect to ARC Cluster. Once you have a prompt on a compute node, you should still be able to see the files you uploaded. To edit your program, you can make edits on a Linux host and just push changes up to arc using sftp. However, it will probably be easier to just edit directly on the ARC machine. If you're accustomed to vim or emacs, they are both there for you. If you're not used to editing on a Linux machine, nano is available and fairly easy to use. To run the test vecadd code, run the make command, then./vec_add 10 input.txt You should see the following output: [unityid@c23 vecadd]$ make /usr/local/cuda/bin/nvcc -arch=sm_30 -g -G -O0 -o vectoradd.o -c vectoradd.cu /usr/local/cuda/bin/nvcc -L/usr/local/cuda/lib64 -lcuda -o vec_add vectoradd.o FA VECTOR ADDITION CUDA SAMPLE PROGRAM Compilation Done ---> nothing else to make :) [unityid@c23 vecadd]$./vec_add 10 input.txt ===== CSC506 Vector Add CUDA Sample Code ===== NUM OF ELEMENTS: 10 TRACE FILE: input.txt [Vector addition of 10 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 1 blocks of 256 threads Copy output data from the CUDA device to the host memory =====Completed Vector Addition===== =

5 = = = = = = = = = Done [unityid@c23 vecadd]$ Once you're ready to build your program, you can compile as follows: [unityid@c23 506_ocean_sim_cuda]$ make clean; make cuda rm -rf SERIAL OMP CUDA rm -f *.o ocean_sim_* g++ -O0 -Wall -Werror -D CUDA -c main.cpp -o CUDA/main.o g++ -O0 -Wall -Werror -D CUDA -c grid.cpp -o CUDA/grid.o /usr/local/cuda-7.0/bin/nvcc -arch=sm_21 -g -G -O0 -o CUDA/solver_cuda.o -c solver_cuda.cu /usr/local/cuda-7.0/bin/nvcc -arch=sm_21 -g -G -O0 -o ocean_sim_cuda CUDA/main.o CUDA/grid.o CUDA/solver_cuda.o FA OCEAN SIMULATOR CUDA VERSION [unityid@c23 506_ocean_sim_cuda]$ When you're ready to run, be sure you're on a compute node. Your prompt should say something like [unityid@c23 506_ocean_sim_cuda]$. You can run your program just like you would an ordinary program../ocean_sim_cuda input_16x16.txt > val_16x16.txt Your output should match the given validation runs in terms of given results and format. You will need to match the results using the diff command. You can use the following command: diff iw given output file your output file You can dump the output from your simulator to stdout and redirect it to a file using the > operator. You will be provided with outputs of 5 validation runs. 4. Report For the OpenMP version, investigate the effect of varying the grid size and the number of threads. Graph the results and explain what you see. For the CUDA version, compare the sequence of commands to launch a CUDA kernel against launching serial code. If you obtain a speedup, explain why. If you don t, explain the overhead that prevents it. 5. Grading 20%: Your code compiles successfully 5

6 40%: Your output matches exactly for runs on all five files (points will be equally distributed). 40%: Report. Credit will be given on the statistics shown and discussion presented. 6. Submission Format In order to grade all submissions promptly, we have to ask you to follow the submission format. Your final submission should be in a zip file named unityid1_unityid2_program2.zip. If you are not working as a team, just name the file unityid_program2.zip. Your Unity ID is the one generated from your name, with alphabets and sometimes digits at the end. Do NOT use your campus card ID or your alias. Your zip file should only contain the following files: grid.cpp grid.h main.cpp main.h Makefile solver_cuda.cu solver_cuda.h solver_omp.cpp solver_omp.h solver_serial.cpp solver_serial.h Do not include the parent folder in your zip file. Also, no need to include the input and output files. This command should help you generate the zip file: zip -r unityid1_unityid2_program2.zip *.cpp *.h *.cu Makefile Not following the submission format property will result in a maximum of 5 points penalty. 7. Suggestions Read the main and superclasses carefully, and understand how the program works Most of the code given to you is well encapsulated, so you do not have to modify most of the existing functions. You just need to define the incomplete functions Understand how the different CUDA APIs handle memory-allocation errors and out-of-bounds references. Make sure there are no memory leaks in your program by de-allocating memory after you are done with it. Most of the code given to you is well encapsulated, so you do not have to modify most of the existing functions. You just need to complete the definition of the incomplete functions. There will be occasional downtime with ARC. Please start working on the assignment at the earliest and not wait until the last minute. 6

Parallelization of an Example Program

Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.