r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11

Size: px

Start display at page:

Download "r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11"

Jerome Douglas
6 years ago
Views:

1 r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11

2 Contents 1 Introduction File main.cpp File src.description.html Common source cudagrass File cudamain.cpp Class CUserKernelArg (le CUserModule.h) Class CUserModule (le CUserModule.h) Allocate memory on host Allocate memory on device Run CUDA kernel once Run CUDA kernel for each segment Variable sux Compile all together Debug printouts Print last CUDA error (debug=1) Print arguments to CUDA part of the module (debug=2) Print total elapsed time for CUDA calculation part (debug=4) Print compile properties (debug=16) Print device(s) properties (debug=32) Print host properties (debug=64) Print device buers properties (debug=128) Print host buers properties (debug=256) Print input and output map properties (debug=512) Print scheduler properties (debug=1024) Print run properties (debug=2048) Print clutter map properties (debug=4096) Contact

3 1 Introduction This is a programmer's manual for developing custom parallel raster GRASS GIS modules running on CUDA GPU. This manual also contains source code provided by NVIDIA Corporation. If you have not installed modules yet, please download sources and installation manual from: Good point to start to develop custom sequential raster programs is: To develop sequential module, only two source les must be modied: main.c and description.html (directory doc/raster/r.example in GRASS GIS source code). Into main.c man can write new C- ANSI application code for the new module, into descripton.html man can write help and a description of the module. Develop parallel module in GRASS GIS running on the GPU is slightly more complicated job than develops sequential module. Two dierent software packages, GRASS GIS and CUDA, must be merged together in one application (module). Due to GPU limited memory size, the memory management is more complicated. Program arguments must be copied to the GPU. Scheduling must be implemented to run individual pieces of the program. In this manual we will briey describe r.cuda.example module, which can do a very simple job on raster maps. This job is reading each grid cell and color it black, if height is more or less than some preset threshold. Source code of r.cuda.example module can be found in cuda-workspace/cudaexample directory. To develop parallel module, ve les must be modied. Files main.cpp and src.desctiption.html (directory cuda-workspace/cudaexample/src_gui_grass) are related more to GRASS GIS. Files cudamain.cpp, CUserModule.cu and CUserModule.h (directory cuda-workspace/cudaexample/src) are related more to CUDA. Hereinbelow we will describe each le separately. 3

4 2 File main.cpp File main.cpp in directory cuda-workspace/cudaexample/src_gui_grass is practically inherited from le main.c (sequential module r.example). It is the entry point of program because it contains main() function. File main.cpp contains GRASS GIS routines for module initialization, denitions and manipulation with options and ags, and history storage routines. It also contains wrappers with CUDA environment. But it no longer contains buer allocations routines, reading raster routines and process data routines. Programs starts with GRASS initialization routine: G_gisinit(argv[0]); which reads grass environment and stores program name to G_program_name(). module initialization routine: Then follow struct GModule *module; module = G_define_module(); module->keywords = _("example"); module->description = _("CUDA example raster module"); module->verbose=1; Flags are dened as follows: struct Flag *flag_negative; flag_negative = G_define_flag(); flag_negative->key = n ; flag_negative->description = _("Negative result"); Options are dened as: struct Option *debugopt; debugopt = G_define_option(); debugopt->key = "debug"; debugopt->type = TYPE_INTEGER; debugopt->required = NO; debugopt->key_desc = "value"; debugopt->description = _("Debug number"); debugopt->answer = _("0"); debugopt->guisection = _("Optional"); Breaking point is options and ags parser: if (G_parser(argc, argv)) exit(exit_failure); 4

5 Status of ags is now in flag_negative->answer and value of option is in debugopt->answer To transfer ags and options to CUDA part of program, we dene vector: vector<string> argument; and ll it with all those arguments. Classic argument form is made with code: char ** argv_cuda; argv_cuda = new char * [argument.size()]; for(unsigned int i=0;i<argument.size();i++) { argv_cuda[i]=new char [argument[i].size()+1]; strcpy(argv_cuda[i],argument[i].c_str()); } When module is initialized and all arguments are set, the wrapping routine: cudacalculation(argument.size(),argv_cuda); is called. This routine is entry point for CUDA calculation part of the program and their denition is in le cudamain.cpp. Running code for this routine is in static library libcudauser.a (see Algorithm 1). Other wrapper is routine: void percentgrass(long percent, long all) { G_percent(percent, all, 2); } which is called from CUDA part of program to indicate how many percent of calculation time is elapsed. 5

6 3 File src.description.html File src.description.html in directory cuda-workspace/cudaexample/src_gui_grass is help and description html le for module. It should follow the same format as is it in description.html in directory doc/raster/r.example in GRASS GIS source code. The only dierence is, that you should omit line Last changed, because it is automatically added at compilation time. 4 Common source cudagrass Common source cudagrass, which is compiled together with any parallel GRASS GIS CUDA module, can be found in directory cuda-workspace/cudagrass. There are a lot of C++ classes, which take care of r/w maps, memory management, scheduler management, argument managements etc. The only class, which is unique for each module, is CUserModule. Although in directory cuda-workspace/cudagrass exists CUserModule class, it is not in function at user module compilation. Each unique module has his own CUserModule class in CUserModule.h and CUserModule.cu les in its module's directory. Schematic of common source classes and CUserModule class can be found in Figure 1. Developer of the new module normally needed to modify only CUserModule class. All other classes are better not to modify. But if there is no alternative, care must be taken because modication of common sources has impact on all user parallel modules! Slika 1: C++ CUserModule and common classes in general CUDA GRASS GIS module. Class CDevice and CDeviceBu are intended for GPU part, and take care to read GPE properties and allocate memory on the GPU. Class CHost and CHostBu are intended for CPU part, and take care to read the CPU properties and allocate the memory on the computer. Class CBu connects the two wings, and contains functions for copying the contents of memory from the CPU to the GPU and 6

7 vice versa. Class CMap is intended for the storage and manipulation of the properties of the digital maps. Segmentation of digital maps is done in the class CSegmentScheduler. Class CRunScheduler is responsible for the proper run-time of CUDA kernels. Reading input arguments is implemented in class CArgs. For data compression takes care class CCompress. All source code, that determines the properties of the user dedicated module, is written in class CUserModule. 5 File cudamain.cpp File cudamain.cpp in directory cuda-workspace/cudaexample/src is starting point to CUDA part of program. It contains only one function: int cudacalculation(int argc, char** argv) {... } This function is wrapper and it is run from int main(int argc, char** argv) (le main.cpp, directory cuda-workspace/cudaexample/src_gui_grass). Note, that arguments char** argv in function cudacalculation(..) are not all the same as arguments char** argv in function main(..). It is programmer's responsible to bring appropriate arguments to CUDA part of program. First the user object must be created (see Algorithm 1 ): CUserModule * user; user = new CUserModule; and then follow argument parser: user->parseuserinputarguments(argc,argv); Input and output maps must be set with: user->setmap( INPUT_MAP, user->string_args.dir, user->string_args.projection, user->string_args.mapset, user->string_args.input_map_name); user->setmap(output_map, user->string_args.dir, user->string_args.projection, user->string_args.cur_mapset, user->string_args.output_map_name, user->kernel_args.consider_region, user->kernel_args.out_format, compress, chunks ); Allocation of buers is done in: 7

8 user->allocatesegmentbuffers(force_segments_n); Process must be added to scheduler with following functions: user->adduserprocess("kerneluser",false); user->adduserprocess(); and nally calculation: user->runcalculation(); Output properties to cellhd, range, null, colr and cats are written with: user->writeoutmapprop(prop_cellhd,action_overwrite); user->writeoutmapprop(prop_range,action_overwrite,"0 1"); user->writeoutmapprop(prop_null,action_erase); user->writeoutmapprop(prop_colr,action_write,viewshed_colr); user->writeoutmapprop(prop_cats,action_write, default_cats); At the end, temporary les and user object must be deleted: user->deletetempfiles(); delete user; Finally function cudacalculation() terminates with: return 0; 6 Class CUserKernelArg (le CUserModule.h) Class CUserKernelArg is dened in le CUserModule.h. It is used to dene user arguments, which she/he wants to bring into kernel. Only simple data types are allowed (bool, char, short, int, long int, oat, double). From CPU code are members of the class accessible with prex kernel_args, for example: kernel_args.water_level=300.0; and from GPU code are members of the class accessible with prex args, for example: double w = args.water_level; 8

9 Default values of members of class CUserKernelArg are usually preset in constructor of CUserModule class, that is in CUserModule::CUserModule(). Values are parsed from char** argv in function: void CUserModule::parseUserInputArguments(int argc, char** argv); where user can put his own code. There is another function: void CUserModule::userSetArguments(); This function will run before rst user kernel lunch and before arguments are copied into device. It is also suitable for set users arguments. Arguments are copied automatically before any user kernel is launched. However, user can copy arguments to device explicitly with function: void CUserModule::copyArgumentsToDevice(cuda_stream stream); 7 Class CUserModule (le CUserModule.h) Class CUserModule is dened in le CUserModule.h. It contains several useful functions. In previous chapter are described three argument functions: parseuserinputarguments(..); usersetarguments(); copyargumentstodevice(..);. One of the most important function is userallocatememory(), where user can put function for allocate additional memory space on host or on device. This two functions are: int CHostBuff::allocatePinnedBuffers(int n, size_t s, string name); int CDeviceBuff::allocateDevGlobalBuffers(int n, size_t s, string name); and are described in the following two sub-chapters. 7.1 Allocate memory on host Function for allocate memory on host is: int allocatepinnedbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_host_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_host_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: 9

10 void* p=host_buff[my_host_mem_id+0].pbuff; to the second: void* p=host_buff[my_host_mem_id+1].pbuff; to the third: void* p=host_buff[my_host_mem_id+2].pbuff; and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=host_buff[my_host_mem_id+0].pbuff_size; and name: string name=host_buff[my_host_mem_id+0].pbuff_name; 7.2 Allocate memory on device Function for allocate memory on device is: int allocatedevglobalbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_device_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_device_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: void* p=dev_buff[my_device_mem_id+0].gbuff; to the second: void* p=dev_buff[my_device_mem_id+1].gbuff; to the third: void* p=dev_buff[my_device_mem_id+2].gbuff; 10

11 and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=dev_buff[my_device_mem_id+0].gbuff_size; and name: string name=dev_buff[my_device_mem_id+0].gbuff_name; For reach global buer from CUDA kernel, pointer to buer must be transferred to kernel through kernel argument itself or through CUserKernelArg class (see section 6). 7.3 Run CUDA kernel once In function: void CUserModule::userPriorRun(cudaStream_t cuda_stream); user can put call to kernel function, which will run only once prior other segments kernels. Minimum code to run kernel is: global void kernelpriorrun_() {... } void CUserModule::userPriorRun(cudaStream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kernelpriorrun_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kernelpriorrun_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 11

12 7.4 Run CUDA kernel for each segment In function void CUserModule::kernelUser(int idx_seg, int idx_flow, char src_buff, char dst_buff, cudastream_t cuda_stream); user can put call to kernel function, which will run once for each memory segment. Minimum code to run kernel is: global void kerneluser_() {... } void CUserModule::kernelUser(int seg_id, int flow_id, char src_buff, char dst_buff, cudastream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kerneluser_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kerneluser_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 12

13 8 Variable sux Variables have sux which follows some simple convention: _buff - pointer to buer (char*, void*,...), all the time points to the beginning of buer _p - pointer to buer (char*, void*,...), similar as _buff, but during calculation can change and points to particular member of buer _id - index (int, int64_t...) _idc - shorter index (char) _count - counter (int, int64_t...) _stream - CUDA stream (cudastream_t) _size - size of the buer of byts (size_t) _size2 - size of the buer of shorts _size4 - size of the buer of ints or oats _size8 - size of the buer of int64_ts or doubles _seek - index seek (size_t) _offset - index seek (int) _length - size of one data member in byts (for example, for int is 4) _n - number, quantity (except for the size of buer, where _size is used) _items - number of elements (for example, number of elements in vector) _name - name (string) _prop - pointer to properties _fp - le pointer (FILE*) _fname - le name (string) _v - vector _args - pointer to function argument _tex - texture memory 9 Compile all together Object user is constructed from class CUserModule, which inherits all properties and data from other clases (see Figure 1). CUDA nvcc and GNU gcc compiler build a static library libcudauser.a with CUDA make script (see Algorithm 1). User module is built with GRASS GIS make script with gcc compiler/linker from les main.cpp, libcudauser.a and description.html. Algoritem 1 GRASS GIS CUDA module compile ow chart. object user in cudamain.cpp CUserModule* user; user = new CUserModule; GRASS GIS main.cpp CUDA make (nvcc + gcc) static library libcudauser.a GRASS GIS make (gcc) user module r.cuda.user CUDA RT library GRASS GIS libcudart description.html 13

14 10 Debug printouts Several printouts are prepared for easier debugging and developing of new module. printout is enabled with debug parameter, for example: Certain debug r.cuda.viewshed --overwrite output=viewshed \ coordinate=592094, obs_elev=20 max_dist=10000 debug=1 Each debug printout has his own number. To print out several dierent debug printouts, just sum numbers together. In the following sub chapters all printouts are described Print last CUDA error (debug=1) It prints the last CUDA error. If there is no error, it should print: cuda last error = no error not. If an error occurs, module prints out last CUDA error nevertheless, if debug parameter is set or 10.2 Print arguments to CUDA part of the module (debug=2) It prints out arguments, which are passed to CUDA part of program. This arguments are not necessary the same as input arguments. The printout could like like: dir=/home/andrej/grass_data projection=slovenija mapset=permanent cur_mapset=fresnel_testing input_map_name=mobitel_slo_dem12i output_map_name=viewshed range=800 raw_range=10000 obs_elev=20 tgt_elev=0.0 azim_angle=0 azim_sector=360 elev_angle=0 elev_sector=180 earth_radius= u coordinate=967,832 raw_coordinate=592094, verbose=1 debug=2 segments= Print total elapsed time for CUDA calculation part (debug=4) It prints out total elapsed time: Finished, total elapsed time is ms 10.4 Print compile properties (debug=16) It prints out some compilation data: 16 COMPILE PROPERTIES: DATE = May TIME = 11:33:24 VERSION =

15 10.5 Print device(s) properties (debug=32) Prints out device properties, like devicequery application from samples does Print host properties (debug=64) Prints out memory data from your computer. First few are important and should look loke: MemTotal= MaxUserMemory= MaxPinnedMemory= MemFree= Buffers= Print device buers properties (debug=128) Prints out device buers properties. This buers are located on global memory on device. Printout looks like: ************************************************************************** 128 DEVICE BUFFERS PROPERTIES: buff size pointer dev_buff[...].gbuff x600f40000 in_map_prop.head_d_id x ns_v_id x ns_v_id x ev_v_id x60104a800 ev_v_id x buff_a_id x60cfa0000 buff_b_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer (on the CUDA device). Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of rst ns_v_id buer one can use: void* p=dev_buff[ns_v_id+0].gbuff and to get pointer to second ev_v_id buffer, one can use: void* p=dev_buff[ev_v_id+1].gbuff 15

16 10.8 Print host buers properties (debug=256) Prints out host buers properties. This buers are located on pinned memory on host. Printout looks like: ************************************************************************** 256 HOST BUFFERS PROPERTIES: buff size pointer host_buff[...].pbuff x in_map_prop.head_h_id x out_map_prop.head_h_id x buff_h_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer. Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of buff_h_id buer one can use: void* p=host_buff[buff_h_id+0].pbuff 10.9 Print input and output map properties (debug=512) Prints out input and output map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc Print scheduler properties (debug=1024) Prints out segments properties, for example: // vector<csegment> segment: (gross,+net) (gross,-net) seg seq row_in_start_id row_in_stop_id in_seek in_size row_out_start_id row_out_stop_id out_seek (net) out_size (0,+0) (588,-1) (587,+1) (1174,-1) (1173,+1) (1759,-0) Print run properties (debug=2048) Prints out running sequence, for example: (readhd,1,0) D->H (copyh2d,2,0) H->A (decompress,3,0) A->B (regression,4,0) B->A (kernelvisibility,0,0) A->B (alignment,5,0) B->B (compress,6,0) B->B (copyd2h,7,0) B->H (writehd,8,0) H->D (exit,9,0) -> 16

17 The letters have the following meanings: D - data in hard disk H - data in host buer A - data in buer A on GPU B - data in buer B on GPU Arrows -> indicates how data ows Print clutter map properties (debug=4096) Prints out clutter map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc. 11 Contact author: Andrej Osterman Any feedback is welcome. Please me on s51mo@hamradio.si or andrej.osterman@guest.arnes.si. Please note that modules are in experimental phase and bugs are still alive. 17

Stream Computing using Brook+

Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture