r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11
|
|
- Jerome Douglas
- 6 years ago
- Views:
Transcription
1 r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11
2 Contents 1 Introduction File main.cpp File src.description.html Common source cudagrass File cudamain.cpp Class CUserKernelArg (le CUserModule.h) Class CUserModule (le CUserModule.h) Allocate memory on host Allocate memory on device Run CUDA kernel once Run CUDA kernel for each segment Variable sux Compile all together Debug printouts Print last CUDA error (debug=1) Print arguments to CUDA part of the module (debug=2) Print total elapsed time for CUDA calculation part (debug=4) Print compile properties (debug=16) Print device(s) properties (debug=32) Print host properties (debug=64) Print device buers properties (debug=128) Print host buers properties (debug=256) Print input and output map properties (debug=512) Print scheduler properties (debug=1024) Print run properties (debug=2048) Print clutter map properties (debug=4096) Contact
3 1 Introduction This is a programmer's manual for developing custom parallel raster GRASS GIS modules running on CUDA GPU. This manual also contains source code provided by NVIDIA Corporation. If you have not installed modules yet, please download sources and installation manual from: Good point to start to develop custom sequential raster programs is: To develop sequential module, only two source les must be modied: main.c and description.html (directory doc/raster/r.example in GRASS GIS source code). Into main.c man can write new C- ANSI application code for the new module, into descripton.html man can write help and a description of the module. Develop parallel module in GRASS GIS running on the GPU is slightly more complicated job than develops sequential module. Two dierent software packages, GRASS GIS and CUDA, must be merged together in one application (module). Due to GPU limited memory size, the memory management is more complicated. Program arguments must be copied to the GPU. Scheduling must be implemented to run individual pieces of the program. In this manual we will briey describe r.cuda.example module, which can do a very simple job on raster maps. This job is reading each grid cell and color it black, if height is more or less than some preset threshold. Source code of r.cuda.example module can be found in cuda-workspace/cudaexample directory. To develop parallel module, ve les must be modied. Files main.cpp and src.desctiption.html (directory cuda-workspace/cudaexample/src_gui_grass) are related more to GRASS GIS. Files cudamain.cpp, CUserModule.cu and CUserModule.h (directory cuda-workspace/cudaexample/src) are related more to CUDA. Hereinbelow we will describe each le separately. 3
4 2 File main.cpp File main.cpp in directory cuda-workspace/cudaexample/src_gui_grass is practically inherited from le main.c (sequential module r.example). It is the entry point of program because it contains main() function. File main.cpp contains GRASS GIS routines for module initialization, denitions and manipulation with options and ags, and history storage routines. It also contains wrappers with CUDA environment. But it no longer contains buer allocations routines, reading raster routines and process data routines. Programs starts with GRASS initialization routine: G_gisinit(argv[0]); which reads grass environment and stores program name to G_program_name(). module initialization routine: Then follow struct GModule *module; module = G_define_module(); module->keywords = _("example"); module->description = _("CUDA example raster module"); module->verbose=1; Flags are dened as follows: struct Flag *flag_negative; flag_negative = G_define_flag(); flag_negative->key = n ; flag_negative->description = _("Negative result"); Options are dened as: struct Option *debugopt; debugopt = G_define_option(); debugopt->key = "debug"; debugopt->type = TYPE_INTEGER; debugopt->required = NO; debugopt->key_desc = "value"; debugopt->description = _("Debug number"); debugopt->answer = _("0"); debugopt->guisection = _("Optional"); Breaking point is options and ags parser: if (G_parser(argc, argv)) exit(exit_failure); 4
5 Status of ags is now in flag_negative->answer and value of option is in debugopt->answer To transfer ags and options to CUDA part of program, we dene vector: vector<string> argument; and ll it with all those arguments. Classic argument form is made with code: char ** argv_cuda; argv_cuda = new char * [argument.size()]; for(unsigned int i=0;i<argument.size();i++) { argv_cuda[i]=new char [argument[i].size()+1]; strcpy(argv_cuda[i],argument[i].c_str()); } When module is initialized and all arguments are set, the wrapping routine: cudacalculation(argument.size(),argv_cuda); is called. This routine is entry point for CUDA calculation part of the program and their denition is in le cudamain.cpp. Running code for this routine is in static library libcudauser.a (see Algorithm 1). Other wrapper is routine: void percentgrass(long percent, long all) { G_percent(percent, all, 2); } which is called from CUDA part of program to indicate how many percent of calculation time is elapsed. 5
6 3 File src.description.html File src.description.html in directory cuda-workspace/cudaexample/src_gui_grass is help and description html le for module. It should follow the same format as is it in description.html in directory doc/raster/r.example in GRASS GIS source code. The only dierence is, that you should omit line Last changed, because it is automatically added at compilation time. 4 Common source cudagrass Common source cudagrass, which is compiled together with any parallel GRASS GIS CUDA module, can be found in directory cuda-workspace/cudagrass. There are a lot of C++ classes, which take care of r/w maps, memory management, scheduler management, argument managements etc. The only class, which is unique for each module, is CUserModule. Although in directory cuda-workspace/cudagrass exists CUserModule class, it is not in function at user module compilation. Each unique module has his own CUserModule class in CUserModule.h and CUserModule.cu les in its module's directory. Schematic of common source classes and CUserModule class can be found in Figure 1. Developer of the new module normally needed to modify only CUserModule class. All other classes are better not to modify. But if there is no alternative, care must be taken because modication of common sources has impact on all user parallel modules! Slika 1: C++ CUserModule and common classes in general CUDA GRASS GIS module. Class CDevice and CDeviceBu are intended for GPU part, and take care to read GPE properties and allocate memory on the GPU. Class CHost and CHostBu are intended for CPU part, and take care to read the CPU properties and allocate the memory on the computer. Class CBu connects the two wings, and contains functions for copying the contents of memory from the CPU to the GPU and 6
7 vice versa. Class CMap is intended for the storage and manipulation of the properties of the digital maps. Segmentation of digital maps is done in the class CSegmentScheduler. Class CRunScheduler is responsible for the proper run-time of CUDA kernels. Reading input arguments is implemented in class CArgs. For data compression takes care class CCompress. All source code, that determines the properties of the user dedicated module, is written in class CUserModule. 5 File cudamain.cpp File cudamain.cpp in directory cuda-workspace/cudaexample/src is starting point to CUDA part of program. It contains only one function: int cudacalculation(int argc, char** argv) {... } This function is wrapper and it is run from int main(int argc, char** argv) (le main.cpp, directory cuda-workspace/cudaexample/src_gui_grass). Note, that arguments char** argv in function cudacalculation(..) are not all the same as arguments char** argv in function main(..). It is programmer's responsible to bring appropriate arguments to CUDA part of program. First the user object must be created (see Algorithm 1 ): CUserModule * user; user = new CUserModule; and then follow argument parser: user->parseuserinputarguments(argc,argv); Input and output maps must be set with: user->setmap( INPUT_MAP, user->string_args.dir, user->string_args.projection, user->string_args.mapset, user->string_args.input_map_name); user->setmap(output_map, user->string_args.dir, user->string_args.projection, user->string_args.cur_mapset, user->string_args.output_map_name, user->kernel_args.consider_region, user->kernel_args.out_format, compress, chunks ); Allocation of buers is done in: 7
8 user->allocatesegmentbuffers(force_segments_n); Process must be added to scheduler with following functions: user->adduserprocess("kerneluser",false); user->adduserprocess(); and nally calculation: user->runcalculation(); Output properties to cellhd, range, null, colr and cats are written with: user->writeoutmapprop(prop_cellhd,action_overwrite); user->writeoutmapprop(prop_range,action_overwrite,"0 1"); user->writeoutmapprop(prop_null,action_erase); user->writeoutmapprop(prop_colr,action_write,viewshed_colr); user->writeoutmapprop(prop_cats,action_write, default_cats); At the end, temporary les and user object must be deleted: user->deletetempfiles(); delete user; Finally function cudacalculation() terminates with: return 0; 6 Class CUserKernelArg (le CUserModule.h) Class CUserKernelArg is dened in le CUserModule.h. It is used to dene user arguments, which she/he wants to bring into kernel. Only simple data types are allowed (bool, char, short, int, long int, oat, double). From CPU code are members of the class accessible with prex kernel_args, for example: kernel_args.water_level=300.0; and from GPU code are members of the class accessible with prex args, for example: double w = args.water_level; 8
9 Default values of members of class CUserKernelArg are usually preset in constructor of CUserModule class, that is in CUserModule::CUserModule(). Values are parsed from char** argv in function: void CUserModule::parseUserInputArguments(int argc, char** argv); where user can put his own code. There is another function: void CUserModule::userSetArguments(); This function will run before rst user kernel lunch and before arguments are copied into device. It is also suitable for set users arguments. Arguments are copied automatically before any user kernel is launched. However, user can copy arguments to device explicitly with function: void CUserModule::copyArgumentsToDevice(cuda_stream stream); 7 Class CUserModule (le CUserModule.h) Class CUserModule is dened in le CUserModule.h. It contains several useful functions. In previous chapter are described three argument functions: parseuserinputarguments(..); usersetarguments(); copyargumentstodevice(..);. One of the most important function is userallocatememory(), where user can put function for allocate additional memory space on host or on device. This two functions are: int CHostBuff::allocatePinnedBuffers(int n, size_t s, string name); int CDeviceBuff::allocateDevGlobalBuffers(int n, size_t s, string name); and are described in the following two sub-chapters. 7.1 Allocate memory on host Function for allocate memory on host is: int allocatepinnedbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_host_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_host_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: 9
10 void* p=host_buff[my_host_mem_id+0].pbuff; to the second: void* p=host_buff[my_host_mem_id+1].pbuff; to the third: void* p=host_buff[my_host_mem_id+2].pbuff; and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=host_buff[my_host_mem_id+0].pbuff_size; and name: string name=host_buff[my_host_mem_id+0].pbuff_name; 7.2 Allocate memory on device Function for allocate memory on device is: int allocatedevglobalbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_device_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_device_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: void* p=dev_buff[my_device_mem_id+0].gbuff; to the second: void* p=dev_buff[my_device_mem_id+1].gbuff; to the third: void* p=dev_buff[my_device_mem_id+2].gbuff; 10
11 and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=dev_buff[my_device_mem_id+0].gbuff_size; and name: string name=dev_buff[my_device_mem_id+0].gbuff_name; For reach global buer from CUDA kernel, pointer to buer must be transferred to kernel through kernel argument itself or through CUserKernelArg class (see section 6). 7.3 Run CUDA kernel once In function: void CUserModule::userPriorRun(cudaStream_t cuda_stream); user can put call to kernel function, which will run only once prior other segments kernels. Minimum code to run kernel is: global void kernelpriorrun_() {... } void CUserModule::userPriorRun(cudaStream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kernelpriorrun_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kernelpriorrun_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 11
12 7.4 Run CUDA kernel for each segment In function void CUserModule::kernelUser(int idx_seg, int idx_flow, char src_buff, char dst_buff, cudastream_t cuda_stream); user can put call to kernel function, which will run once for each memory segment. Minimum code to run kernel is: global void kerneluser_() {... } void CUserModule::kernelUser(int seg_id, int flow_id, char src_buff, char dst_buff, cudastream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kerneluser_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kerneluser_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 12
13 8 Variable sux Variables have sux which follows some simple convention: _buff - pointer to buer (char*, void*,...), all the time points to the beginning of buer _p - pointer to buer (char*, void*,...), similar as _buff, but during calculation can change and points to particular member of buer _id - index (int, int64_t...) _idc - shorter index (char) _count - counter (int, int64_t...) _stream - CUDA stream (cudastream_t) _size - size of the buer of byts (size_t) _size2 - size of the buer of shorts _size4 - size of the buer of ints or oats _size8 - size of the buer of int64_ts or doubles _seek - index seek (size_t) _offset - index seek (int) _length - size of one data member in byts (for example, for int is 4) _n - number, quantity (except for the size of buer, where _size is used) _items - number of elements (for example, number of elements in vector) _name - name (string) _prop - pointer to properties _fp - le pointer (FILE*) _fname - le name (string) _v - vector _args - pointer to function argument _tex - texture memory 9 Compile all together Object user is constructed from class CUserModule, which inherits all properties and data from other clases (see Figure 1). CUDA nvcc and GNU gcc compiler build a static library libcudauser.a with CUDA make script (see Algorithm 1). User module is built with GRASS GIS make script with gcc compiler/linker from les main.cpp, libcudauser.a and description.html. Algoritem 1 GRASS GIS CUDA module compile ow chart. object user in cudamain.cpp CUserModule* user; user = new CUserModule; GRASS GIS main.cpp CUDA make (nvcc + gcc) static library libcudauser.a GRASS GIS make (gcc) user module r.cuda.user CUDA RT library GRASS GIS libcudart description.html 13
14 10 Debug printouts Several printouts are prepared for easier debugging and developing of new module. printout is enabled with debug parameter, for example: Certain debug r.cuda.viewshed --overwrite output=viewshed \ coordinate=592094, obs_elev=20 max_dist=10000 debug=1 Each debug printout has his own number. To print out several dierent debug printouts, just sum numbers together. In the following sub chapters all printouts are described Print last CUDA error (debug=1) It prints the last CUDA error. If there is no error, it should print: cuda last error = no error not. If an error occurs, module prints out last CUDA error nevertheless, if debug parameter is set or 10.2 Print arguments to CUDA part of the module (debug=2) It prints out arguments, which are passed to CUDA part of program. This arguments are not necessary the same as input arguments. The printout could like like: dir=/home/andrej/grass_data projection=slovenija mapset=permanent cur_mapset=fresnel_testing input_map_name=mobitel_slo_dem12i output_map_name=viewshed range=800 raw_range=10000 obs_elev=20 tgt_elev=0.0 azim_angle=0 azim_sector=360 elev_angle=0 elev_sector=180 earth_radius= u coordinate=967,832 raw_coordinate=592094, verbose=1 debug=2 segments= Print total elapsed time for CUDA calculation part (debug=4) It prints out total elapsed time: Finished, total elapsed time is ms 10.4 Print compile properties (debug=16) It prints out some compilation data: 16 COMPILE PROPERTIES: DATE = May TIME = 11:33:24 VERSION =
15 10.5 Print device(s) properties (debug=32) Prints out device properties, like devicequery application from samples does Print host properties (debug=64) Prints out memory data from your computer. First few are important and should look loke: MemTotal= MaxUserMemory= MaxPinnedMemory= MemFree= Buffers= Print device buers properties (debug=128) Prints out device buers properties. This buers are located on global memory on device. Printout looks like: ************************************************************************** 128 DEVICE BUFFERS PROPERTIES: buff size pointer dev_buff[...].gbuff x600f40000 in_map_prop.head_d_id x ns_v_id x ns_v_id x ev_v_id x60104a800 ev_v_id x buff_a_id x60cfa0000 buff_b_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer (on the CUDA device). Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of rst ns_v_id buer one can use: void* p=dev_buff[ns_v_id+0].gbuff and to get pointer to second ev_v_id buffer, one can use: void* p=dev_buff[ev_v_id+1].gbuff 15
16 10.8 Print host buers properties (debug=256) Prints out host buers properties. This buers are located on pinned memory on host. Printout looks like: ************************************************************************** 256 HOST BUFFERS PROPERTIES: buff size pointer host_buff[...].pbuff x in_map_prop.head_h_id x out_map_prop.head_h_id x buff_h_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer. Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of buff_h_id buer one can use: void* p=host_buff[buff_h_id+0].pbuff 10.9 Print input and output map properties (debug=512) Prints out input and output map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc Print scheduler properties (debug=1024) Prints out segments properties, for example: // vector<csegment> segment: (gross,+net) (gross,-net) seg seq row_in_start_id row_in_stop_id in_seek in_size row_out_start_id row_out_stop_id out_seek (net) out_size (0,+0) (588,-1) (587,+1) (1174,-1) (1173,+1) (1759,-0) Print run properties (debug=2048) Prints out running sequence, for example: (readhd,1,0) D->H (copyh2d,2,0) H->A (decompress,3,0) A->B (regression,4,0) B->A (kernelvisibility,0,0) A->B (alignment,5,0) B->B (compress,6,0) B->B (copyd2h,7,0) B->H (writehd,8,0) H->D (exit,9,0) -> 16
17 The letters have the following meanings: D - data in hard disk H - data in host buer A - data in buer A on GPU B - data in buer B on GPU Arrows -> indicates how data ows Print clutter map properties (debug=4096) Prints out clutter map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc. 11 Contact author: Andrej Osterman Any feedback is welcome. Please me on s51mo@hamradio.si or andrej.osterman@guest.arnes.si. Please note that modules are in experimental phase and bugs are still alive. 17
Stream Computing using Brook+
Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture
More informationLecture 03 Bits, Bytes and Data Types
Lecture 03 Bits, Bytes and Data Types Computer Languages A computer language is a language that is used to communicate with a machine. Like all languages, computer languages have syntax (form) and semantics
More informationIntroduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018
How large is the TLB? Johan Montelius September 20, 2018 Introduction The Translation Lookaside Buer, TLB, is a cache of page table entries that are used in virtual to physical address translation. Since
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationProject #1 Exceptions and Simple System Calls
Project #1 Exceptions and Simple System Calls Introduction to Operating Systems Assigned: January 21, 2004 CSE421 Due: February 17, 2004 11:59:59 PM The first project is designed to further your understanding
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationOP2 C++ User s Manual
OP2 C++ User s Manual Mike Giles, Gihan R. Mudalige, István Reguly December 2013 1 Contents 1 Introduction 4 2 Overview 5 3 OP2 C++ API 8 3.1 Initialisation and termination routines..........................
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationMemory management. Johan Montelius KTH
Memory management Johan Montelius KTH 2017 1 / 22 C program # include int global = 42; int main ( int argc, char * argv []) { if( argc < 2) return -1; int n = atoi ( argv [1]); int on_stack
More informationPerformance Considerations and GMAC
Performance Considerations and GMAC 1 Warps and Thread Execution Theoretically all threads in a block can execute concurrently Hardware cost forces compromise Bundle threads with a single control unit
More informationCS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco
CS 326 Operating Systems C Programming Greg Benson Department of Computer Science University of San Francisco Why C? Fast (good optimizing compilers) Not too high-level (Java, Python, Lisp) Not too low-level
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCUDA Toolkit CUPTI User's Guide. DA _v01 September 2012
CUDA Toolkit CUPTI User's Guide DA-05679-001_v01 September 2012 Document Change History Ver Date Resp Reason for change v01 2011/1/19 DG Initial revision for CUDA Tools SDK 4.0 v02 2012/1/5 DG Revisions
More informationCS 322 Operating Systems Practice Midterm Questions
! CS 322 Operating Systems 1. Processes go through the following states in their lifetime. time slice ends Consider the following events and answer the questions that follow. Assume there are 5 processes,
More informationCarnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016
1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose
More informationProgramming Assignment Multi-Threading and Debugging 2
Programming Assignment Multi-Threading and Debugging 2 Due Date: Friday, June 1 @ 11:59 pm PAMT2 Assignment Overview The purpose of this mini-assignment is to continue your introduction to parallel programming
More informationIntroduction to Programming Using Java (98-388)
Introduction to Programming Using Java (98-388) Understand Java fundamentals Describe the use of main in a Java application Signature of main, why it is static; how to consume an instance of your own class;
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationOperating Systems and Networks Assignment 2
Spring Term 2014 Operating Systems and Networks Assignment 2 Assigned on: 27th February 2014 Due by: 6th March 2014 1 Scheduling The following table describes tasks to be scheduled. The table contains
More informationCE221 Programming in C++ Part 1 Introduction
CE221 Programming in C++ Part 1 Introduction 06/10/2017 CE221 Part 1 1 Module Schedule There are two lectures (Monday 13.00-13.50 and Tuesday 11.00-11.50) each week in the autumn term, and a 2-hour lab
More informationCS 220: Introduction to Parallel Computing. Input/Output. Lecture 7
CS 220: Introduction to Parallel Computing Input/Output Lecture 7 Input/Output Most useful programs will provide some type of input or output Thus far, we ve prompted the user to enter their input directly
More informationCSci 4061 Introduction to Operating Systems. Programs in C/Unix
CSci 4061 Introduction to Operating Systems Programs in C/Unix Today Basic C programming Follow on to recitation Structure of a C program A C program consists of a collection of C functions, structs, arrays,
More informationAPT Session 4: C. Software Development Team Laurence Tratt. 1 / 14
APT Session 4: C Laurence Tratt Software Development Team 2017-11-10 1 / 14 http://soft-dev.org/ What to expect from this session 1 C. 2 / 14 http://soft-dev.org/ Prerequisites 1 Install either GCC or
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationMy malloc: mylloc and mhysa. Johan Montelius HT2016
1 Introduction My malloc: mylloc and mhysa Johan Montelius HT2016 So this is an experiment where we will implement our own malloc. We will not implement the world s fastest allocator, but it will work
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationLab 09 - Virtual Memory
Lab 09 - Virtual Memory Due: November 19, 2017 at 4:00pm 1 mmapcopy 1 1.1 Introduction 1 1.1.1 A door predicament 1 1.1.2 Concepts and Functions 2 1.2 Assignment 3 1.2.1 mmap copy 3 1.2.2 Tips 3 1.2.3
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationCS179 GPU Programming Recitation 4: CUDA Particles
Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK
More informationNVJPEG. DA _v0.2.0 October nvjpeg Libary Guide
NVJPEG DA-06762-001_v0.2.0 October 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.
More informationCompiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010
Compiling and Executing CUDA Programs in Emulation Mode High Performance Scientific Computing II ICSI-541 Spring 2010 Topic Overview Overview of compiling and executing CUDA programs in emulation mode
More informationTDDB68. Lesson 1. Simon Ståhlberg
TDDB68 Lesson 1 Simon Ståhlberg Contents General information about the labs Overview of the labs Memory layout of C programs ("Lab 00") General information about Pintos System calls Lab 1 Debugging Administration
More informationSoftware Development With Emacs: The Edit-Compile-Debug Cycle
Software Development With Emacs: The Edit-Compile-Debug Cycle Luis Fernandes Department of Electrical and Computer Engineering Ryerson Polytechnic University August 8, 2017 The Emacs editor permits the
More informationCSE 333 Midterm Exam Sample Solution 7/29/13
Question 1. (44 points) C hacking a question of several parts. The next several pages are questions about a linked list of 2-D points. Each point is represented by a Point struct containing the point s
More informationPace University. Fundamental Concepts of CS121 1
Pace University Fundamental Concepts of CS121 1 Dr. Lixin Tao http://csis.pace.edu/~lixin Computer Science Department Pace University October 12, 2005 This document complements my tutorial Introduction
More informationIntroduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2;
Sami Ilvonen Petri Nikunen Introduction to C Oct 6 8, 2015 @ CSC IT Center for Science Ltd, Espoo int **b1, **b2; /* Initialise metadata */ board_1->height = height; board_1->width = width; board_2->height
More informationProcedures, Parameters, Values and Variables. Steven R. Bagley
Procedures, Parameters, Values and Variables Steven R. Bagley Recap A Program is a sequence of statements (instructions) Statements executed one-by-one in order Unless it is changed by the programmer e.g.
More informationAdvanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.
CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution
More informationCS 261 Fall Mike Lam, Professor. Structs and I/O
CS 261 Fall 2018 Mike Lam, Professor Structs and I/O Typedefs A typedef is a way to create a new type name Basically a synonym for another type Useful for shortening long types or providing more meaningful
More informationCSE 333 Midterm Exam 7/29/13
Name There are 5 questions worth a total of 100 points. Please budget your time so you get to all of the questions. Keep your answers brief and to the point. The exam is closed book, closed notes, closed
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationCYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output
CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output Instructor: Dr. Kun Sun 1 This lecture: [Seacord]: Chapter 6 Readings 2 Secure Coding String management Pointer Subterfuge
More information2/9/18. CYSE 411/AIT681 Secure Software Engineering. Readings. Secure Coding. This lecture: String management Pointer Subterfuge
CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output Instructor: Dr. Kun Sun 1 This lecture: [Seacord]: Chapter 6 Readings 2 String management Pointer Subterfuge Secure
More informationArchitecture: Caching Issues in Performance
Architecture: Caching Issues in Performance Mike Bailey mjb@cs.oregonstate.edu Problem: The Path Between a CPU Chip and Off-chip Memory is Slow CPU Chip Main Memory This path is relatively slow, forcing
More informationRecitation: Cache Lab & C
15-213 Recitation: Cache Lab & C Jack Biggs 16 Feb 2015 Agenda Buffer Lab! C Exercises! C Conventions! C Debugging! Version Control! Compilation! Buffer Lab... Is due soon. So maybe do it soon Agenda Buffer
More informationArchitecture: Caching Issues in Performance
Architecture: Caching Issues in Performance Mike Bailey mjb@cs.oregonstate.edu Problem: The Path Between a CPU Chip and Off-chip Memory is Slow CPU Chip Main Memory This path is relatively slow, forcing
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationreclaim disk space by shrinking files
Sandeep Sahore reclaim disk space by shrinking files Sandeep Sahore holds a Master s degree in computer science from the University of Toledo and has nearly 15 years of experience in the computing industry.
More informationRDBE Host Software. Doc No: X3C 2009_07_21_1 TODO: Add appropriate document number. XCube Communication 1(13)
RDBE Host Software Doc No: X3C 2009_07_21_1 TODO: Add appropriate document number XCube Communication 1(13) Document history Change date Changed by Version Notes 09-07-21 09:12 Mikael Taveniku PA1 New
More informationCSE 374 Programming Concepts & Tools
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2017 Lecture 8 C: Miscellanea Control, Declarations, Preprocessor, printf/scanf 1 The story so far The low-level execution model of a process (one
More informationFor personnal use only
Inverting Large Images Using CUDA Finnbarr P. Murphy (fpm@fpmurphy.com) This is a simple example of how to invert a very large image, stored as a vector using nvidia s CUDA programming environment and
More informationCSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory
Announcements CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory There will be no lecture on Tuesday, Feb. 16. Prof. Thompson s office hours are canceled for Monday, Feb. 15. Prof.
More informationThe Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics
The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and
More informationCOMP 250: Java Programming I. Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette
COMP 250: Java Programming I Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette Variables and types [Downey Ch 2] Variable: temporary storage location in memory.
More informationOther array problems. Integer overflow. Outline. Integer overflow example. Signed and unsigned
Other array problems CSci 5271 Introduction to Computer Security Day 4: Low-level attacks Stephen McCamant University of Minnesota, Computer Science & Engineering Missing/wrong bounds check One unsigned
More informationDirect Memory Access. Lecture 2 Pointer Revision Command Line Arguments. What happens when we use pointers. Same again with pictures
Lecture 2 Pointer Revision Command Line Arguments Direct Memory Access C/C++ allows the programmer to obtain the value of the memory address where a variable lives. To do this we need to use a special
More informationCh. 11: References & the Copy-Constructor. - continued -
Ch. 11: References & the Copy-Constructor - continued - const references When a reference is made const, it means that the object it refers cannot be changed through that reference - it may be changed
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationComputer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview
Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes: C and Unix Overview This course is about computer organization, but since most of our programming is
More informationMore on C programming
Applied mechatronics More on C programming Sven Gestegård Robertz sven.robertz@cs.lth.se Department of Computer Science, Lund University 2017 Outline 1 Pointers and structs 2 On number representation Hexadecimal
More informationBlocks, Grids, and Shared Memory
Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same
More informationComputer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview
Computer Science 322 Operating Systems Mount Holyoke College Spring 2010 Topic Notes: C and Unix Overview This course is about operating systems, but since most of our upcoming programming is in C on a
More informationVirtual Memory 1. Virtual Memory
Virtual Memory 1 Virtual Memory key concepts virtual memory, physical memory, address translation, MMU, TLB, relocation, paging, segmentation, executable file, swapping, page fault, locality, page replacement
More informationP2: Collaborations. CSE 335, Spring 2009
P2: Collaborations CSE 335, Spring 2009 Milestone #1 due by Thursday, March 19 at 11:59 p.m. Completed project due by Thursday, April 2 at 11:59 p.m. Objectives Develop an application with a graphical
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationWhat the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.
C Flow Control David Chisnall February 1, 2011 Outline What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope Disclaimer! These slides contain a lot of
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationLecture 07 Debugging Programs with GDB
Lecture 07 Debugging Programs with GDB In this lecture What is debugging Most Common Type of errors Process of debugging Examples Further readings Exercises What is Debugging Debugging is the process of
More informationVisual Profiler. User Guide
Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................
More informationBIL 104E Introduction to Scientific and Engineering Computing. Lecture 14
BIL 104E Introduction to Scientific and Engineering Computing Lecture 14 Because each C program starts at its main() function, information is usually passed to the main() function via command-line arguments.
More informationBase Component. Chapter 1. *Memory Management. Memory management Errors Exception Handling Messages Debug code Options Basic data types Multithreading
Chapter 1. Base Component Component:, *Mathematics, *Error Handling, *Debugging The Base Component (BASE), in the base directory, contains the code for low-level common functionality that is used by all
More informationCS333 Intro to Operating Systems. Jonathan Walpole
CS333 Intro to Operating Systems Jonathan Walpole Threads & Concurrency 2 Threads Processes have the following components: - an address space - a collection of operating system state - a CPU context or
More informationprimitive arrays v. vectors (1)
Arrays 1 primitive arrays v. vectors (1) 2 int a[10]; allocate new, 10 elements vector v(10); // or: vector v; v.resize(10); primitive arrays v. vectors (1) 2 int a[10]; allocate new, 10 elements
More informationConcurrency, Thread. Dongkun Shin, SKKU
Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point
More informationPrograms. Function main. C Refresher. CSCI 4061 Introduction to Operating Systems
Programs CSCI 4061 Introduction to Operating Systems C Program Structure Libraries and header files Compiling and building programs Executing and debugging Instructor: Abhishek Chandra Assume familiarity
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationGDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial
A Walkthrough with Examples CMSC 212 - Spring 2009 Last modified March 22, 2009 What is gdb? GNU Debugger A debugger for several languages, including C and C++ It allows you to inspect what the program
More informationArray Initialization
Array Initialization Array declarations can specify initializations for the elements of the array: int primes[10] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 ; initializes primes[0] to 2, primes[1] to 3, primes[2]
More informationEECS 213 Introduction to Computer Systems Dinda, Spring Homework 3. Memory and Cache
Homework 3 Memory and Cache 1. Reorder the fields in this structure so that the structure will (a) consume the most space and (b) consume the least space on an IA32 machine on Linux. struct foo { double
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationIntroduction to Supercomputing
Introduction to Supercomputing TMA4280 Introduction to UNIX environment and tools 0.1 Getting started with the environment and the bash shell interpreter Desktop computers are usually operated from a graphical
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationCCReflect has a few interesting features that are quite desirable for DigiPen game projects:
CCReflect v1.0 User Manual Contents Introduction... 2 Features... 2 Dependencies... 2 Compiler Dependencies... 2 Glossary... 2 Type Registration... 3 POD Registration... 3 Non-Pod Registration... 3 External
More informationProgramming Assignment #1: A Simple Shell
Programming Assignment #1: A Simple Shell Due: Check My Courses In this assignment you are required to create a C program that implements a shell interface that accepts user commands and executes each
More informationCS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES
CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES Your name: SUNet ID: In accordance with both the letter and the spirit of the Stanford Honor Code, I did not cheat on this exam. Furthermore,
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCSE 333 Midterm Exam Sample Solution 7/28/14
Question 1. (20 points) C programming. For this question implement a C function contains that returns 1 (true) if a given C string appears as a substring of another C string starting at a given position.
More informationOutline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging
Outline Computer programming Debugging Hints Gathering evidence Common C errors "Education is a progressive discovery of our own ignorance." Will Durant T.U. Cluj-Napoca - Computer Programming - lecture
More informationCS510 Operating System Foundations. Jonathan Walpole
CS510 Operating System Foundations Jonathan Walpole Threads & Concurrency 2 Why Use Threads? Utilize multiple CPU s concurrently Low cost communication via shared memory Overlap computation and blocking
More information