r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11

Size: px
Start display at page:

Download "r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11"

Transcription

1 r.cuda.example Developing custom GRASS GIS CUDA parallel modules Andrej Osterman s51mo.net/fresnel Compiled on 1. junij 2015 at 19:11

2 Contents 1 Introduction File main.cpp File src.description.html Common source cudagrass File cudamain.cpp Class CUserKernelArg (le CUserModule.h) Class CUserModule (le CUserModule.h) Allocate memory on host Allocate memory on device Run CUDA kernel once Run CUDA kernel for each segment Variable sux Compile all together Debug printouts Print last CUDA error (debug=1) Print arguments to CUDA part of the module (debug=2) Print total elapsed time for CUDA calculation part (debug=4) Print compile properties (debug=16) Print device(s) properties (debug=32) Print host properties (debug=64) Print device buers properties (debug=128) Print host buers properties (debug=256) Print input and output map properties (debug=512) Print scheduler properties (debug=1024) Print run properties (debug=2048) Print clutter map properties (debug=4096) Contact

3 1 Introduction This is a programmer's manual for developing custom parallel raster GRASS GIS modules running on CUDA GPU. This manual also contains source code provided by NVIDIA Corporation. If you have not installed modules yet, please download sources and installation manual from: Good point to start to develop custom sequential raster programs is: To develop sequential module, only two source les must be modied: main.c and description.html (directory doc/raster/r.example in GRASS GIS source code). Into main.c man can write new C- ANSI application code for the new module, into descripton.html man can write help and a description of the module. Develop parallel module in GRASS GIS running on the GPU is slightly more complicated job than develops sequential module. Two dierent software packages, GRASS GIS and CUDA, must be merged together in one application (module). Due to GPU limited memory size, the memory management is more complicated. Program arguments must be copied to the GPU. Scheduling must be implemented to run individual pieces of the program. In this manual we will briey describe r.cuda.example module, which can do a very simple job on raster maps. This job is reading each grid cell and color it black, if height is more or less than some preset threshold. Source code of r.cuda.example module can be found in cuda-workspace/cudaexample directory. To develop parallel module, ve les must be modied. Files main.cpp and src.desctiption.html (directory cuda-workspace/cudaexample/src_gui_grass) are related more to GRASS GIS. Files cudamain.cpp, CUserModule.cu and CUserModule.h (directory cuda-workspace/cudaexample/src) are related more to CUDA. Hereinbelow we will describe each le separately. 3

4 2 File main.cpp File main.cpp in directory cuda-workspace/cudaexample/src_gui_grass is practically inherited from le main.c (sequential module r.example). It is the entry point of program because it contains main() function. File main.cpp contains GRASS GIS routines for module initialization, denitions and manipulation with options and ags, and history storage routines. It also contains wrappers with CUDA environment. But it no longer contains buer allocations routines, reading raster routines and process data routines. Programs starts with GRASS initialization routine: G_gisinit(argv[0]); which reads grass environment and stores program name to G_program_name(). module initialization routine: Then follow struct GModule *module; module = G_define_module(); module->keywords = _("example"); module->description = _("CUDA example raster module"); module->verbose=1; Flags are dened as follows: struct Flag *flag_negative; flag_negative = G_define_flag(); flag_negative->key = n ; flag_negative->description = _("Negative result"); Options are dened as: struct Option *debugopt; debugopt = G_define_option(); debugopt->key = "debug"; debugopt->type = TYPE_INTEGER; debugopt->required = NO; debugopt->key_desc = "value"; debugopt->description = _("Debug number"); debugopt->answer = _("0"); debugopt->guisection = _("Optional"); Breaking point is options and ags parser: if (G_parser(argc, argv)) exit(exit_failure); 4

5 Status of ags is now in flag_negative->answer and value of option is in debugopt->answer To transfer ags and options to CUDA part of program, we dene vector: vector<string> argument; and ll it with all those arguments. Classic argument form is made with code: char ** argv_cuda; argv_cuda = new char * [argument.size()]; for(unsigned int i=0;i<argument.size();i++) { argv_cuda[i]=new char [argument[i].size()+1]; strcpy(argv_cuda[i],argument[i].c_str()); } When module is initialized and all arguments are set, the wrapping routine: cudacalculation(argument.size(),argv_cuda); is called. This routine is entry point for CUDA calculation part of the program and their denition is in le cudamain.cpp. Running code for this routine is in static library libcudauser.a (see Algorithm 1). Other wrapper is routine: void percentgrass(long percent, long all) { G_percent(percent, all, 2); } which is called from CUDA part of program to indicate how many percent of calculation time is elapsed. 5

6 3 File src.description.html File src.description.html in directory cuda-workspace/cudaexample/src_gui_grass is help and description html le for module. It should follow the same format as is it in description.html in directory doc/raster/r.example in GRASS GIS source code. The only dierence is, that you should omit line Last changed, because it is automatically added at compilation time. 4 Common source cudagrass Common source cudagrass, which is compiled together with any parallel GRASS GIS CUDA module, can be found in directory cuda-workspace/cudagrass. There are a lot of C++ classes, which take care of r/w maps, memory management, scheduler management, argument managements etc. The only class, which is unique for each module, is CUserModule. Although in directory cuda-workspace/cudagrass exists CUserModule class, it is not in function at user module compilation. Each unique module has his own CUserModule class in CUserModule.h and CUserModule.cu les in its module's directory. Schematic of common source classes and CUserModule class can be found in Figure 1. Developer of the new module normally needed to modify only CUserModule class. All other classes are better not to modify. But if there is no alternative, care must be taken because modication of common sources has impact on all user parallel modules! Slika 1: C++ CUserModule and common classes in general CUDA GRASS GIS module. Class CDevice and CDeviceBu are intended for GPU part, and take care to read GPE properties and allocate memory on the GPU. Class CHost and CHostBu are intended for CPU part, and take care to read the CPU properties and allocate the memory on the computer. Class CBu connects the two wings, and contains functions for copying the contents of memory from the CPU to the GPU and 6

7 vice versa. Class CMap is intended for the storage and manipulation of the properties of the digital maps. Segmentation of digital maps is done in the class CSegmentScheduler. Class CRunScheduler is responsible for the proper run-time of CUDA kernels. Reading input arguments is implemented in class CArgs. For data compression takes care class CCompress. All source code, that determines the properties of the user dedicated module, is written in class CUserModule. 5 File cudamain.cpp File cudamain.cpp in directory cuda-workspace/cudaexample/src is starting point to CUDA part of program. It contains only one function: int cudacalculation(int argc, char** argv) {... } This function is wrapper and it is run from int main(int argc, char** argv) (le main.cpp, directory cuda-workspace/cudaexample/src_gui_grass). Note, that arguments char** argv in function cudacalculation(..) are not all the same as arguments char** argv in function main(..). It is programmer's responsible to bring appropriate arguments to CUDA part of program. First the user object must be created (see Algorithm 1 ): CUserModule * user; user = new CUserModule; and then follow argument parser: user->parseuserinputarguments(argc,argv); Input and output maps must be set with: user->setmap( INPUT_MAP, user->string_args.dir, user->string_args.projection, user->string_args.mapset, user->string_args.input_map_name); user->setmap(output_map, user->string_args.dir, user->string_args.projection, user->string_args.cur_mapset, user->string_args.output_map_name, user->kernel_args.consider_region, user->kernel_args.out_format, compress, chunks ); Allocation of buers is done in: 7

8 user->allocatesegmentbuffers(force_segments_n); Process must be added to scheduler with following functions: user->adduserprocess("kerneluser",false); user->adduserprocess(); and nally calculation: user->runcalculation(); Output properties to cellhd, range, null, colr and cats are written with: user->writeoutmapprop(prop_cellhd,action_overwrite); user->writeoutmapprop(prop_range,action_overwrite,"0 1"); user->writeoutmapprop(prop_null,action_erase); user->writeoutmapprop(prop_colr,action_write,viewshed_colr); user->writeoutmapprop(prop_cats,action_write, default_cats); At the end, temporary les and user object must be deleted: user->deletetempfiles(); delete user; Finally function cudacalculation() terminates with: return 0; 6 Class CUserKernelArg (le CUserModule.h) Class CUserKernelArg is dened in le CUserModule.h. It is used to dene user arguments, which she/he wants to bring into kernel. Only simple data types are allowed (bool, char, short, int, long int, oat, double). From CPU code are members of the class accessible with prex kernel_args, for example: kernel_args.water_level=300.0; and from GPU code are members of the class accessible with prex args, for example: double w = args.water_level; 8

9 Default values of members of class CUserKernelArg are usually preset in constructor of CUserModule class, that is in CUserModule::CUserModule(). Values are parsed from char** argv in function: void CUserModule::parseUserInputArguments(int argc, char** argv); where user can put his own code. There is another function: void CUserModule::userSetArguments(); This function will run before rst user kernel lunch and before arguments are copied into device. It is also suitable for set users arguments. Arguments are copied automatically before any user kernel is launched. However, user can copy arguments to device explicitly with function: void CUserModule::copyArgumentsToDevice(cuda_stream stream); 7 Class CUserModule (le CUserModule.h) Class CUserModule is dened in le CUserModule.h. It contains several useful functions. In previous chapter are described three argument functions: parseuserinputarguments(..); usersetarguments(); copyargumentstodevice(..);. One of the most important function is userallocatememory(), where user can put function for allocate additional memory space on host or on device. This two functions are: int CHostBuff::allocatePinnedBuffers(int n, size_t s, string name); int CDeviceBuff::allocateDevGlobalBuffers(int n, size_t s, string name); and are described in the following two sub-chapters. 7.1 Allocate memory on host Function for allocate memory on host is: int allocatepinnedbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_host_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_host_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: 9

10 void* p=host_buff[my_host_mem_id+0].pbuff; to the second: void* p=host_buff[my_host_mem_id+1].pbuff; to the third: void* p=host_buff[my_host_mem_id+2].pbuff; and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=host_buff[my_host_mem_id+0].pbuff_size; and name: string name=host_buff[my_host_mem_id+0].pbuff_name; 7.2 Allocate memory on device Function for allocate memory on device is: int allocatedevglobalbuffers(int n, size_t s, string name); where int n is number of (same size) buers to allocate, size_t s is size of each buer in bytes and string name is the name of the buer. Function return the index of rst allocated buer. This index is the only track to allocated buers, so it must be saved to, for example, int my_device_mem_id variable, which can be dened in private part of class CUserModule. Name of the allocated buers could be just string of index, in our case my_device_mem_id. This is the best way for debugging purposes. Pointer to rst allocated buer can be in this way reached with: void* p=dev_buff[my_device_mem_id+0].gbuff; to the second: void* p=dev_buff[my_device_mem_id+1].gbuff; to the third: void* p=dev_buff[my_device_mem_id+2].gbuff; 10

11 and so on. Of course, care must be taken how many buers are allocated. Size of the one buer is reached with: size_t size=dev_buff[my_device_mem_id+0].gbuff_size; and name: string name=dev_buff[my_device_mem_id+0].gbuff_name; For reach global buer from CUDA kernel, pointer to buer must be transferred to kernel through kernel argument itself or through CUserKernelArg class (see section 6). 7.3 Run CUDA kernel once In function: void CUserModule::userPriorRun(cudaStream_t cuda_stream); user can put call to kernel function, which will run only once prior other segments kernels. Minimum code to run kernel is: global void kernelpriorrun_() {... } void CUserModule::userPriorRun(cudaStream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kernelpriorrun_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kernelpriorrun_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 11

12 7.4 Run CUDA kernel for each segment In function void CUserModule::kernelUser(int idx_seg, int idx_flow, char src_buff, char dst_buff, cudastream_t cuda_stream); user can put call to kernel function, which will run once for each memory segment. Minimum code to run kernel is: global void kerneluser_() {... } void CUserModule::kernelUser(int seg_id, int flow_id, char src_buff, char dst_buff, cudastream_t cuda_stream) { dim3 threads(threads_n,1,1); dim3 blocks(1,1,1); blocks.x=(my_size-1+threads_n)/threads_n; int shared_memory_size=0; kerneluser_<<< blocks, threads, shared_memory_size, cuda_stream >>>(); } CUDA function global void kerneluser_() is not part of the class CUserModule because CUDA kernel routines do not support C++ style of programming. The solution is to put kernel routines in global scope and just call this routine from class member function. 12

13 8 Variable sux Variables have sux which follows some simple convention: _buff - pointer to buer (char*, void*,...), all the time points to the beginning of buer _p - pointer to buer (char*, void*,...), similar as _buff, but during calculation can change and points to particular member of buer _id - index (int, int64_t...) _idc - shorter index (char) _count - counter (int, int64_t...) _stream - CUDA stream (cudastream_t) _size - size of the buer of byts (size_t) _size2 - size of the buer of shorts _size4 - size of the buer of ints or oats _size8 - size of the buer of int64_ts or doubles _seek - index seek (size_t) _offset - index seek (int) _length - size of one data member in byts (for example, for int is 4) _n - number, quantity (except for the size of buer, where _size is used) _items - number of elements (for example, number of elements in vector) _name - name (string) _prop - pointer to properties _fp - le pointer (FILE*) _fname - le name (string) _v - vector _args - pointer to function argument _tex - texture memory 9 Compile all together Object user is constructed from class CUserModule, which inherits all properties and data from other clases (see Figure 1). CUDA nvcc and GNU gcc compiler build a static library libcudauser.a with CUDA make script (see Algorithm 1). User module is built with GRASS GIS make script with gcc compiler/linker from les main.cpp, libcudauser.a and description.html. Algoritem 1 GRASS GIS CUDA module compile ow chart. object user in cudamain.cpp CUserModule* user; user = new CUserModule; GRASS GIS main.cpp CUDA make (nvcc + gcc) static library libcudauser.a GRASS GIS make (gcc) user module r.cuda.user CUDA RT library GRASS GIS libcudart description.html 13

14 10 Debug printouts Several printouts are prepared for easier debugging and developing of new module. printout is enabled with debug parameter, for example: Certain debug r.cuda.viewshed --overwrite output=viewshed \ coordinate=592094, obs_elev=20 max_dist=10000 debug=1 Each debug printout has his own number. To print out several dierent debug printouts, just sum numbers together. In the following sub chapters all printouts are described Print last CUDA error (debug=1) It prints the last CUDA error. If there is no error, it should print: cuda last error = no error not. If an error occurs, module prints out last CUDA error nevertheless, if debug parameter is set or 10.2 Print arguments to CUDA part of the module (debug=2) It prints out arguments, which are passed to CUDA part of program. This arguments are not necessary the same as input arguments. The printout could like like: dir=/home/andrej/grass_data projection=slovenija mapset=permanent cur_mapset=fresnel_testing input_map_name=mobitel_slo_dem12i output_map_name=viewshed range=800 raw_range=10000 obs_elev=20 tgt_elev=0.0 azim_angle=0 azim_sector=360 elev_angle=0 elev_sector=180 earth_radius= u coordinate=967,832 raw_coordinate=592094, verbose=1 debug=2 segments= Print total elapsed time for CUDA calculation part (debug=4) It prints out total elapsed time: Finished, total elapsed time is ms 10.4 Print compile properties (debug=16) It prints out some compilation data: 16 COMPILE PROPERTIES: DATE = May TIME = 11:33:24 VERSION =

15 10.5 Print device(s) properties (debug=32) Prints out device properties, like devicequery application from samples does Print host properties (debug=64) Prints out memory data from your computer. First few are important and should look loke: MemTotal= MaxUserMemory= MaxPinnedMemory= MemFree= Buffers= Print device buers properties (debug=128) Prints out device buers properties. This buers are located on global memory on device. Printout looks like: ************************************************************************** 128 DEVICE BUFFERS PROPERTIES: buff size pointer dev_buff[...].gbuff x600f40000 in_map_prop.head_d_id x ns_v_id x ns_v_id x ev_v_id x60104a800 ev_v_id x buff_a_id x60cfa0000 buff_b_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer (on the CUDA device). Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of rst ns_v_id buer one can use: void* p=dev_buff[ns_v_id+0].gbuff and to get pointer to second ev_v_id buffer, one can use: void* p=dev_buff[ev_v_id+1].gbuff 15

16 10.8 Print host buers properties (debug=256) Prints out host buers properties. This buers are located on pinned memory on host. Printout looks like: ************************************************************************** 256 HOST BUFFERS PROPERTIES: buff size pointer host_buff[...].pbuff x in_map_prop.head_h_id x out_map_prop.head_h_id x buff_h_id+0 SUM= B, kb, MB, GB In table is list of all allocated buers during calculation. First column is buer index, second column is size. The third column is buer pointer. Fourth column is program variable (which stores buer index) and oset. For example, to get pointer of buff_h_id buer one can use: void* p=host_buff[buff_h_id+0].pbuff 10.9 Print input and output map properties (debug=512) Prints out input and output map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc Print scheduler properties (debug=1024) Prints out segments properties, for example: // vector<csegment> segment: (gross,+net) (gross,-net) seg seq row_in_start_id row_in_stop_id in_seek in_size row_out_start_id row_out_stop_id out_seek (net) out_size (0,+0) (588,-1) (587,+1) (1174,-1) (1173,+1) (1759,-0) Print run properties (debug=2048) Prints out running sequence, for example: (readhd,1,0) D->H (copyh2d,2,0) H->A (decompress,3,0) A->B (regression,4,0) B->A (kernelvisibility,0,0) A->B (alignment,5,0) B->B (compress,6,0) B->B (copyd2h,7,0) B->H (writehd,8,0) H->D (exit,9,0) -> 16

17 The letters have the following meanings: D - data in hard disk H - data in host buer A - data in buer A on GPU B - data in buer B on GPU Arrows -> indicates how data ows Print clutter map properties (debug=4096) Prints out clutter map properties, such as bounds, resolution, format, rows number, cols number, pitch size etc. 11 Contact author: Andrej Osterman Any feedback is welcome. Please me on s51mo@hamradio.si or andrej.osterman@guest.arnes.si. Please note that modules are in experimental phase and bugs are still alive. 17

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Lecture 03 Bits, Bytes and Data Types

Lecture 03 Bits, Bytes and Data Types Lecture 03 Bits, Bytes and Data Types Computer Languages A computer language is a language that is used to communicate with a machine. Like all languages, computer languages have syntax (form) and semantics

More information

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018 How large is the TLB? Johan Montelius September 20, 2018 Introduction The Translation Lookaside Buer, TLB, is a cache of page table entries that are used in virtual to physical address translation. Since

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Project #1 Exceptions and Simple System Calls

Project #1 Exceptions and Simple System Calls Project #1 Exceptions and Simple System Calls Introduction to Operating Systems Assigned: January 21, 2004 CSE421 Due: February 17, 2004 11:59:59 PM The first project is designed to further your understanding

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

OP2 C++ User s Manual

OP2 C++ User s Manual OP2 C++ User s Manual Mike Giles, Gihan R. Mudalige, István Reguly December 2013 1 Contents 1 Introduction 4 2 Overview 5 3 OP2 C++ API 8 3.1 Initialisation and termination routines..........................

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Memory management. Johan Montelius KTH

Memory management. Johan Montelius KTH Memory management Johan Montelius KTH 2017 1 / 22 C program # include int global = 42; int main ( int argc, char * argv []) { if( argc < 2) return -1; int n = atoi ( argv [1]); int on_stack

More information

Performance Considerations and GMAC

Performance Considerations and GMAC Performance Considerations and GMAC 1 Warps and Thread Execution Theoretically all threads in a block can execute concurrently Hardware cost forces compromise Bundle threads with a single control unit

More information

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco CS 326 Operating Systems C Programming Greg Benson Department of Computer Science University of San Francisco Why C? Fast (good optimizing compilers) Not too high-level (Java, Python, Lisp) Not too low-level

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CUDA Toolkit CUPTI User's Guide. DA _v01 September 2012

CUDA Toolkit CUPTI User's Guide. DA _v01 September 2012 CUDA Toolkit CUPTI User's Guide DA-05679-001_v01 September 2012 Document Change History Ver Date Resp Reason for change v01 2011/1/19 DG Initial revision for CUDA Tools SDK 4.0 v02 2012/1/5 DG Revisions

More information

CS 322 Operating Systems Practice Midterm Questions

CS 322 Operating Systems Practice Midterm Questions ! CS 322 Operating Systems 1. Processes go through the following states in their lifetime. time slice ends Consider the following events and answer the questions that follow. Assume there are 5 processes,

More information

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016 1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose

More information

Programming Assignment Multi-Threading and Debugging 2

Programming Assignment Multi-Threading and Debugging 2 Programming Assignment Multi-Threading and Debugging 2 Due Date: Friday, June 1 @ 11:59 pm PAMT2 Assignment Overview The purpose of this mini-assignment is to continue your introduction to parallel programming

More information

Introduction to Programming Using Java (98-388)

Introduction to Programming Using Java (98-388) Introduction to Programming Using Java (98-388) Understand Java fundamentals Describe the use of main in a Java application Signature of main, why it is static; how to consume an instance of your own class;

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Operating Systems and Networks Assignment 2

Operating Systems and Networks Assignment 2 Spring Term 2014 Operating Systems and Networks Assignment 2 Assigned on: 27th February 2014 Due by: 6th March 2014 1 Scheduling The following table describes tasks to be scheduled. The table contains

More information

CE221 Programming in C++ Part 1 Introduction

CE221 Programming in C++ Part 1 Introduction CE221 Programming in C++ Part 1 Introduction 06/10/2017 CE221 Part 1 1 Module Schedule There are two lectures (Monday 13.00-13.50 and Tuesday 11.00-11.50) each week in the autumn term, and a 2-hour lab

More information

CS 220: Introduction to Parallel Computing. Input/Output. Lecture 7

CS 220: Introduction to Parallel Computing. Input/Output. Lecture 7 CS 220: Introduction to Parallel Computing Input/Output Lecture 7 Input/Output Most useful programs will provide some type of input or output Thus far, we ve prompted the user to enter their input directly

More information

CSci 4061 Introduction to Operating Systems. Programs in C/Unix

CSci 4061 Introduction to Operating Systems. Programs in C/Unix CSci 4061 Introduction to Operating Systems Programs in C/Unix Today Basic C programming Follow on to recitation Structure of a C program A C program consists of a collection of C functions, structs, arrays,

More information

APT Session 4: C. Software Development Team Laurence Tratt. 1 / 14

APT Session 4: C. Software Development Team Laurence Tratt. 1 / 14 APT Session 4: C Laurence Tratt Software Development Team 2017-11-10 1 / 14 http://soft-dev.org/ What to expect from this session 1 C. 2 / 14 http://soft-dev.org/ Prerequisites 1 Install either GCC or

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

My malloc: mylloc and mhysa. Johan Montelius HT2016

My malloc: mylloc and mhysa. Johan Montelius HT2016 1 Introduction My malloc: mylloc and mhysa Johan Montelius HT2016 So this is an experiment where we will implement our own malloc. We will not implement the world s fastest allocator, but it will work

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Lab 09 - Virtual Memory

Lab 09 - Virtual Memory Lab 09 - Virtual Memory Due: November 19, 2017 at 4:00pm 1 mmapcopy 1 1.1 Introduction 1 1.1.1 A door predicament 1 1.1.2 Concepts and Functions 2 1.2 Assignment 3 1.2.1 mmap copy 3 1.2.2 Tips 3 1.2.3

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide NVJPEG DA-06762-001_v0.2.0 October 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.

More information

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010 Compiling and Executing CUDA Programs in Emulation Mode High Performance Scientific Computing II ICSI-541 Spring 2010 Topic Overview Overview of compiling and executing CUDA programs in emulation mode

More information

TDDB68. Lesson 1. Simon Ståhlberg

TDDB68. Lesson 1. Simon Ståhlberg TDDB68 Lesson 1 Simon Ståhlberg Contents General information about the labs Overview of the labs Memory layout of C programs ("Lab 00") General information about Pintos System calls Lab 1 Debugging Administration

More information

Software Development With Emacs: The Edit-Compile-Debug Cycle

Software Development With Emacs: The Edit-Compile-Debug Cycle Software Development With Emacs: The Edit-Compile-Debug Cycle Luis Fernandes Department of Electrical and Computer Engineering Ryerson Polytechnic University August 8, 2017 The Emacs editor permits the

More information

CSE 333 Midterm Exam Sample Solution 7/29/13

CSE 333 Midterm Exam Sample Solution 7/29/13 Question 1. (44 points) C hacking a question of several parts. The next several pages are questions about a linked list of 2-D points. Each point is represented by a Point struct containing the point s

More information

Pace University. Fundamental Concepts of CS121 1

Pace University. Fundamental Concepts of CS121 1 Pace University Fundamental Concepts of CS121 1 Dr. Lixin Tao http://csis.pace.edu/~lixin Computer Science Department Pace University October 12, 2005 This document complements my tutorial Introduction

More information

Introduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2;

Introduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2; Sami Ilvonen Petri Nikunen Introduction to C Oct 6 8, 2015 @ CSC IT Center for Science Ltd, Espoo int **b1, **b2; /* Initialise metadata */ board_1->height = height; board_1->width = width; board_2->height

More information

Procedures, Parameters, Values and Variables. Steven R. Bagley

Procedures, Parameters, Values and Variables. Steven R. Bagley Procedures, Parameters, Values and Variables Steven R. Bagley Recap A Program is a sequence of statements (instructions) Statements executed one-by-one in order Unless it is changed by the programmer e.g.

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

CS 261 Fall Mike Lam, Professor. Structs and I/O

CS 261 Fall Mike Lam, Professor. Structs and I/O CS 261 Fall 2018 Mike Lam, Professor Structs and I/O Typedefs A typedef is a way to create a new type name Basically a synonym for another type Useful for shortening long types or providing more meaningful

More information

CSE 333 Midterm Exam 7/29/13

CSE 333 Midterm Exam 7/29/13 Name There are 5 questions worth a total of 100 points. Please budget your time so you get to all of the questions. Keep your answers brief and to the point. The exam is closed book, closed notes, closed

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output

CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output Instructor: Dr. Kun Sun 1 This lecture: [Seacord]: Chapter 6 Readings 2 Secure Coding String management Pointer Subterfuge

More information

2/9/18. CYSE 411/AIT681 Secure Software Engineering. Readings. Secure Coding. This lecture: String management Pointer Subterfuge

2/9/18. CYSE 411/AIT681 Secure Software Engineering. Readings. Secure Coding. This lecture: String management Pointer Subterfuge CYSE 411/AIT681 Secure Software Engineering Topic #12. Secure Coding: Formatted Output Instructor: Dr. Kun Sun 1 This lecture: [Seacord]: Chapter 6 Readings 2 String management Pointer Subterfuge Secure

More information

Architecture: Caching Issues in Performance

Architecture: Caching Issues in Performance Architecture: Caching Issues in Performance Mike Bailey mjb@cs.oregonstate.edu Problem: The Path Between a CPU Chip and Off-chip Memory is Slow CPU Chip Main Memory This path is relatively slow, forcing

More information

Recitation: Cache Lab & C

Recitation: Cache Lab & C 15-213 Recitation: Cache Lab & C Jack Biggs 16 Feb 2015 Agenda Buffer Lab! C Exercises! C Conventions! C Debugging! Version Control! Compilation! Buffer Lab... Is due soon. So maybe do it soon Agenda Buffer

More information

Architecture: Caching Issues in Performance

Architecture: Caching Issues in Performance Architecture: Caching Issues in Performance Mike Bailey mjb@cs.oregonstate.edu Problem: The Path Between a CPU Chip and Off-chip Memory is Slow CPU Chip Main Memory This path is relatively slow, forcing

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

reclaim disk space by shrinking files

reclaim disk space by shrinking files Sandeep Sahore reclaim disk space by shrinking files Sandeep Sahore holds a Master s degree in computer science from the University of Toledo and has nearly 15 years of experience in the computing industry.

More information

RDBE Host Software. Doc No: X3C 2009_07_21_1 TODO: Add appropriate document number. XCube Communication 1(13)

RDBE Host Software. Doc No: X3C 2009_07_21_1 TODO: Add appropriate document number. XCube Communication 1(13) RDBE Host Software Doc No: X3C 2009_07_21_1 TODO: Add appropriate document number XCube Communication 1(13) Document history Change date Changed by Version Notes 09-07-21 09:12 Mikael Taveniku PA1 New

More information

CSE 374 Programming Concepts & Tools

CSE 374 Programming Concepts & Tools CSE 374 Programming Concepts & Tools Hal Perkins Fall 2017 Lecture 8 C: Miscellanea Control, Declarations, Preprocessor, printf/scanf 1 The story so far The low-level execution model of a process (one

More information

For personnal use only

For personnal use only Inverting Large Images Using CUDA Finnbarr P. Murphy (fpm@fpmurphy.com) This is a simple example of how to invert a very large image, stored as a vector using nvidia s CUDA programming environment and

More information

CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory

CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory Announcements CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory There will be no lecture on Tuesday, Feb. 16. Prof. Thompson s office hours are canceled for Monday, Feb. 15. Prof.

More information

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and

More information

COMP 250: Java Programming I. Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette

COMP 250: Java Programming I. Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette COMP 250: Java Programming I Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette Variables and types [Downey Ch 2] Variable: temporary storage location in memory.

More information

Other array problems. Integer overflow. Outline. Integer overflow example. Signed and unsigned

Other array problems. Integer overflow. Outline. Integer overflow example. Signed and unsigned Other array problems CSci 5271 Introduction to Computer Security Day 4: Low-level attacks Stephen McCamant University of Minnesota, Computer Science & Engineering Missing/wrong bounds check One unsigned

More information

Direct Memory Access. Lecture 2 Pointer Revision Command Line Arguments. What happens when we use pointers. Same again with pictures

Direct Memory Access. Lecture 2 Pointer Revision Command Line Arguments. What happens when we use pointers. Same again with pictures Lecture 2 Pointer Revision Command Line Arguments Direct Memory Access C/C++ allows the programmer to obtain the value of the memory address where a variable lives. To do this we need to use a special

More information

Ch. 11: References & the Copy-Constructor. - continued -

Ch. 11: References & the Copy-Constructor. - continued - Ch. 11: References & the Copy-Constructor - continued - const references When a reference is made const, it means that the object it refers cannot be changed through that reference - it may be changed

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes: C and Unix Overview This course is about computer organization, but since most of our programming is

More information

More on C programming

More on C programming Applied mechatronics More on C programming Sven Gestegård Robertz sven.robertz@cs.lth.se Department of Computer Science, Lund University 2017 Outline 1 Pointers and structs 2 On number representation Hexadecimal

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview Computer Science 322 Operating Systems Mount Holyoke College Spring 2010 Topic Notes: C and Unix Overview This course is about operating systems, but since most of our upcoming programming is in C on a

More information

Virtual Memory 1. Virtual Memory

Virtual Memory 1. Virtual Memory Virtual Memory 1 Virtual Memory key concepts virtual memory, physical memory, address translation, MMU, TLB, relocation, paging, segmentation, executable file, swapping, page fault, locality, page replacement

More information

P2: Collaborations. CSE 335, Spring 2009

P2: Collaborations. CSE 335, Spring 2009 P2: Collaborations CSE 335, Spring 2009 Milestone #1 due by Thursday, March 19 at 11:59 p.m. Completed project due by Thursday, April 2 at 11:59 p.m. Objectives Develop an application with a graphical

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control. C Flow Control David Chisnall February 1, 2011 Outline What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope Disclaimer! These slides contain a lot of

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Lecture 07 Debugging Programs with GDB

Lecture 07 Debugging Programs with GDB Lecture 07 Debugging Programs with GDB In this lecture What is debugging Most Common Type of errors Process of debugging Examples Further readings Exercises What is Debugging Debugging is the process of

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 14

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 14 BIL 104E Introduction to Scientific and Engineering Computing Lecture 14 Because each C program starts at its main() function, information is usually passed to the main() function via command-line arguments.

More information

Base Component. Chapter 1. *Memory Management. Memory management Errors Exception Handling Messages Debug code Options Basic data types Multithreading

Base Component. Chapter 1. *Memory Management. Memory management Errors Exception Handling Messages Debug code Options Basic data types Multithreading Chapter 1. Base Component Component:, *Mathematics, *Error Handling, *Debugging The Base Component (BASE), in the base directory, contains the code for low-level common functionality that is used by all

More information

CS333 Intro to Operating Systems. Jonathan Walpole

CS333 Intro to Operating Systems. Jonathan Walpole CS333 Intro to Operating Systems Jonathan Walpole Threads & Concurrency 2 Threads Processes have the following components: - an address space - a collection of operating system state - a CPU context or

More information

primitive arrays v. vectors (1)

primitive arrays v. vectors (1) Arrays 1 primitive arrays v. vectors (1) 2 int a[10]; allocate new, 10 elements vector v(10); // or: vector v; v.resize(10); primitive arrays v. vectors (1) 2 int a[10]; allocate new, 10 elements

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Programs. Function main. C Refresher. CSCI 4061 Introduction to Operating Systems

Programs. Function main. C Refresher. CSCI 4061 Introduction to Operating Systems Programs CSCI 4061 Introduction to Operating Systems C Program Structure Libraries and header files Compiling and building programs Executing and debugging Instructor: Abhishek Chandra Assume familiarity

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial A Walkthrough with Examples CMSC 212 - Spring 2009 Last modified March 22, 2009 What is gdb? GNU Debugger A debugger for several languages, including C and C++ It allows you to inspect what the program

More information

Array Initialization

Array Initialization Array Initialization Array declarations can specify initializations for the elements of the array: int primes[10] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 ; initializes primes[0] to 2, primes[1] to 3, primes[2]

More information

EECS 213 Introduction to Computer Systems Dinda, Spring Homework 3. Memory and Cache

EECS 213 Introduction to Computer Systems Dinda, Spring Homework 3. Memory and Cache Homework 3 Memory and Cache 1. Reorder the fields in this structure so that the structure will (a) consume the most space and (b) consume the least space on an IA32 machine on Linux. struct foo { double

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Introduction to Supercomputing

Introduction to Supercomputing Introduction to Supercomputing TMA4280 Introduction to UNIX environment and tools 0.1 Getting started with the environment and the bash shell interpreter Desktop computers are usually operated from a graphical

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

CCReflect has a few interesting features that are quite desirable for DigiPen game projects:

CCReflect has a few interesting features that are quite desirable for DigiPen game projects: CCReflect v1.0 User Manual Contents Introduction... 2 Features... 2 Dependencies... 2 Compiler Dependencies... 2 Glossary... 2 Type Registration... 3 POD Registration... 3 Non-Pod Registration... 3 External

More information

Programming Assignment #1: A Simple Shell

Programming Assignment #1: A Simple Shell Programming Assignment #1: A Simple Shell Due: Check My Courses In this assignment you are required to create a C program that implements a shell interface that accepts user commands and executes each

More information

CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES

CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES Your name: SUNet ID: In accordance with both the letter and the spirit of the Stanford Honor Code, I did not cheat on this exam. Furthermore,

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CSE 333 Midterm Exam Sample Solution 7/28/14

CSE 333 Midterm Exam Sample Solution 7/28/14 Question 1. (20 points) C programming. For this question implement a C function contains that returns 1 (true) if a given C string appears as a substring of another C string starting at a given position.

More information

Outline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging

Outline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging Outline Computer programming Debugging Hints Gathering evidence Common C errors "Education is a progressive discovery of our own ignorance." Will Durant T.U. Cluj-Napoca - Computer Programming - lecture

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole Threads & Concurrency 2 Why Use Threads? Utilize multiple CPU s concurrently Low cost communication via shared memory Overlap computation and blocking

More information