Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:

Size: px
Start display at page:

Download "Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:"

Transcription

1 Cover Page The handle holds various files of this Leiden University dissertation. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:

2 Chapter 6 Automatic code generation of efficient CUDA programs In Chapter 5 we have presented the implementation of the dynamics routine of the HIRLAM weather forecast model on programmable NVIDIA GPUs. The results showed that the use of GPUs to accelerate a weather forecast model is very promising. The CUDA program used in Chapter 5 was created by hand from the original Fortran code. The handwritten CUDA code is dicult to optimize and complicated to maintain. Therefore, in this chapter we present our extension to CTADEL to be able to generate CUDA code. We show a technique that generates ecient CUDA program for a general problem. Then we apply this technique to generate CUDA code for the dynamics routine of the HIRLAM weather forecast model. 6.1 Motivation Although the CUDA programming model is more convenient than the previous graphics programming APIs for developing GPU codes, the manual development of high performance codes with the CUDA model is still more complicated than the use of parallel programming models such as OpenMP [59] for generalpurpose multi-core systems [7, 28]. Therefore, it is attractive, for enhancing programmer productivity and for software quality, to develop a technique that supports automatic generation of CUDA programs. Recently, this issue has been studied in several projects. Some of them constructed a tool that automatically generates CUDA code at runtime, such as Klockner [37] and Perryman [61]. Other researchers developed a code generation tool that translates to CUDA program from other languages, such as from Fortran [26], C [7], or Java [85]. The above CUDA code generation tools have one common point, namely tak-

3 106 Chapter 6. Automatic code generation of efficient CUDA programs ing an existing program which is implemented in a dierent language as input. We introduce a new method that automatically generates CUDA code from an input problem specication using the code generation tool CTADEL [18]. Originally CTADEL was designed to generate Fortran code. We extend CTADEL to generate C and CUDA instead of Fortran. 6.2 CUDA programming model The architecture of GPU-based computing has been described in detail in Chapter 5. Below we give the structure of a CUDA program. A CUDA program includes two parts. The code that executes on the GPU is called the kernel and the other part that executes on the CPU is called the host code. The kernel is executed by a set of threads in single program multiple data (SPMD) mode. These threads are organized into blocks and a grid. A block is a group of threads and several blocks form a grid. Within a block threads are organized in a 1-, 2-, or 3-dimensional structure. Blocks in a grid are organized in a 1- or 2-dimensional structure. The host code includes program I/O, invocations of data transfers between CPU and GPU, and launchings of kernels. In detail, the host code consists of the following steps: Allocate memory on the GPU; Copy input data from CPU to GPU; Dene the block and grid structures; Invoke the GPU to start kernels; Copy output data from GPU to CPU; Deallocate memory on the GPU. 6.3 Specification in CTADEL CUDA is an extension of C, while CTADEL was implemented to generate Fortran. Therefore we rstly adapt CTADEL to generate C and then extend it for CUDA generation. We note that after the extension CTADEL can automatically generate CUDA from the specication.

4 6.3. Specification in CTADEL 107 %TEMPLATE declaref: FORTRAN DECLARE $stmt(declaref(x, Xs, T)) ->(fortran % print Fortran code T," ",X,"(",Xs,")" ). %TEMPLATE forallf: FORTRAN DO-LOOP $stmt(forallf(s, I=L..U)) ->(fortran % print Fortran code "DO ",I,"=",L,",",U, $stmt(s), "ENDDO" ). (a) %TEMPLATE declarec: C DECLARE $stmt(declarec(x, Xs, T)) ->(C % print C code T," ",X,"[",Xs,"]",";" ). %TEMPLATE forallc: C FOR-LOOP $stmt(forallc(s, I=L..U)) ->(C % print C code "for(",i,"=",l,";", I,"<=",U,";",I,"++)", {",$stmt(s),"}" ). (b) Figure 6.1: Example templates to generate a declaration and a for-loop in Fortran (a) and C (b) Adapting CTADEL for C code generation The choice of the target language is implemented in the last stage of the code generation process in CTADEL. In this nal step, CTADEL produces the generated code based on predened templates. In these templates, the grammar of the target language is dened. As examples, Figure 6.1 (a) shows templates for the generation of a declaration (declaref ) and a for-loop (forallf ) in Fortran. In the template declaref, X denotes a variable which has the type T and Xs is the memory to be allocated. In the template forallf, L and U are the loop boundaries, I denotes the loop index, and S is the statement inside the loop. The change in order to generate C code is realized straightforward by the grammar modication, as in Figure 6.1 (b). In this gure, the grammar structures that convert to Fortran such as ( ) and DO-ENDDO now convert to [ ] and for { }, respectively. The modication to generate other structures of C are done in a similar way Generating the kernel code In the CUDA context, a kernel is a function that is executed in parallel by a number of threads. Similar to a conventional parallel program where each processor works on a small part of the data, each of the threads that execute the kernel is assigned to a portion of the data. We call the portion of data processed by each thread the thread domain. The generation of the kernel code involves splitting the computational domain into thread domains.

5 108 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.2: Splitting of the computational domain into thread domains Figure 6.2 describes the splitting of the computational domain. To determine the thread domain, we need the information of the thread index, threadid. In CUDA, a thread in a block and a block in a grid is given a unique index, threadidx and blockidx, respectively. From threadidx and blockidx, we can determine threadid as threadid.x = blockidx.x blockdim.x + threadidx.x, (6.1) where X can be the x- or y-direction, and blockdim.x is the size of block in X direction. Because a grid is only two-dimensional, threadid in the z-direction is determined as threadid.z = threadidx.z. (6.2) Based on the thread index threadid, we can determine the thread domain. Consider the x-direction and suppose that each thread processes a number of grid points which have the indexes vary from L x to U x, the thread domain is determined as L x = threadid.x npoint x + 1, (6.3) U x = threadid.x npoint x + npoint x, where npoint x is the size of thread domain in the x-direction. Equation (6.3) is applied for the case that the thread domains have the same size. If the computational domain is not exact divisible by the thread domain, we assign the remainder to the rst threads. We have presented this issue in Subsection 1.2.1, Chapter 4. The thread domain in the y and z-direction are determined in a similar way. Figure 6.3 is an example of a kernel code. For simplicity, we show only the code in the x-direction. The kernel code includes two parts: determine the thread domain and perform the calculations on the thread domain. The calculation part is similar to that of the C program. We have presented how to generate

6 6.3. Specification in CTADEL 109 global kernel() { % Determine the thread domain: int threadid.x = blockidx.x*blockdim.x+threadidx.x; int Lx = threadid.x*npoint_x+1; int Ux = threadid.x*npoint_x+npoint_x; % Perform the calculation: for(i=lx;i<=ux;i++){... } } Figure 6.3: Example kernel code a C program in Subsection To generate code for the determination of the thread domain we dene the template $splitdomain(npoint,x) ->(cuda % print CUDA code "int threadid.",x,"=blockidx.",x,"*blockdim.",x,"+threadidx.",x,";" "int L",X,"=threadID.",X,"*npoint_",X,"+1",";" "int U",X,"=threadID.",X,"*npoint_",X," + npoint_",x,";" ). The code generated by the template splitdomain is exactly the code in the rst part of the kernel in Figure 6.3. The variables in the kernel in Figure 6.3 are allocated on a register le, or on local memory if register le is full. The variable inside the kernel can also be allocated using shared memory. A shared memory is allocated using the shared declaration specication, e.g., as: shared int Lx, Ux Generating the host code The host code of a CUDA program consists of allocating and deallocating memory on the GPU, transferring data between CPU and GPU, dening the block and grid, and invoking kernels on the GPU. The synchronization between two steps is obtained by the fact that an operation is started only if the previous operation has been nished. In the following paragraphs we will show examples of the code for each step and how they are generated by CTADEL. Allocate and deallocate memory on the GPU In CUDA, a global memory on the GPU is allocated using cudamalloc() and deallocated using cudafree() primitives. Example codes of allocating and deal-

7 110 Chapter 6. Automatic code generation of efficient CUDA programs locating a variable S, which has type oat and size Size, are given by % ALLOCATE GPU MEMORY float *S; cudamalloc ((void**)&s,size); % DEALLOCATE GPU MEMORY cudafree(s); To generate code for allocating and deallocating a global memory on the GPU, we dene the templates % TEMPLATE TO GENERATE CODE FOR ALLOCATING MEMORY allocategpu(type,var,size) ->(cuda % print CUDA code Type," *",Var,";", "cudamalloc((void**)","&",var,",",size,");" ). % TEMPLATE TO GENERATE CODE FOR DEALLOCATING MEMORY allocategpu(type,var,size)) ->(cuda % print CUDA code "cudafree(",s,");" ). Transfer data between CPU and GPU Data transfer between CPU and GPU is performed using cudamemcpy(). An example of transferring data is given by % TRANSFER DATA FROM CPU TO GPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpyhosttodevice); % TRANSFER DATA FROM GPU TO CPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpydevicetohost); The data transfer command has four parameters: the destination memory address (Mem_dest), the source memory address (Mem_source), the size of the data to be transferred (size_of_data), and the direction of transferring which is specied by HostToDevice for copying from CPU to GPU and DeviceToHost for copying from GPU to CPU. The code for transferring data is generated by the following templates % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM CPU TO GPU copycpu2gpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpyhosttodevice);" ).

8 6.3. Specification in CTADEL 111 % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM GPU TO CPU copygpu2cpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpydevicetohost);" ). Define the block and grid In CUDA, threads are organized into blocks and grid. The block and grid are dened by dimblock and dimgrid primitives, respectively, as % DEFINE THE BLOCK AND GRID dim3 dimblock(blockx,blocky,blockz); dim3 dimgrid(gridx,gridy); In the above code, (blockx,blocky,blockz) are the block sizes, and (gridx,gridy) are the grid sizes. The template that generates code for dening the block and grid reads as % TEMPLATE TO GENERATE CODE FOR DEFINING THE BLOCK AND GRID define_thread(blockx,blocky,blockz,gridx,gridy) ->(cuda % print CUDA code "dim3 ","dimblock(",blockx,",",blocky,",",blockz,")",";", "dim3 ","dimgrid(",gridx,gridy,")" ). Invoke the kernels A kernel is invoked by specifying the name of kernel following by the <<<dim- Grid,dimBlock >>> construct, as % INVOKE KERNEL kernel<<<dimgrid,dimblock>>>(); To generate code for invoking the kernel, we dene the template as % TEMPLATE TO GENERATE CODE FOR INVOKING KERNEL invoke_kernel(kernel_name) ->(cuda % print CUDA code kernel_name,"<<<dimgrid,dimblock>>>"," ).

9 112 Chapter 6. Automatic code generation of efficient CUDA programs Generating the optimal CUDA stream code In Chapter 5, we have shown that a particular choice of multiple streams to overlap the kernel calculations with the data transfers increases the performance of the CUDA implementation for the DYN routine of the HIRLAM weather forecast model upto 36%. In this subsection we will present a technique that generates the optimal CUDA stream program which has the maximal overlap between kernel calculations and data transfers. The technique that we describe here is applicable to the general problem. Suppose that we need to execute a number of kernels on the GPU. Each kernel produces one or a number of output results from a number of input values. Because the CPU and the GPU cannot access memory of each other, we need to transfer input values from the CPU to the GPU before the calculation and output results from the GPU to the CPU after the calculation. To reduce the data transfer time, we only transfer inputs/outputs of the program while keeping intermediate values on the GPU. The calculation in a kernel may require an output value of another kernel as its input. Therefore the execution of kernels has to be arranged in an appropriate order, given by the dependency graph. This dependency graph is a directed acyclic graph (DAG) that represents dependencies of the calculation of kernels on their inputs. In the dependency graph, an invocation of a kernel is called a node. A node takes a set of input values and/or output data of other nodes and uses them to create one or a set of output values. Figure 6.4 is an example of a dependency graph. In this gure a node is denoted by an oval, an input/output value is denoted by a rectangle, and an arrow represents a dependence of a node on its input. In this example, the calculation of kernel A needs data from inputs 1 and 2, the calculation of kernel B needs data from inputs 2, 4 and 5, and the calculation of kernel C needs data from output of kernels A and B and data from inputs 3 and 4. The output value of the total process consists of outputs 1 and 2 of kernels A and C, respectively. Based on the dependency graph, we can generate the optimal CUDA stream program through four steps as follows: 1. Generate the calculation streams. A kernel has to be invoked after another kernel if its calculation uses output of that kernel. Therefore, from the dependency graph we can determine the invocation order of kernels. We group kernels into a list that reects the invocation order of kernels: the rst kernel in the list is invoked rstly, then the second kernel, and so on. We call this list of kernels the calculation stream. In the example in Figure 6.4, the calculation of kernel A does not depend on output of any other kernel. Hence, kernel A can be invoked rstly. Similarly, we can also invoke kernel B rstly. The calculation of kernel C needs the output of kernels A and B. Therefore, kernel C has to be invoked after kernels A

10 6.3. Specification in CTADEL 113 Figure 6.4: An example data dependency graph and B. As a result, we can create two calculation streams: [kernel A,kernel B,kernel C] and [kernel B,kernel A,kernel C]. 2. Generate the CUDA stream codes. The CUDA stream code is generated by adding the input/output transfers required by each kernel to the calculation stream. To have overlap, we split the transfer of inputs/outputs and the calculation of kernels into two CUDA streams so that one stream executes a kernel while the other stream is transferring data. To avoid confusion, we note that a CUDA stream code is the host code of a CUDA program which consists of invocation of kernel executions and data transfers. In term of a host code, these invocations are processed serially. But in term of a CUDA stream code, these invocations are organized into two CUDA streams which are executed in parallel. To describe the generation of the CUDA stream code, we will use the terms previous kernel and next kernel for the kernel that stands in front of and behind the current kernel, respectively, and the term later kernel for one of the kernels that stand behind the current kernel in the calculation stream. The CUDA stream code is created based on three principles. First, the inputs have to be present on the GPU before they can be referred to. Therefore, we add the input transfers of the next kernel to overlap with the calculation of the current kernel. Second, the outputs can only be transferred back after their actual calculation. Hence, we add the output transfers of the current kernel to overlap with the calculation of the next kernel. Third, if the calculation of a kernel in a CUDA stream refers to inputs/values which

11 114 Chapter 6. Automatic code generation of efficient CUDA programs are transferred/calculated in a dierent CUDA stream, a synchronization is needed before the invocation of this kernel. Similarly, a synchronization is needed before a transfer of an output that is calculated in a dierent CUDA stream. The approach to generate the CUDA stream code is as follows: Transfer the inputs of the rst kernel in the calculation stream; For all kernels in the calculation stream: - Synchronize if necessary; - Transfer the inputs of the next kernel and the outputs of the previous kernel in a CUDA stream; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. Table 6.1 shows an example CUDA stream code that is generated from the calculation stream [kernel A,kernel B,kernel C] and the dependency graph in Figure 6.4. Firstly, inputs 1 and 2 of kernel A are transferred in CUDA stream 1. Next, the calculation of kernel A is overlapped with the transfers of input 4 and 5 of kernel B, the calculation of kernel B is overlapped with the transfers of input 3 of kernel C and output 1 of kernel A. Kernel B invoked in CUDA stream 2 uses input 2 which is transferred in CUDA stream 1, hence a synchronization is added before the invocation of kernel B. Similarly, since kernel C invoked in CUDA stream 1 uses output of kernel B, which is invoked in CUDA stream 2, a synchronization is needed before the invocation of kernel C. Finally, output 2 of kernel C is transferred. In the above CUDA stream code generation method, it can happen that the execution time of a kernel is much larger than the time to transfer the inputs/outputs which are overlapped with that kernel, whereas in another overlap phase, the transfer time is larger than the kernel execution time. If this is the case, we do not achieve the maximum prot of overlapping. Hence, we call this: the simple overlap approach. To have a higher ecient overlapping, we adapt the simple overlap approach by balancing the data transfer time and the kernel execution time. We add more transfers to overlap with an expensive kernel and remove transfers if the kernel execution time is smaller than the transfers time. We call this the ecient overlap approach. The CUDA stream code is generated by the ecient overlap approach as follows: Transfer the inputs of the rst kernel in the calculation stream;

12 6.3. Specification in CTADEL 115 Table 6.1: Example CUDA streams code generated from the calculation stream [kernel A,kernel B,kernel C]. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Code line CUDA stream 1 CUDA stream 2 1 transfer input 1 transfer input 2 2 kernel A transfer input 4 3 synchronization transfer input 5 4 transfer input 3 kernel B transfer output 1 5 synchronization 6 kernel C 7 transfer output 2 For all kernels: - Synchronize if necessary; - Transfer the inputs of the next kernel in a CUDA stream; - While the input transfers time is smaller than the kernel execution time, transfer one input of the later kernel, or one calculated output if all inputs have been transferred; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. The kernel execution time and the input/output transfer time can be extrapolated by the number of operations executed by each kernel and the size of data to be transferred, respectively, or can be found empirically. 3. Derive the theoretical time of the CUDA stream codes. The theoretical time of a CUDA stream code is an aggregation of the transfers and execution time in case of non overlap. Examples are the theoretical time of row 1 in Table 6.1 which is equal to the total transfer time of inputs 1 and 2, or the theoretical time of row 6 which is the execution time of kernel C. If there is overlap between a kernel execution and data transfers, then the theoretical time is equal to the maximum value of those two times. For

13 116 Chapter 6. Automatic code generation of efficient CUDA programs example the theoretical time of row 2 in Table 6.1 is the maximum value of the kernel A execution time and the total transfer time of inputs 4 and Generate the optimal CUDA stream program. In general from a dependency graph we can generate many calculation streams, corresponding with many generated CUDA stream codes with dierent theoretical times. At the nal step, we choose the CUDA stream code which has the smallest theoretical time to generate the optimal CUDA stream program. 6.4 Generating the optimal CUDA stream program for the DYN routine In this section we present how to generate the optimal CUDA stream program for the dynamics routine (DYN) of the HIRLAM weather forecast model [29] The DYN routine The DYN routine of the HIRLAM weather forecast model solves the so-called primitive equations that describe the conservation of horizontal momentum, energy and mass (air and water), and the ideal gas law, on the grid points. Specically, the DYN routine calculates the tendency of surface pressure ps t, humidity q t, temperature T t, and wind components (u t,v t ) from the inputs including vertical hybrid coordinate (A,B), coriolis eect f, metric coecient (hxu,hxv,hxt,hyu,hyv,hyt), surface geopotential phis, surface pressure ps, humidity q, temperature T, and wind (u,v). The dependency graph of the DYN routine is shown in Figure 6.5. In this graph, p, lnp, etap, E, Z, and phi are the user-dened variables; t25, t60, and t64 are temporary variables introduced by common subexpression elimination. To derive the theoretical time of the CUDA stream code, we need the information about the time to execute the kernels and the time to transfer the input/output data. The kernel execution times and the input/output transfer times depend on the computational domain. We measure these times for the Table 6.2: Execution time of the kernels (microsecond) for the domain of grid points Kernel p etap lnp E Z phi ps t t60 t64 t25 q t T t u t v t Time

14 6.4. Generating the optimal CUDA stream program for the DYN routine 117 Figure 6.5: Dependency graph of the DYN routine. The inputs and outputs are denoted by a rectangular. The calculated variables are denoted by a oval. An arrow shows the dependence of a kernel calculation on an input or a calculated value of other kernel or the dependence of an output on a calculated value of a kernel. Table 6.3: Transfer time of the input/output data (microsecond) for the domain of grid points Variable Size Copy time 1D inputs (A, B) 256 B 18 2D inputs (ps, f, phis, hxu, hxv, hxt, hyu, hyv, hxt) 384 KB 85 3D inputs (q, T, u, v) 24 MB D outputs (ps t) 384 KB 85 3D outputs (q t, T t, u t, v t) 24 MB 4130 domain of grid points, the base domain that we used in Chapter 5. The results are shown in Tables 6.2 and 6.3. Table 6.3 shows that for the 3D data, transferring the input from the CPU

15 118 Chapter 6. Automatic code generation of efficient CUDA programs to the GPU is more expensive than the output from the GPU to the CPU. This result is also observed in other studies such as [2, 19] The simple and efficient overlap approach In Subsection 6.3.4, we proposed two methods to generate the CUDA stream code, namely the simple and efficient overlap approach. In this subsection we firstly show how the CUDA stream code is generated from a calculation stream applying the simple and efficient overlap approach, and how the theoretical time of a CUDA stream code is derived. Next, we assess the performance of the two CUDA stream code generation approaches. Tables 6.4 and 6.5 show the CUDA stream code, as an example, generated from the calculation stream [p, t60, t64, ps t, lnp, etap, E, Z, q t, t25, phi, T t, u t, v t ], which is the calculation order of the CUDA stream code that we have presented in Chapter 5, applying the simple and efficient overlap approach, respectively. The common point of the two approaches is that the inputs of the first kernel p (A,B,ps) are transferred before the first kernel starts its calculation, and the output of the last kernel v t is transferred back after its calculation has been finished. In addition, the input transfers of the next kernel are overlapped with the calculation of the current kernel. For example, in both Tables 6.4 and 6.5, to calculate t60 (line 3) we need the input u, therefore, u is transferred at line 2. The difference is that with the efficient overlap approach, if the total transfer time in an overlapped phase is smaller than the kernel calculation time, we transfer more data until the total transfer time approximates the calculation time. For example, to calculate ps t (line 5, Tables 6.4 and 6.5), we need the inputs hyu, hxt, and hyt. With the simple overlap approach, the transfers of these inputs are overlapped with the calculation of t64 at line 4 of Table 6.4. However, the total transfer time of the inputs hyu, hxt, and hyt, which is 255 µs, is smaller than the calculation time of t64 (666 µs). Therefore, in the efficient overlap approach we add the transfers of hxv, hyv, hxu, and phis to overlap with the calculation of t64 (see Table 6.5). The third column of Tables 6.4 and 6.5 show the theoretical time of the CUDA stream code. The theoretical time of the CUDA stream code is the total time of all lines in the CUDA stream code. The time of a line is the transfer/calculation time in case of non overlap, or the maximum value of the transfer and calculation time if there is overlap between the calculation and transfer. Examples, in Table 6.4, the transfers of A, B, and ps (line 1) are not overlapped with any calculation, hence the time of line 1 is 121 µs which is the time to transfer A, B, and ps. In line 2, the calculation of p is overlapped with the transfer of u. The transfer time of u, which is 4400 µs, is larger than the calculation time of p (230 µs). Therefore the time of line 2 is 4400 µs. Note that, at this stage we cannot measure exactly the synchronization time; we do not include it in the theoretical time. In Subsection 6.4.4, we will investigate

16 6.4. Generating the optimal CUDA stream program for the DYN routine 119 Table 6.4: The CUDA stream code and theoretical time obtained by the simple overlap approach. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hyu) 4 Calculation(t64) Transfer(hxt) 666 Transfer(hyt) 5 Transfer(hxv) Calculation(ps t ) 85 6 Calculation(lnp) Transfer(hyv) Transfer(hxu) Calculation(etap) Calculation(E) Transfer(f) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Transfer(phis) Calculation(t25) Calculation(phi) Transfer(ps t ) Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46451

17 120 Chapter 6. Automatic code generation of efficient CUDA programs Table 6.5: The CUDA stream code and theoretical time obtained by the efficient overlap approach Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hxv) Transfer(hyv) Transfer(hxu) 4 Calculation(t64) Transfer(hyu) 666 Transfer(hxt) Transfer(hyt) Transfer(phis) 5 Calculation(ps t ) 54 6 Calculation(lnp) Transfer(f) Calculation(etap) Calculation(E) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Calculation(t25) Calculation(phi) 1044 Transfer(ps t ) 13 Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46420

18 6.4. Generating the optimal CUDA stream program for the DYN routine 121 Table 6.6: Theoretical time (µs) of the CUDA stream code obtained by the simple and efficient overlap approach Theoretical time Calculation stream Simple Efficient overlap overlap [p, lnp, t60, t64, ps t, etap, q t, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, ps t, etap, q t, lnp, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, lnp, ps t, etap, E, Z, q t, t25, phi, T t, u t, v t] [p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, E, Z, u t, v t] [p, E, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [p, E, t60, t64, lnp, ps t, etap, Z, q t, t25, phi, T t, u t, v t] [E, p, t60, t64, ps t, etap, q t, lnp, T t, Z, t25, phi, u t, v t] [E, p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [E, p, Z, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, u t, v t] [E, p, Z, t60, t64, ps t, etap, q t, lnp, T t, t25, phi, u t, v t] whether the synchronization time is significant for the theoretical time. To evaluate the performance of the two CUDA stream code generation approaches, we derive the theoretical time of the CUDA stream code generated from the different calculation streams. From the dependency graph we can generate thousands of calculation streams. It will take a lot of time to generate CUDA stream codes for all suitable calculation streams. Hence, we assess the two CUDA stream code generation approaches based on a number of calculation streams as in Table 6.6. These calculation streams are created by hand. In order to obtain some representativity, we choose different kernels for each node. For examples, the first node can be the kernel p or E, the second node can be the kernel lnp, t60, t25, or Z. We observe that the CUDA stream codes generated by the efficient overlap approach are more efficient than those generated by the simple overlap approach. Therefore, in the next subsection we will apply the efficient overlap approach to generate the optimal CUDA stream program Generating the optimal CUDA stream program As presented in Subsection 6.3.4, the optimal CUDA stream program is generated through the following steps: Based on the dependency graph, CTADEL generates the calculation streams. Figure 6.6 shows, as an example, how the first three kernels of the cal-

19 122 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.6: The calculation streams generated for the first three kernels. An arrow shows which kernel can be invoked after the calculation of another kernel.

20 6.4. Generating the optimal CUDA stream program for the DYN routine 123 Table 6.7: The optimal CUDA stream code CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A,B,ps) 121 Transfer(hxv) Calculation(p) Transfer(hyv) 230 Transfer(v) Calculation(lnp) 4400 Calculation(t64) Transfer(u) 4400 Transfer(hxu) Transfer(hyu) Transfer(hxt) Calculation(t60) 359 Transfer(hyt) Calculation(ps t) 54 Transfer(f) Transfer(phis) Calculation(etap) 1044 Calculation(Z) Transfer(q) 4400 Transfer(T ) Calculation(q t) 4641 Calculation(T t) Transfer(q t) Transfer(ps t) Calculation(E) 791 Calculation(t25) 786 Calculation(phi) 1044 Calculation(u t) Transfer(T t) 4295 Transfer(u t) Calculation(v t) 4290 Transfer(v t) 4130 Total theoretical time 45905

21 124 Chapter 6. Automatic code generation of efficient CUDA programs culation streams are generated. At first, either the kernel p or E can be chosen, because the calculations of them do not use outputs of any other kernel. Next, if we choose p as first kernel, the kernels lnp, Z, t25, t60, t64, and E can be selected as second kernel, because the calculations of these kernels only depend on the calculation of p. Similarly, if we choose lnp as second kernel, the kernels E, Z, t25, t60, and t64 can be chosen as third kernel. From the calculation streams, CTADEL generates the CUDA stream codes by adding the input/output transfers applying the efficient overlap approach. An example of a CUDA stream code obtained by the efficient overlap approach is presented in Table 6.5. Next, CTADEL derives the theoretical time of all generated CUDA stream codes. Finally, CTADEL chooses the CUDA stream code with the smallest theoretical time for generation of the optimal CUDA stream program. Table 6.7 shows the CUDA stream code which has the smallest theoretical time of µs Experiments The experiments are performed on the system that we used in Chapter 5. The codes for the experiments are the CUDA implementations of the DYN routine, which is generated by CTADEL. We compare the performance of the CUDA generated code with the handwritten code, and with the C program. We verify the correctness of the generated CUDA programs by comparing the calculated results of DYN from the generated programs with those from the handwritten code. This comparison shows that the generated and handwritten program reproduce bit-wise identical output. The domain and threads structure are chosen the same as in Chapter 5: the computational domain consists of 512x192x64 grid points; the thread domain contains 1x1x32 grid points, and the block has 256x1x32 threads. The elapsed times are shown in Table 6.8. The elapsed time of the C and CUDA code is the total execution time on the CPU. For the C code, it is the calculation time on the CPU. But for the CUDA code, it consists of the calculation time on the GPU and the time to transfer data between CPU and GPU. Because we did not include the synchronization time in deriving the theoretical time of the CUDA stream code, the real run times of the CUDA generated codes are larger than the theoretical times as we can see from Table 6.8. The small difference between the real run time and the theoretical time indicates that

22 6.5. Conclusion 125 Table 6.8: The theoretical and run times (in ms) of the CUDA stream programs on a GTX 480 GPU, and, for reference, that of the C code on an Intel i7-940 CPU. Theoretical time Run time C code 2610 CUDA Hand code (Chapter 5) 46.9 Stream CUDA stream code (Table 6.4) code CUDA stream code (Table 6.5) Optimal CUDA stream code the synchronization time is small and can safely be neglected when deriving the theoretical time of the CUDA stream code. The optimal CUDA stream program generated by CTADEL is slightly faster than the handwritten code. This demonstrates that, although we have done many optimizations, the handwritten code is not optimal. This result also shows that CTADEL can generate efficient CUDA code that is comparable with optimized handwritten code. In summary, we obtain a speedup of 57 of the CUDA program over the C code. 6.5 Conclusion We have presented how to extend CTADEL to generate CUDA programs. To overlap kernel calculations with data transfers, we introduced two approaches to generate the optimal CUDA stream program. We applied this technique to the DYN routine of the HIRLAM weather forecast model. The experimental results showed that the real run times approximate the theoretical times. This indicates that the approach that we use to derive the theoretical time of the CUDA stream program is reliable. The generated code is more efficient than the handwritten code. The difference between the optimal generated code and the handwritten code is small (2%) due to many optimizations that we have done on the handwritten code. The main advantages of code generation here are hence the ease to obtain efficient code, to maintain its efficiency across platforms, and to implement new problems and algorithms; but in other situations it may also take away the burden of manual code optimizations.

23

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:

Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date: Cover Page The handle http://hdl.handle.net/1887/18622 holds various files of this Leiden University dissertation. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

CUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010

CUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010 CUDA by Example The University of Mississippi Computer Science Seminar Series Martin.Lilleeng.Satra@sintef.no SINTEF ICT Department of Applied Mathematics April 28, 2010 Outline 1 The GPU 2 cudapi 3 CUDA

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

CUDA. Sathish Vadhiyar High Performance Computing

CUDA. Sathish Vadhiyar High Performance Computing CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA Kenjiro Taura 1 / 36

CUDA Kenjiro Taura 1 / 36 CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

CS 179: GPU Computing

CS 179: GPU Computing CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 1 CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 Instructions: This is an in class, open note exam. Please use the paper provided to submit your responses. You can include additional

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory. Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu. Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain

More information

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh. GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay : CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information