Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:
|
|
- Augustus Green
- 5 years ago
- Views:
Transcription
1 Cover Page The handle holds various files of this Leiden University dissertation. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:
2 Chapter 6 Automatic code generation of efficient CUDA programs In Chapter 5 we have presented the implementation of the dynamics routine of the HIRLAM weather forecast model on programmable NVIDIA GPUs. The results showed that the use of GPUs to accelerate a weather forecast model is very promising. The CUDA program used in Chapter 5 was created by hand from the original Fortran code. The handwritten CUDA code is dicult to optimize and complicated to maintain. Therefore, in this chapter we present our extension to CTADEL to be able to generate CUDA code. We show a technique that generates ecient CUDA program for a general problem. Then we apply this technique to generate CUDA code for the dynamics routine of the HIRLAM weather forecast model. 6.1 Motivation Although the CUDA programming model is more convenient than the previous graphics programming APIs for developing GPU codes, the manual development of high performance codes with the CUDA model is still more complicated than the use of parallel programming models such as OpenMP [59] for generalpurpose multi-core systems [7, 28]. Therefore, it is attractive, for enhancing programmer productivity and for software quality, to develop a technique that supports automatic generation of CUDA programs. Recently, this issue has been studied in several projects. Some of them constructed a tool that automatically generates CUDA code at runtime, such as Klockner [37] and Perryman [61]. Other researchers developed a code generation tool that translates to CUDA program from other languages, such as from Fortran [26], C [7], or Java [85]. The above CUDA code generation tools have one common point, namely tak-
3 106 Chapter 6. Automatic code generation of efficient CUDA programs ing an existing program which is implemented in a dierent language as input. We introduce a new method that automatically generates CUDA code from an input problem specication using the code generation tool CTADEL [18]. Originally CTADEL was designed to generate Fortran code. We extend CTADEL to generate C and CUDA instead of Fortran. 6.2 CUDA programming model The architecture of GPU-based computing has been described in detail in Chapter 5. Below we give the structure of a CUDA program. A CUDA program includes two parts. The code that executes on the GPU is called the kernel and the other part that executes on the CPU is called the host code. The kernel is executed by a set of threads in single program multiple data (SPMD) mode. These threads are organized into blocks and a grid. A block is a group of threads and several blocks form a grid. Within a block threads are organized in a 1-, 2-, or 3-dimensional structure. Blocks in a grid are organized in a 1- or 2-dimensional structure. The host code includes program I/O, invocations of data transfers between CPU and GPU, and launchings of kernels. In detail, the host code consists of the following steps: Allocate memory on the GPU; Copy input data from CPU to GPU; Dene the block and grid structures; Invoke the GPU to start kernels; Copy output data from GPU to CPU; Deallocate memory on the GPU. 6.3 Specification in CTADEL CUDA is an extension of C, while CTADEL was implemented to generate Fortran. Therefore we rstly adapt CTADEL to generate C and then extend it for CUDA generation. We note that after the extension CTADEL can automatically generate CUDA from the specication.
4 6.3. Specification in CTADEL 107 %TEMPLATE declaref: FORTRAN DECLARE $stmt(declaref(x, Xs, T)) ->(fortran % print Fortran code T," ",X,"(",Xs,")" ). %TEMPLATE forallf: FORTRAN DO-LOOP $stmt(forallf(s, I=L..U)) ->(fortran % print Fortran code "DO ",I,"=",L,",",U, $stmt(s), "ENDDO" ). (a) %TEMPLATE declarec: C DECLARE $stmt(declarec(x, Xs, T)) ->(C % print C code T," ",X,"[",Xs,"]",";" ). %TEMPLATE forallc: C FOR-LOOP $stmt(forallc(s, I=L..U)) ->(C % print C code "for(",i,"=",l,";", I,"<=",U,";",I,"++)", {",$stmt(s),"}" ). (b) Figure 6.1: Example templates to generate a declaration and a for-loop in Fortran (a) and C (b) Adapting CTADEL for C code generation The choice of the target language is implemented in the last stage of the code generation process in CTADEL. In this nal step, CTADEL produces the generated code based on predened templates. In these templates, the grammar of the target language is dened. As examples, Figure 6.1 (a) shows templates for the generation of a declaration (declaref ) and a for-loop (forallf ) in Fortran. In the template declaref, X denotes a variable which has the type T and Xs is the memory to be allocated. In the template forallf, L and U are the loop boundaries, I denotes the loop index, and S is the statement inside the loop. The change in order to generate C code is realized straightforward by the grammar modication, as in Figure 6.1 (b). In this gure, the grammar structures that convert to Fortran such as ( ) and DO-ENDDO now convert to [ ] and for { }, respectively. The modication to generate other structures of C are done in a similar way Generating the kernel code In the CUDA context, a kernel is a function that is executed in parallel by a number of threads. Similar to a conventional parallel program where each processor works on a small part of the data, each of the threads that execute the kernel is assigned to a portion of the data. We call the portion of data processed by each thread the thread domain. The generation of the kernel code involves splitting the computational domain into thread domains.
5 108 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.2: Splitting of the computational domain into thread domains Figure 6.2 describes the splitting of the computational domain. To determine the thread domain, we need the information of the thread index, threadid. In CUDA, a thread in a block and a block in a grid is given a unique index, threadidx and blockidx, respectively. From threadidx and blockidx, we can determine threadid as threadid.x = blockidx.x blockdim.x + threadidx.x, (6.1) where X can be the x- or y-direction, and blockdim.x is the size of block in X direction. Because a grid is only two-dimensional, threadid in the z-direction is determined as threadid.z = threadidx.z. (6.2) Based on the thread index threadid, we can determine the thread domain. Consider the x-direction and suppose that each thread processes a number of grid points which have the indexes vary from L x to U x, the thread domain is determined as L x = threadid.x npoint x + 1, (6.3) U x = threadid.x npoint x + npoint x, where npoint x is the size of thread domain in the x-direction. Equation (6.3) is applied for the case that the thread domains have the same size. If the computational domain is not exact divisible by the thread domain, we assign the remainder to the rst threads. We have presented this issue in Subsection 1.2.1, Chapter 4. The thread domain in the y and z-direction are determined in a similar way. Figure 6.3 is an example of a kernel code. For simplicity, we show only the code in the x-direction. The kernel code includes two parts: determine the thread domain and perform the calculations on the thread domain. The calculation part is similar to that of the C program. We have presented how to generate
6 6.3. Specification in CTADEL 109 global kernel() { % Determine the thread domain: int threadid.x = blockidx.x*blockdim.x+threadidx.x; int Lx = threadid.x*npoint_x+1; int Ux = threadid.x*npoint_x+npoint_x; % Perform the calculation: for(i=lx;i<=ux;i++){... } } Figure 6.3: Example kernel code a C program in Subsection To generate code for the determination of the thread domain we dene the template $splitdomain(npoint,x) ->(cuda % print CUDA code "int threadid.",x,"=blockidx.",x,"*blockdim.",x,"+threadidx.",x,";" "int L",X,"=threadID.",X,"*npoint_",X,"+1",";" "int U",X,"=threadID.",X,"*npoint_",X," + npoint_",x,";" ). The code generated by the template splitdomain is exactly the code in the rst part of the kernel in Figure 6.3. The variables in the kernel in Figure 6.3 are allocated on a register le, or on local memory if register le is full. The variable inside the kernel can also be allocated using shared memory. A shared memory is allocated using the shared declaration specication, e.g., as: shared int Lx, Ux Generating the host code The host code of a CUDA program consists of allocating and deallocating memory on the GPU, transferring data between CPU and GPU, dening the block and grid, and invoking kernels on the GPU. The synchronization between two steps is obtained by the fact that an operation is started only if the previous operation has been nished. In the following paragraphs we will show examples of the code for each step and how they are generated by CTADEL. Allocate and deallocate memory on the GPU In CUDA, a global memory on the GPU is allocated using cudamalloc() and deallocated using cudafree() primitives. Example codes of allocating and deal-
7 110 Chapter 6. Automatic code generation of efficient CUDA programs locating a variable S, which has type oat and size Size, are given by % ALLOCATE GPU MEMORY float *S; cudamalloc ((void**)&s,size); % DEALLOCATE GPU MEMORY cudafree(s); To generate code for allocating and deallocating a global memory on the GPU, we dene the templates % TEMPLATE TO GENERATE CODE FOR ALLOCATING MEMORY allocategpu(type,var,size) ->(cuda % print CUDA code Type," *",Var,";", "cudamalloc((void**)","&",var,",",size,");" ). % TEMPLATE TO GENERATE CODE FOR DEALLOCATING MEMORY allocategpu(type,var,size)) ->(cuda % print CUDA code "cudafree(",s,");" ). Transfer data between CPU and GPU Data transfer between CPU and GPU is performed using cudamemcpy(). An example of transferring data is given by % TRANSFER DATA FROM CPU TO GPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpyhosttodevice); % TRANSFER DATA FROM GPU TO CPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpydevicetohost); The data transfer command has four parameters: the destination memory address (Mem_dest), the source memory address (Mem_source), the size of the data to be transferred (size_of_data), and the direction of transferring which is specied by HostToDevice for copying from CPU to GPU and DeviceToHost for copying from GPU to CPU. The code for transferring data is generated by the following templates % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM CPU TO GPU copycpu2gpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpyhosttodevice);" ).
8 6.3. Specification in CTADEL 111 % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM GPU TO CPU copygpu2cpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpydevicetohost);" ). Define the block and grid In CUDA, threads are organized into blocks and grid. The block and grid are dened by dimblock and dimgrid primitives, respectively, as % DEFINE THE BLOCK AND GRID dim3 dimblock(blockx,blocky,blockz); dim3 dimgrid(gridx,gridy); In the above code, (blockx,blocky,blockz) are the block sizes, and (gridx,gridy) are the grid sizes. The template that generates code for dening the block and grid reads as % TEMPLATE TO GENERATE CODE FOR DEFINING THE BLOCK AND GRID define_thread(blockx,blocky,blockz,gridx,gridy) ->(cuda % print CUDA code "dim3 ","dimblock(",blockx,",",blocky,",",blockz,")",";", "dim3 ","dimgrid(",gridx,gridy,")" ). Invoke the kernels A kernel is invoked by specifying the name of kernel following by the <<<dim- Grid,dimBlock >>> construct, as % INVOKE KERNEL kernel<<<dimgrid,dimblock>>>(); To generate code for invoking the kernel, we dene the template as % TEMPLATE TO GENERATE CODE FOR INVOKING KERNEL invoke_kernel(kernel_name) ->(cuda % print CUDA code kernel_name,"<<<dimgrid,dimblock>>>"," ).
9 112 Chapter 6. Automatic code generation of efficient CUDA programs Generating the optimal CUDA stream code In Chapter 5, we have shown that a particular choice of multiple streams to overlap the kernel calculations with the data transfers increases the performance of the CUDA implementation for the DYN routine of the HIRLAM weather forecast model upto 36%. In this subsection we will present a technique that generates the optimal CUDA stream program which has the maximal overlap between kernel calculations and data transfers. The technique that we describe here is applicable to the general problem. Suppose that we need to execute a number of kernels on the GPU. Each kernel produces one or a number of output results from a number of input values. Because the CPU and the GPU cannot access memory of each other, we need to transfer input values from the CPU to the GPU before the calculation and output results from the GPU to the CPU after the calculation. To reduce the data transfer time, we only transfer inputs/outputs of the program while keeping intermediate values on the GPU. The calculation in a kernel may require an output value of another kernel as its input. Therefore the execution of kernels has to be arranged in an appropriate order, given by the dependency graph. This dependency graph is a directed acyclic graph (DAG) that represents dependencies of the calculation of kernels on their inputs. In the dependency graph, an invocation of a kernel is called a node. A node takes a set of input values and/or output data of other nodes and uses them to create one or a set of output values. Figure 6.4 is an example of a dependency graph. In this gure a node is denoted by an oval, an input/output value is denoted by a rectangle, and an arrow represents a dependence of a node on its input. In this example, the calculation of kernel A needs data from inputs 1 and 2, the calculation of kernel B needs data from inputs 2, 4 and 5, and the calculation of kernel C needs data from output of kernels A and B and data from inputs 3 and 4. The output value of the total process consists of outputs 1 and 2 of kernels A and C, respectively. Based on the dependency graph, we can generate the optimal CUDA stream program through four steps as follows: 1. Generate the calculation streams. A kernel has to be invoked after another kernel if its calculation uses output of that kernel. Therefore, from the dependency graph we can determine the invocation order of kernels. We group kernels into a list that reects the invocation order of kernels: the rst kernel in the list is invoked rstly, then the second kernel, and so on. We call this list of kernels the calculation stream. In the example in Figure 6.4, the calculation of kernel A does not depend on output of any other kernel. Hence, kernel A can be invoked rstly. Similarly, we can also invoke kernel B rstly. The calculation of kernel C needs the output of kernels A and B. Therefore, kernel C has to be invoked after kernels A
10 6.3. Specification in CTADEL 113 Figure 6.4: An example data dependency graph and B. As a result, we can create two calculation streams: [kernel A,kernel B,kernel C] and [kernel B,kernel A,kernel C]. 2. Generate the CUDA stream codes. The CUDA stream code is generated by adding the input/output transfers required by each kernel to the calculation stream. To have overlap, we split the transfer of inputs/outputs and the calculation of kernels into two CUDA streams so that one stream executes a kernel while the other stream is transferring data. To avoid confusion, we note that a CUDA stream code is the host code of a CUDA program which consists of invocation of kernel executions and data transfers. In term of a host code, these invocations are processed serially. But in term of a CUDA stream code, these invocations are organized into two CUDA streams which are executed in parallel. To describe the generation of the CUDA stream code, we will use the terms previous kernel and next kernel for the kernel that stands in front of and behind the current kernel, respectively, and the term later kernel for one of the kernels that stand behind the current kernel in the calculation stream. The CUDA stream code is created based on three principles. First, the inputs have to be present on the GPU before they can be referred to. Therefore, we add the input transfers of the next kernel to overlap with the calculation of the current kernel. Second, the outputs can only be transferred back after their actual calculation. Hence, we add the output transfers of the current kernel to overlap with the calculation of the next kernel. Third, if the calculation of a kernel in a CUDA stream refers to inputs/values which
11 114 Chapter 6. Automatic code generation of efficient CUDA programs are transferred/calculated in a dierent CUDA stream, a synchronization is needed before the invocation of this kernel. Similarly, a synchronization is needed before a transfer of an output that is calculated in a dierent CUDA stream. The approach to generate the CUDA stream code is as follows: Transfer the inputs of the rst kernel in the calculation stream; For all kernels in the calculation stream: - Synchronize if necessary; - Transfer the inputs of the next kernel and the outputs of the previous kernel in a CUDA stream; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. Table 6.1 shows an example CUDA stream code that is generated from the calculation stream [kernel A,kernel B,kernel C] and the dependency graph in Figure 6.4. Firstly, inputs 1 and 2 of kernel A are transferred in CUDA stream 1. Next, the calculation of kernel A is overlapped with the transfers of input 4 and 5 of kernel B, the calculation of kernel B is overlapped with the transfers of input 3 of kernel C and output 1 of kernel A. Kernel B invoked in CUDA stream 2 uses input 2 which is transferred in CUDA stream 1, hence a synchronization is added before the invocation of kernel B. Similarly, since kernel C invoked in CUDA stream 1 uses output of kernel B, which is invoked in CUDA stream 2, a synchronization is needed before the invocation of kernel C. Finally, output 2 of kernel C is transferred. In the above CUDA stream code generation method, it can happen that the execution time of a kernel is much larger than the time to transfer the inputs/outputs which are overlapped with that kernel, whereas in another overlap phase, the transfer time is larger than the kernel execution time. If this is the case, we do not achieve the maximum prot of overlapping. Hence, we call this: the simple overlap approach. To have a higher ecient overlapping, we adapt the simple overlap approach by balancing the data transfer time and the kernel execution time. We add more transfers to overlap with an expensive kernel and remove transfers if the kernel execution time is smaller than the transfers time. We call this the ecient overlap approach. The CUDA stream code is generated by the ecient overlap approach as follows: Transfer the inputs of the rst kernel in the calculation stream;
12 6.3. Specification in CTADEL 115 Table 6.1: Example CUDA streams code generated from the calculation stream [kernel A,kernel B,kernel C]. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Code line CUDA stream 1 CUDA stream 2 1 transfer input 1 transfer input 2 2 kernel A transfer input 4 3 synchronization transfer input 5 4 transfer input 3 kernel B transfer output 1 5 synchronization 6 kernel C 7 transfer output 2 For all kernels: - Synchronize if necessary; - Transfer the inputs of the next kernel in a CUDA stream; - While the input transfers time is smaller than the kernel execution time, transfer one input of the later kernel, or one calculated output if all inputs have been transferred; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. The kernel execution time and the input/output transfer time can be extrapolated by the number of operations executed by each kernel and the size of data to be transferred, respectively, or can be found empirically. 3. Derive the theoretical time of the CUDA stream codes. The theoretical time of a CUDA stream code is an aggregation of the transfers and execution time in case of non overlap. Examples are the theoretical time of row 1 in Table 6.1 which is equal to the total transfer time of inputs 1 and 2, or the theoretical time of row 6 which is the execution time of kernel C. If there is overlap between a kernel execution and data transfers, then the theoretical time is equal to the maximum value of those two times. For
13 116 Chapter 6. Automatic code generation of efficient CUDA programs example the theoretical time of row 2 in Table 6.1 is the maximum value of the kernel A execution time and the total transfer time of inputs 4 and Generate the optimal CUDA stream program. In general from a dependency graph we can generate many calculation streams, corresponding with many generated CUDA stream codes with dierent theoretical times. At the nal step, we choose the CUDA stream code which has the smallest theoretical time to generate the optimal CUDA stream program. 6.4 Generating the optimal CUDA stream program for the DYN routine In this section we present how to generate the optimal CUDA stream program for the dynamics routine (DYN) of the HIRLAM weather forecast model [29] The DYN routine The DYN routine of the HIRLAM weather forecast model solves the so-called primitive equations that describe the conservation of horizontal momentum, energy and mass (air and water), and the ideal gas law, on the grid points. Specically, the DYN routine calculates the tendency of surface pressure ps t, humidity q t, temperature T t, and wind components (u t,v t ) from the inputs including vertical hybrid coordinate (A,B), coriolis eect f, metric coecient (hxu,hxv,hxt,hyu,hyv,hyt), surface geopotential phis, surface pressure ps, humidity q, temperature T, and wind (u,v). The dependency graph of the DYN routine is shown in Figure 6.5. In this graph, p, lnp, etap, E, Z, and phi are the user-dened variables; t25, t60, and t64 are temporary variables introduced by common subexpression elimination. To derive the theoretical time of the CUDA stream code, we need the information about the time to execute the kernels and the time to transfer the input/output data. The kernel execution times and the input/output transfer times depend on the computational domain. We measure these times for the Table 6.2: Execution time of the kernels (microsecond) for the domain of grid points Kernel p etap lnp E Z phi ps t t60 t64 t25 q t T t u t v t Time
14 6.4. Generating the optimal CUDA stream program for the DYN routine 117 Figure 6.5: Dependency graph of the DYN routine. The inputs and outputs are denoted by a rectangular. The calculated variables are denoted by a oval. An arrow shows the dependence of a kernel calculation on an input or a calculated value of other kernel or the dependence of an output on a calculated value of a kernel. Table 6.3: Transfer time of the input/output data (microsecond) for the domain of grid points Variable Size Copy time 1D inputs (A, B) 256 B 18 2D inputs (ps, f, phis, hxu, hxv, hxt, hyu, hyv, hxt) 384 KB 85 3D inputs (q, T, u, v) 24 MB D outputs (ps t) 384 KB 85 3D outputs (q t, T t, u t, v t) 24 MB 4130 domain of grid points, the base domain that we used in Chapter 5. The results are shown in Tables 6.2 and 6.3. Table 6.3 shows that for the 3D data, transferring the input from the CPU
15 118 Chapter 6. Automatic code generation of efficient CUDA programs to the GPU is more expensive than the output from the GPU to the CPU. This result is also observed in other studies such as [2, 19] The simple and efficient overlap approach In Subsection 6.3.4, we proposed two methods to generate the CUDA stream code, namely the simple and efficient overlap approach. In this subsection we firstly show how the CUDA stream code is generated from a calculation stream applying the simple and efficient overlap approach, and how the theoretical time of a CUDA stream code is derived. Next, we assess the performance of the two CUDA stream code generation approaches. Tables 6.4 and 6.5 show the CUDA stream code, as an example, generated from the calculation stream [p, t60, t64, ps t, lnp, etap, E, Z, q t, t25, phi, T t, u t, v t ], which is the calculation order of the CUDA stream code that we have presented in Chapter 5, applying the simple and efficient overlap approach, respectively. The common point of the two approaches is that the inputs of the first kernel p (A,B,ps) are transferred before the first kernel starts its calculation, and the output of the last kernel v t is transferred back after its calculation has been finished. In addition, the input transfers of the next kernel are overlapped with the calculation of the current kernel. For example, in both Tables 6.4 and 6.5, to calculate t60 (line 3) we need the input u, therefore, u is transferred at line 2. The difference is that with the efficient overlap approach, if the total transfer time in an overlapped phase is smaller than the kernel calculation time, we transfer more data until the total transfer time approximates the calculation time. For example, to calculate ps t (line 5, Tables 6.4 and 6.5), we need the inputs hyu, hxt, and hyt. With the simple overlap approach, the transfers of these inputs are overlapped with the calculation of t64 at line 4 of Table 6.4. However, the total transfer time of the inputs hyu, hxt, and hyt, which is 255 µs, is smaller than the calculation time of t64 (666 µs). Therefore, in the efficient overlap approach we add the transfers of hxv, hyv, hxu, and phis to overlap with the calculation of t64 (see Table 6.5). The third column of Tables 6.4 and 6.5 show the theoretical time of the CUDA stream code. The theoretical time of the CUDA stream code is the total time of all lines in the CUDA stream code. The time of a line is the transfer/calculation time in case of non overlap, or the maximum value of the transfer and calculation time if there is overlap between the calculation and transfer. Examples, in Table 6.4, the transfers of A, B, and ps (line 1) are not overlapped with any calculation, hence the time of line 1 is 121 µs which is the time to transfer A, B, and ps. In line 2, the calculation of p is overlapped with the transfer of u. The transfer time of u, which is 4400 µs, is larger than the calculation time of p (230 µs). Therefore the time of line 2 is 4400 µs. Note that, at this stage we cannot measure exactly the synchronization time; we do not include it in the theoretical time. In Subsection 6.4.4, we will investigate
16 6.4. Generating the optimal CUDA stream program for the DYN routine 119 Table 6.4: The CUDA stream code and theoretical time obtained by the simple overlap approach. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hyu) 4 Calculation(t64) Transfer(hxt) 666 Transfer(hyt) 5 Transfer(hxv) Calculation(ps t ) 85 6 Calculation(lnp) Transfer(hyv) Transfer(hxu) Calculation(etap) Calculation(E) Transfer(f) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Transfer(phis) Calculation(t25) Calculation(phi) Transfer(ps t ) Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46451
17 120 Chapter 6. Automatic code generation of efficient CUDA programs Table 6.5: The CUDA stream code and theoretical time obtained by the efficient overlap approach Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hxv) Transfer(hyv) Transfer(hxu) 4 Calculation(t64) Transfer(hyu) 666 Transfer(hxt) Transfer(hyt) Transfer(phis) 5 Calculation(ps t ) 54 6 Calculation(lnp) Transfer(f) Calculation(etap) Calculation(E) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Calculation(t25) Calculation(phi) 1044 Transfer(ps t ) 13 Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46420
18 6.4. Generating the optimal CUDA stream program for the DYN routine 121 Table 6.6: Theoretical time (µs) of the CUDA stream code obtained by the simple and efficient overlap approach Theoretical time Calculation stream Simple Efficient overlap overlap [p, lnp, t60, t64, ps t, etap, q t, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, ps t, etap, q t, lnp, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, lnp, ps t, etap, E, Z, q t, t25, phi, T t, u t, v t] [p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, E, Z, u t, v t] [p, E, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [p, E, t60, t64, lnp, ps t, etap, Z, q t, t25, phi, T t, u t, v t] [E, p, t60, t64, ps t, etap, q t, lnp, T t, Z, t25, phi, u t, v t] [E, p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [E, p, Z, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, u t, v t] [E, p, Z, t60, t64, ps t, etap, q t, lnp, T t, t25, phi, u t, v t] whether the synchronization time is significant for the theoretical time. To evaluate the performance of the two CUDA stream code generation approaches, we derive the theoretical time of the CUDA stream code generated from the different calculation streams. From the dependency graph we can generate thousands of calculation streams. It will take a lot of time to generate CUDA stream codes for all suitable calculation streams. Hence, we assess the two CUDA stream code generation approaches based on a number of calculation streams as in Table 6.6. These calculation streams are created by hand. In order to obtain some representativity, we choose different kernels for each node. For examples, the first node can be the kernel p or E, the second node can be the kernel lnp, t60, t25, or Z. We observe that the CUDA stream codes generated by the efficient overlap approach are more efficient than those generated by the simple overlap approach. Therefore, in the next subsection we will apply the efficient overlap approach to generate the optimal CUDA stream program Generating the optimal CUDA stream program As presented in Subsection 6.3.4, the optimal CUDA stream program is generated through the following steps: Based on the dependency graph, CTADEL generates the calculation streams. Figure 6.6 shows, as an example, how the first three kernels of the cal-
19 122 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.6: The calculation streams generated for the first three kernels. An arrow shows which kernel can be invoked after the calculation of another kernel.
20 6.4. Generating the optimal CUDA stream program for the DYN routine 123 Table 6.7: The optimal CUDA stream code CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A,B,ps) 121 Transfer(hxv) Calculation(p) Transfer(hyv) 230 Transfer(v) Calculation(lnp) 4400 Calculation(t64) Transfer(u) 4400 Transfer(hxu) Transfer(hyu) Transfer(hxt) Calculation(t60) 359 Transfer(hyt) Calculation(ps t) 54 Transfer(f) Transfer(phis) Calculation(etap) 1044 Calculation(Z) Transfer(q) 4400 Transfer(T ) Calculation(q t) 4641 Calculation(T t) Transfer(q t) Transfer(ps t) Calculation(E) 791 Calculation(t25) 786 Calculation(phi) 1044 Calculation(u t) Transfer(T t) 4295 Transfer(u t) Calculation(v t) 4290 Transfer(v t) 4130 Total theoretical time 45905
21 124 Chapter 6. Automatic code generation of efficient CUDA programs culation streams are generated. At first, either the kernel p or E can be chosen, because the calculations of them do not use outputs of any other kernel. Next, if we choose p as first kernel, the kernels lnp, Z, t25, t60, t64, and E can be selected as second kernel, because the calculations of these kernels only depend on the calculation of p. Similarly, if we choose lnp as second kernel, the kernels E, Z, t25, t60, and t64 can be chosen as third kernel. From the calculation streams, CTADEL generates the CUDA stream codes by adding the input/output transfers applying the efficient overlap approach. An example of a CUDA stream code obtained by the efficient overlap approach is presented in Table 6.5. Next, CTADEL derives the theoretical time of all generated CUDA stream codes. Finally, CTADEL chooses the CUDA stream code with the smallest theoretical time for generation of the optimal CUDA stream program. Table 6.7 shows the CUDA stream code which has the smallest theoretical time of µs Experiments The experiments are performed on the system that we used in Chapter 5. The codes for the experiments are the CUDA implementations of the DYN routine, which is generated by CTADEL. We compare the performance of the CUDA generated code with the handwritten code, and with the C program. We verify the correctness of the generated CUDA programs by comparing the calculated results of DYN from the generated programs with those from the handwritten code. This comparison shows that the generated and handwritten program reproduce bit-wise identical output. The domain and threads structure are chosen the same as in Chapter 5: the computational domain consists of 512x192x64 grid points; the thread domain contains 1x1x32 grid points, and the block has 256x1x32 threads. The elapsed times are shown in Table 6.8. The elapsed time of the C and CUDA code is the total execution time on the CPU. For the C code, it is the calculation time on the CPU. But for the CUDA code, it consists of the calculation time on the GPU and the time to transfer data between CPU and GPU. Because we did not include the synchronization time in deriving the theoretical time of the CUDA stream code, the real run times of the CUDA generated codes are larger than the theoretical times as we can see from Table 6.8. The small difference between the real run time and the theoretical time indicates that
22 6.5. Conclusion 125 Table 6.8: The theoretical and run times (in ms) of the CUDA stream programs on a GTX 480 GPU, and, for reference, that of the C code on an Intel i7-940 CPU. Theoretical time Run time C code 2610 CUDA Hand code (Chapter 5) 46.9 Stream CUDA stream code (Table 6.4) code CUDA stream code (Table 6.5) Optimal CUDA stream code the synchronization time is small and can safely be neglected when deriving the theoretical time of the CUDA stream code. The optimal CUDA stream program generated by CTADEL is slightly faster than the handwritten code. This demonstrates that, although we have done many optimizations, the handwritten code is not optimal. This result also shows that CTADEL can generate efficient CUDA code that is comparable with optimized handwritten code. In summary, we obtain a speedup of 57 of the CUDA program over the C code. 6.5 Conclusion We have presented how to extend CTADEL to generate CUDA programs. To overlap kernel calculations with data transfers, we introduced two approaches to generate the optimal CUDA stream program. We applied this technique to the DYN routine of the HIRLAM weather forecast model. The experimental results showed that the real run times approximate the theoretical times. This indicates that the approach that we use to derive the theoretical time of the CUDA stream program is reliable. The generated code is more efficient than the handwritten code. The difference between the optimal generated code and the handwritten code is small (2%) due to many optimizations that we have done on the handwritten code. The main advantages of code generation here are hence the ease to obtain efficient code, to maintain its efficiency across platforms, and to implement new problems and algorithms; but in other situations it may also take away the burden of manual code optimizations.
23
CUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationCover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:
Cover Page The handle http://hdl.handle.net/1887/18622 holds various files of this Leiden University dissertation. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationCUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010
CUDA by Example The University of Mississippi Computer Science Seminar Series Martin.Lilleeng.Satra@sintef.no SINTEF ICT Department of Applied Mathematics April 28, 2010 Outline 1 The GPU 2 cudapi 3 CUDA
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationCUDA. Sathish Vadhiyar High Performance Computing
CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCUDA Kenjiro Taura 1 / 36
CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationGeneral Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationIntroduction to GPU Computing Junjie Lai, NVIDIA Corporation
Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationCS 179: GPU Computing
CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009
1 CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 Instructions: This is an in class, open note exam. Please use the paper provided to submit your responses. You can include additional
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationPinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.
Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationEEM528 GPU COMPUTING
EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationZero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.
Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain
More informationGPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.
GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationCUDA Parallel Programming Model. Scalable Parallel Programming with CUDA
CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationCS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay
: CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More information