Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:

Size: px

Start display at page:

Download "Cover Page. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:"

Augustus Green
5 years ago
Views:

Cover Page The handle http://hdl.handle.net/1887/18622 holds various files of this Leiden University dissertation.

1 Cover Page The handle holds various files of this Leiden University dissertation. Author: Vu, Van Thieu Title: Opportunities for performance optimization of applications through code generation Issue Date:

2 Chapter 6 Automatic code generation of efficient CUDA programs In Chapter 5 we have presented the implementation of the dynamics routine of the HIRLAM weather forecast model on programmable NVIDIA GPUs. The results showed that the use of GPUs to accelerate a weather forecast model is very promising. The CUDA program used in Chapter 5 was created by hand from the original Fortran code. The handwritten CUDA code is dicult to optimize and complicated to maintain. Therefore, in this chapter we present our extension to CTADEL to be able to generate CUDA code. We show a technique that generates ecient CUDA program for a general problem. Then we apply this technique to generate CUDA code for the dynamics routine of the HIRLAM weather forecast model. 6.1 Motivation Although the CUDA programming model is more convenient than the previous graphics programming APIs for developing GPU codes, the manual development of high performance codes with the CUDA model is still more complicated than the use of parallel programming models such as OpenMP [59] for generalpurpose multi-core systems [7, 28]. Therefore, it is attractive, for enhancing programmer productivity and for software quality, to develop a technique that supports automatic generation of CUDA programs. Recently, this issue has been studied in several projects. Some of them constructed a tool that automatically generates CUDA code at runtime, such as Klockner [37] and Perryman [61]. Other researchers developed a code generation tool that translates to CUDA program from other languages, such as from Fortran [26], C [7], or Java [85]. The above CUDA code generation tools have one common point, namely tak-

3 106 Chapter 6. Automatic code generation of efficient CUDA programs ing an existing program which is implemented in a dierent language as input. We introduce a new method that automatically generates CUDA code from an input problem specication using the code generation tool CTADEL [18]. Originally CTADEL was designed to generate Fortran code. We extend CTADEL to generate C and CUDA instead of Fortran. 6.2 CUDA programming model The architecture of GPU-based computing has been described in detail in Chapter 5. Below we give the structure of a CUDA program. A CUDA program includes two parts. The code that executes on the GPU is called the kernel and the other part that executes on the CPU is called the host code. The kernel is executed by a set of threads in single program multiple data (SPMD) mode. These threads are organized into blocks and a grid. A block is a group of threads and several blocks form a grid. Within a block threads are organized in a 1-, 2-, or 3-dimensional structure. Blocks in a grid are organized in a 1- or 2-dimensional structure. The host code includes program I/O, invocations of data transfers between CPU and GPU, and launchings of kernels. In detail, the host code consists of the following steps: Allocate memory on the GPU; Copy input data from CPU to GPU; Dene the block and grid structures; Invoke the GPU to start kernels; Copy output data from GPU to CPU; Deallocate memory on the GPU. 6.3 Specification in CTADEL CUDA is an extension of C, while CTADEL was implemented to generate Fortran. Therefore we rstly adapt CTADEL to generate C and then extend it for CUDA generation. We note that after the extension CTADEL can automatically generate CUDA from the specication.

4 6.3. Specification in CTADEL 107 %TEMPLATE declaref: FORTRAN DECLARE $stmt(declaref(x, Xs, T)) ->(fortran % print Fortran code T," ",X,"(",Xs,")" ). %TEMPLATE forallf: FORTRAN DO-LOOP $stmt(forallf(s, I=L..U)) ->(fortran % print Fortran code "DO ",I,"=",L,",",U, $stmt(s), "ENDDO" ). (a) %TEMPLATE declarec: C DECLARE $stmt(declarec(x, Xs, T)) ->(C % print C code T," ",X,"[",Xs,"]",";" ). %TEMPLATE forallc: C FOR-LOOP $stmt(forallc(s, I=L..U)) ->(C % print C code "for(",i,"=",l,";", I,"<=",U,";",I,"++)", {",$stmt(s),"}" ). (b) Figure 6.1: Example templates to generate a declaration and a for-loop in Fortran (a) and C (b) Adapting CTADEL for C code generation The choice of the target language is implemented in the last stage of the code generation process in CTADEL. In this nal step, CTADEL produces the generated code based on predened templates. In these templates, the grammar of the target language is dened. As examples, Figure 6.1 (a) shows templates for the generation of a declaration (declaref ) and a for-loop (forallf ) in Fortran. In the template declaref, X denotes a variable which has the type T and Xs is the memory to be allocated. In the template forallf, L and U are the loop boundaries, I denotes the loop index, and S is the statement inside the loop. The change in order to generate C code is realized straightforward by the grammar modication, as in Figure 6.1 (b). In this gure, the grammar structures that convert to Fortran such as ( ) and DO-ENDDO now convert to [ ] and for { }, respectively. The modication to generate other structures of C are done in a similar way Generating the kernel code In the CUDA context, a kernel is a function that is executed in parallel by a number of threads. Similar to a conventional parallel program where each processor works on a small part of the data, each of the threads that execute the kernel is assigned to a portion of the data. We call the portion of data processed by each thread the thread domain. The generation of the kernel code involves splitting the computational domain into thread domains.

5 108 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.2: Splitting of the computational domain into thread domains Figure 6.2 describes the splitting of the computational domain. To determine the thread domain, we need the information of the thread index, threadid. In CUDA, a thread in a block and a block in a grid is given a unique index, threadidx and blockidx, respectively. From threadidx and blockidx, we can determine threadid as threadid.x = blockidx.x blockdim.x + threadidx.x, (6.1) where X can be the x- or y-direction, and blockdim.x is the size of block in X direction. Because a grid is only two-dimensional, threadid in the z-direction is determined as threadid.z = threadidx.z. (6.2) Based on the thread index threadid, we can determine the thread domain. Consider the x-direction and suppose that each thread processes a number of grid points which have the indexes vary from L x to U x, the thread domain is determined as L x = threadid.x npoint x + 1, (6.3) U x = threadid.x npoint x + npoint x, where npoint x is the size of thread domain in the x-direction. Equation (6.3) is applied for the case that the thread domains have the same size. If the computational domain is not exact divisible by the thread domain, we assign the remainder to the rst threads. We have presented this issue in Subsection 1.2.1, Chapter 4. The thread domain in the y and z-direction are determined in a similar way. Figure 6.3 is an example of a kernel code. For simplicity, we show only the code in the x-direction. The kernel code includes two parts: determine the thread domain and perform the calculations on the thread domain. The calculation part is similar to that of the C program. We have presented how to generate

6 6.3. Specification in CTADEL 109 global kernel() { % Determine the thread domain: int threadid.x = blockidx.x*blockdim.x+threadidx.x; int Lx = threadid.x*npoint_x+1; int Ux = threadid.x*npoint_x+npoint_x; % Perform the calculation: for(i=lx;i<=ux;i++){... } } Figure 6.3: Example kernel code a C program in Subsection To generate code for the determination of the thread domain we dene the template $splitdomain(npoint,x) ->(cuda % print CUDA code "int threadid.",x,"=blockidx.",x,"*blockdim.",x,"+threadidx.",x,";" "int L",X,"=threadID.",X,"*npoint_",X,"+1",";" "int U",X,"=threadID.",X,"*npoint_",X," + npoint_",x,";" ). The code generated by the template splitdomain is exactly the code in the rst part of the kernel in Figure 6.3. The variables in the kernel in Figure 6.3 are allocated on a register le, or on local memory if register le is full. The variable inside the kernel can also be allocated using shared memory. A shared memory is allocated using the shared declaration specication, e.g., as: shared int Lx, Ux Generating the host code The host code of a CUDA program consists of allocating and deallocating memory on the GPU, transferring data between CPU and GPU, dening the block and grid, and invoking kernels on the GPU. The synchronization between two steps is obtained by the fact that an operation is started only if the previous operation has been nished. In the following paragraphs we will show examples of the code for each step and how they are generated by CTADEL. Allocate and deallocate memory on the GPU In CUDA, a global memory on the GPU is allocated using cudamalloc() and deallocated using cudafree() primitives. Example codes of allocating and deal-

7 110 Chapter 6. Automatic code generation of efficient CUDA programs locating a variable S, which has type oat and size Size, are given by % ALLOCATE GPU MEMORY float *S; cudamalloc ((void**)&s,size); % DEALLOCATE GPU MEMORY cudafree(s); To generate code for allocating and deallocating a global memory on the GPU, we dene the templates % TEMPLATE TO GENERATE CODE FOR ALLOCATING MEMORY allocategpu(type,var,size) ->(cuda % print CUDA code Type," *",Var,";", "cudamalloc((void**)","&",var,",",size,");" ). % TEMPLATE TO GENERATE CODE FOR DEALLOCATING MEMORY allocategpu(type,var,size)) ->(cuda % print CUDA code "cudafree(",s,");" ). Transfer data between CPU and GPU Data transfer between CPU and GPU is performed using cudamemcpy(). An example of transferring data is given by % TRANSFER DATA FROM CPU TO GPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpyhosttodevice); % TRANSFER DATA FROM GPU TO CPU cudamemcpy(mem_dest,mem_source,size_of_data,cudamemcpydevicetohost); The data transfer command has four parameters: the destination memory address (Mem_dest), the source memory address (Mem_source), the size of the data to be transferred (size_of_data), and the direction of transferring which is specied by HostToDevice for copying from CPU to GPU and DeviceToHost for copying from GPU to CPU. The code for transferring data is generated by the following templates % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM CPU TO GPU copycpu2gpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpyhosttodevice);" ).

8 6.3. Specification in CTADEL 111 % TEMPLATE TO GENERATE CODE FOR TRANSFERRING DATA FROM GPU TO CPU copygpu2cpu(mem_dest,mem_source,size) ->(cuda % print CUDA code "cudamemcpy(",mem_dest,",",mem_source,",",size,", cudamemcpydevicetohost);" ). Define the block and grid In CUDA, threads are organized into blocks and grid. The block and grid are dened by dimblock and dimgrid primitives, respectively, as % DEFINE THE BLOCK AND GRID dim3 dimblock(blockx,blocky,blockz); dim3 dimgrid(gridx,gridy); In the above code, (blockx,blocky,blockz) are the block sizes, and (gridx,gridy) are the grid sizes. The template that generates code for dening the block and grid reads as % TEMPLATE TO GENERATE CODE FOR DEFINING THE BLOCK AND GRID define_thread(blockx,blocky,blockz,gridx,gridy) ->(cuda % print CUDA code "dim3 ","dimblock(",blockx,",",blocky,",",blockz,")",";", "dim3 ","dimgrid(",gridx,gridy,")" ). Invoke the kernels A kernel is invoked by specifying the name of kernel following by the <<<dim- Grid,dimBlock >>> construct, as % INVOKE KERNEL kernel<<<dimgrid,dimblock>>>(); To generate code for invoking the kernel, we dene the template as % TEMPLATE TO GENERATE CODE FOR INVOKING KERNEL invoke_kernel(kernel_name) ->(cuda % print CUDA code kernel_name,"<<<dimgrid,dimblock>>>"," ).

9 112 Chapter 6. Automatic code generation of efficient CUDA programs Generating the optimal CUDA stream code In Chapter 5, we have shown that a particular choice of multiple streams to overlap the kernel calculations with the data transfers increases the performance of the CUDA implementation for the DYN routine of the HIRLAM weather forecast model upto 36%. In this subsection we will present a technique that generates the optimal CUDA stream program which has the maximal overlap between kernel calculations and data transfers. The technique that we describe here is applicable to the general problem. Suppose that we need to execute a number of kernels on the GPU. Each kernel produces one or a number of output results from a number of input values. Because the CPU and the GPU cannot access memory of each other, we need to transfer input values from the CPU to the GPU before the calculation and output results from the GPU to the CPU after the calculation. To reduce the data transfer time, we only transfer inputs/outputs of the program while keeping intermediate values on the GPU. The calculation in a kernel may require an output value of another kernel as its input. Therefore the execution of kernels has to be arranged in an appropriate order, given by the dependency graph. This dependency graph is a directed acyclic graph (DAG) that represents dependencies of the calculation of kernels on their inputs. In the dependency graph, an invocation of a kernel is called a node. A node takes a set of input values and/or output data of other nodes and uses them to create one or a set of output values. Figure 6.4 is an example of a dependency graph. In this gure a node is denoted by an oval, an input/output value is denoted by a rectangle, and an arrow represents a dependence of a node on its input. In this example, the calculation of kernel A needs data from inputs 1 and 2, the calculation of kernel B needs data from inputs 2, 4 and 5, and the calculation of kernel C needs data from output of kernels A and B and data from inputs 3 and 4. The output value of the total process consists of outputs 1 and 2 of kernels A and C, respectively. Based on the dependency graph, we can generate the optimal CUDA stream program through four steps as follows: 1. Generate the calculation streams. A kernel has to be invoked after another kernel if its calculation uses output of that kernel. Therefore, from the dependency graph we can determine the invocation order of kernels. We group kernels into a list that reects the invocation order of kernels: the rst kernel in the list is invoked rstly, then the second kernel, and so on. We call this list of kernels the calculation stream. In the example in Figure 6.4, the calculation of kernel A does not depend on output of any other kernel. Hence, kernel A can be invoked rstly. Similarly, we can also invoke kernel B rstly. The calculation of kernel C needs the output of kernels A and B. Therefore, kernel C has to be invoked after kernels A

10 6.3. Specification in CTADEL 113 Figure 6.4: An example data dependency graph and B. As a result, we can create two calculation streams: [kernel A,kernel B,kernel C] and [kernel B,kernel A,kernel C]. 2. Generate the CUDA stream codes. The CUDA stream code is generated by adding the input/output transfers required by each kernel to the calculation stream. To have overlap, we split the transfer of inputs/outputs and the calculation of kernels into two CUDA streams so that one stream executes a kernel while the other stream is transferring data. To avoid confusion, we note that a CUDA stream code is the host code of a CUDA program which consists of invocation of kernel executions and data transfers. In term of a host code, these invocations are processed serially. But in term of a CUDA stream code, these invocations are organized into two CUDA streams which are executed in parallel. To describe the generation of the CUDA stream code, we will use the terms previous kernel and next kernel for the kernel that stands in front of and behind the current kernel, respectively, and the term later kernel for one of the kernels that stand behind the current kernel in the calculation stream. The CUDA stream code is created based on three principles. First, the inputs have to be present on the GPU before they can be referred to. Therefore, we add the input transfers of the next kernel to overlap with the calculation of the current kernel. Second, the outputs can only be transferred back after their actual calculation. Hence, we add the output transfers of the current kernel to overlap with the calculation of the next kernel. Third, if the calculation of a kernel in a CUDA stream refers to inputs/values which

11 114 Chapter 6. Automatic code generation of efficient CUDA programs are transferred/calculated in a dierent CUDA stream, a synchronization is needed before the invocation of this kernel. Similarly, a synchronization is needed before a transfer of an output that is calculated in a dierent CUDA stream. The approach to generate the CUDA stream code is as follows: Transfer the inputs of the rst kernel in the calculation stream; For all kernels in the calculation stream: - Synchronize if necessary; - Transfer the inputs of the next kernel and the outputs of the previous kernel in a CUDA stream; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. Table 6.1 shows an example CUDA stream code that is generated from the calculation stream [kernel A,kernel B,kernel C] and the dependency graph in Figure 6.4. Firstly, inputs 1 and 2 of kernel A are transferred in CUDA stream 1. Next, the calculation of kernel A is overlapped with the transfers of input 4 and 5 of kernel B, the calculation of kernel B is overlapped with the transfers of input 3 of kernel C and output 1 of kernel A. Kernel B invoked in CUDA stream 2 uses input 2 which is transferred in CUDA stream 1, hence a synchronization is added before the invocation of kernel B. Similarly, since kernel C invoked in CUDA stream 1 uses output of kernel B, which is invoked in CUDA stream 2, a synchronization is needed before the invocation of kernel C. Finally, output 2 of kernel C is transferred. In the above CUDA stream code generation method, it can happen that the execution time of a kernel is much larger than the time to transfer the inputs/outputs which are overlapped with that kernel, whereas in another overlap phase, the transfer time is larger than the kernel execution time. If this is the case, we do not achieve the maximum prot of overlapping. Hence, we call this: the simple overlap approach. To have a higher ecient overlapping, we adapt the simple overlap approach by balancing the data transfer time and the kernel execution time. We add more transfers to overlap with an expensive kernel and remove transfers if the kernel execution time is smaller than the transfers time. We call this the ecient overlap approach. The CUDA stream code is generated by the ecient overlap approach as follows: Transfer the inputs of the rst kernel in the calculation stream;

12 6.3. Specification in CTADEL 115 Table 6.1: Example CUDA streams code generated from the calculation stream [kernel A,kernel B,kernel C]. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Code line CUDA stream 1 CUDA stream 2 1 transfer input 1 transfer input 2 2 kernel A transfer input 4 3 synchronization transfer input 5 4 transfer input 3 kernel B transfer output 1 5 synchronization 6 kernel C 7 transfer output 2 For all kernels: - Synchronize if necessary; - Transfer the inputs of the next kernel in a CUDA stream; - While the input transfers time is smaller than the kernel execution time, transfer one input of the later kernel, or one calculated output if all inputs have been transferred; - Execute the kernel in another CUDA stream; Transfer the outputs of the last kernel. The kernel execution time and the input/output transfer time can be extrapolated by the number of operations executed by each kernel and the size of data to be transferred, respectively, or can be found empirically. 3. Derive the theoretical time of the CUDA stream codes. The theoretical time of a CUDA stream code is an aggregation of the transfers and execution time in case of non overlap. Examples are the theoretical time of row 1 in Table 6.1 which is equal to the total transfer time of inputs 1 and 2, or the theoretical time of row 6 which is the execution time of kernel C. If there is overlap between a kernel execution and data transfers, then the theoretical time is equal to the maximum value of those two times. For

13 116 Chapter 6. Automatic code generation of efficient CUDA programs example the theoretical time of row 2 in Table 6.1 is the maximum value of the kernel A execution time and the total transfer time of inputs 4 and Generate the optimal CUDA stream program. In general from a dependency graph we can generate many calculation streams, corresponding with many generated CUDA stream codes with dierent theoretical times. At the nal step, we choose the CUDA stream code which has the smallest theoretical time to generate the optimal CUDA stream program. 6.4 Generating the optimal CUDA stream program for the DYN routine In this section we present how to generate the optimal CUDA stream program for the dynamics routine (DYN) of the HIRLAM weather forecast model [29] The DYN routine The DYN routine of the HIRLAM weather forecast model solves the so-called primitive equations that describe the conservation of horizontal momentum, energy and mass (air and water), and the ideal gas law, on the grid points. Specically, the DYN routine calculates the tendency of surface pressure ps t, humidity q t, temperature T t, and wind components (u t,v t ) from the inputs including vertical hybrid coordinate (A,B), coriolis eect f, metric coecient (hxu,hxv,hxt,hyu,hyv,hyt), surface geopotential phis, surface pressure ps, humidity q, temperature T, and wind (u,v). The dependency graph of the DYN routine is shown in Figure 6.5. In this graph, p, lnp, etap, E, Z, and phi are the user-dened variables; t25, t60, and t64 are temporary variables introduced by common subexpression elimination. To derive the theoretical time of the CUDA stream code, we need the information about the time to execute the kernels and the time to transfer the input/output data. The kernel execution times and the input/output transfer times depend on the computational domain. We measure these times for the Table 6.2: Execution time of the kernels (microsecond) for the domain of grid points Kernel p etap lnp E Z phi ps t t60 t64 t25 q t T t u t v t Time

14 6.4. Generating the optimal CUDA stream program for the DYN routine 117 Figure 6.5: Dependency graph of the DYN routine. The inputs and outputs are denoted by a rectangular. The calculated variables are denoted by a oval. An arrow shows the dependence of a kernel calculation on an input or a calculated value of other kernel or the dependence of an output on a calculated value of a kernel. Table 6.3: Transfer time of the input/output data (microsecond) for the domain of grid points Variable Size Copy time 1D inputs (A, B) 256 B 18 2D inputs (ps, f, phis, hxu, hxv, hxt, hyu, hyv, hxt) 384 KB 85 3D inputs (q, T, u, v) 24 MB D outputs (ps t) 384 KB 85 3D outputs (q t, T t, u t, v t) 24 MB 4130 domain of grid points, the base domain that we used in Chapter 5. The results are shown in Tables 6.2 and 6.3. Table 6.3 shows that for the 3D data, transferring the input from the CPU

15 118 Chapter 6. Automatic code generation of efficient CUDA programs to the GPU is more expensive than the output from the GPU to the CPU. This result is also observed in other studies such as [2, 19] The simple and efficient overlap approach In Subsection 6.3.4, we proposed two methods to generate the CUDA stream code, namely the simple and efficient overlap approach. In this subsection we firstly show how the CUDA stream code is generated from a calculation stream applying the simple and efficient overlap approach, and how the theoretical time of a CUDA stream code is derived. Next, we assess the performance of the two CUDA stream code generation approaches. Tables 6.4 and 6.5 show the CUDA stream code, as an example, generated from the calculation stream [p, t60, t64, ps t, lnp, etap, E, Z, q t, t25, phi, T t, u t, v t ], which is the calculation order of the CUDA stream code that we have presented in Chapter 5, applying the simple and efficient overlap approach, respectively. The common point of the two approaches is that the inputs of the first kernel p (A,B,ps) are transferred before the first kernel starts its calculation, and the output of the last kernel v t is transferred back after its calculation has been finished. In addition, the input transfers of the next kernel are overlapped with the calculation of the current kernel. For example, in both Tables 6.4 and 6.5, to calculate t60 (line 3) we need the input u, therefore, u is transferred at line 2. The difference is that with the efficient overlap approach, if the total transfer time in an overlapped phase is smaller than the kernel calculation time, we transfer more data until the total transfer time approximates the calculation time. For example, to calculate ps t (line 5, Tables 6.4 and 6.5), we need the inputs hyu, hxt, and hyt. With the simple overlap approach, the transfers of these inputs are overlapped with the calculation of t64 at line 4 of Table 6.4. However, the total transfer time of the inputs hyu, hxt, and hyt, which is 255 µs, is smaller than the calculation time of t64 (666 µs). Therefore, in the efficient overlap approach we add the transfers of hxv, hyv, hxu, and phis to overlap with the calculation of t64 (see Table 6.5). The third column of Tables 6.4 and 6.5 show the theoretical time of the CUDA stream code. The theoretical time of the CUDA stream code is the total time of all lines in the CUDA stream code. The time of a line is the transfer/calculation time in case of non overlap, or the maximum value of the transfer and calculation time if there is overlap between the calculation and transfer. Examples, in Table 6.4, the transfers of A, B, and ps (line 1) are not overlapped with any calculation, hence the time of line 1 is 121 µs which is the time to transfer A, B, and ps. In line 2, the calculation of p is overlapped with the transfer of u. The transfer time of u, which is 4400 µs, is larger than the calculation time of p (230 µs). Therefore the time of line 2 is 4400 µs. Note that, at this stage we cannot measure exactly the synchronization time; we do not include it in the theoretical time. In Subsection 6.4.4, we will investigate

16 6.4. Generating the optimal CUDA stream program for the DYN routine 119 Table 6.4: The CUDA stream code and theoretical time obtained by the simple overlap approach. Overlap of the kernel calculation and input/output transfers is indicated by the kernel calculation and input/output transfers in the same row. Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hyu) 4 Calculation(t64) Transfer(hxt) 666 Transfer(hyt) 5 Transfer(hxv) Calculation(ps t ) 85 6 Calculation(lnp) Transfer(hyv) Transfer(hxu) Calculation(etap) Calculation(E) Transfer(f) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Transfer(phis) Calculation(t25) Calculation(phi) Transfer(ps t ) Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46451

17 120 Chapter 6. Automatic code generation of efficient CUDA programs Table 6.5: The CUDA stream code and theoretical time obtained by the efficient overlap approach Line CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A) 1 Transfer(B) 121 Transfer(ps) 2 Calculation(p) Transfer(u) Transfer(v) Calculation(t60) 4400 Transfer(hxv) Transfer(hyv) Transfer(hxu) 4 Calculation(t64) Transfer(hyu) 666 Transfer(hxt) Transfer(hyt) Transfer(phis) 5 Calculation(ps t ) 54 6 Calculation(lnp) Transfer(f) Calculation(etap) Calculation(E) Transfer(q) Calculation(Z) Calculation(q t ) Transfer(T ) Calculation(t25) Calculation(phi) 1044 Transfer(ps t ) 13 Transfer(q t ) Calculation(T t ) Calculation(u t ) Transfer(T t ) Transfer(u t ) Calculation(v t ) Transfer(v t ) 4130 Total theoretical time 46420

18 6.4. Generating the optimal CUDA stream program for the DYN routine 121 Table 6.6: Theoretical time (µs) of the CUDA stream code obtained by the simple and efficient overlap approach Theoretical time Calculation stream Simple Efficient overlap overlap [p, lnp, t60, t64, ps t, etap, q t, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, ps t, etap, q t, lnp, T t, E, Z, t25, phi, u t, v t] [p, t60, t64, lnp, ps t, etap, E, Z, q t, t25, phi, T t, u t, v t] [p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, E, Z, u t, v t] [p, E, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [p, E, t60, t64, lnp, ps t, etap, Z, q t, t25, phi, T t, u t, v t] [E, p, t60, t64, ps t, etap, q t, lnp, T t, Z, t25, phi, u t, v t] [E, p, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, Z, u t, v t] [E, p, Z, t25, phi, t60, t64, ps t, etap, q t, lnp, T t, u t, v t] [E, p, Z, t60, t64, ps t, etap, q t, lnp, T t, t25, phi, u t, v t] whether the synchronization time is significant for the theoretical time. To evaluate the performance of the two CUDA stream code generation approaches, we derive the theoretical time of the CUDA stream code generated from the different calculation streams. From the dependency graph we can generate thousands of calculation streams. It will take a lot of time to generate CUDA stream codes for all suitable calculation streams. Hence, we assess the two CUDA stream code generation approaches based on a number of calculation streams as in Table 6.6. These calculation streams are created by hand. In order to obtain some representativity, we choose different kernels for each node. For examples, the first node can be the kernel p or E, the second node can be the kernel lnp, t60, t25, or Z. We observe that the CUDA stream codes generated by the efficient overlap approach are more efficient than those generated by the simple overlap approach. Therefore, in the next subsection we will apply the efficient overlap approach to generate the optimal CUDA stream program Generating the optimal CUDA stream program As presented in Subsection 6.3.4, the optimal CUDA stream program is generated through the following steps: Based on the dependency graph, CTADEL generates the calculation streams. Figure 6.6 shows, as an example, how the first three kernels of the cal-

19 122 Chapter 6. Automatic code generation of efficient CUDA programs Figure 6.6: The calculation streams generated for the first three kernels. An arrow shows which kernel can be invoked after the calculation of another kernel.

20 6.4. Generating the optimal CUDA stream program for the DYN routine 123 Table 6.7: The optimal CUDA stream code CUDA stream 1 CUDA stream 2 Time (µs) Transfer(A,B,ps) 121 Transfer(hxv) Calculation(p) Transfer(hyv) 230 Transfer(v) Calculation(lnp) 4400 Calculation(t64) Transfer(u) 4400 Transfer(hxu) Transfer(hyu) Transfer(hxt) Calculation(t60) 359 Transfer(hyt) Calculation(ps t) 54 Transfer(f) Transfer(phis) Calculation(etap) 1044 Calculation(Z) Transfer(q) 4400 Transfer(T ) Calculation(q t) 4641 Calculation(T t) Transfer(q t) Transfer(ps t) Calculation(E) 791 Calculation(t25) 786 Calculation(phi) 1044 Calculation(u t) Transfer(T t) 4295 Transfer(u t) Calculation(v t) 4290 Transfer(v t) 4130 Total theoretical time 45905

21 124 Chapter 6. Automatic code generation of efficient CUDA programs culation streams are generated. At first, either the kernel p or E can be chosen, because the calculations of them do not use outputs of any other kernel. Next, if we choose p as first kernel, the kernels lnp, Z, t25, t60, t64, and E can be selected as second kernel, because the calculations of these kernels only depend on the calculation of p. Similarly, if we choose lnp as second kernel, the kernels E, Z, t25, t60, and t64 can be chosen as third kernel. From the calculation streams, CTADEL generates the CUDA stream codes by adding the input/output transfers applying the efficient overlap approach. An example of a CUDA stream code obtained by the efficient overlap approach is presented in Table 6.5. Next, CTADEL derives the theoretical time of all generated CUDA stream codes. Finally, CTADEL chooses the CUDA stream code with the smallest theoretical time for generation of the optimal CUDA stream program. Table 6.7 shows the CUDA stream code which has the smallest theoretical time of µs Experiments The experiments are performed on the system that we used in Chapter 5. The codes for the experiments are the CUDA implementations of the DYN routine, which is generated by CTADEL. We compare the performance of the CUDA generated code with the handwritten code, and with the C program. We verify the correctness of the generated CUDA programs by comparing the calculated results of DYN from the generated programs with those from the handwritten code. This comparison shows that the generated and handwritten program reproduce bit-wise identical output. The domain and threads structure are chosen the same as in Chapter 5: the computational domain consists of 512x192x64 grid points; the thread domain contains 1x1x32 grid points, and the block has 256x1x32 threads. The elapsed times are shown in Table 6.8. The elapsed time of the C and CUDA code is the total execution time on the CPU. For the C code, it is the calculation time on the CPU. But for the CUDA code, it consists of the calculation time on the GPU and the time to transfer data between CPU and GPU. Because we did not include the synchronization time in deriving the theoretical time of the CUDA stream code, the real run times of the CUDA generated codes are larger than the theoretical times as we can see from Table 6.8. The small difference between the real run time and the theoretical time indicates that

22 6.5. Conclusion 125 Table 6.8: The theoretical and run times (in ms) of the CUDA stream programs on a GTX 480 GPU, and, for reference, that of the C code on an Intel i7-940 CPU. Theoretical time Run time C code 2610 CUDA Hand code (Chapter 5) 46.9 Stream CUDA stream code (Table 6.4) code CUDA stream code (Table 6.5) Optimal CUDA stream code the synchronization time is small and can safely be neglected when deriving the theoretical time of the CUDA stream code. The optimal CUDA stream program generated by CTADEL is slightly faster than the handwritten code. This demonstrates that, although we have done many optimizations, the handwritten code is not optimal. This result also shows that CTADEL can generate efficient CUDA code that is comparable with optimized handwritten code. In summary, we obtain a speedup of 57 of the CUDA program over the C code. 6.5 Conclusion We have presented how to extend CTADEL to generate CUDA programs. To overlap kernel calculations with data transfers, we introduced two approaches to generate the optimal CUDA stream program. We applied this technique to the DYN routine of the HIRLAM weather forecast model. The experimental results showed that the real run times approximate the theoretical times. This indicates that the approach that we use to derive the theoretical time of the CUDA stream program is reliable. The generated code is more efficient than the handwritten code. The difference between the optimal generated code and the handwritten code is small (2%) due to many optimizations that we have done on the handwritten code. The main advantages of code generation here are hence the ease to obtain efficient code, to maintain its efficiency across platforms, and to implement new problems and algorithms; but in other situations it may also take away the burden of manual code optimizations.

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment