Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library

Size: px

Start display at page:

Download "Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library"

Adrian Stafford
5 years ago
Views:

1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder, CO 80301, USA ave.com

2 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU by Rogue Wave Software 2010 by Rogue Wave Software. All Rights Reserved Printed in the United States of America Publishing History: October, 2010 Trademark Information The Rogue Wave Softw are name and logo, SourcePro, Stingray, HostAccess, IMSL and PV-WAV E are registered trademar ks of Rogue Wave Softw are, Inc. or its subsidiaries in the US and other countries. JMSL, JWAV E, TS-WAVE, Py IMSL and Know ledge in Motion are trademarks of Rogue Wave Softw are, Inc. or its subsidiaries. All other company, product or brand names are the property of their respective ow ners. IMPORTA NT NOTICE: The infor mation contained in this document is subject to change w ithout notice. Rogue Wave Softw are, Inc. makes no w arranty of any kind w ith regards to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Rogue Wave Softw are, Inc. shall not be liable for errors contained herein or for incidental, consequential, or other indirect damages in connection w ith the furnishing, performance, or use of this material.

3 Abstract The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL Fortran Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. A broad set of Level 2 and Level 3 BLAS functions have been implemented and benchmarks are presented for GEMM, TRMM, GEMV and GER. These benchmarks have been performed on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. The CUDA BLAS versions can be hundreds of time faster than general BLAS written in Fortran, but also up to eight times faster than vendor-supplied hardware-optimized BLAS running on four CPU cores.

4 TABLE OF CONTENTS Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 2 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library... 4 Strategy... 4 The Benchmarks... 6 Measurements... 7 Linux with Intel Fortran Professional Compiler (64-bit)... 7 Speedup Summary Note About Floating Point Exception Handling Sample Benchmarking Code Conclusion About the Author... 28

5 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library In recent years, traditional high-performance hardware has been supplemented with graphic processing units once utilized only for 3D visualization. These general purpose graphics processing units (GPGPUs) have matured enough that BLAS packages are now available and both single and double-precision calculations are supported. These two facts indicate the environment has reached a maturity level high enough for general purpose libraries such as IMSL to consider leveraging the hardware. From the GPGPU market viewpoint, the clear leader is NVIDIA. Their CUDA BLAS library is widely used in conjunction with the hardware as an enabling technology for higher level applications. NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing. See Table1.0 for names marked in the color GREEN. IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library. No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes. However, it is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature. See The Fortran 2003 Handbook, Adams, et al., p IMSL s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS. The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS. Strategy The strategy for using the attached NVIDIA GPU is given by the following algorithm: If the maximum of vector or matrix dimensions are larger than a switchover array size, NSTART, and NVIDIA provides a CUBLAS code, then o Copy the required vector and matrix data from the CPU to the GPU o Compute the result on the GPU o Copy the result from the GPU to the CPU Else, use a FORTRAN equivalent version of the BLAS routine that does not use the GPU. Normally a code that calls an IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code. In the first algorithm step, a working block is allocated on the GPU for each array argument. 4

6 A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks. If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size. The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS. The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine. This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART = 32 is used for each vector argument of each routine, but it may not be optimal. This default allows the routines to function correctly without initial attention to this value. The user can reset this value for each individual routine in the listings of Table1.0 marked with the color GREEN by using the IMSL routine CUBLAS_SET( ). This must be done before any use of the CUDA CUBLAS routine occurs. The switchover values can be obtained using the IMSL routine CUBLAS_GET( ). The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work. As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data. This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time. If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called. The necessity of copying for each use of a BLAS code depends on the application. Valid results are always copied back from the GPU to the CPU memory. The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative. The absolute value is used as the switchover value. When utilizing this feature, it is important that the user reset this to a positive value when the argument requires an initial copy step. There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities can be used to: Get the current switchover point value Set the switchover point value to a new value Maintain buffer sizes on the NVIDIA device Print error messages generated through the use of the NVIDIA device using the IMSL error handler 5

7 Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the CUDA CUBLAS library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed. The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero. For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to DGEMM will use an alternate version of this routine if the switchover value for DGEMM has been reset to zero. Table 1.0: Level 2 and Level 3 Basic Linear Algebra Subprograms GREEN Denotes NVIDIA Version Available Operation Real Double Complex Double Complex Matri x-vector Multiply, General SGEMV DGEMV CGEMV ZGEMV Matri x-vector Multiply, Banded SGBMV DGBMV CGBMV ZGBMV Rank-One Matrix Update, General and Real Rank-One Matrix Update, General, Complex and Transpose SGER DGER CGERU ZGERU Rank-One Matrix Update, Symmetric and Real SSYR DSYR Matri x--matrix Multiply, General SGEMM DGEMM CGEMM ZGEMM Matri x-matrix Multiply, Symmetric SSYMM DSYMM CSYMM ZSYMM Matri x-matrix Multiply, Hermitian CHEMM ZHEMM Rank - k Update, Hermitian CHERK ZHERK Rank - 2k Update, Symmetric SSYR2K DSYR2K CSYR2K ZSYR2K Matri x-matrix Multiply, Triangular STRMM DTRMM CTRMM ZTRMM Matri x-matrix solve, Triangular STRSM DTRSM CTRSM ZTRSM The Benchmarks The performance of selected BLAS was measured on multi-core systems using the NVIDIA Tesla C2050. Each function was used to solve a large enough problem to allow for the use of the NVIDIA CUDA BLAS to be guaranteed and to significantly overcome the time required for the data copy between the CPU and GPU. Each test case was run with a varying number of threads allowed. The number of threads was set using the OpenMP environment variable OMP_NUM_THREADS. The Intel 11.1 FORTRAN compiler was used for both tested environments. 6

8 Note that in order to realize the performance gains recorded for the GEMV routines, the necessary steps to keep the data on the GPU (discussed above) were implemented. Measurements The reported results in the benchmark times are the elapsed wall clock times for each test. Comparative times for an alternate Fortran version of the BLAS routine, a vendor-supplied version of the BLAS routine, and the corresponding NVIDIA CUDA BLAS routine are given. Linux with Intel Fortran Professional Compiler (64-bit) CPU Hardware: Dual Quad Core Xeon E5420 (Harpertown) 2.5GHz, 133MHz Front Side Bus GPU Hardware: NVIDIA Tesla C2050 Operating System: Red Hat 5 Compiler: Intel Fortran 11.1 ( 64-bit) SGEMM Times DGEMM Times CGEMM Times

9 ZGEMM Times Note that for the array size of 8000 for ZGEMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 8

10 9

11 SGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times DGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times CGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times

12 11

13 SGER Times (ms)- Matrix kept on the GPU, Vector updated 30 times DGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times

14 CGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times

15 14

16 STRMM Times (s) DTRMM Times (s) CTRMM Times (s) ZTRMM Times (s)

17 Note that for the array size of 8000 for ZTRMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 16

18 Speedup Summary The tables and charts shown above present raw timings from the benchmark test suite. Of interest to many users is the speedup observed when using the GPU hardware relative to another case. Here speedup is defined as the ratio of times, T CPU /T GPU. A value of 1 indicates the test runs in the same duration on either hardware while values greater than 1 indicate cases where running on the GPU hardware is favorable. A value of 2 would mean that the code ran in half the time on the GPU card. All of the above tests can be combined into a single heatmap style chart where results are shaded relative to the computed speedup value. The following two charts compare the NVIDIA GPU runs against first the pure Fortran BLAS implementation and then the most favorable case for the CPU, using an optimized BLAS library with 4 threads. 17

19 T[CPU]/T[GPU] for Fortran BLAS S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats)

20 T[CPU]/T[GPU] for Vendor BLAS with 4 cores S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats) The speedup against pure Fortran is considerable in most cases, up to almost 400 times faster for a large SGEMM problem. A highly optimized BLAS library using 4 threads is a difficult opponent, but the GPU hardware still does very well even for the relatively small 500x500 problem size for Level 3 BLAS functions. In both cases the large ZGEMM and ZTRMM problems that encountered memory issues in the copy phase are clearly evident. The need for rather large problems to see performance increases using the Level 2 BLAS function GEMV is also very clear. In general, the GER function should not be used in a standalone manner on the GPU card; instead, this function is available for convenience when users can keep data on the card and then perform this operation without additional data I/O. 19

21 Note About Floating Point Exception Handling If exceptions resulting in NaN or Inf are a concern then the user code must examine the output for the occurrence of these. However, details about gradual underflow or the occurrence of underflow or an invalid operation is not available from the GPU. If this is to be detected in a running program within the BLAS, then one must use the Fortran version. Sample Benchmarking Code The following code benchmarks the BLAS routines SGEMV, DGEMV, CGEMV, and ZGEMV. This code requires the BLAS_INTERFACE module from the IMSL Library. The array breakpoints are set so that a Fortran version of the BLAS is used first, then reset so that the NVIDIA version of the BLAS is used. Each benchmark is run 30 times with the initial matrix kept on the GPU in order to realize the performance gains. The Fortran intrinsic, system_clock, is used to measure the actual elapsed ( wall clock ) time in seconds. PROGRAM BENCHMARK USE BLAS_INTERFACE IMPLICIT NONE INTEGER I, jj!integer, PARAMETER :: NBASE=1000, NREPEATS=10, INC=1000, NCASES=09!INTEGER, PARAMETER :: NVALUES(NCASES)=(/(NBASE+I*INC,I=1,NCASES)/)!REAL(skind) :: TF(NCASES),TN(NCASES) REAL(SKIND), ALLOCATABLE :: TF(:), TN(:) INTEGER :: NBASE, NREPEATS, INC, NCASES, MODE INTEGER, ALLOCATABLE :: NVALUES(:)! By default sizes of 500, 1000, 2000, 4000, 8000 are used. You! can shorten execution time by executing fewer cases. For example! when NCASES=3 only 500, 1000, and 2000 are executed. NCASES=5! Can be used for averaging, Each sample size is executed NREPEATS! times and then averaged. During repeating arrays which are not! changed are left on the GPU. If multiple runs with the same! data array is not used using the GPU is not cost effective. NREPEATS=30! Mode 1 both NVIDEA and default Fortran are executed.! Mode=2 only NVIDEA is executed MODE=1 ALLOCATE(NVALUES(NCASES),TF(NCASES),TN(NCASES))! NVALUES=(/(NBASE+(I-1)*INC,I=1,NCASES)/) NVALUES(1) = 500 NVALUES(2) = 1000 NVALUES(3) = 2000 NVALUES(4) = 4000 NVALUES(5) =

22 do jj=1,nrepeats write(*,'(//"================ Number of Repeats = ",i4)') jj TF=0; TN=0 CALL BENCH_Sgemv(NVALUES, jj, TF, TN) write(*,'(//"(sgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Dgemv(NVALUES, jj, TF, TN) write(*,'(//"(dgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Cgemv(NVALUES, jj, TF, TN) write(*,'(//"(cgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Zgemv(NVALUES, jj, TF, TN) write(*,'(//"(zgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) end do CONTAINS SUBROUTINE BENCH_Sgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Sgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: REAL(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate INTEGER :: ISWITCH NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 21

23 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,NMIN-1) C=SZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Sgemv,1) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Sgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Sgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE == 1) RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE == 2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Sgemv SUBROUTINE BENCH_Dgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Dgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:) 22

24 ! Type variables used in calling routines: REAL(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=DONE BETA = DZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,NMIN-1) C=DZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Dgemv,1) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Dgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Dgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate 23

25 ! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(DONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Dgemv SUBROUTINE BENCH_Cgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Cgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind),allocatable :: S(:),T(:) COMPLEX(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,skind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,skind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO 24

26 ! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Cgemv,1) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Cgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Cgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Cgemv SUBROUTINE BENCH_Zgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Zgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind),allocatable :: S(:),T(:) COMPLEX(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR 25

27 INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,dkind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,dkind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Zgemv,1) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Zgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Zgemv,1,ISWITCH) call system_clock(count=iend) 26

28 TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Zgemv END PROGRAM BENCHMARK 27

29 Conclusion The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL FORTRAN Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. In some cases, users need to be familiar enough with their application with regard to keeping data on the GPU in order to realize the full benefit of using the CUDA BLAS Library with the IMSL Library. It is expected that further leveraging of the CUDA BLAS Library will be available when demand and scheduling suggest adding additional IMSL interface codes to this library. Finally, IMSL has performed these benchmarks on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. About the Author Edward Stewart is the product manager for the IMSL Numerical Libraries and Director of Research at Rogue Wave Software. Ed received his Ph.D. in physical ocean science and engineering from the University of Delaware. He has experience in many quantitative areas including quantification and interpretation of statistics and probability, coordination and analysis of large data sets, frequency domain time series analysis, partial differential equations, finite difference numerical modeling, and nonlinear dynamics. Edward has also been a major contributor in the development of new features and algorithms in PV-WAVE and the IMSL Libraries. He has published journal articles on experimental fluid dynamics and technical documents regarding Rogue Wave products. 28

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse