Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library

Size: px
Start display at page:

Download "Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library"

Transcription

1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder, CO 80301, USA ave.com

2 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU by Rogue Wave Software 2010 by Rogue Wave Software. All Rights Reserved Printed in the United States of America Publishing History: October, 2010 Trademark Information The Rogue Wave Softw are name and logo, SourcePro, Stingray, HostAccess, IMSL and PV-WAV E are registered trademar ks of Rogue Wave Softw are, Inc. or its subsidiaries in the US and other countries. JMSL, JWAV E, TS-WAVE, Py IMSL and Know ledge in Motion are trademarks of Rogue Wave Softw are, Inc. or its subsidiaries. All other company, product or brand names are the property of their respective ow ners. IMPORTA NT NOTICE: The infor mation contained in this document is subject to change w ithout notice. Rogue Wave Softw are, Inc. makes no w arranty of any kind w ith regards to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Rogue Wave Softw are, Inc. shall not be liable for errors contained herein or for incidental, consequential, or other indirect damages in connection w ith the furnishing, performance, or use of this material.

3 Abstract The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL Fortran Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. A broad set of Level 2 and Level 3 BLAS functions have been implemented and benchmarks are presented for GEMM, TRMM, GEMV and GER. These benchmarks have been performed on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. The CUDA BLAS versions can be hundreds of time faster than general BLAS written in Fortran, but also up to eight times faster than vendor-supplied hardware-optimized BLAS running on four CPU cores.

4 TABLE OF CONTENTS Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 2 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library... 4 Strategy... 4 The Benchmarks... 6 Measurements... 7 Linux with Intel Fortran Professional Compiler (64-bit)... 7 Speedup Summary Note About Floating Point Exception Handling Sample Benchmarking Code Conclusion About the Author... 28

5 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library In recent years, traditional high-performance hardware has been supplemented with graphic processing units once utilized only for 3D visualization. These general purpose graphics processing units (GPGPUs) have matured enough that BLAS packages are now available and both single and double-precision calculations are supported. These two facts indicate the environment has reached a maturity level high enough for general purpose libraries such as IMSL to consider leveraging the hardware. From the GPGPU market viewpoint, the clear leader is NVIDIA. Their CUDA BLAS library is widely used in conjunction with the hardware as an enabling technology for higher level applications. NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing. See Table1.0 for names marked in the color GREEN. IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library. No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes. However, it is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature. See The Fortran 2003 Handbook, Adams, et al., p IMSL s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS. The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS. Strategy The strategy for using the attached NVIDIA GPU is given by the following algorithm: If the maximum of vector or matrix dimensions are larger than a switchover array size, NSTART, and NVIDIA provides a CUBLAS code, then o Copy the required vector and matrix data from the CPU to the GPU o Compute the result on the GPU o Copy the result from the GPU to the CPU Else, use a FORTRAN equivalent version of the BLAS routine that does not use the GPU. Normally a code that calls an IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code. In the first algorithm step, a working block is allocated on the GPU for each array argument. 4

6 A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks. If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size. The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS. The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine. This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART = 32 is used for each vector argument of each routine, but it may not be optimal. This default allows the routines to function correctly without initial attention to this value. The user can reset this value for each individual routine in the listings of Table1.0 marked with the color GREEN by using the IMSL routine CUBLAS_SET( ). This must be done before any use of the CUDA CUBLAS routine occurs. The switchover values can be obtained using the IMSL routine CUBLAS_GET( ). The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work. As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data. This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time. If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called. The necessity of copying for each use of a BLAS code depends on the application. Valid results are always copied back from the GPU to the CPU memory. The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative. The absolute value is used as the switchover value. When utilizing this feature, it is important that the user reset this to a positive value when the argument requires an initial copy step. There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities can be used to: Get the current switchover point value Set the switchover point value to a new value Maintain buffer sizes on the NVIDIA device Print error messages generated through the use of the NVIDIA device using the IMSL error handler 5

7 Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the CUDA CUBLAS library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed. The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero. For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to DGEMM will use an alternate version of this routine if the switchover value for DGEMM has been reset to zero. Table 1.0: Level 2 and Level 3 Basic Linear Algebra Subprograms GREEN Denotes NVIDIA Version Available Operation Real Double Complex Double Complex Matri x-vector Multiply, General SGEMV DGEMV CGEMV ZGEMV Matri x-vector Multiply, Banded SGBMV DGBMV CGBMV ZGBMV Rank-One Matrix Update, General and Real Rank-One Matrix Update, General, Complex and Transpose SGER DGER CGERU ZGERU Rank-One Matrix Update, Symmetric and Real SSYR DSYR Matri x--matrix Multiply, General SGEMM DGEMM CGEMM ZGEMM Matri x-matrix Multiply, Symmetric SSYMM DSYMM CSYMM ZSYMM Matri x-matrix Multiply, Hermitian CHEMM ZHEMM Rank - k Update, Hermitian CHERK ZHERK Rank - 2k Update, Symmetric SSYR2K DSYR2K CSYR2K ZSYR2K Matri x-matrix Multiply, Triangular STRMM DTRMM CTRMM ZTRMM Matri x-matrix solve, Triangular STRSM DTRSM CTRSM ZTRSM The Benchmarks The performance of selected BLAS was measured on multi-core systems using the NVIDIA Tesla C2050. Each function was used to solve a large enough problem to allow for the use of the NVIDIA CUDA BLAS to be guaranteed and to significantly overcome the time required for the data copy between the CPU and GPU. Each test case was run with a varying number of threads allowed. The number of threads was set using the OpenMP environment variable OMP_NUM_THREADS. The Intel 11.1 FORTRAN compiler was used for both tested environments. 6

8 Note that in order to realize the performance gains recorded for the GEMV routines, the necessary steps to keep the data on the GPU (discussed above) were implemented. Measurements The reported results in the benchmark times are the elapsed wall clock times for each test. Comparative times for an alternate Fortran version of the BLAS routine, a vendor-supplied version of the BLAS routine, and the corresponding NVIDIA CUDA BLAS routine are given. Linux with Intel Fortran Professional Compiler (64-bit) CPU Hardware: Dual Quad Core Xeon E5420 (Harpertown) 2.5GHz, 133MHz Front Side Bus GPU Hardware: NVIDIA Tesla C2050 Operating System: Red Hat 5 Compiler: Intel Fortran 11.1 ( 64-bit) SGEMM Times DGEMM Times CGEMM Times

9 ZGEMM Times Note that for the array size of 8000 for ZGEMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 8

10 9

11 SGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times DGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times CGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times

12 11

13 SGER Times (ms)- Matrix kept on the GPU, Vector updated 30 times DGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times

14 CGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times

15 14

16 STRMM Times (s) DTRMM Times (s) CTRMM Times (s) ZTRMM Times (s)

17 Note that for the array size of 8000 for ZTRMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 16

18 Speedup Summary The tables and charts shown above present raw timings from the benchmark test suite. Of interest to many users is the speedup observed when using the GPU hardware relative to another case. Here speedup is defined as the ratio of times, T CPU /T GPU. A value of 1 indicates the test runs in the same duration on either hardware while values greater than 1 indicate cases where running on the GPU hardware is favorable. A value of 2 would mean that the code ran in half the time on the GPU card. All of the above tests can be combined into a single heatmap style chart where results are shaded relative to the computed speedup value. The following two charts compare the NVIDIA GPU runs against first the pure Fortran BLAS implementation and then the most favorable case for the CPU, using an optimized BLAS library with 4 threads. 17

19 T[CPU]/T[GPU] for Fortran BLAS S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats)

20 T[CPU]/T[GPU] for Vendor BLAS with 4 cores S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats) The speedup against pure Fortran is considerable in most cases, up to almost 400 times faster for a large SGEMM problem. A highly optimized BLAS library using 4 threads is a difficult opponent, but the GPU hardware still does very well even for the relatively small 500x500 problem size for Level 3 BLAS functions. In both cases the large ZGEMM and ZTRMM problems that encountered memory issues in the copy phase are clearly evident. The need for rather large problems to see performance increases using the Level 2 BLAS function GEMV is also very clear. In general, the GER function should not be used in a standalone manner on the GPU card; instead, this function is available for convenience when users can keep data on the card and then perform this operation without additional data I/O. 19

21 Note About Floating Point Exception Handling If exceptions resulting in NaN or Inf are a concern then the user code must examine the output for the occurrence of these. However, details about gradual underflow or the occurrence of underflow or an invalid operation is not available from the GPU. If this is to be detected in a running program within the BLAS, then one must use the Fortran version. Sample Benchmarking Code The following code benchmarks the BLAS routines SGEMV, DGEMV, CGEMV, and ZGEMV. This code requires the BLAS_INTERFACE module from the IMSL Library. The array breakpoints are set so that a Fortran version of the BLAS is used first, then reset so that the NVIDIA version of the BLAS is used. Each benchmark is run 30 times with the initial matrix kept on the GPU in order to realize the performance gains. The Fortran intrinsic, system_clock, is used to measure the actual elapsed ( wall clock ) time in seconds. PROGRAM BENCHMARK USE BLAS_INTERFACE IMPLICIT NONE INTEGER I, jj!integer, PARAMETER :: NBASE=1000, NREPEATS=10, INC=1000, NCASES=09!INTEGER, PARAMETER :: NVALUES(NCASES)=(/(NBASE+I*INC,I=1,NCASES)/)!REAL(skind) :: TF(NCASES),TN(NCASES) REAL(SKIND), ALLOCATABLE :: TF(:), TN(:) INTEGER :: NBASE, NREPEATS, INC, NCASES, MODE INTEGER, ALLOCATABLE :: NVALUES(:)! By default sizes of 500, 1000, 2000, 4000, 8000 are used. You! can shorten execution time by executing fewer cases. For example! when NCASES=3 only 500, 1000, and 2000 are executed. NCASES=5! Can be used for averaging, Each sample size is executed NREPEATS! times and then averaged. During repeating arrays which are not! changed are left on the GPU. If multiple runs with the same! data array is not used using the GPU is not cost effective. NREPEATS=30! Mode 1 both NVIDEA and default Fortran are executed.! Mode=2 only NVIDEA is executed MODE=1 ALLOCATE(NVALUES(NCASES),TF(NCASES),TN(NCASES))! NVALUES=(/(NBASE+(I-1)*INC,I=1,NCASES)/) NVALUES(1) = 500 NVALUES(2) = 1000 NVALUES(3) = 2000 NVALUES(4) = 4000 NVALUES(5) =

22 do jj=1,nrepeats write(*,'(//"================ Number of Repeats = ",i4)') jj TF=0; TN=0 CALL BENCH_Sgemv(NVALUES, jj, TF, TN) write(*,'(//"(sgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Dgemv(NVALUES, jj, TF, TN) write(*,'(//"(dgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Cgemv(NVALUES, jj, TF, TN) write(*,'(//"(cgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Zgemv(NVALUES, jj, TF, TN) write(*,'(//"(zgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) end do CONTAINS SUBROUTINE BENCH_Sgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Sgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: REAL(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate INTEGER :: ISWITCH NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 21

23 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,NMIN-1) C=SZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Sgemv,1) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Sgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Sgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE == 1) RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE == 2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Sgemv SUBROUTINE BENCH_Dgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Dgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:) 22

24 ! Type variables used in calling routines: REAL(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=DONE BETA = DZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,NMIN-1) C=DZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Dgemv,1) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Dgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Dgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate 23

25 ! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(DONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Dgemv SUBROUTINE BENCH_Cgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Cgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind),allocatable :: S(:),T(:) COMPLEX(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,skind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,skind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO 24

26 ! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Cgemv,1) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Cgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Cgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Cgemv SUBROUTINE BENCH_Zgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Zgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind),allocatable :: S(:),T(:) COMPLEX(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR 25

27 INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,dkind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,dkind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Zgemv,1) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Zgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Zgemv,1,ISWITCH) call system_clock(count=iend) 26

28 TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Zgemv END PROGRAM BENCHMARK 27

29 Conclusion The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL FORTRAN Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. In some cases, users need to be familiar enough with their application with regard to keeping data on the GPU in order to realize the full benefit of using the CUDA BLAS Library with the IMSL Library. It is expected that further leveraging of the CUDA BLAS Library will be available when demand and scheduling suggest adding additional IMSL interface codes to this library. Finally, IMSL has performed these benchmarks on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. About the Author Edward Stewart is the product manager for the IMSL Numerical Libraries and Director of Research at Rogue Wave Software. Ed received his Ph.D. in physical ocean science and engineering from the University of Delaware. He has experience in many quantitative areas including quantification and interpretation of statistics and probability, coordination and analysis of large data sets, frequency domain time series analysis, partial differential equations, finite difference numerical modeling, and nonlinear dynamics. Edward has also been a major contributor in the development of new features and algorithms in PV-WAVE and the IMSL Libraries. He has published journal articles on experimental fluid dynamics and technical documents regarding Rogue Wave products. 28

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2018 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

JMSL Library for Java Applications

JMSL Library for Java Applications JMSL Library for Java Applications A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.rougewave.com JMSL Library for Java Applications

More information

Parallel Functionality and Automatic Thread Control in PV-WAVE 10.0

Parallel Functionality and Automatic Thread Control in PV-WAVE 10.0 Parallel Functionality and Automatic Thread Control in PV-WAVE 10.0 PV-WAVE Performance Improvement Brochure Rogue Wave Software April 2012 Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder,

More information

IMSL C Numerical Library

IMSL C Numerical Library IMSL C Numerical Library Getting Started Guide for Windows A Technical Guide by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com IMSL C

More information

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Antoine Petitet et Stefanos Vlachoutsis Juin 2011 Copyright ESI Group, 2009. 2010. All rights reserved.

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Parallel Performance of the IMSL C Numerical Library

Parallel Performance of the IMSL C Numerical Library Parallel Performance of the IMSL C Numerical Library Benchmarking OpenMP A White Paper by Rogue Wave Software October 2012 Update Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301,

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Parallel Programming and the IMSL Libraries

Parallel Programming and the IMSL Libraries Parallel Programming and the IMSL Libraries An Overview A White Paper by Rogue Wave Software. February 2012 Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.rougewave.com

More information

CUDA 7.0 Performance Report. May 2015

CUDA 7.0 Performance Report. May 2015 CUDA 7.0 Performance Report May 2015 1 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library

More information

CUDA 6.5 Performance Report

CUDA 6.5 Performance Report CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

9. Linear Algebra Computation

9. Linear Algebra Computation 9. Linear Algebra Computation Basic Linear Algebra Subprograms (BLAS) Routines that provide standard, low-level, building blocks for performing basic vector and matrix operations. Originally developed

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

A Sampling of CUDA Libraries Michael Garland

A Sampling of CUDA Libraries Michael Garland A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

NVBLAS LIBRARY. DU _v6.0 February User Guide

NVBLAS LIBRARY. DU _v6.0 February User Guide NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate

More information

CUDA Fortran COMPILERS &TOOLS. Porting Guide

CUDA Fortran COMPILERS &TOOLS. Porting Guide Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note GPU LIBRARY ADVISOR DA-06762-001_v8.0 September 2016 Application Note TABLE OF CONTENTS Chapter 1. Overview... 1 Chapter 2. Usage... 2 DA-06762-001_v8.0 ii Chapter 1. OVERVIEW The NVIDIA is a cross-platform

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart ECMWF Workshop on High Performance Computing in Meteorology 3 rd November 2010 Dean Stewart Agenda Company Overview Rogue Wave Product Overview IMSL Fortran TotalView Debugger Acumem ThreadSpotter 1 Copyright

More information

Porting Guide. CUDA Fortran COMPILERS &TOOLS

Porting Guide. CUDA Fortran COMPILERS &TOOLS Porting Guide CUDA Fortran COMPILERS &TOOLS 1 Simple Increment Code Host CPU and its memory The cudafor module incudes CUDA Fortran definitions and interfaces to the runtime API The device variable attribute

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Resources for parallel computing

Resources for parallel computing Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors Debugging Programs Accelerated with Intel Xeon Phi Coprocessors A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com Debugging

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of

More information

Basic Linear Algebra Subprograms Library

Basic Linear Algebra Subprograms Library Software Development Kit for Multicore Acceleration Version 3.1 Basic Linear Algebra Subprograms Library Programmer s Guide and API Reference SC33-8426-01 Software Development Kit for Multicore Acceleration

More information

NAG Fortran Library Routine Document F01CTF.1

NAG Fortran Library Routine Document F01CTF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

Introduction to OpenACC Directives. Duncan Poole, NVIDIA Introduction to OpenACC Directives Duncan Poole, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance

More information

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations

More information

NAG Fortran Library Routine Document F01CWF.1

NAG Fortran Library Routine Document F01CWF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Index. classes, 47, 228 coarray examples, 163, 168 copystring, 122 csam, 125 csaxpy, 119 csaxpyval, 120 csyscall, 127 dfetrf,14 dfetrs, 14

Index. classes, 47, 228 coarray examples, 163, 168 copystring, 122 csam, 125 csaxpy, 119 csaxpyval, 120 csyscall, 127 dfetrf,14 dfetrs, 14 Index accessor-mutator routine example in a module, 7 PUBLIC or PRIVATE components, 6 ACM, ix editors of CALGO, ix Adams, Brainerd et al., see books, Fortran reference Airy s equation boundary value problem,

More information

Matrix Multiplication Specialization in STAPL

Matrix Multiplication Specialization in STAPL Matrix Multiplication Specialization in STAPL Adam Fidel, Lena Olson, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Dept. of Computer

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary white paper Computer-Aided Engineering ANSYS Mechanical on Intel Xeon Processors Engineer Productivity Boosted by Higher-Core CPUs Engineers can be significantly more productive when ANSYS Mechanical runs

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018 CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected

More information

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline: AMath 483/583 Lecture 22 Outline: MPI Master Worker paradigm Linear algebra LAPACK and the BLAS References: $UWHPSC/codes/mpi class notes: MPI section class notes: Linear algebra Another Send/Receive example

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Customer Oriented Q&A: Intel Visual Fortran Composer XE 2011 with IMSL

Customer Oriented Q&A: Intel Visual Fortran Composer XE 2011 with IMSL Customer Oriented Q&A: Intel Visual Fortran Composer XE 2011 with IMSL New Deployment Licensing for Your Applications that Include Rogue Wave* IMSL* Numerical Libraries Content Intel resells the Rogue

More information

A quick guide to Fortran

A quick guide to Fortran A quick guide to Fortran Sergiy Bubin Department of Physics Nazarbayev University History of Fortran One of the oldest general purpose high-level computer languages First developed in 1957 at IBM in the

More information

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions Afshin Hemmati Moghadam Mahder Gebremedhin Kristian Stavåker Peter Fritzson

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Performance Modeling for Ranking Blocked Algorithms

Performance Modeling for Ranking Blocked Algorithms Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science 27.4.2012 Elmar Peise (AICES) Performance Modeling 27.4.2012 1 Blocked

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

GPU Programming Paradigms

GPU Programming Paradigms GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Marianne Spurrier and Joe Swartz, Lockheed Martin Corp. and ruce lack, Cray Inc. ASTRACT: Matrix decomposition and solution

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CS 179: Lecture 10. Introduction to cublas

CS 179: Lecture 10. Introduction to cublas CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and

More information

Allocating Storage for 1-Dimensional Arrays

Allocating Storage for 1-Dimensional Arrays Allocating Storage for 1-Dimensional Arrays Recall that if we know beforehand what size we want an array to be, then we allocate storage in the declaration statement, e.g., real, dimension (100 ) :: temperatures

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

NAG Library Routine Document C05RBF.1

NAG Library Routine Document C05RBF.1 C05 Roots of One or More Transcendental Equations NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee. Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

PACKAGE SPECIFICATION 21 I S 22 P = I

PACKAGE SPECIFICATION 21 I S 22 P = I PACKAGE SPECFCATON 1 SUMMARY For a full matrix that is real symmetric, complex Hermitian, or complex symmetric, this module performs partial or full factorizations and performs solutions of corresponding

More information

Software Announcement October 14, 2003

Software Announcement October 14, 2003 Software Announcement October 14, 2003 IBM Parallel Engineering and Scientific Subroutine Library for (Parallel ESSL) offers scientific subroutines for optimum performance for AIX 5L Overview IBM Parallel

More information