Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library
|
|
- Adrian Stafford
- 5 years ago
- Views:
Transcription
1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder, CO 80301, USA ave.com
2 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU by Rogue Wave Software 2010 by Rogue Wave Software. All Rights Reserved Printed in the United States of America Publishing History: October, 2010 Trademark Information The Rogue Wave Softw are name and logo, SourcePro, Stingray, HostAccess, IMSL and PV-WAV E are registered trademar ks of Rogue Wave Softw are, Inc. or its subsidiaries in the US and other countries. JMSL, JWAV E, TS-WAVE, Py IMSL and Know ledge in Motion are trademarks of Rogue Wave Softw are, Inc. or its subsidiaries. All other company, product or brand names are the property of their respective ow ners. IMPORTA NT NOTICE: The infor mation contained in this document is subject to change w ithout notice. Rogue Wave Softw are, Inc. makes no w arranty of any kind w ith regards to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Rogue Wave Softw are, Inc. shall not be liable for errors contained herein or for incidental, consequential, or other indirect damages in connection w ith the furnishing, performance, or use of this material.
3 Abstract The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL Fortran Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. A broad set of Level 2 and Level 3 BLAS functions have been implemented and benchmarks are presented for GEMM, TRMM, GEMV and GER. These benchmarks have been performed on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. The CUDA BLAS versions can be hundreds of time faster than general BLAS written in Fortran, but also up to eight times faster than vendor-supplied hardware-optimized BLAS running on four CPU cores.
4 TABLE OF CONTENTS Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 1 Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library... 2 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library... 4 Strategy... 4 The Benchmarks... 6 Measurements... 7 Linux with Intel Fortran Professional Compiler (64-bit)... 7 Speedup Summary Note About Floating Point Exception Handling Sample Benchmarking Code Conclusion About the Author... 28
5 Use of NVIDIA BLAS with the IMSL FORTRAN Numerical Library In recent years, traditional high-performance hardware has been supplemented with graphic processing units once utilized only for 3D visualization. These general purpose graphics processing units (GPGPUs) have matured enough that BLAS packages are now available and both single and double-precision calculations are supported. These two facts indicate the environment has reached a maturity level high enough for general purpose libraries such as IMSL to consider leveraging the hardware. From the GPGPU market viewpoint, the clear leader is NVIDIA. Their CUDA BLAS library is widely used in conjunction with the hardware as an enabling technology for higher level applications. NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing. See Table1.0 for names marked in the color GREEN. IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library. No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes. However, it is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature. See The Fortran 2003 Handbook, Adams, et al., p IMSL s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS. The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS. Strategy The strategy for using the attached NVIDIA GPU is given by the following algorithm: If the maximum of vector or matrix dimensions are larger than a switchover array size, NSTART, and NVIDIA provides a CUBLAS code, then o Copy the required vector and matrix data from the CPU to the GPU o Compute the result on the GPU o Copy the result from the GPU to the CPU Else, use a FORTRAN equivalent version of the BLAS routine that does not use the GPU. Normally a code that calls an IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code. In the first algorithm step, a working block is allocated on the GPU for each array argument. 4
6 A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks. If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size. The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS. The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine. This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART = 32 is used for each vector argument of each routine, but it may not be optimal. This default allows the routines to function correctly without initial attention to this value. The user can reset this value for each individual routine in the listings of Table1.0 marked with the color GREEN by using the IMSL routine CUBLAS_SET( ). This must be done before any use of the CUDA CUBLAS routine occurs. The switchover values can be obtained using the IMSL routine CUBLAS_GET( ). The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work. As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data. This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time. If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called. The necessity of copying for each use of a BLAS code depends on the application. Valid results are always copied back from the GPU to the CPU memory. The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative. The absolute value is used as the switchover value. When utilizing this feature, it is important that the user reset this to a positive value when the argument requires an initial copy step. There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities can be used to: Get the current switchover point value Set the switchover point value to a new value Maintain buffer sizes on the NVIDIA device Print error messages generated through the use of the NVIDIA device using the IMSL error handler 5
7 Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the CUDA CUBLAS library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed. The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero. For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to DGEMM will use an alternate version of this routine if the switchover value for DGEMM has been reset to zero. Table 1.0: Level 2 and Level 3 Basic Linear Algebra Subprograms GREEN Denotes NVIDIA Version Available Operation Real Double Complex Double Complex Matri x-vector Multiply, General SGEMV DGEMV CGEMV ZGEMV Matri x-vector Multiply, Banded SGBMV DGBMV CGBMV ZGBMV Rank-One Matrix Update, General and Real Rank-One Matrix Update, General, Complex and Transpose SGER DGER CGERU ZGERU Rank-One Matrix Update, Symmetric and Real SSYR DSYR Matri x--matrix Multiply, General SGEMM DGEMM CGEMM ZGEMM Matri x-matrix Multiply, Symmetric SSYMM DSYMM CSYMM ZSYMM Matri x-matrix Multiply, Hermitian CHEMM ZHEMM Rank - k Update, Hermitian CHERK ZHERK Rank - 2k Update, Symmetric SSYR2K DSYR2K CSYR2K ZSYR2K Matri x-matrix Multiply, Triangular STRMM DTRMM CTRMM ZTRMM Matri x-matrix solve, Triangular STRSM DTRSM CTRSM ZTRSM The Benchmarks The performance of selected BLAS was measured on multi-core systems using the NVIDIA Tesla C2050. Each function was used to solve a large enough problem to allow for the use of the NVIDIA CUDA BLAS to be guaranteed and to significantly overcome the time required for the data copy between the CPU and GPU. Each test case was run with a varying number of threads allowed. The number of threads was set using the OpenMP environment variable OMP_NUM_THREADS. The Intel 11.1 FORTRAN compiler was used for both tested environments. 6
8 Note that in order to realize the performance gains recorded for the GEMV routines, the necessary steps to keep the data on the GPU (discussed above) were implemented. Measurements The reported results in the benchmark times are the elapsed wall clock times for each test. Comparative times for an alternate Fortran version of the BLAS routine, a vendor-supplied version of the BLAS routine, and the corresponding NVIDIA CUDA BLAS routine are given. Linux with Intel Fortran Professional Compiler (64-bit) CPU Hardware: Dual Quad Core Xeon E5420 (Harpertown) 2.5GHz, 133MHz Front Side Bus GPU Hardware: NVIDIA Tesla C2050 Operating System: Red Hat 5 Compiler: Intel Fortran 11.1 ( 64-bit) SGEMM Times DGEMM Times CGEMM Times
9 ZGEMM Times Note that for the array size of 8000 for ZGEMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 8
10 9
11 SGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times DGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times CGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGEMV Times (ms) - Matrix kept on the GPU, Vector updated 30 times
12 11
13 SGER Times (ms)- Matrix kept on the GPU, Vector updated 30 times DGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times
14 CGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times ZGER Times (ms) - Matrix kept on the GPU, Vector updated 30 times
15 14
16 STRMM Times (s) DTRMM Times (s) CTRMM Times (s) ZTRMM Times (s)
17 Note that for the array size of 8000 for ZTRMM, a CPU to GPU copy failure occurred because the array was too large. Therefore, the error checking was activated and the benchmark dropped down to the pure Fortran version of the algorithm. 16
18 Speedup Summary The tables and charts shown above present raw timings from the benchmark test suite. Of interest to many users is the speedup observed when using the GPU hardware relative to another case. Here speedup is defined as the ratio of times, T CPU /T GPU. A value of 1 indicates the test runs in the same duration on either hardware while values greater than 1 indicate cases where running on the GPU hardware is favorable. A value of 2 would mean that the code ran in half the time on the GPU card. All of the above tests can be combined into a single heatmap style chart where results are shaded relative to the computed speedup value. The following two charts compare the NVIDIA GPU runs against first the pure Fortran BLAS implementation and then the most favorable case for the CPU, using an optimized BLAS library with 4 threads. 17
19 T[CPU]/T[GPU] for Fortran BLAS S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats)
20 T[CPU]/T[GPU] for Vendor BLAS with 4 cores S D C Z GEMM TRMM GEMV (30 repeats) GER (30 repeats) The speedup against pure Fortran is considerable in most cases, up to almost 400 times faster for a large SGEMM problem. A highly optimized BLAS library using 4 threads is a difficult opponent, but the GPU hardware still does very well even for the relatively small 500x500 problem size for Level 3 BLAS functions. In both cases the large ZGEMM and ZTRMM problems that encountered memory issues in the copy phase are clearly evident. The need for rather large problems to see performance increases using the Level 2 BLAS function GEMV is also very clear. In general, the GER function should not be used in a standalone manner on the GPU card; instead, this function is available for convenience when users can keep data on the card and then perform this operation without additional data I/O. 19
21 Note About Floating Point Exception Handling If exceptions resulting in NaN or Inf are a concern then the user code must examine the output for the occurrence of these. However, details about gradual underflow or the occurrence of underflow or an invalid operation is not available from the GPU. If this is to be detected in a running program within the BLAS, then one must use the Fortran version. Sample Benchmarking Code The following code benchmarks the BLAS routines SGEMV, DGEMV, CGEMV, and ZGEMV. This code requires the BLAS_INTERFACE module from the IMSL Library. The array breakpoints are set so that a Fortran version of the BLAS is used first, then reset so that the NVIDIA version of the BLAS is used. Each benchmark is run 30 times with the initial matrix kept on the GPU in order to realize the performance gains. The Fortran intrinsic, system_clock, is used to measure the actual elapsed ( wall clock ) time in seconds. PROGRAM BENCHMARK USE BLAS_INTERFACE IMPLICIT NONE INTEGER I, jj!integer, PARAMETER :: NBASE=1000, NREPEATS=10, INC=1000, NCASES=09!INTEGER, PARAMETER :: NVALUES(NCASES)=(/(NBASE+I*INC,I=1,NCASES)/)!REAL(skind) :: TF(NCASES),TN(NCASES) REAL(SKIND), ALLOCATABLE :: TF(:), TN(:) INTEGER :: NBASE, NREPEATS, INC, NCASES, MODE INTEGER, ALLOCATABLE :: NVALUES(:)! By default sizes of 500, 1000, 2000, 4000, 8000 are used. You! can shorten execution time by executing fewer cases. For example! when NCASES=3 only 500, 1000, and 2000 are executed. NCASES=5! Can be used for averaging, Each sample size is executed NREPEATS! times and then averaged. During repeating arrays which are not! changed are left on the GPU. If multiple runs with the same! data array is not used using the GPU is not cost effective. NREPEATS=30! Mode 1 both NVIDEA and default Fortran are executed.! Mode=2 only NVIDEA is executed MODE=1 ALLOCATE(NVALUES(NCASES),TF(NCASES),TN(NCASES))! NVALUES=(/(NBASE+(I-1)*INC,I=1,NCASES)/) NVALUES(1) = 500 NVALUES(2) = 1000 NVALUES(3) = 2000 NVALUES(4) = 4000 NVALUES(5) =
22 do jj=1,nrepeats write(*,'(//"================ Number of Repeats = ",i4)') jj TF=0; TN=0 CALL BENCH_Sgemv(NVALUES, jj, TF, TN) write(*,'(//"(sgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Dgemv(NVALUES, jj, TF, TN) write(*,'(//"(dgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Cgemv(NVALUES, jj, TF, TN) write(*,'(//"(cgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) TF=0; TN=0 CALL BENCH_Zgemv(NVALUES, jj, TF, TN) write(*,'(//"(zgemv Times: Fortran then NVIDIA, F/N)")') DO I=1,NCASES WRITE(*,'(2I6,1P3E14.4)') I,NVALUES(I),TF(I),TN(I),TF(I)/TN(I) end do CONTAINS SUBROUTINE BENCH_Sgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Sgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: REAL(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate INTEGER :: ISWITCH NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 21
23 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Sgemv,II,NMIN-1) C=SZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Sgemv,1) DO J=1,NREPEATS CALL Sgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Sgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Sgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE == 1) RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE == 2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Sgemv SUBROUTINE BENCH_Dgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Dgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:) 22
24 ! Type variables used in calling routines: REAL(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(A(LDA,N), B(LDB*N), C(LDC*N),CSAVE(LDC*N)) CALL RANDOM_NUMBER(A) CALL RANDOM_NUMBER(B) ALPHA=DONE BETA = DZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Dgemv,II,NMIN-1) C=DZERO call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Dgemv,1) DO J=1,NREPEATS CALL Dgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Dgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Dgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate 23
25 ! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(DONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Dgemv SUBROUTINE BENCH_Cgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Cgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(skind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(skind),allocatable :: S(:),T(:) COMPLEX(skind) ALPHA, BETA REAL(skind) TEMP, NORM, RERR INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,skind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,skind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO 24
26 ! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Cgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Cgemv,1) DO J=1,NREPEATS CALL Cgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Cgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Cgemv,1,ISWITCH) call system_clock(count=iend) TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, skind) TFORTRAN=TFORTRAN/REAL(NREPEATS, skind) END SUBROUTINE BENCH_Cgemv SUBROUTINE BENCH_Zgemv(NVALUES, NREPEATS, TFORTRAN, TNVIDIA)! Benchmark NVIDIA and Fortran BLA code, Zgemv. INTEGER, INTENT(IN) :: NVALUES(:) INTEGER, INTENT(IN) :: NREPEATS REAL(skind), INTENT(OUT) :: TNVIDIA(:), TFORTRAN(:)! Type variables used in calling routines: COMPLEX(dkind),allocatable, target :: A(:,:),B(:),C(:),CSAVE(:) REAL(dkind),allocatable :: S(:),T(:) COMPLEX(dkind) ALPHA, BETA REAL(dkind) TEMP, NORM, RERR 25
27 INTEGER :: I, II, ISWITCH, J, LDA, LDB, LDC, N, NCASE, NMIN, NMAX integer :: iend, istart, icount_rate NCASE=size(NVALUES) NMIN=minval(NVALUES) NMAX=maxval(NVALUES) DO I = 1,NCASE N=NVALUES(I) LDA=N LDB=1 LDC=1 ALLOCATE(S(N), T(N),A(LDA,N), B(LDB*N),& C(LDC*N),CSAVE(LDC*N)) DO J=1,N call random_number(s) call random_number(t) A(:,J)=cmplx(S,T,dkind) call random_number(s) call random_number(t) B(:)=cmplx(S,T,dkind)! CALL RANDOM_NUMBER(A)! CALL RANDOM_NUMBER(B) ALPHA=SONE BETA = SZERO! Set breakover value to use Fortran and then NVIDIA: IF(MODE == 1) THEN DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,0) call system_clock(count=istart, COUNT_RATE=icount_rate) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) call system_clock(count=iend) TFORTRAN(I)=float(iend-istart)/icount_rate CSAVE=C NORM=maxval(abs(C)) ELSE TFORTRAN=0 END IF! Use NVIDIA for all dimensions here: DO II=1,3 CALL CUBLAS_SET(CUDABLAS_Zgemv,II,NMIN-1) call system_clock(count=istart, COUNT_RATE=icount_rate) ISWITCH=CUBLAS_GET(CUDABLAS_Zgemv,1) DO J=1,NREPEATS CALL Zgemv('N',N,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)! Time without copy each use: CALL CUBLAS_SET(CUDABLAS_Zgemv,1,-ABS(ISWITCH)) CALL CUBLAS_SET(CUDABLAS_Zgemv,1,ISWITCH) call system_clock(count=iend) 26
28 TNVIDIA(I)=float(iend-istart)/icount_rate! Compute relative error norm: IF(MODE==1)RERR=MAXVAL(ABS(C-CSAVE))/NORM DEALLOCATE(A,B,C,CSAVE,S,T) IF(MODE==2) CYCLE IF(RERR <= SQRT(EPSILON(SONE))) CYCLE! Quit with no timing if results do not agree. TNVIDIA=SZERO TFORTRAN=SZERO RETURN! I! Average the time per call. TNVIDIA=TNVIDIA/REAL(NREPEATS, dkind) TFORTRAN=TFORTRAN/REAL(NREPEATS, dkind) END SUBROUTINE BENCH_Zgemv END PROGRAM BENCHMARK 27
29 Conclusion The use of the NVIDIA GPU with the corresponding CUDA BLAS Library and the IMSL FORTRAN Numerical Library is an effective means of boosting performance for problem sizes above certain thresholds. In some cases, users need to be familiar enough with their application with regard to keeping data on the GPU in order to realize the full benefit of using the CUDA BLAS Library with the IMSL Library. It is expected that further leveraging of the CUDA BLAS Library will be available when demand and scheduling suggest adding additional IMSL interface codes to this library. Finally, IMSL has performed these benchmarks on standard systems using the publicly available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for best performance. About the Author Edward Stewart is the product manager for the IMSL Numerical Libraries and Director of Research at Rogue Wave Software. Ed received his Ph.D. in physical ocean science and engineering from the University of Delaware. He has experience in many quantitative areas including quantification and interpretation of statistics and probability, coordination and analysis of large data sets, frequency domain time series analysis, partial differential equations, finite difference numerical modeling, and nonlinear dynamics. Edward has also been a major contributor in the development of new features and algorithms in PV-WAVE and the IMSL Libraries. He has published journal articles on experimental fluid dynamics and technical documents regarding Rogue Wave products. 28
CUDA Toolkit 4.0 Performance Report. June, 2011
CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2018 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationCUDA 6.0 Performance Report. April 2014
CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random
More informationJMSL Library for Java Applications
JMSL Library for Java Applications A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.rougewave.com JMSL Library for Java Applications
More informationParallel Functionality and Automatic Thread Control in PV-WAVE 10.0
Parallel Functionality and Automatic Thread Control in PV-WAVE 10.0 PV-WAVE Performance Improvement Brochure Rogue Wave Software April 2012 Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder,
More informationIMSL C Numerical Library
IMSL C Numerical Library Getting Started Guide for Windows A Technical Guide by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com IMSL C
More informationPremiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures
Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Antoine Petitet et Stefanos Vlachoutsis Juin 2011 Copyright ESI Group, 2009. 2010. All rights reserved.
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationParallel Performance of the IMSL C Numerical Library
Parallel Performance of the IMSL C Numerical Library Benchmarking OpenMP A White Paper by Rogue Wave Software October 2012 Update Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301,
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationParallel Programming and the IMSL Libraries
Parallel Programming and the IMSL Libraries An Overview A White Paper by Rogue Wave Software. February 2012 Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.rougewave.com
More informationCUDA 7.0 Performance Report. May 2015
CUDA 7.0 Performance Report May 2015 1 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library
More informationCUDA 6.5 Performance Report
CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More information9. Linear Algebra Computation
9. Linear Algebra Computation Basic Linear Algebra Subprograms (BLAS) Routines that provide standard, low-level, building blocks for performing basic vector and matrix operations. Originally developed
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationA Sampling of CUDA Libraries Michael Garland
A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationINTEL MKL Vectorized Compact routines
INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationNVBLAS LIBRARY. DU _v6.0 February User Guide
NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate
More informationCUDA Fortran COMPILERS &TOOLS. Porting Guide
Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationGPU LIBRARY ADVISOR. DA _v8.0 September Application Note
GPU LIBRARY ADVISOR DA-06762-001_v8.0 September 2016 Application Note TABLE OF CONTENTS Chapter 1. Overview... 1 Chapter 2. Usage... 2 DA-06762-001_v8.0 ii Chapter 1. OVERVIEW The NVIDIA is a cross-platform
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart
ECMWF Workshop on High Performance Computing in Meteorology 3 rd November 2010 Dean Stewart Agenda Company Overview Rogue Wave Product Overview IMSL Fortran TotalView Debugger Acumem ThreadSpotter 1 Copyright
More informationPorting Guide. CUDA Fortran COMPILERS &TOOLS
Porting Guide CUDA Fortran COMPILERS &TOOLS 1 Simple Increment Code Host CPU and its memory The cudafor module incudes CUDA Fortran definitions and interfaces to the runtime API The device variable attribute
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationResources for parallel computing
Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationDebugging Programs Accelerated with Intel Xeon Phi Coprocessors
Debugging Programs Accelerated with Intel Xeon Phi Coprocessors A White Paper by Rogue Wave Software. Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.roguewave.com Debugging
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationPARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures
PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures Solovev S. A, Pudov S.G sergey.a.solovev@intel.com, sergey.g.pudov@intel.com Intel Xeon, Intel Core 2 Duo are trademarks of
More informationBasic Linear Algebra Subprograms Library
Software Development Kit for Multicore Acceleration Version 3.1 Basic Linear Algebra Subprograms Library Programmer s Guide and API Reference SC33-8426-01 Software Development Kit for Multicore Acceleration
More informationNAG Fortran Library Routine Document F01CTF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationIntroduction to OpenACC Directives. Duncan Poole, NVIDIA
Introduction to OpenACC Directives Duncan Poole, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance
More informationSarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018
Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations
More informationNAG Fortran Library Routine Document F01CWF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationLecture V: Introduction to parallel programming with Fortran coarrays
Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationIndex. classes, 47, 228 coarray examples, 163, 168 copystring, 122 csam, 125 csaxpy, 119 csaxpyval, 120 csyscall, 127 dfetrf,14 dfetrs, 14
Index accessor-mutator routine example in a module, 7 PUBLIC or PRIVATE components, 6 ACM, ix editors of CALGO, ix Adams, Brainerd et al., see books, Fortran reference Airy s equation boundary value problem,
More informationMatrix Multiplication Specialization in STAPL
Matrix Multiplication Specialization in STAPL Adam Fidel, Lena Olson, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Dept. of Computer
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationDell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationEngineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary
white paper Computer-Aided Engineering ANSYS Mechanical on Intel Xeon Processors Engineer Productivity Boosted by Higher-Core CPUs Engineers can be significantly more productive when ANSYS Mechanical runs
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationCAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationImplementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationAMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:
AMath 483/583 Lecture 22 Outline: MPI Master Worker paradigm Linear algebra LAPACK and the BLAS References: $UWHPSC/codes/mpi class notes: MPI section class notes: Linear algebra Another Send/Receive example
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationCustomer Oriented Q&A: Intel Visual Fortran Composer XE 2011 with IMSL
Customer Oriented Q&A: Intel Visual Fortran Composer XE 2011 with IMSL New Deployment Licensing for Your Applications that Include Rogue Wave* IMSL* Numerical Libraries Content Intel resells the Rogue
More informationA quick guide to Fortran
A quick guide to Fortran Sergiy Bubin Department of Physics Nazarbayev University History of Fortran One of the oldest general purpose high-level computer languages First developed in 1957 at IBM in the
More informationSimulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions
Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions Afshin Hemmati Moghadam Mahder Gebremedhin Kristian Stavåker Peter Fritzson
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationPerformance Modeling for Ranking Blocked Algorithms
Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science 27.4.2012 Elmar Peise (AICES) Performance Modeling 27.4.2012 1 Blocked
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationGPU Programming Paradigms
GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationSolution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices
Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Marianne Spurrier and Joe Swartz, Lockheed Martin Corp. and ruce lack, Cray Inc. ASTRACT: Matrix decomposition and solution
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More informationAccelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic
Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationCS 179: Lecture 10. Introduction to cublas
CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and
More informationAllocating Storage for 1-Dimensional Arrays
Allocating Storage for 1-Dimensional Arrays Recall that if we know beforehand what size we want an array to be, then we allocate storage in the declaration statement, e.g., real, dimension (100 ) :: temperatures
More information1.3 Data processing; data storage; data movement; and control.
CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical
More informationNAG Library Routine Document C05RBF.1
C05 Roots of One or More Transcendental Equations NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised
More informationAutomatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.
Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationPACKAGE SPECIFICATION 21 I S 22 P = I
PACKAGE SPECFCATON 1 SUMMARY For a full matrix that is real symmetric, complex Hermitian, or complex symmetric, this module performs partial or full factorizations and performs solutions of corresponding
More informationSoftware Announcement October 14, 2003
Software Announcement October 14, 2003 IBM Parallel Engineering and Scientific Subroutine Library for (Parallel ESSL) offers scientific subroutines for optimum performance for AIX 5L Overview IBM Parallel
More information