GPUBenchmark results for tesla2

Size: px

Start display at page:

Download "GPUBenchmark results for tesla2"

Stewart Bell
6 years ago
Views:

1 Benchmark results for tesla2 May 4, 202 Abstract This report shows the Benchmark results obtained on tesla2 on May 4, 202. Contents Introduction 2 Hardware description 3 Transfer speed between hard disk and main memory 3 4 Transfer speed between and main memory 3 5 Transfer speed between two s 5 6 Matrix-matrix multiplication performance 7 7 Matrix-vector multiplication performance 9 Introduction Benchmark has been used to evaluate different aspects of the tesla2 computer. Depending on its hardware architecture and the libraries available when Benchmark was run, some or all of the following aspects will be reported in this document: Transfer speed between hard disk and main memory. Transfer speed between and main memory. Transfer speed between two s. Matrix-matrix multiplication performance. Matrix-vector multiplication performance. The next section describes the hardware characteristics of tesla2. Each one of the remainder sections will focus in one of the previously enumerated performance aspects.

2 Executed on May 4, 202 Benchmark results for tesla2 2 Hardware description This section shows the characteristics of the s and of tesla2. The s available at tesla2 have the next characteristics: cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB 2

3 Benchmark results for tesla2 Executed on May 4, 202 bogomips : The Device installed on tesla2 that has been used to perform some of the tests has the next characteristics: Tesla T20 Processor CUDA driver version : 4000 CUDA Runtime version : 4000 Multiprocessors : 4 Global memory (total) : bytes Constant memory (total) : bytes Shared memory per block (total) : 4952 bytes Available registers per block : Threads per block : 24 Max. dimension of a block : 24 x 24 x 64 Max. dimension of a grid : x x Clock rate :.5 GHz 3 Transfer speed between hard disk and main memory To obtain the transfer speed from hard disk to main memory, and from main memory to hard disk, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from hard disk to main memory, and viceversa, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table shows the transfer speed obtained for several numbers of rows and columns, from hard disk to main memory, and from main memory to hard disk. Note that the transfer speed is given in Mebibytes per second (MiB/s). These results are reported graphically in Figure. Rows Columns Size Hard disk Main memory (MiB) Main memory (MiB/s) Hard disk (MiB/s) Table : Transfer speed from hard disk to main memory, and from main memory to hard disk, for different matrix sizes A Mebibyte is defined as 2 20 bytes 3

4 Executed on May 4, 202 Benchmark results for tesla Transfer speed (MiB/s) Hard disk to memory Memory to hard disk Size of matrix (MiB) Figure : Transfer speed from hard disk to main memory, and from main memory to hard disk, for different matrix sizes 4 Transfer speed between and main memory To obtain the transfer speed from internal memory to main memory, and from main memory to internal memory, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from internal memory to main memory, and viceversa, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table 2 shows the transfer speed obtained for several numbers of rows and columns, from to main memory, and from main memory to. Note that the transfer speed is given in Mebibytes per second (MiB/s) 2. Rows Columns Size Main memory (MiB) Main memory (MiB/s) (MiB/s) Table 2: Transfer speed from to main memory, and from main memory to, for different matrix sizes 2 A Mebibyte is defined as 2 20 bytes 4

5 Benchmark results for tesla2 Executed on May 4, 202 The same tests have been done using padding to allocate the matrices in the internal memory. When padding is applied, it may be necessary to reserve additional storage to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing. Padding is the recommended allocation method for 2D arrays. Table 3 shows the transfer speed obtained for several numbers of rows and columns, from to main memory, and from main memory to, when padding is in use. Note that the transfer speed is given in Mebibytes per second. Rows Columns Size Main memory (MiB) Main memory (MiB/s) (MiB/s) Table 3: Transfer speed from to main memory, and from main memory to, when using padding, for different matrix sizes Figure 2 shows the transfer speed from to main memory, and from main memory to, with and without padding Transfer speed (MiB/s) > main memory (with padding) Main memory -> (with padding) -> main memory Main memory -> Size of matrix (MiB) Figure 2: Transfer speed from to main memory, and from main memory to, with and without padding, for different matrix sizes 5

6 Executed on May 4, 202 Benchmark results for tesla2 5 Transfer speed between two s To obtain the transfer speed between two s using Direct or via main memory, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from one to another, both using Direct and via main memory, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table 4 shows the transfer speed obtained for several numbers of rows and columns, from one to another both using Direct and via main memory. Note that the transfer speed is given in Mebibytes per second (MiB/s) 3. These results are reported graphically in Figure 3. Rows Columns Size (MiB) 2 (MiB/s) MM 2 (MiB/s) Table 4: Transfer speed from one to another using Direct and via main memory Transfer speed (MiB/s) >2 ->MM-> Size of matrix (MiB) Figure 3: Transfer speed from one to another using Direct and via main memory 3 A Mebibyte is defined as 2 20 bytes 6

7 Benchmark results for tesla2 Executed on May 4, Matrix-matrix multiplication performance This section shows both the time required and the GFLOPS obtained by the and the on tesla2 when computing the operation: C = αab + βc, where A R m k, B R k n, C R m n, and α and β are scalars. This operation has been performed on the by calling the CBLAS cblas_sgemm() function, and on the by calling the CUBLAS cublassgemm() function. Different m, n, and k values have been used. For each value of m, the n and k values have been obtained as 25, 50, 75, and 0% of m. In order to obtain a more accurate measure, for each combination of m, n, and k, ten operations have been carried on in both and. Then, the median of the outcomes for each case has been computed. The GFLOPS for each combination of m, n, and k has been computed as: 2mnk 9 GF LOP S = time Table 5 shows the time and GFLOPS results obtained for the given m, n, and k values. m n k (ms) (ms) GFLOPS GFLOPS Table 5: Time and GFLOPS for the operation C = αab + βc on the and on the for different m, n and k values Figure 4 shows the time to compute the operation C = αab + βc on the and on the (notice that n and k are 25, 50, 75, and 0% of m). Figure 5 shows the and GFLOPS for these values of m, n and k. 7

8 Executed on May 4, 202 Benchmark results for tesla (a) n and k are 25% of m (b) n and k are 50% of m 0000 e (c) n and k are 75% of m (d) n and k are 0% of m Figure 4: Time to compute the operation C = αab + βc on the and on the 00 GFLOPS GFLOPS n=k=25% m n=k=50% m n=k=75% m n=k=0% m n=k=25% m n=k=50% m n=k=75% m n=k=0% m Figure 5: and GFLOPS for C = αab + βc with several m, n and k values (n and k are 25, 50, 75, and 0% of m) 8

9 Benchmark results for tesla2 Executed on May 4, Matrix-vector multiplication performance This section shows both the time required and the GFLOPS obtained by the and the on tesla2 when computing the operation: y = αax + βy, where A R m n, x R n, y R m, and α and β are scalars. This operation has been performed on the by calling the CBLAS cblas_sgemv() function, and on the by calling the CUBLAS cublassgemv function. Different m and n values have been used. For each value of m, the n values have been obtained as 25, 50, 75, and 0% of m. In order to obtain a more accurate measure, for each combination of m and n, ten operations have been carried on in both and. Then, the median of the outcomes for each case has been computed. The GFLOPS for each combination of m and n has been computed as: GF LOP S = 2mn 9 time Table 6 shows the time and GFLOPS results obtained for the given m and n values. Figure 6 shows the time to compute the operation y = αax + βy on the and on the (notice that n is 25, 50, 75, and 0% of m). Figure 7 shows the and GFLOPS for these values of m and n. 9

10 Executed on May 4, 202 Benchmark results for tesla2 m n (ms) (ms) GFLOPS GFLOPS Table 6: Time and GFLOPS for the operation y = αax + βy on the and on the for different m and n values

11 Benchmark results for tesla2 Executed on May 4, (a) n is 25% of m (b) n is 50% of m (c) n is 75% of m (d) n is 0% of m Figure 6: Time to compute the operation y = αax + βy on the and on the 0 GFLOPS GFLOPS n=25% m n=50% m n=75% m n=0% m n=25% m n=50% m n=75% m n=0% m Figure 7: and GFLOPS for y = αax + βy with several m and n values (n is 25, 50, 75, and 0% of m)

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28