Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Size: px

Start display at page:

Download "Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA"

Megan Powers
5 years ago
Views:

1 Implementation an Evaluation of AS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Kazuya Matsumoto 1, orihisa Fujita 2, Toshihiro Hanawa 3, an Taisuke Boku 1,2 1 Center for Computational Sciences, University of Tsukuba 2 Grauate School of Systems an Information Engineering, University of Tsukuba 3 Information Technology Center, The University of Tokyo Abstract. We have been eveloping a proprietary interconnect technology calle Tightly Couple Accelerators (TCA) architecture to improve communication latency an banwith between accelerators (GPUs) over ifferent noes. This paper presents a Conjugate Graient (CG) benchmark implementation using the TCA an results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster base on the TCA concept. The implementation is base on the CG benchmark in AS Parallel Benchmarks, an its parallelization is achieve by a two-imensional ecomposition of matrix ata. The TCA utilization improves the communication performance compare with the implementation with MPI/InfiniBan utilization for small size benchmark classes. This stuy also shows that the CG implementation with the two-imensional ecomposition is more suitable for the TCA utilization than a CG implementation with a one-imensional ecomposition to make use of the interconnect. 1 Introuction Currently, GPU clusters are wiely use as high performance computing systems. A problem of GPU clusters is that the communication spee between multiple compute noes is not fast enough compare to its high computation spee. In orer to aress this problem, we have been researching the Tightly Couple Accelerators (TCA) architecture [2]. The TCA is a technology on a proprietary interconnect network to enable irect communication between accelerators over ifferent noes. We have conucte the basic performance evaluation of the TCA [4, 1]. In [4], a Conjugate Graient (CG) metho has been implemente by utilizing allgather an allreuce collective communications with TCA s communication functions. The parallelization of the CG implementation is accomplishe by a one-imensional ecomposition of matrix ata. Results of the performance evaluation shows that the CG metho implementation using TCA outperforms the implementation using MPI/InfiniBan for sparse matrices whose matrix (problem) size is nine thousan or smaller; however, the communication performance

2 of TCA tens to become lower when either or both of the target matrix sizes an the number of utilizing processes are larger. In the present stuy, we evaluate a ifferent CG metho implementation with the TCA utilization. The CG implementation is base on the CG benchmark in AS Parallel Benchmarks. We apply a two-imensional ecomposition of matrix as the enhance implementation from [4], an present results of performance evaluation on the HA-PACS/TCA GPU cluster. Aitionally, this stuy presents communication performance ifferences between the CG implementation with the two-imensional ecomposition an the one-imensional ecomposition. 2 Tightly Couple Accelerators Architecture This section briefly explains the Tightly Couple Accelerators (TCA) architecture (see [2, 1] for etaile information on the TCA). The TCA is a novel technology of proprietary interconnect for PC clusters. The PCI Express Aaptive Communication Hub ver. 2 (PEACH2) is a prototype implementation of the TCA architecture. We can construct a cluster system by connecting the PEACH2 boars with each other. The communication using the PEACH2 is conucte only with the PCIe protocol, an the overhea time for protocol conversion, which is require in the InfiniBan, is eliminate. Consequently, the PEACH2 enables ata communication with extremely low latency. The PEACH2 provies two types of ata communication functions: PIO an DMA. In the PIO communication, the ata is transferre to a remote noe by the CPU s remote write operation. The latency of PIO is very small, an, as a result, the PIO is useful to transfer short messages. The DMA function is achieve by the DMA controller, which has four DMA channels. While the latency of DMA is larger than that of PIO, the DMA emonstrates higher maximum banwith performance. The HA-PACS (Highly Accelerate Parallel Avance system for Computational Sciences) is a GPU cluster system at the Center for Computational Sciences, University of Tsukuba. The HA-PACS/TCA is a proof-of-concept system of TCA architecture concept an a performance evaluation test-be of PEACH2 boar. The HA-PACS/TCA contains the PEACH2 as an interconnect aapter in aition to a commoity InfiniBan interconnect. Table 1 shows the specification of HA-PACS/TCA. The HA-PACS/TCA consists of four sub-clusters. Each sub-cluster is compose of 16 compute noes. The 16 noes are connecte by the PEACH2 an configure a 2 8 torus network. ote that the 64 noes of HA-PACS/TCA are connecte also by two ports of InfiniBan QDR in a fat-tree configuration with full bisection banwith. 3 Implementation The CG benchmark in AS Parallel Benchmarks (PB) is originally written in Fortran. Though several CG benchmark stuies on GPUs have been conucte [3, 8], there is no available implementations written in MPI an CUDA to our knowlege. Because of the fact, in the present stuy, we moify the MPI version

3 Table 1. HA-PACS/TCA system configuration oe configuration Motherboar SuperMicro X9DRG-QF CPU Intel Xeon E v2 2.8 GHz 2 (IvyBrige 10 cores / CPU) Memory DDR MHz 4 ch, 128 GB (=8 16 GB) Peak performance 224 Gflops / CPU GPU VIDIA Tesla K20X 732 MHz 4 (Kepler GK cores / GPU) Memory GDDR5 6 GB / GPU Peak performance 1.31 Tflops / GPU Interconnect InfiniBan: Mellanox Connect-X3 Dual-port QDR TCA: PEACH2 boar (Altera Stratix-IV GX 530 FPGA) System configuration umber of noes 64 Interconnect InfiniBan QDR 108 ports switch 2 ch Peak performance 364 Tflops /2 /2 /4 /4 /4 /2 /2 P = 2 P = 4 P = 8 P = 16 Fig. 1. Process istribution by two-imensional ecomposition of matrix ata. The number in each rectangular represents its process rank i. of CG benchmark in AS such that the main computation part (conj gra function) is written in C language an CUDA, an its inter-noe ata communications are conucte with the TCA/PEACH2. Let us enote a linear equation system as Ax = b, where A is an symmetric positive efinite matrix, an both x an b are a vector with elements in the following. The moifie implementation uses the ientical ata istribution an communication pattern to the original MPI version. The parallelization of CG benchmark is achieve by a two-imensional ecomposition of the matrix A an conformable istribution of vectors. When we efine the number of processes as P, the matrix A is two-imensionally ecompose by P = P r P c processes. Base on the ecomposition, each process contains (/P r ) (/P c ) sub-matrix of A an vectors with /P c elements. Figure 1 shows the process istribution by two-imensional ecomposition of matrix A ata for P = 2, 4, 8, 16. Almost all of computations in the implementation are carrie out by GPUs. While we implement the sparse-matrix vector multiplication (SpMV) by ourselves, our CG implementation utilizes the VIDIA s CUBLAS library for vec-

4 tor ot prouct (DOT) an vector aition (AXPY) operations. ote that the computation part is not tune so eeply since this stuy focuses on performance evaluation of ata communication. Three kins of ata communication are require in the implementation. The first require communication is to sen vector ata for obtaining the prouct of SpMV after the local SpMV computation on a GPU in each noe. The communication is conucte among P c processes in the same process row in a binary tree fashion, an thus P c communication steps are neee (each step nees to sen 8/P c Bytes of vector ata in ouble precision). The secon communication is to sen a scalar (8 Bytes) value to compute the sum of local prouct by the DOT computation. This communication is also mae among the P c processes an P c steps are require. The thir communication is to sen 8/P c Bytes of a vector for ata exchange. In the following, let us name the first, secon an thir communication as COMM SpMV, COMM DOT an COMM EXCH, respectively. The COMM DOT is scalar ata communications between CPU memories of ifferent processes an the latency for issuing communication operations occupies almost all of its communication time. The COMM SpMV an COMM EXCH are block ata communications of 8/P c Bytes between ifferent GPU memories an a communication banwith is important for high performance communication as well as the issue latency. Consiering the communication characteristics, we implement the COMM SpMV an COMM EXCH with the DMA communication function of TCA/PEACH2 an implement the COMM DOT with the PIO function. ote that, as shown in Figure 2, the DMA communication is faster than the PIO communication when message sizes are larger than 128 Bytes, which are smaller than sizes for the require block communications in any problem classes of CG benchmark. The TCA/PEACH2 configures a 2 8 torus network on a sub-cluster of HA-PACS/TCA; thus, a way of process (noe) mapping also affects the communication performance. As can be seen from Fig. 1, the CG implementation with two-imensional ecomposition requires communication among P c processes (4 processes at most when we utilize up to 16 noes). We use a noe mapping shown in Figure 3. This mapping oes not cause message ata contentions an collisions within the TCA/PEACH2 s communication network on a sub-cluster even when P = 16 cases. 4 Performance Evaluation We conuct performance measurements on a sub-cluster of the HA-PACS/TCA. A single GPU an a single CPU socket are utilize per noe 4. For comparison with the implementation using TCA/PEACH2, this section also presents the 4 This is because using two or more sub-clusters entails a hybri utilization of the TCA/PEACH2 an MPI/IB, an because two or more GPUs usage requires aitional consierations to use the TCA/PEACH2 effectively. Both of the hybri utilization an the multi GPU usage are among our future work.

5 WhWh W,W/K WhWh W,D WhWhDW// 'Wh'Wh W,D e e e eee D 'Wh'Wh DW// Fig. 2. Ping-pong communication performance between two neighboring noes on the HA-PACS/TCA. Fig. 3. oe mapping optimize for CG benchmark implementation on a sub-cluster of HA- PACS/TCA. Circles represent compute noes, lines between circles represent ata links, an the number correspons to its process rank i. performance of an implementation using MPI/InfiniBan (MPI/IB) for internoe communications. We use the MVAPICH2 GDR 2.1a (MV2GDR) MPI library implementation [6]. As with the TCA/PEACH2, the MV2GDR utilizes the GPUDirect for RDMA (GDR) technology [5] for irect communication between GPUs. ote that the theoretical peak banwith of TCA/PEACH2 (PCIe Gen2 x8) is twice lower than that of MPI/IB (ual-rail InfiniBan QDR 5 ); therefore, the implementation using TCA/PEACH2 is outperforme when message sizes become large. On the conition where the program is compile by Intel C compiler with MV2GDR 2.1a an CUDA 6.5 usage, the TCA/PEACH2 is faster for message sizes up to 64 KB than the MPI/IB in terms of the ping-pong communication performance as shown in Fig. 2. In the CG benchmark, the problem sizes (CLASS) an the number of processes (P ) can be esignate. This section presents results of performance evaluations for CLASS=S, W, A, B an P = 2, 4, 8, 16. The problem (matrix/vector) size is 1,400 for CLASS=S, 7,000 for CLASS=W, 14,000 for CLASS=A, an 75,000 for CLASS=B. We measure the consuming time for each computation/communication operation in the conj gra program function. Figure 4 shows the measure time breakown on average time of ten times calls to the function. ote that this is the breakown of process rank 0 an the communication time is the communication wait time. A single call of the conj gra function inclues 26 P c of COMM SpMV, 52 P c of COMM DOT, 26 of COMM EXCH, 26 of SpMV, 52 of DOT, an P c of AXPY operations. In Fig. 4, the performance is shown for the implementation in Fortran (original PB-MPI 5 The theoretical peak banwith of the ual-rail InfiniBan QDR is 8 GB/s, which is equivalent to that of PCIe Gen3 x8.

6 e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W KDD^Ds KDDK KDDy, ywz K ^Ds D K e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W Wl Wl Wle Wle Wl Wl Wle Wle KDD^Ds KDDK KDDy, ywz e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W K ^Ds D K, / / W, / / & D W, / / W & D W, / / W & D W W & D W Wl Wl Wle Wle Wl Wl Wle Wle Fig. 4. Time breakown on a single conj gra function call of CG benchmark. In each figure, the performance (time in microsecon) is shown for the implementation in Fortran (original PB-MPI coe), C language, CUDA with TCA/PEACH2 communication, an CUDA with MPI/IB communication. The upper plot results for several results in Fortran an C, such as CLASS=B & P=2 case, are cut for simplicity. Fig. 5. Time breakown of CG implementation with one-imensional ecomposition coe), C language, CUDA with TCA/PEACH2 communication, an CUDA with MPI/IB communication. The Fortran implementation is faster than the C implementation (similar performance results were reporte for the PB-LU benchmark by Pennycook et al. [7]). Compare with the original Fortran implementation, the GPU/CUDA usage for PB implementation eteriorates the performance in several cases

7 (particularly in cases of smaller size class an larger number of processes utilization). This is mainly ue to that the GPU usage brings bigger overheas for the CUDA kernel invocations 6 an for the inter-noe communication call between GPUs. The TCA/PEACH2 utilization contributes the performance improvement for the three small size classes (CLASS=S, CLASS=W, an CLASS=A) compare with the MPI/IB utilization. Especially, the communication performance is 3.00 times higher in the case CLASS=S with P = 16, an the overall performance incluing the computation time an communication time is 1.24 times higher. The large performance improvement by TCA/PEACH2 is erive from improvements of COMM SpMV an COMM DOT. The communication time is 2.42 times shorter for COMM SpMV an 4.72 times shorter for COMM DOT in this case. In the case CLASS=A with P = 16, the communication performance using the TCA/PEACH2 is 1.44 times higher an the overall performance is 1.12 times higher. The performance of COMM DOT with TCA/PEACH2 is higher in all four size cases since the communication latency mostly ecies the performance for the scalar ata communication. However, the TCA/PEACH2 is not always effective. The performance for CLASS=B is almost same. In the CLASS=B case, the COMM EXCH is the main performance bottleneck. To see how the performance is ifferent between the CG benchmark implementation with two-imensional ecomposition (2D-CG) in the present stuy an the CG implementation with one-imensional ecomposition (1D-CG) in our previous stuy [4], we aitionally measure the performance of 1D-CG implementation on the equivalent conition an cases (1D-CG is implemente by ourselves an not in the AS Parallel Benchmarks). Figure 5 shows the measure time breakown of the 1D-CG for matrices corresponing CLASS=S an CLASS=B. The 1D-CG utilizes allgather an allreuce collective communications (see [4] for implementation etails). All P processes are involve for both the collective communications in 1D-CG, whereas P c processes are involve at most in 2D-CG. Since the network topology of TCA/PEACH2 is 2 8 torus network, collisions within the communication network cannot be avoie for P = 16 cases in 1D-CG. As shown in Fig. 5, the communication time in 1D-CG becomes larger when P increases. In aition, the largest message size of 1D-CG is larger than that of 2D-CG for P = 8, 16 cases (the size is 8/2 Bytes in 1D-CG an 8/P c in 2D-CG). This fact is isavantage for large matrix sizes (CLASS=B) an relative performance ifferences between TCA/PEACH2 an MPI/IB is large compare with 2D-CG implementation. In general, the message size of each communication in 2D-CG is shorter than or equal to that in 1D-CG for corresponing communication. The performance egraation with shorter message size is serious in MPI/IB while TCA/PEACH2 provies a goo performance thanks to its very small latency. Thus, the combination of such short messages an avoiance of message collision on the torus network of TCA/PEACH2 leas this performance improvement on 2D-CG benchmark. 6 A CUDA kernel invocation at least takes 9.7 µsec, incluing the CUDA stream synchronization time, in our measurement.

8 5 Conclusion The present stuy has utilize the TCA/PEACH2 for an implementation of AS Parallel CG benchmark an conucte its performance evaluation on the HA-PACS/TCA GPU cluster. Results of the performance evaluation show that the CG implementation with the parallelization by a two-imensional ecomposition of matrix ata oes not cause message ata collisions within the communication network of TCA/PEACH when processes are properly mappe to noes; an the present CG implementation is consiere to be better suite for the TCA/PEACH2 utilization than the previous CG metho implementation with a one-imensional ecomposition [4]. The performance improvement over MPI/IB utilization is ue to the very small latency of TCA/PEACH2. We will continue researches on the TCA with the view that reucing the latency between accelerators by irect communication is important for strong-scaling computing. Acknowlegements. The present stuy was supporte by the Japan Science an Technology Agency s CREST program entitle Research an Development of Unifie Environment on Accelerate Computing an Interconnection for Post- Petascale Era. The authors woul like to thank the Center for Computational Sciences, University of Tsukuba for allowing us to use the HA-PACS/TCA system as part of the interisciplinary Collaborative Research Program. References 1. Hanawa, T., Fujii, H., Fujita,., Oajima, T., Matsumoto, K., Boku, T.: Evaluation of FFT for GPU Cluster Using Tightly Couple Accelerators Architecture. In: Proc. Cluster pp IEEE (2015) 2. Hanawa, T., Koama, Y., Boku, T., Sato, M.: Tightly Couple Accelerators Architecture for Minimizing Communication Latency among Accelerators. In: Proc. IPDPSW pp IEEE (2013) 3. Lee, S., Vetter, J.S.: Early evaluation of irective-base GPU programming moels for prouctive exascale computing. In: Proc. SC 12 (2012) 4. Matsumoto, K., Hanawa, T., Koama, Y., Fujii, H., Boku, T.: Implementation of CG Metho on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication. In: Proc. IPDPSW pp IEEE (2015) 5. VIDIA: VIDIA GPUDirect (Accesse April 25, 2016), 6. Pana, D.K.: MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA) (Accesse April 25, 2016), 7. Pennycook, S.J., Hammon, S.D., Jarvis, S.A., Mualige, G.R.: Performance Analysis of a Hybri MPI/CUDA Implementation of the AS- LU Benchmark. SIGMETRICS Perform. Eval. Rev. 38(4), (2011), 8. Xu, R., Tian, X., Charasekaran, S., Yan, Y., Chapman, B.: AS Parallel Benchmarks for GPGPUs Using a Directive-Base Programming Moel. In: Languages an Compilers for Parallel Computing. LCS, vol. 8967, pp Springer (2015)

Interconnection Network for Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What