Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

,,.,, InfiniBand PCI Express,,,. Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation Akira Nishida, The recent development of commodity hardware technologies makes building a software distributed shared memory computing environment a more practical approach for scientific computing that requires the repetitive solutions of large linear systems. In this study, we build a tightly connected PC cluster with PCI Express and InfiniBand technology for the scalable implementation of parallel iterative linear solvers, evaluate the performance and bottlenecks of commodity hardware technologies. The scalability of the implementation of parallel sparse matrix computations considering both the local memory bandwidth of the compute nodes and their internode communication bandwidth is evaluated. 1., PC, 2),6).,, PC,,.,,,,, -.,,,,.,, InfiniBand PCI Department of Computer Science, the University of Tokyo CREST CREST, JST Express,. 2. PCI Express PCI, Intel, NEC 24. PCI 1GB/s, PCI Express, 2.5Gb/s 32, 16GB/s. PCI Express, InfiniBand. Mellanox Technologies InfiniHost (HCA), 1 5Gbps DDR InfiniBand 4 ( 1 ), 8 PCI Express, 4Gbps. Opteron 1

2 PCI Express x16., 2.2GHz, 1MB Opteron 848, 4 16,. 2. 512MB PC32 DDR SDRAM (ECC Registered) 8. PCI Express x16 Internode communication InfiniHost HCA nforce Professional 22 1 Mellanox Technologies PCI Express DDR InfiniHost HCA MHGS18-XT DDR Fig. 1 InfiniHCA HCA MHGS18-XT DDR for PCI Express bus from Mellanox Technologies, Inc. DDR4 SDRAM DDR4 SDRAM DDR4 SDRAM DDR4 SDRAM AMD 64bit.,,.,, 2.6.11 SuSE Linux 9.3. 2.6 Linux, Opteron.,, 4-way Opteron 4 IBQ, Mellanox Technologies 8x PCI Express single port DDR InfiniHost HCA, PCI Express MHGS18-XT DDR. 24 DDR InfiniBand MTS24 DDR. 2-way 8 IBD 9),, 1CPU 1.5GB/s. PCI Express x8 DDR InfiniBand HCA, 4-way, PCI Express x8 2., NVIDIA nforce Professional Uniwide (Appro), Iwill Uniserver 3346, Internode communication InfiniHost HCA PCI Express x16 AMD 8132 nforce Professional 225 PCI-X 2 quad Opteron Fig. 2 System Diagram of dual Opteron Cluster Node. 3 InfiniBand MTS24 DDR Opteron IBQ Fig. 3 Opteron Cluster IBQ, connected with InfiniBand switch MTS24 DDR., 17 1, MPI OpenMP SCASH-MPI.,, InfiniBand 2

. 3. SCASH-MPI, MPI,.,, InfiniBand MPI MVAPICH 4). MVAPICH, 5). MVAPICH.9.6., MHGS-18XT, Uniserver 3346., DDR InfiniBand Opteron, 1 HCA. 4-5 MVAPICH.9.6 Opteron/MHGS- 18XT MPI., MPI, MPI 3). Time (us) 1 1 1 1 1 Intranode Internode.1 1 1 1 1 1 1 1e+6 1e+7 Message Size (Bytes) 4 MVAPICH.9.6 MPI ( DDR InfiniBand HCA (Mellanox MHGS18-XT) ) Fig. 4 MPI Latency of MVAPICH.9.6 (nodes connected with DDR InfiniHost HCA (Mellanox MHGS18-XT)). 3.1 STREAM benchmark,, STREAM benchmark 1) Uniwide., LAM, MPICH.,. 2 Intranode 18 Internode 16 14 12 1 8 6 4 2 1 1 1 1 1 1 1e+6 1e+7 Message Size (Bytes) 5 MVAPICH.9.6 ( DDR InfiniBand HCA (Mellanox MHGS18-XT) ) Fig. 5 MPI Bidirectional Bandwidth of MVAPICH.9.6 (nodes connected with DDR InfiniHost HCA (Mellanox MHGS18-XT)). Bandwidth (MB/s).,,. 1, 6 STREAM benchmark. /* - MAIN LOOP - repeat NTIMES times - */ scalar = 3.; for (k=; k<ntimes; k++) { times[][k] = second(); for (j=; j<n; j++) c[j] = a[j]; times[][k] = second() - times[][k]; times[1][k] = second(); for (j=; j<n; j++) b[j] = scalar*c[j]; times[1][k] = second() - times[1][k]; times[2][k] = second(); for (j=; j<n; j++) c[j] = a[j]+b[j]; times[2][k] = second() - times[2][k]; times[3][k] = second(); for (j=; j<n; j++) a[j] = b[j]+scalar*c[j]; times[3][k] = second() - times[3][k]; } 6 STREAM benchmark Fig. 6 Main loops of STREAM benchmark.,,. OpenMP stream d omp.c,., 7 4MPI 3

1 STREAM benchmark Table 1 STREAM benchmark types. Benchmark Operation Bytes per iteration a[i] = b[i] 16 a[i] = q * b[i] 16 a[i] = b[i] + c[i] 24 a[i] = b[i] + q * c[i] 24 STREAM benchmark. 4,,, 91.6MB., 1 3,., SCASH-MPI MPI 1 8)., 1 2., 8, 9 GbE. GbE MPI 24.27us, 226MB/s, InfiniBand., STREAM,,., 6 for,, 1. 2 15 1 5 2 4 6 8 1 12 14 16 7 4MPI STREAM benchmark Fig. 7 Speedup of STREAM benckmark results with up to four processes per node. 4. NAS Parallel Benchmark,, NAS Parallel Benchmark 7) 3.2, 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 8 2MPI STREAM benchmark Fig. 8 Speedup of STREAM benckmark results with up to two processes per node. 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 2MPI STREAM benchmark (GbE ) Fig. 9 Speedup of STREAM benckmark results with up to two processes per node on GbE. 35 3 25 2 15 1 5 2 4 6 8 1 12 14 16 1, 4MPI STREAM benchmark Fig. 1 Speedup of STREAM benckmark results with up to four processes per node (Mapping specified and barrier synchronization removed). 4

CG kernel Class S B. vector r,p,q,z; /* working vector */ /* solve A*x */ real CGSOLVE(matrix A,vector x, real zeta) { int it; real alpha, beta, rho, rho; r = x; p = x; z =.; rho = x * x; for(it=; it < NITCG; it++) { q = A*p; alpha = rho/(p * q); z += alpha * p; rho = rho; r += -alpha * q; rho = r*r; beta = rho / rho; p += beta*r; } r=a*z-x; zeta = 1./x*z; x = (1./sqrt(z*z))*z; return sqrt(r*r); } 11 NPB CG Fig. 11 Algorithm of NPB CG main loop. CG kernel,. 11.,,.,. S=1,4, W=7,, A=14,, B=75,, C=15,, S=7, W=8, A=11, B=13, C=15., 16 75KB., MPI NPB CG kernel DDR InfiniHost HCA. CG kernel, 9). 12 MPI CG kernel IBD. 4-way 13, DDR InfiniBand, 2 SDR InfiniBand Class C,., MPI, 1 2MPI OpenMP NPB CG 14. NPB BT, SP 8),, CG. MPI, performance (MFLOPS) 4 35 3 25 2 15 1 5 Class S Class W Class A Class B Class C 1 2 4 8 16 number of processes 12 SDR InfiniHost HCA 2 NPB CG Fig. 12 Speedup of NPB CG with two ports of SDR InfiniHost HCA. performance (MFLOPS) 4 35 3 25 2 15 1 5 Class S Class W Class A Class B Class C 1 2 4 8 16 number of processes 13 DDR InfiniHost HCA NPB CG (MPI ) Fig. 13 Speedup of NPB CG (MPI version) with DDR InfiniHost HCA.,, CG,. 5., PCI Express InfiniBand,,.,, 4-way, 5

performance (MFLOPS) 8 7 6 5 4 3 2 1 1 2 4 8 number of processes Class S Class W Class A Class B 14 DDR InfiniHost HCA NPB CG (OpenMP ). 1 2MPI Fig. 14 Speedup of NPB CG (OpenMP version) with DDR InfiniHost HCA, up to two processes per node.,,.,,. InfiniBand, Mellanox Technologies, Inc.,,,.,, 1616225,. Conference on Supercomputing 3 (23). 5) Liu, J., Vishnu, A. and Panda, D. K.: Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation, Proceedings of the International Conference on Supercomputing 4 (24). 6) Pfister, G. F.: In Search of Clusters, Prentice- Hall, second edition (1998). 7) van der Wijngaart, R.: The NAS Parallel Benchmarks 2.4, Technical ReportNAS-2-7, NASA (22). 8),,, : MPI, :, Vol. 46, No. SIG 7, pp. 63 73 (25). 9) : InfiniBand,, Vol. 25, No. 81, pp. 97 12 (25). 1) STREAM: Sustainable Memory Bandwidth in High Performance Computers, http://www.cs.virginia.edu/stream/. 2) Hennessy, J. I. and Patterson, D. A.: Computer Architecture: A Quantitative Approach, Third Edition, Morgan Kaufmann (23). 3) Jin, H. W., Sur, S., Chai, L. and Panda, D. K.: LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster, Proceedings of the International Conference on Parallel Processing (25). 4) Liu, J., Chandrasekaran, B., J. Wu, W. J., Kini, S. P., Yu, W., Buntinas, D., Wyckoff, P. and Panda, D. K.: Performance Comparison of MPI implementations over Infiniband, Myrinet and Quadrics, Proceedings of the International 6