Benchmark runs of pcmalib on Nehalem and Shanghai nodes

Size: px

Start display at page:

Download "Benchmark runs of pcmalib on Nehalem and Shanghai nodes"

Willis O’Brien’
5 years ago
Views:

MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for

1 MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical Computer Science and Swiss Institute of Bioinformatics, ETH Zurich, CH 9 Zurich, Switzerland christian.mueller@inf.ethz.ch Introduction This document summarizes parallel benchmark results of pcmalib on different processors. pcmalib implements the Evolution Strategy with Covariance Matrix Adaptation(CMA-ES) and its Particle Swarm extension(ps-cma-es). In the next section we describe the conducted computer experimental set-up and the technical realization on the different platforms. Section 3 describes in detail the obtained results on the MOSAIC cluster and the best hardware/software combination of the newly conducted tests. Section presents the wall-clock times for all conducted tests, and the last section summarizes and discusses the results. Experimental Set-up We assess the parallel efficiency of our implementations of CMA-ES and PS-CMA-ES on a distributedmemorycomputerclusterontheconstrainedrandomfitnesslandscapef rand (x) = Y, wherexisdefinedin theboundedsubset[,] n R n. Foranyx, Y isdrawnfromtheuniformdistributionu(,). Each algorithm evaluates the fitness function 5 times (corresponding to drawing 5 uniformly distributed random numbers) on =,..., processor cores (for the newly tested processors we only go to = ). The number of CMA-ES instances or the swarm size in PS-CMA-ES is always chosen equal to in order to avoid cache and memory congestion effects. Distributing a problem of fixed size onto an increasing number of processors measures the strong scaling of the algorithms, where the workload per processor decreases and the communication overhead increases. The random landscape F rand ensures several properties that are indispensable for an unbiased assessment of the parallel scaling. First, the computational cost of evaluating the objective function is independent of the search dimension and the specific optimization path. Second, the random landscape guarantees that all CMA-ES instances experience the same search space. We perform three benchmarks with varying values of the communication interval I c in order to disentangle the influence of the covariance matrix Eigen-decomposition and the MPI communication in PS-CMA-ES. The first set-up considers the standard parallel CMA-ES without swarm communication, i.e., I c =. The second benchmark evaluates the performance of the standard PS-CMA-ES with I c =. Since I c is in units of generations, and increasing S ( ) also increases the number of function evaluations per generation, the number of MPI communications performed in total during the fixed 5 function evaluations decreases. Therefore, the third set-up considers PS-CMA-ES with a constant number of MPI communication steps, independent of the swarm size S. This is achieved by setting I c = /S. All three benchmarks are conducted in n =,3,5, dimensions. Each benchmark is repeated r =,...,R times. For each repetition r, we measure the elapsed wall-clock time

2 t i,r on each processor core i =,...,. The overall run time t( ) of the algorithm on processors is given by the maximum time over all processes, averaged over the R independent runs: t( ) = mean r From this, the parallel speedup s and efficiency e are defined as: max t i,r. () i=,..., s( ) = t() t( ), e() = s(). (). Reference set-up of the pcmalib on the MOSAIC cluster The Fortran library is compiled with the Intel Fortran compiler version. and optimization level O3, and linked against OpenMPI version... The tests are performed on a Gentoo..5 Linux cluster consisting of compute nodes. Each node contains Intel Xeon. GHz quad-core processors ( cores pernode) with GB ofram percore. Thenodesareconnectedbyadedicated GigabitEthernetnetwork, entirely reserved for MPI communication(there is a second, identical network for system communication). TORQUE and Maui are used as resource manager and queuing system, respectively. In order to assess the influence of intra- vs. inter-node MPI communication, the scheduler is instructed to assign MPI processes per node.. Set-up on the Nehalem node Enda O Brien from DALCO used an -core Nehalem node (Xeon 557,.93 GHz) with GB memory, running SUSE Linux...-.-smp. Three different software combinations have been tested in order to disentangle the influence of processor and numerical libraries on the speed-up.. The first and fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML) The third combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale)..3 Set-up on the Shanghai node Enda O Brien also tested an -core AMD 37 ( Shanghai ) node, with. GHz ( x quad-core processors) and GB memory. Again, the same three different software combinations have been tested on this platform.. The first and again fastest software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the MKL libraries.. The second software combination consists of the Intel..3 compilers, the Intel MPI V.3...9, and the LAPACK that is included with the AMD Core Math Library (ACML). 3. The third (and for AMD natural) combination uses the Pathscale V.3. compilers, ACML.., and OpenMPI V...5 (also built with Pathscale). 3 Results on the MOSAIC cluster and the Nehalem node with Intel software We first present and analyze the results obtained on the MOSAIC cluster (identical to the ones in Mueller9b.pdf). The second subsection contains the detailed results of the Intel compiler/mpi/mkl test on the Nehalem node.

3 3. Results on the MOSAIC cluster The measured maximum wall-clock times for all 3 benchmarks on the MOSAIC cluster are reported in Fig., the speedups in Fig., and the parallel efficiencies in Fig n= n=3 3 n= 3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). 3

4 3 n= Speedup 3 3 n=3 Speedup Speedup 3 3 n= 3 Figure : Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown).

5 Efficiency Efficiency Efficiency n= n= n= 3 Figure 3: Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = 5 runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are no noticeable differences between the three different test set-ups. Up to =, i.e. on a single node, the wall-clock time decreases from.5s to below s. The speedup increases up to and the efficiency decreases to..7. When using two nodes ( processes), the wall-clock time increases again, and speedup and efficiency drop considerably. This is expected as the network latency becomes the limiting factor for such a small test problem. The situation changes in higher dimensions. For n = 3, the wall-clock time of parallel CMA-ES decreases from s on a single core to below s on cores. The two PS-CMA-ES tests need around 7s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about 5 seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 5

6 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around s. While the wall-clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around 5s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are and 5, respectively, corresponding to parallel efficiencies of 5 and.75. For n =, the parallel scaling further improves. The efficiency for standard CMA-ES is.7 on cores, while standard PS-CMA-ES achieves a super-linear efficiency of.7 (due to the decreasing number of MPI communications). 3. Results of the Intel compiler/intel MKL/ Intel MPI test on the Nehalem node The measured maximum wall-clock times for all 3 benchmarks on the Nehalem node are reported in Fig., the speedups in Fig. 5, and the parallel efficiencies in Fig..

7 n= n=3 n= Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). 7

8 n= Speedup n=3 Speedup n= Speedup Figure 5: Parallel speedup s of the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown).

9 Efficiency Efficiency Efficiency n= n=3 n= Figure : Parallel efficency e of the parallel CMA-ES ( ), and PS-CMA-ESwith constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3, dimensions. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). In n = dimensions, there are again no significant differences between the three different test setups. Up to =, i.e. on a single node, the wall-clock time decreases from.s to below.s. The speedup increases up to and the efficiency decreases to.7.. For n = 3, the wall-clock time of parallel CMA-ES decreases from 7.s on a single core to below.s on cores. The two PS-CMA- ES tests need around 9.5s on a single core due to the additional construction of the rotation matrix. The PS-CMA-ES with constant number of MPI communications shows a similar scaling as the parallel CMA-ES, with an offset of about seconds, corresponding to the constant communication overhead. The PS-CMA-ES with decreasing number of communications approaches the behavior of the standard parallel CMA-ES since, with increasing, the MPI communication overhead and the 3-dimensional rotations become negligible compared to the computational cost of CMA-ES. This is also reflected in the parallel speedup and efficiency. The standard PS-CMA-ES with I c = achieves the best efficiency (due to a higher computational cost on a single core), closely followed by the parallel CMA-ES. The same qualitative behaviour is observed in n = 5 (figures not shown), but, due to the higher computational cost, the parallel efficiency increases further. The computational costs for the basic CMA-ES operations and the matrix rotations now dominate, and the communication overhead becomes less apparent. On a single core, parallel CMA-ES needs s and the two PS-CMA-ES variants around 35s. While the wall- 9

10 clock time of the standard PS-CMA-ES rapidly approaches the one of CMA-ES for increasing, the PS-CMA-ES with a constant number of MPI communications shows an offset of around s due to the communication overhead and the 5-dimensional matrix rotation. The speedups of the parallel CMA-ES and the standard PS-CMA-ES for = are 7.5, respectively, corresponding to parallel efficiencies of.9 and. For n =, the parallel scaling further improves. The efficiency for standard parallel CMA-ES and PS-CMA-ES is around.97 on cores. Comparison of the performance on the Nehalem and the Shanghai node for different software libraries Due to limited time from my side I decided to only show the wall-clock times of the conducted test runs. Figure 7 summarizes the three different software combinations on the Nehalem node, Figure the same combinations on the Shanghai node. Due to the automatic generation of the plots the y-axes representing the wall-clock times are scaled according to the maximum wall-clock time occurring in each test run. This makes a visual comparison between test runs a bit tricky. I m sorry for the inconvenience but this is all I have for the moment.. The Nehalem node The main observation for the Nehalem node is that, indeed, the combination of Intel compiler/mkl/mpi gives the overall best performance in all dimensions and for all settings of parallel (PS-)CMA-ES on a single core compared to the other software combinations. The speed-up is between % and 3% depending on the number of processors used, the algorithmic set-up and the dimension. E.g., for n =, the wall-clock time is below.5 s while the ACML swap and the Pathscale/ACM- L/Open MPI combinations are around 3.5 s and 3 s, respectively. For higher dimensions, the Intel compiler/acml/intel MPI combination performs slightly better than Pathscale/ACML/Open MPI, indicating that the compiler/mpi combinations are less influential than the use of the optimized LAPACK routines in ACML/MKL.. The Shanghai node The picture changes when evaluating the software combinations on the Shanghai node. Again, the Intel compiler/mkl/mpi combination outperforms the other two, especially for n =, 3. For n = 5,, however, the Intel compiler/mkl/mpi and Intel compiler/acml/mpi combinations give almost identical results. This indicates that the superiority of the MKL LAPACK routines vanishes on the AMD cores. The Pathscale/ACML/Open MPI is inferior to the first two combinations in all dimensions..3 Nehalem and Shanghai in comparison The Intel libraries in combination with the Intel cores give the overall best performance. However, the AMD cores give a better average performance independent of the libraries used. All three AMD set-ups are,e.g., fasterthansoftwarecombinationand3onthe Nehalemnodeforn = 5,. However,asEnda O Brien already pointed out in the correspondence, the LAPACK libraries should be considered as real part of the system, and as such, the Intel combinations are undoubtedly the best one for pcmalib.

11 n= n= n= n= n= n= 5 n= n= n=3 n=5 n=5 n=5 Figure 7: Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Nehalem node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.

12 n= n=3 n=5 n= n= n= n=5 n= n= n= n= n=3 Figure : Overall run time t( ) in seconds for the parallel CMA-ES ( ), and PS-CMA-ES with constant ( ) and decreasing ( ) number of MPI communications on the random landscape test problem in n =,3,5, dimensions on the Shanghai node. The number of processor cores is varied from to. Each point is averaged from R = runs. The standard deviations are close to zero (data not shown). The first column depicts the results for the combination Intel Compiler/MKL/Intel MPI, the second one Intel Compiler/ACML/Intel MPI, the third one Pathscale/ACML/Open MPI.

13 5 Concluding remarks In this document I tried to summarize the results Enda O Brien kindly generated with our pcmalib test bed. My general conclusion would be that, indeed, the Intel cores together with the Intel software libraries achieve the best single-core performance. As pcmalib heavily relies on LAPACK routines for eigenvector decomposition, the results show the excellent performance of the Intel MKL in combination with Intel cores. Compared to our previous results on the MOSAIC cluster, the Nehalem node almost halves the wall-clock time for the n = set-up. Compared to our published results, a significant speed-up is also gained with the Shanghai node. When running parallel (PS-)CMA-ES on several cores, the differences between the different hardware/software combinations become less apparent except in the case n =. Although we believe that the given test bed provides a very good set-up to evaluate speed-up and scalability of pcmalib on different processor cores, it is not clear how well the results generalize to the application of real-world optimization tasks, e.g., when the fitness/objective function involves more complex numerical operations than generating pseudo-random numbers. 3

Assessment of LS-DYNA Scalability Performance on Cray XD1

5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123