Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing

Size: px

Start display at page:

Download "Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing"

Brian Fleming
5 years ago
Views:

1 Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing Nagarajan Kathiresan, Ph.D., IBM India, Bangalore.

2 Agenda :- Different types of cluster/smp architectures Trade-off between Price and Performance Memory bound and CPU bound applications Programming Models Hybrid Scenario Processor binding and Memory affinity Advantages of Hybrid Parallelization Case Study Molecular Dynamics Application Conclusion Acknowledgement

3 Different types of Cluster/SMP Architectures Shared Memory MIMD Easy to build & shared memory communication. Limitation : Reliability & expandability :- A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Distributed Memory MIMD Communication: Inter Process Communication (IPC). Network can be configured like Tree, Mesh, Cube, etc. Unlike Shared memory MIMD Easily to expandable Highly reliable (any CPU failure does not affect the whole system) Processor Processor A M E M B O U R S Y IPC channel Processor Processor B M E M B O U R S Y M E M B O U R S Y Global Global Memory Memory System System Processor Processor A A M E M B O U R S Y Memory Memory System A System A Processor Processor B B M E M B O U R S Y Memory Memory System B System B Processor Processor C Processor Processor C C M E M B O U R S Y Memory Memory System System C C

4 Cluster of SMP Architectures Multichip nodes Each node has Single, two, four, eight and sixteen chips Multicore chips Each chip has 2,4,6,8 cores etc Memory is associated with chips NUMA architecture - More accessible from cores on same chip

SMP architecture scalability is limited due to shared memory contention.

5 Trade-off between Price and Performance Uniprocessors architecture performance is limited due to application may not possible to run in parallel manner (Eg:- Architecture limitation). SMP architecture scalability is limited due to shared memory contention. Cluster of SMP performance (Pure MPI based applications) is saturated (at some point) due to imbalance in communication across MPI processes, unable to increase memory conception per MPI process etc. Source:

6 Memory bound and CPU bound applications Year by Year, the CPU performance (speed) is increased exponentially but memory speed is limited. Most of the HPC applications are CPU intensive and Memory intensive workloads. These HPC applications scalabilities are limited due to STREAM performance is saturated after certain number of cores. Application configuration which uses hybrid mechanism will help to fill-up the gap between memory-bound and CPU-bound workload. Source:

7 Parallel Programming Models Shared Memory Distributed Memory OpenMP C + MPI C + MPI + OpenMP

Hybrid Scenario (i) MPI is used to handle the communications across the SMP nodes and OpenMP within the nodes, (ii) At node level, OpenMP based threads are used to avoids the extra explicit

8 Hybrid Scenario (i) MPI is used to handle the communications across the SMP nodes and OpenMP within the nodes, (ii) At node level, OpenMP based threads are used to avoids the extra explicit communications that MPI would require, (iii)the shared memory/thread copy mechanism is used to increase the dynamic load balancing (iv)mpi is used to handle the parallelism across the SMP nodes for efficient communications.

9 Different Hybrid Configurations - views 16 Application threads 4 MPI x 4 Threads per task = 16 Application threads 1 MPI x 16 Threads per task = 16 Application threads

10 Processor binding and Memory affinity CPU or Processor binding To bind one or more processes to one or more processors (CPUs). Example: /bin/taskset NUMA Architecture In a NUMA architecture, memory is physically distributed throughout the machine even though it is virtually treated as a single address space. Memory affinity plays a crucial role to improve the performance for memory bound applications. Local memory (close to the MPI process running on the chip) will be faster to access than other remote memories. Ex. MEMORY_AFFINITY=MCM

Case Study #1 The CPMD is a parallelized plane wave / pseudopotential implementation of density functional theory, particularly designed for ab-initio molecular dynamics.

11 Case Study #1 The CPMD is a parallelized plane wave / pseudopotential implementation of density functional theory, particularly designed for ab-initio molecular dynamics. Dependent Libraries to build CPMD Math library ( Eg. ESSL/MKL/ACML/ATLAS) Configuration: MPI OpenMP MPI + OpenMP (Hybrid) Configuration Flags : CC = mpicc FC = mpif77 -c LD = mpif77 -nofor_main FFLAGS = -L/opt/intel/mkl/ /lib/em64t -lmkl_em64t -lmkl_core \ -lmkl_lapack -lmkl_em64t -lm -openmp -lpthread Ref:

12 Literature Review & Reference Test Case: 32 water molecules, Si512, etc ( Standard example test cases) Each CPMD run consists of two parts: (i) An optimization step and ( Wave function Optimization) (ii) second a molecular dynamics simulation. ( Actual simulation) Ref:

13 Our Case Study Exercise:- Using Pure MPI: 1.The elapsed time and TCPU times are not good compared with Hybrid model. 2. Scalability is poor for more no. of cores. Using Hybrid model (MPI + OpenMP): 1. The elapsed time and TCPU times are improved. 2. We were able to improve the scalability & performance. Hardware Config: HS22 Intel Blade Server (2 Way Quad core) Nehalem Processor Pure MPI Performance Hybrid Performance Elapsed Tim e (in Sec) Scalability (in % ) Elapsed tim e (in Sec) Scalability (in % ) No. of Nodes No. of Nodes Total elapsed time (in Sec) Scalability (in %) Elapsed Time ( Sec) Scalability (in %)

14 Summary Pure MPI vs Hybrid Optimization at runtime export I_MPI_ADJUST_ALLTOALL=4 export I_MPI_ADJUST_ALLREDUCE=5 export I_MPI_DEVICE=rdssm export I_MPI_PIN_DOMAIN=socket export OMP_NUM_THREADS=8 ie., 2 MPI process x 8 threads per node Total elapsed time (in Sec) Distributed Memory Configuration Performance improvement (in %) No of Nodes Performance improvement Distributed Shared Memory configuration AlltoAll All Reduce Ref: Intel MPI Library for Linux reference manual

15 Case Study #2 What is CP2K? CP2K is a freely available (GPL) program, written in Fortran 95, to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. Dependent Libraries Fortran 95 Compiler BLAS and LAPACK Math libraries (ESSL, MKL or ACML may be used to improve the performance) ScaLAPACK library for parallel version of CP2K FFT Library (Optional but based on input test case requirement) Libint library (required only when the input file uses Hartree-Fock exchange (HFX) calculations) MPI Library for parallel Configuration: MPI Version OpenMP Version MPI/OpenMP Hybrid version C P 2 K Ref:

16 Challenges in Fortran 95 Compilers PGI, Pathscale and GNU Fortran compilers works fine with Pure MPI and Hybrid (MPI + OpenMP) configuration. Intel Fortran ( ) may not work for CP2K application which uses hybrid configuration (only for HFX based calculations). Therefore, GNU Fortran compiler is an alternative for Intel architecture. Gfortran or later version will do the array operations are parallelized automatically by the workshare directives. Therefore, hybrid configuration using GNU fortran version or later is mandatory. Function pointers support (?) absent in Intel Fortran and hence need to compile with ISO_C_BINDING support for Pure MPI version. Due to some bugs(?) in Intel math kernel library (MKL), we were unable to link with Intel MKL (for Hybrid model). Hence, we used open source Math libraries (ScaLAPACK, BLAS, FFTW, LAPACK etc)

17 Example Hybrid Configuration -D GRID_CORE=X (with X=1..6) specific optimized core routines can be selected -D MAX_CONTR defines the maximum CC = gcc angular momentum up to which specialized code FC = mpif90 will be used LD = mpif90 AR = ar r DFLAGS = -D GFORTRAN -D FFTSG -D LIBINT -D parallel -D SCALAPACK \ -D BLACS -D FFTW3 -D MAX_CONTR=3 (optional ) \ -D GRID_CORE=2 (optional) FCFLAGS = -I <fftw-3.2.2/include> -O3 -fopenmp -ffast-math -funroll-loops -ftree-vectorize - march=native -ffree-form $(DFLAGS) LDFLAGS = $(FCFLAGS) LIBS = libint_cpp_wrapper.o \ ( From tools/hfx_tools/libint_tools/libint_cpp_wrapper.cpp ) libderiv.a \ libint.a \ libscalapack.a \ blacs_mpi-linux-0.a \ blacscinit_mpi-linux-0.a \ blacsf77init_mpi-linux-0.a \ blacs_mpi-linux-0.a \ lapack_linux.a \ blas_linux.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ \ -lpthread OBJECTS_ARCHITECTURE = machine_gfortran.o

18 Performance and Scalability R atio between co mpute and co mmunicatio n using P ure M P I 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputatio n time No. of Cores, No. of Nodes Co mmunicatio n time Even though, the computational time reduces by increasing number of core, the overall performance was poor due to tremendous increase in communication time. Input: regtest-hfx based input files) R atio between C o mput and C o mmunicatio n using H ybrid (M P I+ OpenM P ) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputation Time No. of Cores, No. of nodes Co mmunication time Using hybrid parallelization paradigm, the ever increasing (with increasing number of cores) communication time is reduced by 49%, 530%, 814% and 579% respectively for 264, 516, 1032 and 2052 cores compared to Pure MPI patent.

19 Wall time The hybrid method of parallelization Pure MPI - Communication and Compute statistics 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% R atio between co mpute and co mmunicatio n using P ure M P I 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Co mputatio n time No. of Cores, No. of Nodes Communicatio n time Max. and Min Compute & Communication using Pure MPI No. of Cores, No. of Nodes M aximum Compute M inimum Compute M aximum Communication M inimum Communication Most of the communication time spend in MPI_Bareer() and MPI_Wait() calls due to highly imbalance in the communication. The minimum and maximum communication ratio is, (1.08%, 44.28%), (5.94%, 63.42%), (6.97%, 66.87%) and (26.38%, 69.84%) for 256, 512, 1024 and 2048 cores respectively which is compared with communication time.

20 Hybrid parallelization W a ll tim e 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M ax and M in compute & Communication using Hybrid No. of Cores, No. of Nodes M ax. compute M ax. Communication M in. Compute M in. communication The maximum time spend on communication is not more than 36% and maximum time spend on computation is more than 98% The communication time (min and Max) is reduced tremendous and this helped out us for better load balancing.

21 Performance impact in Hybrid parallelization Scalability Performance difference between Pure M PI and Hybrid model % % % % % % 50.00% 0.00% Pure M PI Hybrid No. of Cores, No. of Nodes Performance improvement using Hybrid model export I_MPI_DEVICE=rdssm export OMP_NUM_THREADS=1 export I_MPI_PERHOST=allcores export I_MPI_PIN=1 export I_MPI_PIN_MODE=mpd export I_MPI_PIN_PROCS=allcores % of Performance difference Hyper-threading is enabled and hence double the number of cores Physical cores are mostly used to run the MPI process and logical cores are only used for running OpenMP threads. Instructions from more than one thread can be executing in any given pipeline stage at a time. A single thread rarely uses all the execution units and when a thread stalls (because the thread needs data that is not stored in the cache), the CPU ends up waiting for memory for hundreds of cycles, the second thread immediately ready to run can fill up these empty spaces, and thus prevent wasting resources. Therefore, the compute time for 1032 and 2052 cores are increased tremendously and hence the superlinear scalability is able to obtain by 17.46% and 42.94% (117% and 142% total improvement compared with pure MPI) for 1032 and 2052 cores respectively.

22 Additionally, Increase the memory consumption MPI/OpenMP by hybrid parallelism using The performance improvement using hybrid parallelism is obtained in the following manner: o With a hybrid option, we can reduce the number of MPI tasks and increase the memory per MPI task significantly. This will result in storing many of the 2-electron integrals in the memory thus reducing the time to compute them repeatedly. o With a hybrid option, we could exploit the HT technology which is not possible with pure MPI case, because of the memory constraints. o Hybrid option improves the communication performance as there are fewer MPI tasks contending for the bandwidth and improves latency.

23 Advantages of Hybrid Parallelization Improving the performance, Overlapping communication and computation, Increase the memory consumption per MPI process Nested parallelism support Improve the Load balancing Efficiently utilizing the system resources (Eg: Simultaneous Multi Threading (SMT) in Power, Hyper Threading (HT) in Intel are used to run the worker threads of MPI process), Extend the scalability towards linear/super-linearly

24 Conclusion Using the hybrid way of running the parallel program helping to improve the performance and able to achieve super linear scalability. The appropriate process binding is used to pin the MPI process at the core/socket level, and therefore, the inter-node communication efficiency is increased.

25 Acknowledgement & Thanks! Luigi Brochard IBM, Distinguished Engineer Swamy N. Kandadai IBM, Senior Executive IT Specialist Rajendra D. (Raj) Panda IBM, Software Performance Analyst Mathias Puetz IBM, Application Performance Specialist

26 Thank you!

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for