Code-Agnostic Performance Characterisation and Enhancement

Size: px

Start display at page:

Download "Code-Agnostic Performance Characterisation and Enhancement"

Janis Rodgers
5 years ago
Views:

1 Code-Agnostic Performance Characterisation and Enhancement Ben Menadue Academic

2 Who Uses the NCI? NCI has a large user base 1000s of users across 100s of projects These projects encompass almost every research area physical sciences Earth sciences engineering mathematics finance social science Correspondingly, there is a huge variation in backgrounds and experience some are programmers can optimise their algorithm and code to suit the machine most just run pre-packaged software no control over the source

3 Performance Characterisation for Beginners If source code is available, can instrument and profile in the usual fashion. For non-advanced users, often we walk them through this and help them analyse the results. What about for pre-built packages? Use an LD_PRELOAD to catch and log e.g. MPI calls We provide several such tools: IPM mpip perf (not an LD_PRELOAD, but still doesn t require recompilation) IPM is our tool of choice: easy to use: module load openmpi ipm interfaces with PAPI for hardware counters NCI patches for message binning, rounding off, suspend-resume

4 IPM Profile of CCAM Performance of CCAM on Raijin was not what we expected slower than on Vayu! Profiled a run using IPM to see what was going on

5 IPM Profile of CCAM Performance of CCAM on Raijin was not what we expected slower than on Vayu! Profiled a run using IPM to see what was going on

6 Performance Improvement in CCAM What can we do to improve the performance? Standard software in use by many researchers. Can t change the algorithm or code. Need a different strategy to improve performance. Work with the communication and system libraries instead. IPM profile shows huge overhead coming from MPI calls. Mellanox Accelerators Messaging Accelerator (MXM) improves message passing by using extra, Mellanox hardware features Fabric Collective Accelerator (FCA) offloads collectives from the processes to the interconnect hardware

7 Performance Improvement in CCAM Without Changing Code Execution Time (s) CCAM Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT Avg. Time is based on varied number of runs July Normal November Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT

8 Performance Improvement in CCAM Without Changing Code Execution Time (s) CCAM Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT Avg. Time is based on varied number of runs July Normal November Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT

9 Performance Improvement in CCAM What can we do to improve the performance? Standard software in use by many researchers. Can t change the algorithm or code. Need a different strategy to improve performance. Work with the communication and system libraries instead. Operating system can also be impacting performance Moved to the latest CentOS 6 kernel and operating system new task scheduling, memory management, Enabled hyperthreading allow operating system tasks to run on separate hardware threads reduce impact and jitter

10 Performance Improvement in CCAM Without Changing Code Execution Time (s) CCAM Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT Avg. Time is based on varied number of runs July Normal November Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT

11 Application Software Stack General package installations are made on request to a central location, /apps. Lustre filesystem, mounted on all nodes. We typically build these so they pass all their tests. This normally means default optimisation and gcc. Fortran 90/03/08 modules and libraries are built using both gfortran and ifort since the ABI is different. While quite reasonable performance, sometimes users need/want more. Working closely with a developer of Fludity to compile a custom software stack: Lots of dependencies: MPI, PETSc, Metis, Scotch, Zoltan, Python, GMSH, All built using latest Intel compilers and OpenMPI with very high optimisation settings. Found several compiler bugs reported to Intel and several already fixed. 20% improvement in runtime using custom software stack! Still using a debugging build of PETSc known to have significant performance impact.

12 Summary Even without changing a line of source code, there still lots of performance enhancements available! Highly optimised software stack for best serial performance. Using latest kernel and system libraries can reduce impact from operating system. Hyperthreading reduces jitter and impact from O/S tasks. Mellanox Accelerators can significantly improve MPI performance especially for collectives.

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU