MPI-IO Performance Optimization IOR Benchmark on IBM ESS GL4 Systems

Size: px

Start display at page:

Download "MPI-IO Performance Optimization IOR Benchmark on IBM ESS GL4 Systems"

Ashley West
6 years ago
Views:

1 MPI-IO Performance Optimization IOR Benchmark on IBM ESS GL4 Systems Xinghong He HPC Application Support IBM Systems WW Client Centers May

POSIX performance Baseline, capability of the file system IOR MPIIO

2 Agenda System configurations Storage system, compute cluster IOR benchmark Build, run-time environment, test cases (command line) IOR POSIX performance Baseline, capability of the file system IOR MPIIO performance PE and MPIIO (ROMIO) parameters Collective IO, independent IO File transfer size 2

3 System configurations compute-1 compute-2 compute-40 8GB pagepool IB Switch1 IB Switch2 FDR 3xFDR server1 server2 server3 server4 147GB pagepool ESS GL4 6Gbps SAS 6Gbps SAS ESS GL4 3

4 System configurations compute-1 compute-2 compute-40 8GB pagepool IB Switch1 IB Switch2 EDR 3xFDR server1 server2 server3 server4 147GB pagepool ESS GL4 6Gbps SAS 6Gbps SAS ESS GL4 4

5 40 compute nodes IBM Power System S824L ( L) 2x10-core POWER GHz 256GB (16x16GB CDIMM) memory GPFS pagepool size: 8GB 2 FDR InfiniBand links (1 dual-port adapter) Ubuntu (LE) IBM Parallel Environment Run-time Edition

6 Compute nodes - updated IBM Power System S822LC (8335-GTA) 2x10-core POWER GHz 256GB (8x32GB 1333 MHz RDIMM) memory GPFS pagepool size: 8GB 2 EDR InfiniBand links (1 dual-port adapter) RHEL-7.2 LE IBM Parallel Environment Run-time Edition

7 2 ESS GL4, containing 4 IBM Power System S822L ( L) 2x10-core POWER GHz 256GB (16x16GB CDIMM) memory GPFS pagepool size: 147GB 6 FDR InfiniBand links (3 dual-port adapters) RHEL-7.1 BE IBM Spectrum Scale IBM DCS3700 Expansion Unit ( E) 464 (8x58) NL-SAS 2TB HDDs, 4 400GB SDDs RAID 8+2P 7

8 IOR Benchmark IOR downloaded from sourceforge Build no change to any source or makefile Export PATH=/opt/ibmhpc/prcurrent/ppe.poe/bin:${PATH} To add mpicc which is not in /usr/bin cd IOR/src/C; make mpiio Will build POSIX and MPIIO Test cases posix-1: -b $bsize -t 16M -s 1 -w -r -g -v -d 1 -i 4 -o $TARGET -a POSIX -F posix-2: -b $bsize -t 16M -s 1 -w -r -g -v -d 1 -i 4 -o $TARGET -a POSIX mpiio-1: -b $bsize -t 16M -s 1 -w -r -g -v -d 1 -i 4 -o $TARGET -a MPIIO -c -F mpiio-2: -b $bsize -t 16M -s 1 -w -r -g -v -d 1 -i 4 -o $TARGET -a MPIIO -c bsize is chosen to ensure total file size per compute node is 76800MB, ~10x of pagepool 8

9 PE environment variables export MP_USE_BULK_XFER=yes export MP_EAGER_LIMIT=65536 export MEMORY_AFFINITY=MCM export MP_RESD=poe export MP_PE_AFFINITY=yes export MP_BINDPROC=yes export MP_TASK_AFFINITY=cpu export MP_CPU_BIND_LIST="152,144,136,128,120,112,104\,96,88,80,72,64,56,48,40,32,24,16,8,0" Adapter on the 2nd socket 9

10 Other settings MPIIO related export GPFSMPIO_COMM=1 Use MPI_Isend/MPI_Irecv, instead of MPI_alltoallv for data exchanging between the aggregator and other processes export GPFSMPIO_P2PCONTIG=1 export MP_IOTASKLIST= ${io_list} export ROMIO_HINTS=hints_file Equivalent to IOR option -U hints_file export MP_I_SHOW_AGGRS=1 export ROMIO_PRINT_HINTS=1 Equivalent to IOR option -H 10

11 IOR POSIX IO on 2 GL4-1ppn IOR POSIX IO on 2 GL4: 1 mpi task per node IO bandwidth in MiB/s posix-1 write posix-1 read posix-2 write posix-2 read Number of compute nodes 11

12 IOR POSIX IO on 2 GL4-4ppn IOR POSIX IO on 2 GL4-4 MPI tasks per node IO bandwidth in MiB/s posix-1 write posix-1 read posix-2 write posix-2 read Number of compute nodes 12

13 IOR MPIIO on 2 GL4-1ppn IOR MPIIO on 2 GL4: 1 MPI task per node IO bandwidth in MiB/s mpiio-1 write mpiio-1 read mpiio-2 write mpiio-2 read Number of compute nodes 13

14 IOR MPIIO on 2 GL4-4ppn IOR MPIIO on 2 GL4-4 MPI tasks per node IO bandwidth in MiB/s mpiio-1 write mpiio-1 read mpiio-2 write mpiio-2 read Number of compute nodes 14

15 Parameter table of the test cses GPFSMPIO_ COMM GPFSMPIO_ P2PCONTIG MP_IOTASKLI ST romio_cb_write romio_cb_read 15 def default default default default default comm 1 default default default p2p 1 default default default both 1 1 default default default dd_def default disable disable dd_comm 1 default disable disable dd_p2p 1 default disable disable dd_both 1 1 default disable disable tio_def all default default tio_comm 1 all default default tio_p2p 1 all default default tio_both 1 1 all default default dd_tio_def all disable disable dd_tio_comm 1 all disable disable dd_tio_p2p 1 all disable disable dd_tio_both 1 1 all disable disable Note: default is 0 0 One aggregator per node enable enable

16 ROMIO hints parameter default values PE cb_buffer_size = romio_cb_read = enable romio_cb_write = enable cb_nodes = 4 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = ind_wr_buffer_size = romio_ds_read = automatic romio_ds_write = automatic romio_filesystem_type = GPFS+PE: IBM GPFS for PE OpenMPI cb_buffer_size = romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = ind_wr_buffer_size = romio_ds_read = automatic romio_ds_write = automatic cb_config_list = *:1 16

17 16 MB transfer size - mpiio on 1 and 2 nodes 17

18 IOR mpiio-1 write from one node 16 MB tsize IOR mpiio-1 write on 1 node, 16MB Bandwidths in MiB/s x1 1x2 1x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 18

19 IOR mpiio-1 write from two nodes 16 MB tsize IOR mpiio-1 write on 2 nodes, 16MB Bandwidths in MiB/s x1 2x2 2x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 19

20 IOR mpiio-2 write from one node 16 MB tsize IOR mpiio-2 write on 1 node, 16MB Bandwidths in MiB/s x1 1x2 1x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 20

21 IOR mpiio-2 write from two nodes 16 MB tsize IOR mpiio-2 write on 2 node, 16MB Bandwidths in MiB/s x1 2x2 2x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 21

22 1 MB transfer size - much larger difference 22

23 IOR mpiio-1 write from one node 1 MB tsize IOR mpiio-1 write on 1 node, 1MB Bandwidths in MiB/s x1 1x2 1x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 23

24 IOR mpiio-1 write from two nodes 1 MB tsize IOR mpiio-1 write on 2 nodes, 1MB Bandwidths in MiB/s x1 2x2 2x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 24

25 IOR mpiio-2 write from one node 1 MB tsize IOR mpiio-2 write on 1 node, 1MB Bandwidths in MiB/s x1 1x2 1x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 25

26 IOR mpiio-2 write from two nodes 1 MB tsize IOR mpiio-2 write on 2 nodes, 1MB Bandwidths in MiB/s x1 2x2 2x4 def comm p2p Both dd_def dd_comm dd_p2p dd_both tio_def tio_comm tio_p2p tio_both dd_tio_def dd_tio_comm dd_tio_p2p dd_tio_both node x ppn 26

27 IOR mpiio-2 write BW comparison 16 MB 1 MB default best ratio default best ratio 1x x x x x x mpiio-2 write bandwidths in MiB/s for the default and the best parameters. The 16 MB and and 1 MB are file transfer size (tsize) of IOR option -t. 27

28 Summary 45GB/s read and 35GB/s write for 2 ESS GL4 Both POSIX and MPIIO Both file_per_proc and single_shared_file MPI collective IO very sensitive to ROMIO hints parameters and other run-time parameters More impact to single_share_file than to file_per_proc More impact to multiple MPI tasks per node than to one MPI task per node More impact to smaller transfer size than to larger transfer size The worst from 1 MB transfer size can be 136x worse It can be expected to be more worse for sub -MB transfer sizes 28

29 Thank you! 29

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Analyzing the High Performance Parallel I/O on LRZ HPC systems Sandra Méndez. HPC Group, LRZ. June 23, 2016 Outline SuperMUC supercomputer User Projects Monitoring Tool I/O Software Stack I/O Analysis