Op#miza#on and Tuning Collec#ves in MVAPICH2

Size: px

Start display at page:

Download "Op#miza#on and Tuning Collec#ves in MVAPICH2"

Victor Lambert
5 years ago
Views:

1 Op#miza#on and Tuning Collec#ves in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: state.edu h<p:// state.edu/~subramon

2 Outline Introduc#on CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 2

3 MPI Collec#ve Opera#ons MPI Standard defines colleccve communicacon primicves to implement group communicacon operacons CollecCves such as MPI_Bcast, MPI_Allreduce are commonly used across parallel applicacons, owing to their ease of use and performance portability Processor and network architecture is constantly evolving mulc- core/many- core architectures, InfiniBand FDR, etc. CriCcal to design new algorithms to implement and opcmize colleccve operacons on emerging systems 3

4 MPI Collec#ve Opera#ons Performance of colleccve operacons is sensicve to various factors: Message length Number of processes in the communicator Processor architecture Number of CPU sockets per node Number of cores per CPU socket Network technology InfiniBand DDR/QDR/FDR Single- Rail, MulC- Rail Important to carefully tune various designs to implement colleccve operacons efficiently 4

5 Conven#onal Collec#ve Communica#on Algorithms Basic colleccve communicacon algorithms: Binomial Tree (MPI_Bcast, MPI_Reduce, MPI_Sca<er, MPI_Gather) Recursive Doubling (MPI_Allreduce, MPI_Allgather) Ring Exchange (MPI_Allgather) Bruck Algorithm, Pairwise Exchange (MPI_Alltoall) Combining basic algorithms to implement more complex schemes: Sca<er- Allgather (MPI_Bcast) Reduce- Sca<er- Allgather (MPI_Allreduce) Reduce- Sca<er- Gather (MPI_Reduce) 5

6 Collec#ve Communica#on Algorithms for Mul#core Systems Basic MulC- core Aware designs: Hierarchical communicator system Intra- node communicator Includes all processes that share the same address space Lowest rank process is the node- leader Inter- leader communicator MPICH implements many colleccves in a hierarchical manner. Intra- node phases are implemented via simple point- to- point operacons CriCcal to design efficient schemes on emerging mulc- /many- core systems 6

7 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 7

Node (Pt- to- Pt) Intra- Node (Shared- Memory) Intra- Node (Kernel- Assisted) Run- Cme flags: All shared-

8 Collec#ve Communica#on in MVAPICH2 Collec#ve Algorithms in MV2 Conven#onal (Flat) Mul#- Core Aware Designs Inter- Node (Pt- to- Pt) Inter- Node Communica#on Inter- Node (Hardware- Mcast) Intra- Node Communica#on Intra- Node (Pt- to- Pt) Intra- Node (Shared- Memory) Intra- Node (Kernel- Assisted) Run- Cme flags: All shared- memory based colleccves: MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast- based colleccves: MV2_USE_MCAST (Default : OFF) 8

9 Collec#ve Communica#on in MVAPICH2 MVAPICH2 relies on highly opcmized and tuned shared- memory based opcmizacons for several important colleccves MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Barrier MVAPICH2 also uses a combinacon of hierarchical designs to opcmize MPI_Gather and MPI_ScaYer An opcmized variant of recursive- doubling is used to improve the latency of MPI_Allgather colleccve MVAPICH2 also relies on hardware mulccast based designs to opcmize MPI_Bcast and MPI_ScaYer Kernel- assisted mechanisms are also being used to opcmize performance of colleccves, such as MPI_Allgather, in MVAPICH2 9

10 OSU MicroBenchmarks for Collec#ves Executable osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_barrier osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_sca<er osu_sca<er osu_sca<erv Latency Benchmark MPI_Allgather MPI_Allgatherv MPI_Allreduce MPI_Alltoall MPI_Alltoallv MPI_Barrier MPI_Bcast MPI_Gather MPI_Gatherv MPI_Reduce MPI_Reduce_sca<er MPI_Sca<er MPI_Sca<erv 10

11 OSU MicroBenchmarks for Collec#ves Running OMB colleccve benchmarks: mpirun_rsh np # - hosaile hosts./<benchmark name> Reports the average Cme taken to complete the colleccve operacon Latency averaged across all processes, across 1,000 iteracons AddiConal Run- Cme opcons: Users may choose to view addiconal latency stacsccs, such as Max and Min latency, by enabling the `` f opcon mpirun_rsh np # - hosaile hosts./<benchmark name> - f ``- m opcon allows users to specify the maximum amount of memory and the message lengths used by the benchmark mpirun_rsh np # - hosaile hosts./<benchmark name> - m # (Useful to prevent seg- faults with large messages, on systems with limited memory) 11

12 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 12

13 Tuning Collec#ve Opera#ons in MVAPICH2 MVAPICH2 relies on several shared- memory- based, hardware- mulccast- based and kernel assisted designs to opcmize colleccves CriCcal to tune such designs to ensure low communicacon latency for colleccve operacons, across various message lengths, number of processes, different systems. CollecCve designs are broadly classified as: - Communica#on algorithm (Binomial, Recursive- Doubling, etc.) - Communica#on mechanism (Shared- memory, Limic, Mcast, etc.) 13

14 Hardware Mul#cast- aware MPI_Bcast on TACC Stampede Latency (us) Small Messages (102,400 Cores) Default Mul#cast Message Size (Bytes) % Latency (us) Large Messages (102,400 Cores) Default Mul#cast 2K 8K 32K Message Size (Bytes) 128K Latency (us) Byte Message Default Mul#cast 80% Latency (us) KByte Message Default Mul#cast Number of Nodes Number of Nodes MCAST- based designs improve latency of MPI_Bcast by up to 85% Use MV2_USE_MCAST =1 to enable MCAST- based designs 14

15 Enabling Hardware Mul#cast- aware MulCcast is applicable to MPI_Bcast MPI_Sca<er MPI_Allreduce Parameter Descrip#on Default Nature MV2_USE_MCAST = enable- mcast Enables hardware MulCcast features Configure flag to enable Disabled Enabled 15

16 MPI_ScaYer - Benefits of using Hardware- Mcast Latency (usec) % Sca<er- Default 512 Processes Message Length (Bytes) Latency (usec) Sca<er- Mcast 1,024 Processes 75% Message Length (Bytes) Enabling MCAST- based designs for MPI_Sca<er improves small message latency by up to 75% Use MV2_USE_MCAST =1 to enable MCAST- based designs 16

17 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 Benefits of using Hardware MulCcast Effects of Shared- memory based communica#on Tuning the performance of MPI_Allgather Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 17

18 Effects of Shared- memory based communica#on Shared- Memory specific data structures created to opcmize colleccve communicacon Tunable parameter Applicable to all MPI CollecCves (Except vector variants) For opcmal performance this feature should be enabled Parameter Descrip#on Default Nature MV2_USE_SHMEM_COLL = 1 Enables shared memory opcmizacons for colleccves Enabled 18

19 Effects of Shared- memory based communica#on Latency (us) Processes, MPI_Bcast MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Size (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Bcast, with 4,096 MPI processes improves by up to 75% through opcmized, tuned shared- memory- based designs 75% 19

20 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Allreduce MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL= Message Length (Bytes) 66% MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Allreduce improves by up to 66% through opcmized, tuned shared- memory- based designs 20

Effects of Shared- memory based communica#on Latency (usec) 80 70 60 50 40 30 20 10 0 4096 Processes, MPI_Gather MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL=1 85% Message Length (Bytes)

21 Effects of Shared- memory based communica#on Latency (usec) Processes, MPI_Gather MV2_USE_SHMEM_COLL=0 MV2_USE_SHMEM_COLL=1 85% Message Length (Bytes) MV2_USE_SHMEM_COLL=1 accvates the use of shared- memory- based colleccve opcmizacons Latency of MPI_Gather improves by up to 85% through opcmized, tuned shared- memory- based designs 21

22 Outline Introduc#on Collec#ve Communica#on in MVAPICH2 Tuning Collec#ve communica#on opera#ons in MVAPICH2 - Benefits of using Hardware Mul#cast - Effects of Shared- memory based communica#on - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 22

23 Tuning the performance of MPI_Allgather MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs Parameter MV2_ALLGATHER_REVERSE_RANKING=1 Descrip#on Enables allgather reverse ranking opcmizacon 23

24 Tuning the performance of MPI_Allgather Latency (us) % MV2_ALLGATHER_REVERSE_RANKING=1 MV2_ALLGATHER_REVERSE_RANKING= K Message Size (Bytes) MV2_ALLGATHER_REVERSE_RANKING=1 enables opcmizacons for MPI_Allgather with small and medium length messages Latency of MPI_Allgather, with 2,048 MPI processes improves by up to 58% through opcmized and tuned designs 24

25 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for collec#ves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 25

26 Collec#ve Tuning Using LiMIC2 LIMIC2 module enables MVAPICH2 to use kernel- assisted data transfers MPI Jobs onen run in fully subscribed mode and can involve significant intranode communicacon Such pa<erns are good indicators of enabling LIMIC2 to speedup intranode tranfers CollecCves such as MPI_Allgather benefits from the use of LIMIC2 Parameter Descrip#on Default Nature MV2_SMP_USE_LIMIC2 Enables shared memory opcmizacons for colleccves Enabled - - with- limic2 Configure flag disabled 26

27 Collec#ve Tuning Using LiMIC2 Allgather Case Study Latency (us) Shmem- Gather Limic2- Gather 20 % K 64K 128K 256K 512K 1M Message Size (Bytes) MPI_Allgather relies on the ring- exchange pa<ern for large messages. CriCcal to opcmize intra- node phases of the ring exchange on mulc- core systems Zero- copy LiMIC transfers improve performance by up to 20% for large message MPI_Allgather operacons 27

28 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking collec#ves in MVAPICH2 Summary 28

29 Improved Bcast in MVAPICH2-2.0a Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Latency (us) Process Bcast MVAPICH2-1.9 MVAPICH2-2.0a K 4K 16K Message Size (Bytes) Average of 25% improvement for 1024 processes and an average of 22% improvement for 2048 processes 29

30 Non- Blocking Collec#ves in MVAPICH2 MVAPICH2 supports for MPI- 3 Non- Blocking CollecCve communicacon and Neighborhood colleccve communicacon primicves MVAPICH2 1.9 and MVPICH2 2.0a MPI- 3 colleccves in MVAPICH2 can use either the Gen2 or the nemesis interfaces, over InfiniBand MVAPICH2 implements non- blocking colleccves either in a mulc- core- aware hierarchical manner, or via a basic flat approach ApplicaCon developers can use MPI- 3 colleccves to achieve computacon/ communicacon overlap Upcoming releases of MVAPICH2 will include support for non- blocking colleccves based on Mellanox CORE- Direct interface 30

31 Outline IntroducCon CollecCve CommunicaCon in MVAPICH2 Tuning CollecCve communicacon operacons in MVAPICH2 - Benefits of using Hardware MulCcast - Effects of Shared- memory based communicacon - Tuning the performance of MPI_Allgather - Benefits of using kernel- assisted schemes for colleccves Improved Bcast in MVAPICH2-2.0a Non- blocking colleccves in MVAPICH2 Summary 31

32 Summary CriCcal to opcmize and MPI colleccve operacons on emerging mulc- core systems and high- speed networks Necessary to tune various colleccve designs to ensure low communicacon latency on modern commodity systems MVAPICH2 relies on several opcmized and tuned designs to deliver low communicacon latency for colleccves MVAPICH2 has been tuned across several architectures, considering various processor and network architectures 32

33 Web Pointers NOWLAB Web Page hyp://nowlab.cse.ohio- state.edu MVAPICH Web Page hyp://mvapich.cse.ohio- state.edu 33

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi