Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Size: px

Start display at page:

Download "Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik"

Wilfrid Woods
5 years ago
Views:

1 Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik

2 Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy protocol using Unreliable Datagram MVAPICH-Aptus : A scalable High performance Multi-Transport MPI over InfiniBand

3 High Performance and Scalable MPI with Reduced Memory usage Motivation Does aggressively reducing communication buffer memory lead to degradation of end application performance? How much memory can we expect the MPI library to consume during execution of a typical application, while still proving the best available performance?

4 High Performance and Scalable MPI with Reduced Memory usage IB provides several types of transport services Reliable Connection (RC) - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

5 High Performance and Scalable MPI with Reduced Memory usage Upper level software service Shared Receive Queue - This allows multiple QPs to be attached to one receive queue (even for connection oriented transport) - This approach is memory efficient

6 High Performance and Scalable MPI with Reduced Memory usage Remote Direct Memory Access (RDMA) - Application can directly access the memory of the remove process. - RDMA has very low latency.

7 High Performance and Scalable MPI with Reduced Memory usage MVAPICH Design Overview MVAPICH uses two major protocols 1. Eager Protocol - It is used to transfer small messages. - The messages are buffered inside the MPI library. - pre-allocated communication buffers are required on the sender and receiver side 2. Rendezvous Protocol - It is used to transfer large messages. - The message are sent directly to receiver s user memory.

8 High Performance and Scalable MPI with Reduced Memory usage 1. Adaptive RDMA with Send/Receive - In order to avoid a memory-scalability problem when the number of nodes increase, this channel is adaptive. - Limited buffers are allocated initially. - Once a threshold number of messages are exchanged, next messages are transferred using RDMA.

9 High Performance and Scalable MPI with Reduced Memory usage 2. Adaptive RDMA with SQR Channel - Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used. - Drawback : Sender doesn t know the receiver buffer availability. - Solution : Setting a low-watermark for the SQR.

10 High Performance and Scalable MPI with Reduced Memory usage 3. Shared Receive Queue - This channel exclusively utilizes the SRQ feature. - This follows the same low-watermark technique as the ARDMA-SRQ. - Even though RDMA has low latency, they consume more memory.

11 High Performance and Scalable MPI with Reduced Memory usage NAS Benchmark

12 High Performance and Scalable MPI with Reduced Memory usage High Performance Linpack - Benchmark for solving linear equations. - It is used as the primary measure for ranking biannual Top 500 list of the world s fastest supercomputers

13 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

14 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Motivation 1. Performance Scalability - Memory copies are detrimental to the overall performance of the application. - HCA cache can only hold a limited number of QPs 2. Resource Scalability - With a connection oriented transport the memory requirements increase linearly with the number of connected processes.

15 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Traditional Zero-Copy 1. Matched Queues Interface - The receiver deciphers the message tag from the sent message and matches it with the posted receive operations. 2. Rendezvous Protocol using RDMA - Initially a handshake protocol is used, followed by RDMA.

16 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram UD vs RC memory usage For 16k connections UD = 40 MB / process RC = 240 MB / process

17 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Challenges for true zero copy design Limited MTU Size - UD transport has a Maximum Transfer Unit(MTU) limit of 2KB. - Segmentation required. Lack of dedicated Receive Buffers - Difficult to post receive buffers for a particular peer as they are all shared. - If no buffer is posted to a QP, message sent is silently dropped. Lack of Reliability - There is no guarantee that a message will arrive at the receiver Lack of ordering - Message may not arrive in the same order they are sent. Lack of RDMA - RDMA only works for connection oriented transport.

18 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Proposed Design - Design is based on serialized communication since RDMA is not specified for UD transport - Serialized implies that the order of transfer is agreed beforehand, and only sender transmit to a QP at a single time.

19 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Solutions to design challenges 1. Efficient Segmentation - The design chooses to get completion signal only for the last packet. - The underlying reliability layer would mark packets as missing at the receiver s end and the sender is notified. 2. Zero Copy Pool - A pool of QPs are maintained. - When a message transfer is initiated, a QP is taken from the pool and the application receive buffer is posted to it. 3. Optimized Reliability and Ordering for Large Messages - One approach is the perform a checksum for the entire receive buffer. - Each operation can specify a 32-bit immediate field that will be available to the receiver as part of the completion entry.

20 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Experimental Evaluation Ping Pong Latency

21 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Uni-Directional Bandwidth

22 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Bi-Directional Bandwidth

23 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

24 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Motivation This paper seeks to address two mains questions - 1. What are the different protocols developed for MPI over IB? How well do they perform at scale? 2. Given this knowledge, can the MPI Library be designed to dynamically select protocols to optimized for performance and scalability?

25 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand IB provides several types of transport services Reliable Connection (RC) - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

26 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Eager Protocol Channel Message Channel

27 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Rendezvous Protocol Channel Message Channel

28 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Performance : Eager Latency Channel Evaluation

29 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Performance : Uni-Directional Bandwidth

30 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Scalability Test : Memory Usage Channel Evaluation

31 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Scalability Test : Latency Channel Evaluation

32 MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Characteristics Summary

33 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Overview of Design As seen from the experimental results, using only one channel is not sufficient to achieve performance and scalability. The solution is to use a combination of message channels and transports to optimize for performance as well as scalability. Design Challenges 1. When should a channel be created? 2. When should a channel be used?

34 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Channel Allocation

35 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Channel Usage From the experimental results we can see the channels behave differently for different message size A flexible form is defined when sending a message Using this flexible framework, send rules can be changed on a per-system or job level to meet application needs without changing the code within MPI library.

36 Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Performance Evaluation

37 QUESTIONS?

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The