Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer

013 8th International Conference on Communications and Networking in China (CHINACOM) Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer Pang Zhengbin, Wang Shaogang, Wu Dan, Lu Pingjing School of Computer, National University of Defense Technology Changsha, Hunan, China, 410073 zbpang@163.com, {wshaogang,daisydanwu,lupingjing}@nudt.edu.cn Abstract MPI collective communication overhead dominates the communication cost for large scale parallel computers, scalability and operation latency for collective communication is critical for next generation computers. This paper proposes a fast and scalable barrier communication offload approach which supports millions of compute cores. Following our approach, the barrier operation sequence is packed by host MPI driver into the barrier "descriptor", which is pushed to the (Network- Interfaces). The can complete the barrier automatically following its algorithm descriptor. Our approach accelerates both intra-node and inter-node barrier communication. We show that our approach achieves both barrier performance and scalability, especially for large scale computer system. This paper also proposes an extendable and easy-to-implement architecture supporting barrier offload communication and also other communication pattern. I. INTRODUCTION Collective communication (barrier, broadcast, reduce, allto-all) is very important for scientific applications running on parallel computers, it has been shown that the collective communication overhead could take over 80% communication cost for large scale super computers[1]. The barrier operation is the most common used collective communication, its performance is critical for most MPI parallel applications. In this paper, we focus on the implementation of fast barrier for large scale parallel systems. For next generation exascale computers, the system could have over 1 million cores, a good barrier implementation should achieve both low latency and scalability[]. In order to achieve the overlapping of communication and computation, offload the collective communication to hardware could have obvious benefits for these systems. Present barrier offload technique, like Core-Direct [3], TIANHE-1A[4], uses a triggered point-to-point communication approach, Core-Direct software initiates multiple point-to-point communication requests to the hardware and sets the request to be triggered by other messages, in this way, the whole collective communication can be handled by hardware without further software intervention. We observe that present barrier offload method may suffer from long delay and poor scalability. The Core-Direct must push many working-queue-element for a single barrier operation in each node, e.g., for barrier group with 4096 nodes, each node needs to push 1 work requests to the hardware[], we observe that this incurs long host- (Network-Interface- Card) communication. For next-generation computer networks, its point-to-point communication delay is usually high as its topology usually uses the torus mode. But each chip s network bandwidth is high due to the technology advances in serdes. In this paper, we propose a new barrier communication offload approach, which fits well for the next generation system networking. For next generation super computer, the processor usually incorporates many cores, so each must support more MPI threads. We see that the is difficult to support the communication requirement for too many threads. In this paper, we leverage a hierarchical approach which the interprocessor threads perform the barrier through dedicate hardware, and the node leader thread communication with other nodes through inter-node barrier algorithm which is executed fully by hardware. Compared with other approach, our approach only requires the host to push 1 communication descriptor to the in each node. The barrier descriptor covers both intra-node and internode barrier algorithm. The hardware can follow the descriptor to automatically finish the full barrier algorithm. We also give the hardware architecture that smoothly supports the new barrier offload approach, due to the simple barrier engine architecture, the can dedicate more hardware resources to collective communication. From simulation results, we show that our approach performs better than present barrier offload technique. II. BARRIER OFFLOAD ALGORITHM Our approach offloads the MPI barrier operation through the following steps: step1: each barrier node s MPI driver calculates the barrier communication sequence,i.e, the communication pattern performed by the barrier algorithm, and packs it into the barrier descriptor. All descriptor is packed following the enhanced dissemination algorithm, while the host does not perform any real communication during this step. step: the descriptor is sent to the, our approach enables the to complete the real barrier communication automatically without any further host intervention, the barrier communication is performed solely by hardware. step3: when the completes the whole barrier communication, it informs the host through the host- communication. 610 978-1-4799-1406-7 013 IEEE

N number of barrier nodes rank my local rank round 1 repeat round round + 1 sendpeer1 rank + 3round mod p sendpeer rank (3round + p) mod p recvpeer1 rank + 3round mod p sendpeer1 rank (3round + p) mod p send barrier msg to sendpeer1 with round id send barrier msg to sendpeer with round id receiv barrier msg from recvpeer1 with round id receiv barrier msg from recvpeer with round id until round log(3, N ) 1 NR Fig.. the architecture supporting collective communication Barrier Group 0 Thread 0 Group Config Register Barrier Group Counter Fig. 1. -way dissemination algorithm Counter 0 Thread 1 Thread Counter 1 Counter Thread 3 Thread n Counter 3 Counter n Fig. 3. the architecture supporting collective communication A. Inter-Node Barrier Algorithm The dissemination algorithm is a common used barrier method[6], [7]. It supports the barrier group with arbitrary number of nodes. The basic dissemination barrier algorithm requires multiple rounds, each round sends and receives one barrier message from other node. The following round communication can be initiated only after the previous round has been finished. We observe that in large scale systems, the network system usually uses the torus topology, the network point-to-point communication delay is high as it may require many hops to reach the dest node. For these systems, the basic dissemination algorithm is not efficient. In most cases, every node takes long time waiting for the source barrier message in each round. To efficiently hide the barrier message delay, our hardware uses an enhanced K-way dissemination algorithm to offload the barrier communication. The modified algorithm is able to sends and receives K messages parallel in each round. Our approach defines a new message type which is used for solely barrier communication. The new barrier message is very small, so even the does not support multi-ports parallel message processing, the barrier messages can be sent and received very fast. The example -way dissemination algorithm is shown in 1. We can prove that the -way dissemination algorithm requires total log(3, N ) rounds to complete the barrier for N nodes. The obvious benefit of the new algorithm is that it can greatly reduce the number of communication rounds, e.g., for the -way dissemination algorithm, the algorithm rounds can be reduced from log(, N ) to log(3, N ), we observe that the whole barrier delay can benefit from less algorithm rounds. B. Inter-Node Barrier Algorithm for Fat Tree For super computer which uses the fat tree interconnection, the topology is fast on the broadcast operation. For example, the TIANHE super computer, each board is equipped with 8 s, and all the on-board s are connected to one NR, which is a fat tree router chip. We see that, for the fat tree topology, the basic scatter-broadcast barrier algorithm is faster than pair-wise algorithm. We leverages the scatter-broadcast approach to perform the barrier communication within the board. Each board selects a leader, which all on-board s send barrier notification to it. The leader is then communication with other leader to finish the whole system barrier operation. On receiving all the notification messages, the leader broadcasts the barrier reach notification to all other s. C. Intra-Node Barrier Algorithm For the MPI threads that are resident on the same processor, We use a fence counter to accelerate the barrier operation. After intra-node MPI threads has finished the communication, our approach select a leader thread, through which it communicates with other MPI leader thread in other nodes. To accelerate intra-node barrier, the hardware incorporates several groups of fence counter, each group is used to support one MPI communicator. Within the group, there are several counters, each counter is bonded with one MPI thread, when the thread reaches the barrier, it host driver increases its own fence counter, and the group counter which is the smallest value of thread counter, when the group counter is increased by 1, the MPI threads that resides in its host process all reach the barrier point. The group config register is used to designate which thread is participated in the barrier operation. The barrier group is assigned to a MPI communicator when it is created by MPI driver, and the group counter is reset to initial value when the group is re-assigned. III. BARRIER A LGORITHM D ESCRIPTOR When offloading the collective communication to the hardware, there is one approach that offloads the full collective algorithm to the hardware. For example, the collective optimization over Infiniband[8] uses an embedded processor to execute the algorithm. We observe that this approach will 611

greatly complicate the design. The embedded processor is usually limited by its performance, it is far slow compared with s bandwidth and the host processor s performance. We propose a new approach that does not require the hardware to execute the full barrier algorithm, instead, the barrier s communication sequence is calculated by host s MPI driver, and the hardware simply follows the operation sequence to handle the real communication. We see that this will lead to simple hardware design, through which more hardware can be dedicated to the real collective communication. For any node in the barrier group, we can see from the dissemination algorithm that even before real communication, each round s source and dest nodes can be statically determined. Our approach leverages each node s MPI driver to calculate the barrier sequences and pack it to the algorithm descriptor. After the descriptor is generated, it is pushed to the through its host interface. The hardware can follow the descriptor to automatically communicate with other nodes, after the sequence is completed, the whole barrier is completed. The host interface may be varied for different systems, for example, the command queue residents in the host memory, or the descriptor is directly written to onchip RAM through PCIE write command. An example structure of barrier descriptor supporting - way dissemination algorithm is shown in figure 4. The DType field indicates descriptor type, along with the barrier descriptor, the system may support other collective or pointto-point communication type, the following descriptor field is interpreted according to its descriptor type. the BID is a system wide barrier ID, and it is predefined when the communication group is created, each node s barrier descriptor for the same barrier group uses the same barrier ID. The barrier message uses the BID to match the barrier descriptor for the target node. The SendVec field is a bit vector, its width equals the maximum barrier algorithm round, and each bit indicates whether the corresponding round should send out barrier messages to the target node. The RC1, RC,.. RC16 shows the number of barrier message in each algorithm round it should received. only after receiving all the source barrier messages and sending out all the barrier messages, the barrier engine proceeds to the next algorithm round. SendPeer indicates the target node ID for each communication round. The S flag is used to synchronize multiple barrier descriptor, when the S flag is set, it waits for previous descriptors to be completed before issuing to the barrier processing hardware. The M flag is used to mark current to be the barrier leader, this flag is used when the underlying topology is fat tree. The V flag is used to indicate whether to perform the fat tree topology optimization. The BVEC flag is used to indicate which thread is participating the barrier communication using the fence counter approach. When BVEC is set to all zero, the fence counter optimization is disabled. We see that this should be easy to support the nextgeneration systems. The host- communication cost is low as it only requires each node to push 1 descriptor to the node. The barrier descriptor is small compared with most standard 63~0 17~64 191~ ~19 319~6 383~30 11~384 3 0 DType BarrierID SendVec S V M BVEC 4 RC0 RC19 RC RC17 RC16 RC1 RC14 Fig. 4. example descriptor structure for -way dissemination algorithm point-to-point message descriptor, for example, the TIANHE- 1A computer s MP (Message Passing) descriptor has 104 bits[4]. Each node has its own barrier descriptor, and it should be generated completely through node s local information. The target and source rank id for each barrier round can be easy generated if local process rank and the barrier group size is known. But for the hardware to perform the real communication, the should know the target and source node id, our approach leverages the MPI driver to translate the process rank id to the physical node id. The translation process should be handled by local node without any communication, so our approach requires that the rank to node mapping information is saved in each node s memory when the MPI communicator group is created. The barrier message takes the information on BID,, DestID to the dest node. Through these information, the dest node can easily determine which point it has reached for the algorithm. If the dest node has not reached the barrier, the barrier messages are saved in temporal buffer and wait for the dest node s own descriptor to be pushed to the. When the has completed the sequence defined in the descriptor, the group barrier communication is finished. It then informs the host that all other nodes have reached the barrier. IV. THE HARDWARE IMPLEMENTATION A. Barrier Engine Architecture For the barrier communication, a complicated case is to deal with different processing arriving patterns. The timing difference between collective communication group nodes can have a significant impact on the performance of the operation, it requires the hardware to be carefully designed to avoid performance degression. To handle this problem, BE leverages the DAMQ (Dynamically-Allocated Multi Queue)[10] to hold incoming barrier messages. The packets stored in this queue can be processed out of order. If the barrier reaches the target node whose has not reached the barrier, BE saves the packets in the DAMQ buffer. We use a simple barrier message handshake protocol to barrier between two nodes. In the barrier descriptor, if the recvpeer is valid, this indicates that local node should wait RC13 RC1 0 RC11 RC10 RC9 RC8 9 RC7 RC6 RC RC4 RC3 8 RC RC1 61

the source node to reach barrier; if the sendpeer is valid, this indicates that local node should tell the dest node that it reaches the barrier. If the barrier messages from sendpeer reaches the target node, but the target node has not reached the barrier, the barrier messages are saved in a DAMQ buffer, then the target barrier engine sends back the BarrierRsp message. On receiving the BarrierRsp message, the source node knows that the target node is sure to get the message. When DAMQ buffer is full, the target node nacks the source node with BarrierNack message, on receiving this message, the source node will resend the barrier message after a predefined delay. When the barrier descriptor reaches the, the barrier engine will first check its local DAMQ buffer to see if there are any previously reached barrier messages. If there are any, BE processes these messages immediately. Note that each barrier message takes the information on the barrioud id BID, through which it is matched with dest node s descriptor. If the message s BID equals the descriptor BID, the source node and target node are from the same barrier group. The BID could be derived from the MPI communicator group id, it is required that all the nodes on the same barrier group agree on the BID. This BE engine supports multi barrier run in parallel, with each barrier uses different ID. The structure of the barrier engine is shown in figure. The logic is separated by the barrier message sending (TE) and receiving module (SE). The SDQ and HDQ are descriptor queue. SDQ is resident in host memory, and HDQ is resident in on-chip RAM. The HDQ is mainly used for fast communication and fully controlled by host MPI driver. The OF is the fetching module which reads from descriptor queue and dispatches the descriptor to the barrier engine. The SE module is responsible for receiving network barrier message that comes from sendpeer. The barrier engine saves the messages in the DAMQ buffer; To reduce hardware requirement, all barrier group messages are saved in one buffer, and the DAMQ buffer can be handled out of order. When the receiving DAMQ receives one message, it directly sends back the rsp reply to the sender. For local node, it does not need to known the barrier message is from which node, so the barrier descriptor only holds the number of messages it should received. The SE module uses a booking table to holds the number of source barrier messages for each round. Because each node may reach the barrier in the arbitrary sequence, the receiving messages may reaches local node out of order, so current receiving round id is, where from round 1 to -1, all barrier messages have received. The TE module is responsible for sending barrier messages to target nodes following the sequence defined in the descriptor. For algorithm round i, barrier messages to node sendpeer are sent if: the SE module has received all barrier messages before round i, and the sendpeer is valid for round i. The barrier messages for current round are sent in pipeline before their responsive messages are received. If target node replies with nack message, the barrier message is resent after a preconfigured delay. TE and SE are running in parallel and independently. Be- 8 7 6 4 3 1 0 Network Interface BarrierRsp Barrier BarrierRsp Barrier DAMQ DAMQ Source Round TE (Target Engine) SE (Source Engine) Fig.. barrier engine architecture speedup over software 4 10 0 40 80 100 OF SDQ HDQ Fig. 6. barrier offload speedup over software only approach cause the TE module needs to know current receiving round, SE module directly gives this information through module ports. V. EXPERIMENTS We implemented the barrier engine using the SystemVerilog language, and integrated the barrier engine module into TIANHE-1 s RTL model. TIANHE-1 s point-to-point communication engine uses the descriptor for MP (Message- Passing) and RDMA (Remote-Direct-Memory-Access)[4], and we add the new barrier descriptor type. From our experiences that the barrier engine is easy to design, we model the barrier engine with less than 6000 SystemVerilog code lines. The new model is simulated by synopsys VCS simulator. We test the barrier latency for different sized barrier. To simulate large scale barrier groups, we designed a simplified model using SystemVerilog language, the simplified model requires less simulation resources and runs more fast, yet its processing delay is similar with the real RTL model. To simulate the network, we use a general model which route point-to-point message to the target node, the point-topoint delay is calculated based on the number of hops for the D torus network. A. Barrier Delay Compared With Software Only Approach In this section, we test the average barrier speedup over the software only approach. The software approach is simulated by modifying the timing parameters collected from the real hardware. The software only barrier delay is compared with our approach, shown in figure 6. 613

1.1 1 0.9 0.8 0.7 0.6 0. 1 4 10 0 40 80 100 Fig. 7. barrier communication delay one node late two node late Our barrier offload approach gets obvious delay reduction compared with the software only approach, and shows much better performance scalability. For the barrier group with nodes, our offload approach is 7.6x faster than the software only approach. The performance benefits come from the following reasons: 1) The software approach uses standard MP message for barrier communication, we see that the MP packet is too large for barrier communication, it incurs long processing delay. ) The host- communication cost is high, it is even worse than the Core-Direct approach which uses the triggered point-to-point operations. For each point-topoint message, the host needs to push the MP descriptor to the, and waits for the s completion events. 3) The offload approach permits more communication and computation overlapping, the performance benefits may depends on the applications. B. Process Arriving Patterns Barrier performance is greatly impacted by the process arriving patterns. We give some sample processing arriving patterns and shows its impact on total barrier delay. The performance results is shown in figure 7. The simulation is conducted on the barrier group with 1-node and -node arriving late, while other nodes arrive the barrier at the same time. The performance is compared with the baseline simulation when all the barrier nodes reach the barrier at the same time. We obtain from the test result that the barrier delay is greatly impacted by the process arriving pattern, when the barrier group size grows, its impact is more obvious. This behavior is because the barrier algorithm is executed following the round sequence. if one node does not reach the barrier, it will not send out the barrier messages to its targets, so all the following barrier communication must wait for this node to reach the barrier. We compare the results with test results from [1], it has been shown that our approach is less affected by the process arriving pattern. This is because the hardware can resume the barrier algorithm more quickly and automatically without any host intervention, but for the software approach, it must take long time on host- communication when the late node reaches the barrier. VI. CONCLUSION We propose a new barrier offload approach, with the new hardware-software interfaces, the barrier engine is. The hardware follows the descriptor to executes the complex K- way dissemination algorithm. Simulation results show that our approach reduces barrier delay efficiently and achieves good computation and communication overlap. From our experiences, the barrier engine is easy to implement and requires less chip resources, so the can dedicate more logic for real communication, this is important for next-generation super computer, where each must support more processor threads. VII. ACKNOWLEDGEMENT This research is sponsored by Natural Science Foundation of China(61014), Chinese 863 project (013AA014301), Hunan Provincial Natural Science Foundation of China(13JJ4007). REFERENCES [1] R. Thakur, R. Rabenseifner, and W. Gropp, Optimization of Collective Communication Operations in MPICH, International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49 66, Feb. 00. [] H. Miyazaki, Y. Kusano, N. Shinjou, F. Shoji, M. Yokokawa, and T. Watanabe, Overview of the K computer System, FUJITSU Sci. Tech. J, vol. 48, no. 3, pp. 6, 01. [3] M. G. Venkata, R. L. Graham, J. Ladd, and P. Shamis, Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct, in 01 41st International Conference on Parallel Processing (ICPP). IEEE, pp. 89 98. [4] M. Xie, Y. Lu, L. Liu, H. Cao, and X. Yang, Implementation and Evaluation of Network Interface and Message Passing Services for TianHe-1A Supercomputer, in 011 IEEE 19th Annual Symposium on High-Performance Interconnects (HOTI). IEEE, pp. 78 86. [] K. S. Hemmert, B. Barrett, and K. D. Underwood, Using triggered operations to offload collective communication operations, in EuroMPI 10: Proceedings of the 17th European MPI users group meeting conference on Recent advances in the message passing interface. Springer-Verlag, Sep. 010. [6] T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm, Fast barrier synchronization for InfiniBand TM, in IPDPS 06: Proceedings of the 0th international conference on Parallel and distributed processing. IEEE Computer Society, Apr. 006. [7] D. Hensgen, R. Finkel, and U. Manber, Two algorithms for barrier synchronization, International Journal of Parallel Programming, vol. 17, no. 1, Feb. 1988. [8] A. R. Mamidala, Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters, Phd Thesis, 008. [9] F. Sonja, Hardware Support for Efficient Packet Processing, Phd Thesis, pp. 1 07, Mar. 01. [10] Y. Tamir and G. L. Frazier, Dynamically-allocated multi-queue buffers for VLSI communication switches, Computers, IEEE Transactions on, vol. 41, no. 6, pp. 7 737, 199. [11] V. Tipparaju, W. Gropp, H. Ritzdorf, R. Thakur, and J. L. Traff, Investigating High Performance RMA Interfaces for the MPI-3 Standard, in 009 International Conference on Parallel Processing (ICPP). IEEE, pp. 93 300. [1] Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms*, pp. 1 10, May 00. 614