Research on the Implementation of MPI on Multicore Architectures

Size: px

Start display at page:

Download "Research on the Implementation of MPI on Multicore Architectures"

Jasmin Rogers
5 years ago
Views:

1 Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China Yan Gu Department of Computer Science & Technology, Tshinghua University, Beijing, China Abstract We first introduced the multicore-oriented optimization modules of two common MPI implementations MPICH2 and OpenMPI, and then tested their performances on one multicore computer. By enabling and disabling these modules, we provided their performances, including bandwidth and latency, under different circumstances. Finally, we analyzed the two MPI implementations and discussed the choice of MPI implementations and possible improvements. Keywords-Message Passing Interface, Multicore, MPICH2, OpenMPI, Intranode Communication. I. INTRODUCTION As CPU frequency has stalled at about 3-4 GHz for a long time, multicore architectures are more and more widely used for better performance on a single computer. However, as a communications protocol, MPI is initially designed for distributed memory systems. Unlike OpenMP, it does not allow shared data. Instead, MPI programs transfer data by message passing. Because of the memory wall, when there are thousands of cores on one computer, message passing will be more efficient than shared data accessing. Thus, it is important for MPI implementations to increase the speed of data communication. There are many commonly used open-source implementations, such as MPICH2 and OpenMPI. To fully exploit multicore architectures, these implementations may use some new technologies. The purpose of this project is to study the multicore consciousness of such implementations, analysis their performances, and make improvements if possible. II. MULTICORE CONSCIOUSNESS OF MPI IMPLEMENTATIONS A. Thread-Level Parallelism As traditionally used for computer cluster, major MPI implementations, including both MPICH2 and OpenMPI, focus on parallelism more on process level than on thread level. For multicore nodes, theoretically, lowering the parallelization level ( process fusion ) can improve the performance. But for a better scalability, MPI does not provide a convenient way of thread-level communication. Instead, the thread-level control is left to the user. Pthreads, OpenMP, or other thread parallelization methods are available inside each MPI process. To enable multithreaded programming, MPI Init thread() [3] should be called in place of MPI Init(). After that, thread parallelization like OpenMP directives can be used. Both implementations support this way of hybrid programming. It leaves the programmer much freedom to decide the behavior of all processes and threads. The programmer can choose private or shared data properly to reach the peak performance. Nevertheless, it makes the programming nearer to the hardware rather than the algorithm itself. Moreover, existed MPI code cannot easily migrated to multicore architecture in this way. B. MPICH2 Nemesis Starting with MPICH2 1.1 series, the default channel is ch3:nemesis. So in the article, we mainly discuss Nemesis [6] for MPICH2. The Nemesis communication subsystem provides an efficient scalable way of both intranode and internode communications. In Nemesis, each process has only one lock-free receive queue. When one process needs to send message, it dequeues a free element from a lock-free free queue, fills this element with message, and then enqueues it onto the receiver s receive queue. The receiving operation is just the reverse of sending dequeuing from the receive queue, handling the message and enqueuing back to the original free queue. Here is the figure of the mechanism of message queue: 1 Free 6 Fill 2 packet Recv Sending process Free 3 Recv Handle 4 packet 5 Receiving process There are three variations of the location of free queue: 1) One global free queue 2) One free queue per process that will be dequeued by other processes while sending to it 3) One free queue per process that will be dequeued by the process itself while sending messages out

2 The first one is better for small-scale shared-memory architecture a multicore node. Since free queue access of processes are not balanced, this method improves memory utilization. The other two are mainly designed for large-scale distributed-memory architecture like NUMA, for decreasing remote memory access latency. They only need either the sender or the receiver to access remote memory. The Nemesis implementation uses the third variation for large-scale clusters. This variation can be implemented with multipleenqueuer single-dequeuer lock-free queues. Here is the pseudo-code of enqueuing and dequeuing, with atomic swap (SWAP) and compare-and-swap (CAS) operations: Enqueue(queue, element) prev=swap(queue->tail,element); if (prev==null) queue->head=element; else prev->next=element; Dequeue(queue,&element) element=queue->head; if (element->next!=null) queue->head=element->next; else queue->head=null; old=cas(queue->tail,element,null); if (old!=element) while (element->next==null) SKIP; queue->head=element->next; In addition, there are some optimizations for faster intranode communication: 1) Reducing L2 Cache Misses Memory is much slower than cache on modern computers, so it is critical to reduce L2 cache misses. While enqueuing onto an empty queue or dequeuing the last element in a queue, a process has to access both the head and tail of the queue. In these cases, there is only one L2 cache miss if the head and tail were in the same cache line. Otherwise, either the head or the tail is accessed. Thus, there would be L2 cache misses because of false sharing if the head and tail were in the same cache line. Based on the discussion above, there is not a best way to decide the placement of the head and tail. In fact, Nemesis puts them in the same cache line and uses a shadow head pointer (initialized to NULL) in another cache line. The dequeuer first checks the shadow. If the shadow is not NULL, the dequeuer directly uses the shadow head. Otherwise, it checks the real head. If the real head is not NULL, meaning some elements have been enqueued, the dequeuer copy the real head to the shadow and then set the real head to NULL. In this way, L2 cache misses only occurred when enqueuing onto an empty queue or dequeuing from a queue with only one element. 2) Bypassing Queues There is another technique using fastboxes to decrease latency. The fastbox is a single buffer, one per pair of processes. The sender puts the message into the fastbox if it is empty, rather than the queue. Similarly, the receiver gets the message from the fastbox if full. Fastboxes can improve the performance of intranode communication. However, this method lacks scalability so that it cannot be applied to global large sharedmemory. Also, it requires the receiver to check multiple fastboxes. This can be partly avoided by specifying the sender in Nemesis. Besides, this implementaion may change the order of message sending/receiving. To address the problem, Nemesis uses a sequence number to keep the original order. 3) Memory Copy Nemesis uses assembly string copy functions and MMX instructions, which is more efficient than standard libc memcpy function. 4) Large Message Transfer The shared-memory queue discussed above is efficient for transferring small messages. However, for large messages, this method is not a good solution. Therefore, the Large Message Transfer (LMT) interface is added into CH3. This can increase the transfer bandwidth and decrease the impact on the applications data in the cache. The LMT interface uses the rendezvous protocol. Unlike the original eager protocol, it ensures the receiver is matched before sending so that the sender does not need to take more memory for unsent messages. The protocol threshold is 32KB by default. The procedure of the two protocols can be shown in this figure: Sender Receiver Sender Receiver Send Eager Protocol RNDZ_START RNDZ_REPLY DATA FIN Rendezvous Protocol For intranode, LMT copies through buffer in shared memory. Using double-buffering, copying from the sender to the buffer and from the buffer to the receiver can be concurrent. 5) Bypassing the Posted Receive Queue

3 While receiving, traditional CH3 implementation checks all the message queue and wait if there is no message matches current message to be received. With this optimization, if no message matches, CH3 will not wait. Instead, it checks other receive requests. If it finds one matched pair, it can receive message. C. OpenMPI sm BTL There is an equivalent of Nemesis in OpenMPI sm BTL (shared-memory Byte Transfer Layer), which is a lowlatency, high-bandwidth mechanism for transferring data between two processes via shared memory. It can only be used between processes on the same node. The sm BTL transfers fragments of messages broken up by the PML (Point-to-point Message Management Layer). The steps are: [4] The sender fills a shared-memory fragment out of one of its free lists. Each process has one free list for smaller fragments and another for larger fragments. The sender packs the user-message fragment into this shared-memory fragment. The sender posts a pointer to this shared fragment into the appropriate FIFO queue of the receiver. The receiver polls its FIFO(s). When it finds a new fragment pointer, it unpacks data out of the sharedmemory fragment and notifies the sender that the shared fragment is ready for reuse (to be returned to the sender s free list). On each node where an MPI job has two or more processes running, the job creates a file that each process mmaps into its address space. Shared-memory resources that the job needs such as FIFOs and fragment free lists are allocated from this shared-memory area. D. KNEM The KNEM (Kernel Nemesis) is a Linux kernel module enabling high-performance intranode MPI communication for large messages. The LMT module of MPICH2 (since 1.1.1) and the sm BTL of OpenMPI (since 1.5) use KNEM to improve intranode communications. On a single node, both Nemesis and sm BTL use a buffer for copying messages. It performs well for small messages when the number of cores is not large. As the size of message and the number of cores expand, this solution will be too slow. Other potential problems include cache pollution, high CPU use, etc. For better scalability and performance, KNEM lowers the data sharing from user space to kernel space. Here is how KNEM works: [5] The sender declares a send buffer to KNEM. KNEM tells the sender the virtual segments contained in the buffer, with a unique cookie. The sender passes the cookie to the receiver. The receiver gives KNEM its required cookie and where its receive buffer is. KNEM finishes the copy. Send Buffer Sender LMT Send Cmd (1) Send Cmd List Inter LMT Communication (3) Cookie (2) Recv Cmd (4) Acquire Send Cmd (5) Recv Buffer Receiver LMT Copy (6) Obviously, KNEM saves one copy, which is efficient for large messages and many-core systems. These operations are more complicated than double buffer copying (the system call overhead is about 100ns [2]) so that KNEM should not be applied to small messages. More details of KNEM can be found in [7]. By the way, KNEM can also improve communication with the Intel R I/O Acceleration Technology (I/OAT) [1] technology, using DMA to transmitting data. A. Platform III. EXPERIMENT Hardware CPU Intel Core i5 CPU 2.67GHz, 4 Cores Cache 32+32KB L1 per core, 256KB L2 per core, 8MB L3 shared Memory 4GB 1333MHz Software OS Arch Linux x86-64 with Kernel Compiler GCC MPICH , compiled with -O2 No LMT / LMT / LMT + KNEM OpenMPI 1.5.1, compiled with -O2 KNEM support enabled/disabled KNEM 0.9.4, compiled with -O2 Benchmark OSU Micro-Benchmarks 3.2 compiled with mpicc -O3 Processes 2 processes for one-to-one test 4 for others Due to hardware limitations, the KNEM does not enable DMA and I/OAT. B. Results 1) Bandwidth Test

4 2) Latency Test C. Analysis From the figures above, we see that: In most cases, Nemesis (without LMT/KNEM) is the best for small messages while sm BTL is the best for large messages. The watershed is about 16KB. For messages between 16KB and 4MB, KNEM really accelerates sm BTL. But on the contrary, for messages over 4MB, KNEM in fact makes sm BTL slower. After 4MB, the message size is larger than the L3 cache. Because the OSU Micro-Benchmark always sends and receives one piece of memory, cache misses only occur when the message size is larger than cache. On the memory level, due to different implementations, sm BTL may be better. For example, Possibly sm BTL uses assembly code but KNEM does not. If DMA is enabled, maybe KNEM can perform better. The LMT gives a better performance than that of

5 original Nemesis only when the message size is enough. For different tests, the threshold can vary from 32KB to 256KB. In general, the more concurrent memory accesses are, the smaller the threshold will be, since more accesses cost more memory and cache space for unsent messages. The reasons why there are steep slopes at 32KB in these figures is because LMT is not enabled for messages smaller than 32KB. For small messages, the combination of KNEM and LMT is the slowest one. Unlike sm BTL, KNEM shows its advantages for messages larger than cache, which demonstrates that the efficiency of copying is: sm BTL > KNEM > LMT. Since there are only four cores on this computer, there is not much difference between one-to-one and all-to-all tests. However, they will be significantly distinct with more cores. [4] FAQ of sm BTL. [5] D. Buntinas, B. Goglin, D. Goodell, G. Mercier and S. Moreaud. Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis. Proceedings of ICPP, [6] D. Buntinas, G. Mercier and W. Gropp. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. CCGRID, [7] T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres and J. Dongarra. Kernel assisted collective intra-node communication among multicore and manycore CPUs. INRIA, IV. CONCLUSION AND POSSIBLE IMPROVEMENTS Use MPICH2 for programs with frequent small message passings; use OpenMPI when messages are large. If the sizes of messages are large enough, use KNEM to accelerate message passing. These conclusions are based on our platform. If the cache is not shared, there are more cores, or DMA can be enabled, the performances can be different. It is highly recommended to do a similar test before deciding the MPI implementation for use. In conclusion, there are some possible improvements as follow: For multicore architecture, each node actually executes multiple processes. Thus, it is unnecessary that every process has its own free queue. All processes in one node can share one free queue, which is good for load balance. However, the prerequisite is that the location of processes should be known, like process fusion. Therefore, it needs either some modification of source code or dynamic queue allocation at runtime. The rendezvous protocol in LMT is driven by the sender. The sender cannot send messages before the receiver returns that it is ready. The latency of twice transferring can be very large. To avoid this, the receiver can tell the sender immediately when it needs a message. The sender then sends a message if it finds a matched receiver for the message. This improvement can reduce one message passing. But it is obviously increases the overhead of the sender. So its effectiveness highly depends on hardware platform and network condition. [1] Intel I/OAT webpage. [2] KNEM website. REFERENCES [3] Linux manual page of MPI_Init_thread.

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory