Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer

Size: px
Start display at page:

Download "Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer"

Transcription

1 013 8th International Conference on Communications and Networking in China (CHINACOM) Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer Pang Zhengbin, Wang Shaogang, Wu Dan, Lu Pingjing School of Computer, National University of Defense Technology Changsha, Hunan, China, Abstract MPI collective communication overhead dominates the communication cost for large scale parallel computers, scalability and operation latency for collective communication is critical for next generation computers. This paper proposes a fast and scalable barrier communication offload approach which supports millions of compute cores. Following our approach, the barrier operation sequence is packed by host MPI driver into the barrier "descriptor", which is pushed to the (Network- Interfaces). The can complete the barrier automatically following its algorithm descriptor. Our approach accelerates both intra-node and inter-node barrier communication. We show that our approach achieves both barrier performance and scalability, especially for large scale computer system. This paper also proposes an extendable and easy-to-implement architecture supporting barrier offload communication and also other communication pattern. I. INTRODUCTION Collective communication (barrier, broadcast, reduce, allto-all) is very important for scientific applications running on parallel computers, it has been shown that the collective communication overhead could take over 80% communication cost for large scale super computers[1]. The barrier operation is the most common used collective communication, its performance is critical for most MPI parallel applications. In this paper, we focus on the implementation of fast barrier for large scale parallel systems. For next generation exascale computers, the system could have over 1 million cores, a good barrier implementation should achieve both low latency and scalability[]. In order to achieve the overlapping of communication and computation, offload the collective communication to hardware could have obvious benefits for these systems. Present barrier offload technique, like Core-Direct [3], TIANHE-1A[4], uses a triggered point-to-point communication approach, Core-Direct software initiates multiple point-to-point communication requests to the hardware and sets the request to be triggered by other messages, in this way, the whole collective communication can be handled by hardware without further software intervention. We observe that present barrier offload method may suffer from long delay and poor scalability. The Core-Direct must push many working-queue-element for a single barrier operation in each node, e.g., for barrier group with 4096 nodes, each node needs to push 1 work requests to the hardware[], we observe that this incurs long host- (Network-Interface- Card) communication. For next-generation computer networks, its point-to-point communication delay is usually high as its topology usually uses the torus mode. But each chip s network bandwidth is high due to the technology advances in serdes. In this paper, we propose a new barrier communication offload approach, which fits well for the next generation system networking. For next generation super computer, the processor usually incorporates many cores, so each must support more MPI threads. We see that the is difficult to support the communication requirement for too many threads. In this paper, we leverage a hierarchical approach which the interprocessor threads perform the barrier through dedicate hardware, and the node leader thread communication with other nodes through inter-node barrier algorithm which is executed fully by hardware. Compared with other approach, our approach only requires the host to push 1 communication descriptor to the in each node. The barrier descriptor covers both intra-node and internode barrier algorithm. The hardware can follow the descriptor to automatically finish the full barrier algorithm. We also give the hardware architecture that smoothly supports the new barrier offload approach, due to the simple barrier engine architecture, the can dedicate more hardware resources to collective communication. From simulation results, we show that our approach performs better than present barrier offload technique. II. BARRIER OFFLOAD ALGORITHM Our approach offloads the MPI barrier operation through the following steps: step1: each barrier node s MPI driver calculates the barrier communication sequence,i.e, the communication pattern performed by the barrier algorithm, and packs it into the barrier descriptor. All descriptor is packed following the enhanced dissemination algorithm, while the host does not perform any real communication during this step. step: the descriptor is sent to the, our approach enables the to complete the real barrier communication automatically without any further host intervention, the barrier communication is performed solely by hardware. step3: when the completes the whole barrier communication, it informs the host through the host- communication IEEE

2 N number of barrier nodes rank my local rank round 1 repeat round round + 1 sendpeer1 rank + 3round mod p sendpeer rank (3round + p) mod p recvpeer1 rank + 3round mod p sendpeer1 rank (3round + p) mod p send barrier msg to sendpeer1 with round id send barrier msg to sendpeer with round id receiv barrier msg from recvpeer1 with round id receiv barrier msg from recvpeer with round id until round log(3, N ) 1 NR Fig.. the architecture supporting collective communication Barrier Group 0 Thread 0 Group Config Register Barrier Group Counter Fig. 1. -way dissemination algorithm Counter 0 Thread 1 Thread Counter 1 Counter Thread 3 Thread n Counter 3 Counter n Fig. 3. the architecture supporting collective communication A. Inter-Node Barrier Algorithm The dissemination algorithm is a common used barrier method[6], [7]. It supports the barrier group with arbitrary number of nodes. The basic dissemination barrier algorithm requires multiple rounds, each round sends and receives one barrier message from other node. The following round communication can be initiated only after the previous round has been finished. We observe that in large scale systems, the network system usually uses the torus topology, the network point-to-point communication delay is high as it may require many hops to reach the dest node. For these systems, the basic dissemination algorithm is not efficient. In most cases, every node takes long time waiting for the source barrier message in each round. To efficiently hide the barrier message delay, our hardware uses an enhanced K-way dissemination algorithm to offload the barrier communication. The modified algorithm is able to sends and receives K messages parallel in each round. Our approach defines a new message type which is used for solely barrier communication. The new barrier message is very small, so even the does not support multi-ports parallel message processing, the barrier messages can be sent and received very fast. The example -way dissemination algorithm is shown in 1. We can prove that the -way dissemination algorithm requires total log(3, N ) rounds to complete the barrier for N nodes. The obvious benefit of the new algorithm is that it can greatly reduce the number of communication rounds, e.g., for the -way dissemination algorithm, the algorithm rounds can be reduced from log(, N ) to log(3, N ), we observe that the whole barrier delay can benefit from less algorithm rounds. B. Inter-Node Barrier Algorithm for Fat Tree For super computer which uses the fat tree interconnection, the topology is fast on the broadcast operation. For example, the TIANHE super computer, each board is equipped with 8 s, and all the on-board s are connected to one NR, which is a fat tree router chip. We see that, for the fat tree topology, the basic scatter-broadcast barrier algorithm is faster than pair-wise algorithm. We leverages the scatter-broadcast approach to perform the barrier communication within the board. Each board selects a leader, which all on-board s send barrier notification to it. The leader is then communication with other leader to finish the whole system barrier operation. On receiving all the notification messages, the leader broadcasts the barrier reach notification to all other s. C. Intra-Node Barrier Algorithm For the MPI threads that are resident on the same processor, We use a fence counter to accelerate the barrier operation. After intra-node MPI threads has finished the communication, our approach select a leader thread, through which it communicates with other MPI leader thread in other nodes. To accelerate intra-node barrier, the hardware incorporates several groups of fence counter, each group is used to support one MPI communicator. Within the group, there are several counters, each counter is bonded with one MPI thread, when the thread reaches the barrier, it host driver increases its own fence counter, and the group counter which is the smallest value of thread counter, when the group counter is increased by 1, the MPI threads that resides in its host process all reach the barrier point. The group config register is used to designate which thread is participated in the barrier operation. The barrier group is assigned to a MPI communicator when it is created by MPI driver, and the group counter is reset to initial value when the group is re-assigned. III. BARRIER A LGORITHM D ESCRIPTOR When offloading the collective communication to the hardware, there is one approach that offloads the full collective algorithm to the hardware. For example, the collective optimization over Infiniband[8] uses an embedded processor to execute the algorithm. We observe that this approach will 611

3 greatly complicate the design. The embedded processor is usually limited by its performance, it is far slow compared with s bandwidth and the host processor s performance. We propose a new approach that does not require the hardware to execute the full barrier algorithm, instead, the barrier s communication sequence is calculated by host s MPI driver, and the hardware simply follows the operation sequence to handle the real communication. We see that this will lead to simple hardware design, through which more hardware can be dedicated to the real collective communication. For any node in the barrier group, we can see from the dissemination algorithm that even before real communication, each round s source and dest nodes can be statically determined. Our approach leverages each node s MPI driver to calculate the barrier sequences and pack it to the algorithm descriptor. After the descriptor is generated, it is pushed to the through its host interface. The hardware can follow the descriptor to automatically communicate with other nodes, after the sequence is completed, the whole barrier is completed. The host interface may be varied for different systems, for example, the command queue residents in the host memory, or the descriptor is directly written to onchip RAM through PCIE write command. An example structure of barrier descriptor supporting - way dissemination algorithm is shown in figure 4. The DType field indicates descriptor type, along with the barrier descriptor, the system may support other collective or pointto-point communication type, the following descriptor field is interpreted according to its descriptor type. the BID is a system wide barrier ID, and it is predefined when the communication group is created, each node s barrier descriptor for the same barrier group uses the same barrier ID. The barrier message uses the BID to match the barrier descriptor for the target node. The SendVec field is a bit vector, its width equals the maximum barrier algorithm round, and each bit indicates whether the corresponding round should send out barrier messages to the target node. The RC1, RC,.. RC16 shows the number of barrier message in each algorithm round it should received. only after receiving all the source barrier messages and sending out all the barrier messages, the barrier engine proceeds to the next algorithm round. SendPeer indicates the target node ID for each communication round. The S flag is used to synchronize multiple barrier descriptor, when the S flag is set, it waits for previous descriptors to be completed before issuing to the barrier processing hardware. The M flag is used to mark current to be the barrier leader, this flag is used when the underlying topology is fat tree. The V flag is used to indicate whether to perform the fat tree topology optimization. The BVEC flag is used to indicate which thread is participating the barrier communication using the fence counter approach. When BVEC is set to all zero, the fence counter optimization is disabled. We see that this should be easy to support the nextgeneration systems. The host- communication cost is low as it only requires each node to push 1 descriptor to the node. The barrier descriptor is small compared with most standard 63~0 17~64 191~ ~19 319~6 383~30 11~ DType BarrierID SendVec S V M BVEC 4 RC0 RC19 RC RC17 RC16 RC1 RC14 Fig. 4. example descriptor structure for -way dissemination algorithm point-to-point message descriptor, for example, the TIANHE- 1A computer s MP (Message Passing) descriptor has 104 bits[4]. Each node has its own barrier descriptor, and it should be generated completely through node s local information. The target and source rank id for each barrier round can be easy generated if local process rank and the barrier group size is known. But for the hardware to perform the real communication, the should know the target and source node id, our approach leverages the MPI driver to translate the process rank id to the physical node id. The translation process should be handled by local node without any communication, so our approach requires that the rank to node mapping information is saved in each node s memory when the MPI communicator group is created. The barrier message takes the information on BID,, DestID to the dest node. Through these information, the dest node can easily determine which point it has reached for the algorithm. If the dest node has not reached the barrier, the barrier messages are saved in temporal buffer and wait for the dest node s own descriptor to be pushed to the. When the has completed the sequence defined in the descriptor, the group barrier communication is finished. It then informs the host that all other nodes have reached the barrier. IV. THE HARDWARE IMPLEMENTATION A. Barrier Engine Architecture For the barrier communication, a complicated case is to deal with different processing arriving patterns. The timing difference between collective communication group nodes can have a significant impact on the performance of the operation, it requires the hardware to be carefully designed to avoid performance degression. To handle this problem, BE leverages the DAMQ (Dynamically-Allocated Multi Queue)[10] to hold incoming barrier messages. The packets stored in this queue can be processed out of order. If the barrier reaches the target node whose has not reached the barrier, BE saves the packets in the DAMQ buffer. We use a simple barrier message handshake protocol to barrier between two nodes. In the barrier descriptor, if the recvpeer is valid, this indicates that local node should wait RC13 RC1 0 RC11 RC10 RC9 RC8 9 RC7 RC6 RC RC4 RC3 8 RC RC1 61

4 the source node to reach barrier; if the sendpeer is valid, this indicates that local node should tell the dest node that it reaches the barrier. If the barrier messages from sendpeer reaches the target node, but the target node has not reached the barrier, the barrier messages are saved in a DAMQ buffer, then the target barrier engine sends back the BarrierRsp message. On receiving the BarrierRsp message, the source node knows that the target node is sure to get the message. When DAMQ buffer is full, the target node nacks the source node with BarrierNack message, on receiving this message, the source node will resend the barrier message after a predefined delay. When the barrier descriptor reaches the, the barrier engine will first check its local DAMQ buffer to see if there are any previously reached barrier messages. If there are any, BE processes these messages immediately. Note that each barrier message takes the information on the barrioud id BID, through which it is matched with dest node s descriptor. If the message s BID equals the descriptor BID, the source node and target node are from the same barrier group. The BID could be derived from the MPI communicator group id, it is required that all the nodes on the same barrier group agree on the BID. This BE engine supports multi barrier run in parallel, with each barrier uses different ID. The structure of the barrier engine is shown in figure. The logic is separated by the barrier message sending (TE) and receiving module (SE). The SDQ and HDQ are descriptor queue. SDQ is resident in host memory, and HDQ is resident in on-chip RAM. The HDQ is mainly used for fast communication and fully controlled by host MPI driver. The OF is the fetching module which reads from descriptor queue and dispatches the descriptor to the barrier engine. The SE module is responsible for receiving network barrier message that comes from sendpeer. The barrier engine saves the messages in the DAMQ buffer; To reduce hardware requirement, all barrier group messages are saved in one buffer, and the DAMQ buffer can be handled out of order. When the receiving DAMQ receives one message, it directly sends back the rsp reply to the sender. For local node, it does not need to known the barrier message is from which node, so the barrier descriptor only holds the number of messages it should received. The SE module uses a booking table to holds the number of source barrier messages for each round. Because each node may reach the barrier in the arbitrary sequence, the receiving messages may reaches local node out of order, so current receiving round id is, where from round 1 to -1, all barrier messages have received. The TE module is responsible for sending barrier messages to target nodes following the sequence defined in the descriptor. For algorithm round i, barrier messages to node sendpeer are sent if: the SE module has received all barrier messages before round i, and the sendpeer is valid for round i. The barrier messages for current round are sent in pipeline before their responsive messages are received. If target node replies with nack message, the barrier message is resent after a preconfigured delay. TE and SE are running in parallel and independently. Be Network Interface BarrierRsp Barrier BarrierRsp Barrier DAMQ DAMQ Source Round TE (Target Engine) SE (Source Engine) Fig.. barrier engine architecture speedup over software OF SDQ HDQ Fig. 6. barrier offload speedup over software only approach cause the TE module needs to know current receiving round, SE module directly gives this information through module ports. V. EXPERIMENTS We implemented the barrier engine using the SystemVerilog language, and integrated the barrier engine module into TIANHE-1 s RTL model. TIANHE-1 s point-to-point communication engine uses the descriptor for MP (Message- Passing) and RDMA (Remote-Direct-Memory-Access)[4], and we add the new barrier descriptor type. From our experiences that the barrier engine is easy to design, we model the barrier engine with less than 6000 SystemVerilog code lines. The new model is simulated by synopsys VCS simulator. We test the barrier latency for different sized barrier. To simulate large scale barrier groups, we designed a simplified model using SystemVerilog language, the simplified model requires less simulation resources and runs more fast, yet its processing delay is similar with the real RTL model. To simulate the network, we use a general model which route point-to-point message to the target node, the point-topoint delay is calculated based on the number of hops for the D torus network. A. Barrier Delay Compared With Software Only Approach In this section, we test the average barrier speedup over the software only approach. The software approach is simulated by modifying the timing parameters collected from the real hardware. The software only barrier delay is compared with our approach, shown in figure

5 Fig. 7. barrier communication delay one node late two node late Our barrier offload approach gets obvious delay reduction compared with the software only approach, and shows much better performance scalability. For the barrier group with nodes, our offload approach is 7.6x faster than the software only approach. The performance benefits come from the following reasons: 1) The software approach uses standard MP message for barrier communication, we see that the MP packet is too large for barrier communication, it incurs long processing delay. ) The host- communication cost is high, it is even worse than the Core-Direct approach which uses the triggered point-to-point operations. For each point-topoint message, the host needs to push the MP descriptor to the, and waits for the s completion events. 3) The offload approach permits more communication and computation overlapping, the performance benefits may depends on the applications. B. Process Arriving Patterns Barrier performance is greatly impacted by the process arriving patterns. We give some sample processing arriving patterns and shows its impact on total barrier delay. The performance results is shown in figure 7. The simulation is conducted on the barrier group with 1-node and -node arriving late, while other nodes arrive the barrier at the same time. The performance is compared with the baseline simulation when all the barrier nodes reach the barrier at the same time. We obtain from the test result that the barrier delay is greatly impacted by the process arriving pattern, when the barrier group size grows, its impact is more obvious. This behavior is because the barrier algorithm is executed following the round sequence. if one node does not reach the barrier, it will not send out the barrier messages to its targets, so all the following barrier communication must wait for this node to reach the barrier. We compare the results with test results from [1], it has been shown that our approach is less affected by the process arriving pattern. This is because the hardware can resume the barrier algorithm more quickly and automatically without any host intervention, but for the software approach, it must take long time on host- communication when the late node reaches the barrier. VI. CONCLUSION We propose a new barrier offload approach, with the new hardware-software interfaces, the barrier engine is. The hardware follows the descriptor to executes the complex K- way dissemination algorithm. Simulation results show that our approach reduces barrier delay efficiently and achieves good computation and communication overlap. From our experiences, the barrier engine is easy to implement and requires less chip resources, so the can dedicate more logic for real communication, this is important for next-generation super computer, where each must support more processor threads. VII. ACKNOWLEDGEMENT This research is sponsored by Natural Science Foundation of China(61014), Chinese 863 project (013AA014301), Hunan Provincial Natural Science Foundation of China(13JJ4007). REFERENCES [1] R. Thakur, R. Rabenseifner, and W. Gropp, Optimization of Collective Communication Operations in MPICH, International Journal of High Performance Computing Applications, vol. 19, no. 1, pp , Feb. 00. [] H. Miyazaki, Y. Kusano, N. Shinjou, F. Shoji, M. Yokokawa, and T. Watanabe, Overview of the K computer System, FUJITSU Sci. Tech. J, vol. 48, no. 3, pp. 6, 01. [3] M. G. Venkata, R. L. Graham, J. Ladd, and P. Shamis, Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct, in 01 41st International Conference on Parallel Processing (ICPP). IEEE, pp [4] M. Xie, Y. Lu, L. Liu, H. Cao, and X. Yang, Implementation and Evaluation of Network Interface and Message Passing Services for TianHe-1A Supercomputer, in 011 IEEE 19th Annual Symposium on High-Performance Interconnects (HOTI). IEEE, pp [] K. S. Hemmert, B. Barrett, and K. D. Underwood, Using triggered operations to offload collective communication operations, in EuroMPI 10: Proceedings of the 17th European MPI users group meeting conference on Recent advances in the message passing interface. Springer-Verlag, Sep [6] T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm, Fast barrier synchronization for InfiniBand TM, in IPDPS 06: Proceedings of the 0th international conference on Parallel and distributed processing. IEEE Computer Society, Apr [7] D. Hensgen, R. Finkel, and U. Manber, Two algorithms for barrier synchronization, International Journal of Parallel Programming, vol. 17, no. 1, Feb [8] A. R. Mamidala, Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters, Phd Thesis, 008. [9] F. Sonja, Hardware Support for Efficient Packet Processing, Phd Thesis, pp. 1 07, Mar. 01. [10] Y. Tamir and G. L. Frazier, Dynamically-allocated multi-queue buffers for VLSI communication switches, Computers, IEEE Transactions on, vol. 41, no. 6, pp , 199. [11] V. Tipparaju, W. Gropp, H. Ritzdorf, R. Thakur, and J. L. Traff, Investigating High Performance RMA Interfaces for the MPI-3 Standard, in 009 International Conference on Parallel Processing (ICPP). IEEE, pp [1] Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms*, pp. 1 10, May

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shunji Uno, Shinji Sumimoto, Kenichi Miura, Naoyuki Shida, Takahiro Kawashima,

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters Timo Schneider, Torsten Hoefler ETH Zurich Dept. of Computer Science Universitätstr. 6 8092 Zurich, Switzerland Email:

More information

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS INFINIBAND HOST CHANNEL ADAPTERS (HCAS) WITH PCI EXPRESS ACHIEVE 2 TO 3 PERCENT LOWER LATENCY FOR SMALL MESSAGES COMPARED WITH HCAS USING 64-BIT, 133-MHZ

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads

More information

GPU-Aware Intranode MPI_Allreduce

GPU-Aware Intranode MPI_Allreduce GPU-Aware Intranode MPI_Allreduce Iman Faraji ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ifaraji@queensuca Ahmad Afsahi ECE Dept, Queen s University Kingston, ON, Canada KL 3N6 ahmadafsahi@queensuca

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Exploiting Offload Enabled Network Interfaces

Exploiting Offload Enabled Network Interfaces spcl.inf.ethz.ch S. DI GIROLAMO, P. JOLIVET, K. D. UNDERWOOD, T. HOEFLER Exploiting Offload Enabled Network Interfaces How to We program need an abstraction! QsNet? Lossy Networks Ethernet Lossless Networks

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast Amith R. Mamidala Lei Chai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Towards scalable RDMA locking on a NIC

Towards scalable RDMA locking on a NIC TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA NEED FOR EFFICIENT

More information

Optimization of Collective Communication in Intra- Cell MPI

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI M. K. Velamati 1, A. Kumar 1, N. Jayam 1, G. Senthilkumar 1, P.K. Baruah 1, R. Sharma 1, S. Kapoor 2, and A. Srinivasan 3 1 Dept. of Mathematics

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Accelerating MPI collective communications through hierarchical algorithms with flexible internode communication and imbalance awareness

Accelerating MPI collective communications through hierarchical algorithms with flexible internode communication and imbalance awareness Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Winter 2015 Accelerating MPI collective communications through hierarchical algorithms with flexible internode communication

More information

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 In-Network Computing Paving the Road to Exascale 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric

More information

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment Richard L. Graham, Joshua S. Ladd, Manjunath GorentlaVenkata Oak Ridge National

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017 In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait

More information

The Tofu Interconnect D

The Tofu Interconnect D The Tofu Interconnect D 11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems Zhenjiang Dong, Jun Wang, George Riley, Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute

More information

Hybrid MPI - A Case Study on the Xeon Phi Platform

Hybrid MPI - A Case Study on the Xeon Phi Platform Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory

More information

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

Fast Barrier Synchronization for InfiniBand TM

Fast Barrier Synchronization for InfiniBand TM Fast Barrier Synchronization for InfiniBand TM Torsten Hoefler, Torsten Mehlan, Frank Mietke and Wolfgang Rehm Chemnitz University of Technology Dept. of Computer Science Chemnitz, 917 GERMANY {htor, tome,

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

Design and Implementation of MPICH2 over InfiniBand with RDMA Support Design and Implementation of MPICH2 over InfiniBand with RDMA Support Jiuxing Liu Weihang Jiang Pete Wyckoff Dhabaleswar K Panda David Ashton Darius Buntinas William Gropp Brian Toonen Computer and Information

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

IBRMP: a Reliable Multicast Protocol for InfiniBand

IBRMP: a Reliable Multicast Protocol for InfiniBand 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects IBRMP: a Reliable Multicast Protocol for InfiniBand Qian Liu, Robert D. Russell Department of Computer Science University of New Hampshire

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 Next Generation Technical Computing Unit Fujitsu Limited Contents Overview of the PRIMEHPC FX10 Supercomputer 2 SPARC64 TM IXfx: Fujitsu-Developed

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems : Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems Cong Xu Manjunath Gorentla Venkata Richard L. Graham Yandong Wang Zhuo Liu Weikuan Yu Auburn University Oak Ridge National

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shunji Uno, Shinji Sumimoto, Kenichi Miura, Naoyuki Shida, Takahiro Kawashima,

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

A Configuration Management Model on the High-speed Networks

A Configuration Management Model on the High-speed Networks A Configuration Management Model on the High-speed Networks Jie Huang, Lin Chen 1 School of Computer, National University of Defense Technology, Chang Sha, China huangjie@nudt.edu.cn Abstract. The infiniband-based

More information

Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey

Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey Journal of Computer Science 7 (1): 114-119, 2011 ISSN 1549-3636 2011 Science Publications Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey K. Indra Gandhi and

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:

More information

Tolerating Message Latency through the Early Release of Blocked Receives

Tolerating Message Latency through the Early Release of Blocked Receives Tolerating Message Latency through the Early Release of Blocked Receives Jian Ke 1, Martin Burtscher 1, and Evan Speight 2 1 Computer Systems Laboratory, School of Electrical & Computer Engineering, Cornell

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

LibPhotonNBC: An RDMA Aware Collective Library on Photon

LibPhotonNBC: An RDMA Aware Collective Library on Photon LibPhotonNBC: An RDMA Aware Collective Library on Photon Udayanga Wickramasinghe 1, Ezra Kissel 1, and Andrew Lumsdaine 1,2 1 Department of Computer Science, Indiana University, USA, {uswickra,ezkissel}@indiana.edu

More information

Per-call Energy Saving Strategies in All-to-all Communications

Per-call Energy Saving Strategies in All-to-all Communications Computer Science Technical Reports Computer Science 2011 Per-call Energy Saving Strategies in All-to-all Communications Vaibhav Sundriyal Iowa State University, vaibhavs@iastate.edu Masha Sosonkina Iowa

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Ethan Kao CS 6410 Oct. 18 th 2011

Ethan Kao CS 6410 Oct. 18 th 2011 Ethan Kao CS 6410 Oct. 18 th 2011 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings

More information

2 The Polymorphous Array Processor

2 The Polymorphous Array Processor 3rd International Conference on Multimedia Technology ICMT 2013) The Design of SIMD Controllers for a Polymorphous Multimedia Processor Lin Pu1, Tao Li1, Xueyuan Yi2, and Jungang Han2 Abstract. The design

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

ICC: An Interconnect Controller for the Tofu Interconnect Architecture

ICC: An Interconnect Controller for the Tofu Interconnect Architecture : An Interconnect Controller for the Tofu Interconnect Architecture August 24, 2010 Takashi Toyoshima Next Generation Technical Computing Unit Fujitsu Limited Background Requirements for Supercomputing

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Loaded: Server Load Balancing for IPv6

Loaded: Server Load Balancing for IPv6 Loaded: Server Load Balancing for IPv6 Sven Friedrich, Sebastian Krahmer, Lars Schneidenbach, Bettina Schnor Institute of Computer Science University Potsdam Potsdam, Germany fsfried, krahmer, lschneid,

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

Using the Cray Gemini Performance Counters

Using the Cray Gemini Performance Counters Using the Cray Gemini Performance Counters Kevin Pedretti, Courtenay Vaughan, Richard Barrett, Karen Devine, K. Scott Hemmert Sandia National Laboratories Albuquerque, NM 87185 Email: {ktpedre, ctvaugh,

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

White Paper. Technical Advances in the SGI. UV Architecture

White Paper. Technical Advances in the SGI. UV Architecture White Paper Technical Advances in the SGI UV Architecture TABLE OF CONTENTS 1. Introduction 1 2. The SGI UV Architecture 2 2.1. SGI UV Compute Blade 3 2.1.1. UV_Hub ASIC Functionality 4 2.1.1.1. Global

More information

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems Shinji Sumimoto 1, Yuichiro Ajima 1, Kazushige Saga 1, Takafumi Nose 1, Naoyuki Shida 1 and Takeshi Nanri 2 1 Fujitsu Ltd.

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Paving the Road to Exascale August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data

More information

II. Principles of Computer Communications Network and Transport Layer

II. Principles of Computer Communications Network and Transport Layer II. Principles of Computer Communications Network and Transport Layer A. Internet Protocol (IP) IPv4 Header An IP datagram consists of a header part and a text part. The header has a 20-byte fixed part

More information

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom ISCA 2018 Session 8B: Interconnection Networks Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech (aniruddh@gatech.edu) Tushar

More information

MPI Programming Techniques

MPI Programming Techniques MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication

More information

IBM Research Report. Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

IBM Research Report. Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer RC5088 (W101-066) December 16, 010 Computer Science IBM Research Report Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer Amith Mamidala

More information

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009 Networks Distributed Systems Philipp Kupferschmied Universität Karlsruhe, System Architecture Group May 6th, 2009 Philipp Kupferschmied Networks 1/ 41 1 Communication Basics Introduction Layered Communication

More information

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS Ying Qian Mohammad J. Rashti Ahmad Afsahi Department of Electrical and Computer Engineering, Queen s University Kingston, ON, CANADA

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Efficient and Scalable NIC-Based Barrier over Quadrics and Myrinet

Efficient and Scalable NIC-Based Barrier over Quadrics and Myrinet Efficient and Scalable NIC-Based Barrier over Quadrics and Myrinet Weikuan Yu Darius Buntinas Rich L. Graham Dhabaleswar K. Panda OSU-CISRC-11/03-TR63 Technical Report Efficient and Scalable NIC-Based

More information