Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot

Size: px

Start display at page:

Download "Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot"

Anthony Spencer
5 years ago
Views:

Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot Noboru Tanabe Toshiba Corporation noboru.tanabe@toshiba.co.jp Hideki Imashiro Hitachi Information Technology Co., Ltd.

1 Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot Noboru Tanabe Toshiba Corporation noboru.tanabe@toshiba.co.jp Hideki Imashiro Hitachi Information Technology Co., Ltd. himashi@hitachi-it.co.jp Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology t.kudoh@aist.go.jp Yoshihiro Hamada, Hironori Nakajo Tokyo University of Agriculture and Technology hamada@nj.cs.tuat.ac.jp, nakajo@cc.tuat.ac.jp Junji Yamamoto Hitachi, Ltd. junji-y@crl.hitachi.co.jp Hideharu Amano Keio University hunga@am.ics.keio.ac.jp Abstract DIMMnet-1 is a high performance network interface for PC clusters that can be directly plugged into the DIMM slot of a PC. By using both low latency AOTF (Atomic On-The- Fly) sending and high bandwidth BOTF (Block On-The-Fly) sending, it can overcome the overhead caused by standard I/O such as the PCI bus. Two types of DIMMnet-1 prototype boards (providing optical and electrical network interfaces) containing a Martini network interface controller chip are currently available. They can be plugged into a 100MHz DIMM slot of a PC with a Pentium-3, Pentium- 4 or Athlon processor. The round-trip time for AOTF on this incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time for AOTF is 4 times faster than that of an SR8000 supercomputer. The inter-two-node floating sum operation time is 1903 ns. This shows that DIMMnet-1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange. 1 Introduction Many high performance PC clusters use system area networks such as Myrinet for interconnection. Myrinet2000, Pentium is registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other product and company names mentioned herein may be trademarks and/or service marks of their respective owners. based on 64 bit 66 MHz PCI with a 2 Gbps electrical or optical link is well known. The sustained one-way bandwidth is 245 MB/s and its short-message latency is 7 µs[1]. Dolphin s PCI-SCI D330, based on 64 bit 66 MHz PCI with a 667 MB/s link has a good reputation. The sustained oneway bandwidth is 200 MB/s, and its remote write latency is 1.46 µs[2]. However most current PCs have only 32 bit 33 MHz conventional PCI slots. If Myrinet or a PCI-SCI were to be plugged into a conventional PCI slot, the performance would be degraded. In addition, because of the half duplex bus, PCI based NIC (network interface card) bandwidth will degrade when the sending operation and the receiving operation are executed simultaneously. On the other hand, over 1 GB/s communication bandwidth per NIC is becoming more realistic by using optical interconnection. However, because of PCI bandwidth limitations, bus handling overheads, and protocol software overheads, the potential of optical links cannot be fully realized on the NIC. In order to overcome these problems, we have proposed the use of a memory slot on a PC s motherboard to plug high performance NICs into low cost PCs. We have named this class of NIC MEMOnet, and the NIC plugged into a DIMM slot DIMMnet. We are currently developing a DIMMnet-1 prototype NIC based on a DIMMnet. Here, the architecture, implementation and performance evaluation of DIMMnet-1 are described. 1

2 2 Architecture MEMOnet is a class of NIC proposed by the authors. We call a NIC plugged into a memory slot MEMOnet. A NIC plugged into the DIMM slot is called DIMMnet. In order to create an effective DIMMnet implementation, four techniques described below are proposed. A_kick_addr_P.page (20bit) 4bit 8bit 8bit (Fixed) 256word 1bit 1bit 1bit 1bit 8bit 8bit 8bit 8bit v v v v = = = = To : Send FIFO Header seed 2.1 MEMOnet MEMOnet is a class of NIC proposed by the authors. We called MEMOnet as a class of NIC plugged into the memory slot. A NIC plugged into the DIMM slot is called DIMMnet. In order to realize effective DIMMnet implementation, four techniques shown below are proposed. Switched small memory modules Small memory modules such as SO-DIMMs or micro- DIMMs will be inexpensive upgradeable broadband large memory on a NIC board. Multiple memory modules allow double buffering. Multiplexed address restoring Physical address is multiplexed as row address and column address on memory bus by a north bridge chip. The multiplexed address has to be restored for MEM- Onet. On-chip low latency common memory (LLCM) In spite of the large capacity, switched small memory modules require overhead for bank-switching. Onchip multi-ported memory accommodates this disadvantage. In spite of the small capacity, it allows for low latency communication. Memory capacity masquerading In order to obtain a large address space for MEMOnet, memory capacity masquerading by the address decoder and SPD information on the NIC board is effective. 2.2 AOTF with header TLB Atomic on-the-fly (AOTF) sending is the communication architecture with the lowest overhead, and is realized by means of a header TLB. Figure1 shows an example of a header TLB. A header TLB associates a packet header seed with the address of a memory access to the memory slot. A header seed is a complete packet header without the lower part of the remote address. misshit (Interrupt to core CPU) A_kick_addr_P.offset (12bit) 4 to 2 encoder 8bit 2bit 2bit packet generator 4K word Figure 1. Structure of a header TLB 64bit Figure2 shows an overview of atomic on-the-fly sending. In the case where the higher part of the address hits the entry on the header TLB, the packet generator produces a complete packet which can access the remote memory by inserting the lower part of the address on the memory slot, and by combining the header and the data on the memory slot to make a packet. Header Trans. table on NIC Memory (privileged) H Seed A_kick_Addr_V TLB on Host A_kick_Addr_P Data from Host CPU NIC-LSI On chip CPU Miss Tag Data page A_kick_Addr_P H Seed Header TLB Hit (privileged) offset Header Transaction FIFO Figure 2. Atomic on-the-fly sending Network In order to create an effective AOTF sending implementation, the following three techniques are proposed. Utilizing physical remote addresses Because the header TLB is located in kernel space, it can hold physical remote addresses. Physical remote addresses do not need to be translated at the receiving end. This technique allows for low latency communication. Length generation from byte-enabled signals

3 One-byte access is convenient for flag manipulation. This will save space for the flag on small capacity LLCMs. An 8-byte width DIMM slot has 8-bit byteenabled signals. Length generation logic allows for a variable payload size for AOTF sending. Credit-base flow control Because all current north bridge chips have no signal to wait for DRAM access, a special flow control for FIFO transactions of AOTF sending logic is needed. Access for checking FIFO overflow brings high latency. Credit-base flow control reduces the flow control overhead. 2.3 On-the-fly receiving On-the-fly receiving (OTFR) is a shortcut receiving method without DMA controller setting and address translation. This is usually used for short packets produced by AOTF sending. In order to implement effective on-the-fly receiving, two techniques described below have been designed. Flag in header specifying direct forwarding A header TLB protects the header from user modification. A flag in this type of header allows for rapid packet forwarding to shortcut logic for low latency receiving with minimal hardware costs. Selective address representation Selective address representation of remote addresses in headers allows for low latency in physical address representation and rich flexibility in other cases. 2.4 BOTF with protection stampable window Block on-the-fly (BOTF) sending is a low-overhead high-bandwidth communication architecture realized by means of protection stampable window memory units. This performs higher bandwidth sending than AOTF sending. Figure3 shows an overview of block on-the-fly sending. Window memory is the memory used to hold the data from each context on the host. Each window memory is mapped on the user space to allow for low-overhead userlevel communication. The high speed host CPU moves data for sending packets from the register, cache or main memory to the window memory mapped on the user space. When the kick address is accessed by the host CPU, the BOTF sending controller makes a packet and sends it to the network. In this way, if context switching has occurred, words written by a process to window memory cannot be overwritten by other processes. In the packet making state, the BOTF sending controller stamps protection information associated with the kick address. In addition, it checks the physical access enable flag which may only be set by the AOTF sending controller, because the window can be freely written by the owner of the window s user process. When the flag in a word in window memory is set by a user process, the communication request will be aborted. B_kick_addr_V B_push_addr_V TLB on Host B_kick_addr_P B_push_addr_P Header Seeds on Main Memory (USER area) Length H Seed H Seed Data1 Data2 Combined block data by writebuffer of Host CPU window status register for USER x Decoder x x-b Window x-a Window x-b Window memory b a NIC-LSI window occupied flags PGID x-a Process Group ID Table (privileged) Send FIFO Protection stamp & check stage Figure 3. Block on-the-fly sending Network In order to create an effective BOTF sending implementation, two techniques are proposed. Using multi window memory If a user can use only a single window memory, effective bandwidth with BOTF sending is cut in half. Utilizing two window memories allows for writing to the window while simultaneously sending to the network. Using multiwindow memory helps to reduce the status checking overhead with the technique described below. Multi status bits prefetch Because window status checking accesses an uncached area, the cost is not insignificant. Multiwindow status bits prefetching allows for a reduction in the status checking overhead and an increase in the effective bandwidth by BOTF sending. 3 Prototype DIMMnet-1 is a prototype NIC based on the architecture described above. 3.1 Switch Four types of switches for DIMMnet-1 described below have been implemented. 1. Electrical version RHiNET2 switch 2. Optical version RHiNET2 switch

4 3. Optical RHiNET3 switch 4. Optical IP based switch with an electrical port 3.2 NIC DIMMnet-1 is a prototype NIC based on the MEM- Onet architecture, in which the NIC is plugged into a PC133 (DIMM) slot. Figure 4 shows the basic structure of DIMMnet-1. Table1 shows the specifications of DIMMnet- 1. SO-DIMM1 (S-DRAM) FET-SW1 FET-SW3 168pin DIMM Interface LINK I/F Martini LSI FET-SW2 FET-SW4 /MWAIT /MIRQ SO-DIMM2 (S-DRAM) (with 2 PEMM signals) Figure 4. Basic structure of DIMMnet-1 Table 1. Basic specifications of DIMMnet-1 Host interface PEMM[8] or DIMM Common memory on NIC PC133 SO-DIMM 2 SO-DIMM capacity / slot 64MB to 512MB Capacity of Low Latency 128KB (on-chip) Common Memory (LLCM) Instruction SRAM capacity 128KB (on-chip) Data SRAM capacity 128KB (on-chip) On-chip processor R3000 like 32bit RISC ASIC s link port 12pair LVDS Common memory 1064MB/s(for Network) bandwidth 1064MB/s(for Host) Send hardware latency 135ns(DIMM to Link I/F) Receive hardware latency 68ns(Link I/F to LLCM) Technology of the ASIC 0.14µm CMOS Embedded Array by Hitachi Supported chipsets Pro133A, Pro266(Pentium-3) P4X266, P4M266(Pentium-4) KT133A(Athlon, AthlonXP) Three types of DIMMnet-1 have been designed as described below. The first and the second prototype have already been implemented. 1. Electrical version DIMMnet-1 for switches 1 and 4 2. Optical version DIMMnet-1 for switch 2 3. Optical version DIMMnet-1 for switch Martini : communication controller ASIC Martini is a dedicated LSI not only for DIMMnet-1, but also for RHiNET2/NI and RHiNET3/NI which are PCI type NICs. Martini supports AOTF sending, BOTF sending, onthe-fly receiving and hardware-based remote DMA communication. It is implemented with all techniques mentioned in the previous section. There are two versions of Martini. The second version operates at higher frequency than the first one. 4 Performance 4.1 Remote write latency by AOTF The asumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to DIMM is 10 DIMM clock cycles(75 ns when the DIMM frequency is 133 MHz). The estimated minimum AOTF sending latency of the NIC s low speed part is 18 clock cycles (135 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for AOTF is 9 clock cycles (68 ns) by cycle accurate simulator. Therefore, the remote write latency of DIMMnet-1 is about 353 ns when the DIMM frequency is 133 MHz and the high-speed part frequency is 500 MHz. The remote write latency of Dolphin s latest PCI- SCI (D330) is 1,460 ns. Therefore, the remote write latency of DIMMnet-1 by AOTF is 4.1 times faster than that of the D Remote write latency by BOTF The assumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to the DIMM is 10 DIMM clock cycles (75 ns when the DIMM frequency is 133 MHz). The assumed memory bus cycles for 54 bytes of data (length = 8 B, header = 32 B, payload = 8 B, kick = 8 B) copied from the CPU to the window memory on DIMMnet-1 with a writecombining attribute is 10 DIMM clock cycles (75 ns). The

5 latency is increased by n DIMM interface clocks for sending 8 n bytes in the packet payload, when using a highperformance CPU with the write-combining mechanism. The estimated minimum BOTF sending latency of the NIC s low speed part is 21 clock cycles (158 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for BOTF is 19 clock cycles (143 ns) by cycle accurate simulator. This is larger than that for AOTF because of the additional overhead for address translation and DMA operation. Therefore, remote write latency of DIMMnet-1 by BOTF is about 441 ns when the DIMM frequency is 133 MHz and the high speed part frequency is 500 MHz. 4.3 Round-trip time by AOTF The round-trip time contains some other software overheads. The measured round-trip time of DIMMnet-1 is 1,875 ns. The round-trip time of Myrinet2000 with GM is 14,000 ns. The measured round-trip time of an incompletely tuned DIMMnet-1 is 7.5 times faster than that of Myrinet Round-trip time by BOTF The round-trip time for a ping pong operation contains some other software overheads. The measured round-trip time of the first optical type DIMMnet-1 is 4,271 ns when the DIMM frequency is 66 MHz and the link frequency is 250 MHz on an 850 MHz Pentium-3-based PC. A faster CPU, faster DIMM frequency and faster link frequency will allow for shorter round-trip times. 4.5 Barrier synchronization time On DIMMnet-1, AOTF sending and OTF receiving to LLCM (on-chip Low Latency Common Memory) is recommended for realizing barrier synchronization, as this is the fastest DIMMnet-1 communication method. The simplest implementation of barrier synchronization between 8 child processes is shown in Figure 5. All child processes execute remote writing (PUSH) of to the LLCM of a home node with AOTF. After the initialization, all child processes can request synchronization packets only by writing of data to the special address called the AOTF kick address. The home node process executes polling all of the bytes to be written specified value on the LLCM. If the home node process detects that all child processes have updated the byte of data, it execute a PUSH byte to the LLCM of all child processes. Changing of the byte data on the LLCM has to be polled by child processes. When the changes are detected by the child processes, a barrier synchronization phase is completed. write AOTF child1 poll LLCM OTFR OTFR SW AOTF LLCM poll write 8bytes home write AOTF child7 poll LLCM OTFR Data for next phase Figure 5. Eight nodes barrier synchronization by DIMMnet-1 The measured barrier synchronization time between two nodes is 2,075 ns. The barrier synchronization time of the Hitachi supercomputer SR8000 with synchronization hardware is about 8,000 ns. Therefore, the barrier synchronization time of DIMMnet-1 is 4 times faster than that of the SR Reduction operations Barrier synchronization is a kind of reduce with scatter operation. Fast barrier synchronization described above is implemented by software on a host with no assistance by the CPU on the NIC. Therefore, other reduce operations which are difficult for the NIC to process [10] (for example, the global sum of floating numbers) will be fast when performed with DIMMnet-1. The measured time for floating sum operations between two nodes returned by an incompletely tuned DIMMnet-1 and host PC is only 1,903 ns. A K-ary tree structured algorithm will achieve a high-speed reduction operation on large scale PC clusters. Therefore, many applications which have not been accelerated on large scale parallel platforms because of reduction operations would appear to benefit from

6 acceleration by PC clusters with DIMMnet-1. 5 Conclusion A high performance network interface for PC clusters called DIMMnet-1 that can be directly plugged into the DIMM slots of PCs has been presented. Low latency communication by AOTF sending and BOTF sending have been evaluated on an incompletely tuned DIMMnet-1 prototype. The round-trip time by AOTF on an incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time by AOTF on an incompletely tuned DIMMnet-1 is 4 times faster than that of the SR8000 supercomputer with a hardware barrier mechanism. The floating reduction operation time between two nodes is only 1,903 ns. Many applications which have not been accelerated on a large scale parallel platform because of frequent finegrained communications or collective operations such as reduction operations and barrier synchronization seemed to be accelerated by using PC clusters with DIMMnet- 1. Reduction operations have been inefficient not only on PC clusters with other NICs, but also on shared memory multiprocessors[11]. Some applications in electrical engineering have these characteristics. For example, circuit simulation on large PC clusters requires fine grained communications caused by the random sparse matrix structure and reduction operations caused by pivot selection and convergence testing. We are currently evaluating the second version of the Martini-based DIMMnet-1. In addition we are making software environment for PC cluster connected by DIMMnet-1 and switches. Message passing library development, paralellizing compiler development and the evaluation with aplications are planed. Acknowledgment The authors would like to express their sincere gratitude to Hiroaki Nishi, Hitoshi Suda, Akihiro Mitsuhashi, Toshiaki Uejima, Hidetoshi Kinno, Hiroaki Terakawa, Kouzou Oosugi, Hidenobu Iwata, Hayato Yamamoto, Yoshimasa Kashiwabara, Toshiteru Keikouin, Jun-ichiro Tsuchiya, Kounosuke Watanabe and others who cooperated in the developpment of Martini LSI or DIMMnet-1. This work is supported by the Real World Computing Partnership(RWCP). References [1] Myricom corp. GM API Performance with PCI64B and PCI64C Myrinet/PCI Interfaces, June 2002 [2] Dolphin Corp. PCI SCI-64 Adapter Card pci-sci64.htm [3] Dolphin Corp. PCI-SCI Adapter Card D320/D321 Functional Overview, Part no.:d , Nov.1999 [4] InfiniBand Trade Association, [5] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano MEMOnet : Network interface plugged into a memory slot, IEEE International Conference on Cluster Computing (CLUS- TER2000), Nov.2000, pp [6] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano On-the-fly sending: a low latency high bandwidth message transger mechanism, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN2000), Dec.2000, pp [7] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano Low Latency High Bandwidth Message Transfer Mechanisms for a network interface plugged into a memory slot, Cluster Computing Journal, Vol.5, No.1, Jan.2002, pp.7-17 [8] Standard of Electronic Industries Association of Japan Processor Enhanced Memory Module (PEMM) Standard for Processor Enhanced Memory Module Functional Specifications, EIAJ ED-5514 Jul.1998 [9] O.Tatebe, U.Nagashima, S.Sekiguchi, H.Kitabayashi, Y.Hayashida Design and implementation of FMPL, a fast message-passing library for remote memory operations, Proceedings of Conference on High Performance Networking and Computing (SC2001), Nov.2001 [10] D.Buntinas, D.K.Panda, P.Sadayappan Performance Benefits of NIC-Based Barrier on Myrinet/GM, Proceedings of the Workshop on Communication Architecture for Clusters (CAC) with IPDPS 01, Apr.2001 [11] M.J.Garzaran, M.Prvulovic, Y.Zhang, A.Jula, H.Yu, L.Rauchwerger, J.torrellas Architectural support for parallel reductions in scalable shared-memory multiprocessors, Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT 01), Sep.2001

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing RHiNET-3/SW: an 0-Gbit/s high-speed network switch for distributed parallel computing S. Nishimura 1, T. Kudoh 2, H. Nishi 2, J. Yamamoto 2, R. Ueno 3, K. Harasawa 4, S. Fukuda 4, Y. Shikichi 4, S. Akutsu