Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot
|
|
- Anthony Spencer
- 5 years ago
- Views:
Transcription
1 Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot Noboru Tanabe Toshiba Corporation noboru.tanabe@toshiba.co.jp Hideki Imashiro Hitachi Information Technology Co., Ltd. himashi@hitachi-it.co.jp Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology t.kudoh@aist.go.jp Yoshihiro Hamada, Hironori Nakajo Tokyo University of Agriculture and Technology hamada@nj.cs.tuat.ac.jp, nakajo@cc.tuat.ac.jp Junji Yamamoto Hitachi, Ltd. junji-y@crl.hitachi.co.jp Hideharu Amano Keio University hunga@am.ics.keio.ac.jp Abstract DIMMnet-1 is a high performance network interface for PC clusters that can be directly plugged into the DIMM slot of a PC. By using both low latency AOTF (Atomic On-The- Fly) sending and high bandwidth BOTF (Block On-The-Fly) sending, it can overcome the overhead caused by standard I/O such as the PCI bus. Two types of DIMMnet-1 prototype boards (providing optical and electrical network interfaces) containing a Martini network interface controller chip are currently available. They can be plugged into a 100MHz DIMM slot of a PC with a Pentium-3, Pentium- 4 or Athlon processor. The round-trip time for AOTF on this incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time for AOTF is 4 times faster than that of an SR8000 supercomputer. The inter-two-node floating sum operation time is 1903 ns. This shows that DIMMnet-1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange. 1 Introduction Many high performance PC clusters use system area networks such as Myrinet for interconnection. Myrinet2000, Pentium is registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other product and company names mentioned herein may be trademarks and/or service marks of their respective owners. based on 64 bit 66 MHz PCI with a 2 Gbps electrical or optical link is well known. The sustained one-way bandwidth is 245 MB/s and its short-message latency is 7 µs[1]. Dolphin s PCI-SCI D330, based on 64 bit 66 MHz PCI with a 667 MB/s link has a good reputation. The sustained oneway bandwidth is 200 MB/s, and its remote write latency is 1.46 µs[2]. However most current PCs have only 32 bit 33 MHz conventional PCI slots. If Myrinet or a PCI-SCI were to be plugged into a conventional PCI slot, the performance would be degraded. In addition, because of the half duplex bus, PCI based NIC (network interface card) bandwidth will degrade when the sending operation and the receiving operation are executed simultaneously. On the other hand, over 1 GB/s communication bandwidth per NIC is becoming more realistic by using optical interconnection. However, because of PCI bandwidth limitations, bus handling overheads, and protocol software overheads, the potential of optical links cannot be fully realized on the NIC. In order to overcome these problems, we have proposed the use of a memory slot on a PC s motherboard to plug high performance NICs into low cost PCs. We have named this class of NIC MEMOnet, and the NIC plugged into a DIMM slot DIMMnet. We are currently developing a DIMMnet-1 prototype NIC based on a DIMMnet. Here, the architecture, implementation and performance evaluation of DIMMnet-1 are described. 1
2 2 Architecture MEMOnet is a class of NIC proposed by the authors. We call a NIC plugged into a memory slot MEMOnet. A NIC plugged into the DIMM slot is called DIMMnet. In order to create an effective DIMMnet implementation, four techniques described below are proposed. A_kick_addr_P.page (20bit) 4bit 8bit 8bit (Fixed) 256word 1bit 1bit 1bit 1bit 8bit 8bit 8bit 8bit v v v v = = = = To : Send FIFO Header seed 2.1 MEMOnet MEMOnet is a class of NIC proposed by the authors. We called MEMOnet as a class of NIC plugged into the memory slot. A NIC plugged into the DIMM slot is called DIMMnet. In order to realize effective DIMMnet implementation, four techniques shown below are proposed. Switched small memory modules Small memory modules such as SO-DIMMs or micro- DIMMs will be inexpensive upgradeable broadband large memory on a NIC board. Multiple memory modules allow double buffering. Multiplexed address restoring Physical address is multiplexed as row address and column address on memory bus by a north bridge chip. The multiplexed address has to be restored for MEM- Onet. On-chip low latency common memory (LLCM) In spite of the large capacity, switched small memory modules require overhead for bank-switching. Onchip multi-ported memory accommodates this disadvantage. In spite of the small capacity, it allows for low latency communication. Memory capacity masquerading In order to obtain a large address space for MEMOnet, memory capacity masquerading by the address decoder and SPD information on the NIC board is effective. 2.2 AOTF with header TLB Atomic on-the-fly (AOTF) sending is the communication architecture with the lowest overhead, and is realized by means of a header TLB. Figure1 shows an example of a header TLB. A header TLB associates a packet header seed with the address of a memory access to the memory slot. A header seed is a complete packet header without the lower part of the remote address. misshit (Interrupt to core CPU) A_kick_addr_P.offset (12bit) 4 to 2 encoder 8bit 2bit 2bit packet generator 4K word Figure 1. Structure of a header TLB 64bit Figure2 shows an overview of atomic on-the-fly sending. In the case where the higher part of the address hits the entry on the header TLB, the packet generator produces a complete packet which can access the remote memory by inserting the lower part of the address on the memory slot, and by combining the header and the data on the memory slot to make a packet. Header Trans. table on NIC Memory (privileged) H Seed A_kick_Addr_V TLB on Host A_kick_Addr_P Data from Host CPU NIC-LSI On chip CPU Miss Tag Data page A_kick_Addr_P H Seed Header TLB Hit (privileged) offset Header Transaction FIFO Figure 2. Atomic on-the-fly sending Network In order to create an effective AOTF sending implementation, the following three techniques are proposed. Utilizing physical remote addresses Because the header TLB is located in kernel space, it can hold physical remote addresses. Physical remote addresses do not need to be translated at the receiving end. This technique allows for low latency communication. Length generation from byte-enabled signals
3 One-byte access is convenient for flag manipulation. This will save space for the flag on small capacity LLCMs. An 8-byte width DIMM slot has 8-bit byteenabled signals. Length generation logic allows for a variable payload size for AOTF sending. Credit-base flow control Because all current north bridge chips have no signal to wait for DRAM access, a special flow control for FIFO transactions of AOTF sending logic is needed. Access for checking FIFO overflow brings high latency. Credit-base flow control reduces the flow control overhead. 2.3 On-the-fly receiving On-the-fly receiving (OTFR) is a shortcut receiving method without DMA controller setting and address translation. This is usually used for short packets produced by AOTF sending. In order to implement effective on-the-fly receiving, two techniques described below have been designed. Flag in header specifying direct forwarding A header TLB protects the header from user modification. A flag in this type of header allows for rapid packet forwarding to shortcut logic for low latency receiving with minimal hardware costs. Selective address representation Selective address representation of remote addresses in headers allows for low latency in physical address representation and rich flexibility in other cases. 2.4 BOTF with protection stampable window Block on-the-fly (BOTF) sending is a low-overhead high-bandwidth communication architecture realized by means of protection stampable window memory units. This performs higher bandwidth sending than AOTF sending. Figure3 shows an overview of block on-the-fly sending. Window memory is the memory used to hold the data from each context on the host. Each window memory is mapped on the user space to allow for low-overhead userlevel communication. The high speed host CPU moves data for sending packets from the register, cache or main memory to the window memory mapped on the user space. When the kick address is accessed by the host CPU, the BOTF sending controller makes a packet and sends it to the network. In this way, if context switching has occurred, words written by a process to window memory cannot be overwritten by other processes. In the packet making state, the BOTF sending controller stamps protection information associated with the kick address. In addition, it checks the physical access enable flag which may only be set by the AOTF sending controller, because the window can be freely written by the owner of the window s user process. When the flag in a word in window memory is set by a user process, the communication request will be aborted. B_kick_addr_V B_push_addr_V TLB on Host B_kick_addr_P B_push_addr_P Header Seeds on Main Memory (USER area) Length H Seed H Seed Data1 Data2 Combined block data by writebuffer of Host CPU window status register for USER x Decoder x x-b Window x-a Window x-b Window memory b a NIC-LSI window occupied flags PGID x-a Process Group ID Table (privileged) Send FIFO Protection stamp & check stage Figure 3. Block on-the-fly sending Network In order to create an effective BOTF sending implementation, two techniques are proposed. Using multi window memory If a user can use only a single window memory, effective bandwidth with BOTF sending is cut in half. Utilizing two window memories allows for writing to the window while simultaneously sending to the network. Using multiwindow memory helps to reduce the status checking overhead with the technique described below. Multi status bits prefetch Because window status checking accesses an uncached area, the cost is not insignificant. Multiwindow status bits prefetching allows for a reduction in the status checking overhead and an increase in the effective bandwidth by BOTF sending. 3 Prototype DIMMnet-1 is a prototype NIC based on the architecture described above. 3.1 Switch Four types of switches for DIMMnet-1 described below have been implemented. 1. Electrical version RHiNET2 switch 2. Optical version RHiNET2 switch
4 3. Optical RHiNET3 switch 4. Optical IP based switch with an electrical port 3.2 NIC DIMMnet-1 is a prototype NIC based on the MEM- Onet architecture, in which the NIC is plugged into a PC133 (DIMM) slot. Figure 4 shows the basic structure of DIMMnet-1. Table1 shows the specifications of DIMMnet- 1. SO-DIMM1 (S-DRAM) FET-SW1 FET-SW3 168pin DIMM Interface LINK I/F Martini LSI FET-SW2 FET-SW4 /MWAIT /MIRQ SO-DIMM2 (S-DRAM) (with 2 PEMM signals) Figure 4. Basic structure of DIMMnet-1 Table 1. Basic specifications of DIMMnet-1 Host interface PEMM[8] or DIMM Common memory on NIC PC133 SO-DIMM 2 SO-DIMM capacity / slot 64MB to 512MB Capacity of Low Latency 128KB (on-chip) Common Memory (LLCM) Instruction SRAM capacity 128KB (on-chip) Data SRAM capacity 128KB (on-chip) On-chip processor R3000 like 32bit RISC ASIC s link port 12pair LVDS Common memory 1064MB/s(for Network) bandwidth 1064MB/s(for Host) Send hardware latency 135ns(DIMM to Link I/F) Receive hardware latency 68ns(Link I/F to LLCM) Technology of the ASIC 0.14µm CMOS Embedded Array by Hitachi Supported chipsets Pro133A, Pro266(Pentium-3) P4X266, P4M266(Pentium-4) KT133A(Athlon, AthlonXP) Three types of DIMMnet-1 have been designed as described below. The first and the second prototype have already been implemented. 1. Electrical version DIMMnet-1 for switches 1 and 4 2. Optical version DIMMnet-1 for switch 2 3. Optical version DIMMnet-1 for switch Martini : communication controller ASIC Martini is a dedicated LSI not only for DIMMnet-1, but also for RHiNET2/NI and RHiNET3/NI which are PCI type NICs. Martini supports AOTF sending, BOTF sending, onthe-fly receiving and hardware-based remote DMA communication. It is implemented with all techniques mentioned in the previous section. There are two versions of Martini. The second version operates at higher frequency than the first one. 4 Performance 4.1 Remote write latency by AOTF The asumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to DIMM is 10 DIMM clock cycles(75 ns when the DIMM frequency is 133 MHz). The estimated minimum AOTF sending latency of the NIC s low speed part is 18 clock cycles (135 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for AOTF is 9 clock cycles (68 ns) by cycle accurate simulator. Therefore, the remote write latency of DIMMnet-1 is about 353 ns when the DIMM frequency is 133 MHz and the high-speed part frequency is 500 MHz. The remote write latency of Dolphin s latest PCI- SCI (D330) is 1,460 ns. Therefore, the remote write latency of DIMMnet-1 by AOTF is 4.1 times faster than that of the D Remote write latency by BOTF The assumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to the DIMM is 10 DIMM clock cycles (75 ns when the DIMM frequency is 133 MHz). The assumed memory bus cycles for 54 bytes of data (length = 8 B, header = 32 B, payload = 8 B, kick = 8 B) copied from the CPU to the window memory on DIMMnet-1 with a writecombining attribute is 10 DIMM clock cycles (75 ns). The
5 latency is increased by n DIMM interface clocks for sending 8 n bytes in the packet payload, when using a highperformance CPU with the write-combining mechanism. The estimated minimum BOTF sending latency of the NIC s low speed part is 21 clock cycles (158 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for BOTF is 19 clock cycles (143 ns) by cycle accurate simulator. This is larger than that for AOTF because of the additional overhead for address translation and DMA operation. Therefore, remote write latency of DIMMnet-1 by BOTF is about 441 ns when the DIMM frequency is 133 MHz and the high speed part frequency is 500 MHz. 4.3 Round-trip time by AOTF The round-trip time contains some other software overheads. The measured round-trip time of DIMMnet-1 is 1,875 ns. The round-trip time of Myrinet2000 with GM is 14,000 ns. The measured round-trip time of an incompletely tuned DIMMnet-1 is 7.5 times faster than that of Myrinet Round-trip time by BOTF The round-trip time for a ping pong operation contains some other software overheads. The measured round-trip time of the first optical type DIMMnet-1 is 4,271 ns when the DIMM frequency is 66 MHz and the link frequency is 250 MHz on an 850 MHz Pentium-3-based PC. A faster CPU, faster DIMM frequency and faster link frequency will allow for shorter round-trip times. 4.5 Barrier synchronization time On DIMMnet-1, AOTF sending and OTF receiving to LLCM (on-chip Low Latency Common Memory) is recommended for realizing barrier synchronization, as this is the fastest DIMMnet-1 communication method. The simplest implementation of barrier synchronization between 8 child processes is shown in Figure 5. All child processes execute remote writing (PUSH) of to the LLCM of a home node with AOTF. After the initialization, all child processes can request synchronization packets only by writing of data to the special address called the AOTF kick address. The home node process executes polling all of the bytes to be written specified value on the LLCM. If the home node process detects that all child processes have updated the byte of data, it execute a PUSH byte to the LLCM of all child processes. Changing of the byte data on the LLCM has to be polled by child processes. When the changes are detected by the child processes, a barrier synchronization phase is completed. write AOTF child1 poll LLCM OTFR OTFR SW AOTF LLCM poll write 8bytes home write AOTF child7 poll LLCM OTFR Data for next phase Figure 5. Eight nodes barrier synchronization by DIMMnet-1 The measured barrier synchronization time between two nodes is 2,075 ns. The barrier synchronization time of the Hitachi supercomputer SR8000 with synchronization hardware is about 8,000 ns. Therefore, the barrier synchronization time of DIMMnet-1 is 4 times faster than that of the SR Reduction operations Barrier synchronization is a kind of reduce with scatter operation. Fast barrier synchronization described above is implemented by software on a host with no assistance by the CPU on the NIC. Therefore, other reduce operations which are difficult for the NIC to process [10] (for example, the global sum of floating numbers) will be fast when performed with DIMMnet-1. The measured time for floating sum operations between two nodes returned by an incompletely tuned DIMMnet-1 and host PC is only 1,903 ns. A K-ary tree structured algorithm will achieve a high-speed reduction operation on large scale PC clusters. Therefore, many applications which have not been accelerated on large scale parallel platforms because of reduction operations would appear to benefit from
6 acceleration by PC clusters with DIMMnet-1. 5 Conclusion A high performance network interface for PC clusters called DIMMnet-1 that can be directly plugged into the DIMM slots of PCs has been presented. Low latency communication by AOTF sending and BOTF sending have been evaluated on an incompletely tuned DIMMnet-1 prototype. The round-trip time by AOTF on an incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time by AOTF on an incompletely tuned DIMMnet-1 is 4 times faster than that of the SR8000 supercomputer with a hardware barrier mechanism. The floating reduction operation time between two nodes is only 1,903 ns. Many applications which have not been accelerated on a large scale parallel platform because of frequent finegrained communications or collective operations such as reduction operations and barrier synchronization seemed to be accelerated by using PC clusters with DIMMnet- 1. Reduction operations have been inefficient not only on PC clusters with other NICs, but also on shared memory multiprocessors[11]. Some applications in electrical engineering have these characteristics. For example, circuit simulation on large PC clusters requires fine grained communications caused by the random sparse matrix structure and reduction operations caused by pivot selection and convergence testing. We are currently evaluating the second version of the Martini-based DIMMnet-1. In addition we are making software environment for PC cluster connected by DIMMnet-1 and switches. Message passing library development, paralellizing compiler development and the evaluation with aplications are planed. Acknowledgment The authors would like to express their sincere gratitude to Hiroaki Nishi, Hitoshi Suda, Akihiro Mitsuhashi, Toshiaki Uejima, Hidetoshi Kinno, Hiroaki Terakawa, Kouzou Oosugi, Hidenobu Iwata, Hayato Yamamoto, Yoshimasa Kashiwabara, Toshiteru Keikouin, Jun-ichiro Tsuchiya, Kounosuke Watanabe and others who cooperated in the developpment of Martini LSI or DIMMnet-1. This work is supported by the Real World Computing Partnership(RWCP). References [1] Myricom corp. GM API Performance with PCI64B and PCI64C Myrinet/PCI Interfaces, June 2002 [2] Dolphin Corp. PCI SCI-64 Adapter Card pci-sci64.htm [3] Dolphin Corp. PCI-SCI Adapter Card D320/D321 Functional Overview, Part no.:d , Nov.1999 [4] InfiniBand Trade Association, [5] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano MEMOnet : Network interface plugged into a memory slot, IEEE International Conference on Cluster Computing (CLUS- TER2000), Nov.2000, pp [6] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano On-the-fly sending: a low latency high bandwidth message transger mechanism, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN2000), Dec.2000, pp [7] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano Low Latency High Bandwidth Message Transfer Mechanisms for a network interface plugged into a memory slot, Cluster Computing Journal, Vol.5, No.1, Jan.2002, pp.7-17 [8] Standard of Electronic Industries Association of Japan Processor Enhanced Memory Module (PEMM) Standard for Processor Enhanced Memory Module Functional Specifications, EIAJ ED-5514 Jul.1998 [9] O.Tatebe, U.Nagashima, S.Sekiguchi, H.Kitabayashi, Y.Hayashida Design and implementation of FMPL, a fast message-passing library for remote memory operations, Proceedings of Conference on High Performance Networking and Computing (SC2001), Nov.2001 [10] D.Buntinas, D.K.Panda, P.Sadayappan Performance Benefits of NIC-Based Barrier on Myrinet/GM, Proceedings of the Workshop on Communication Architecture for Clusters (CAC) with IPDPS 01, Apr.2001 [11] M.J.Garzaran, M.Prvulovic, Y.Zhang, A.Jula, H.Yu, L.Rauchwerger, J.torrellas Architectural support for parallel reductions in scalable shared-memory multiprocessors, Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT 01), Sep.2001
RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing
RHiNET-3/SW: an 0-Gbit/s high-speed network switch for distributed parallel computing S. Nishimura 1, T. Kudoh 2, H. Nishi 2, J. Yamamoto 2, R. Ueno 3, K. Harasawa 4, S. Fukuda 4, Y. Shikichi 4, S. Akutsu
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (III)
COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II
ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,
More informationImplementation of PC Cluster System with Memory Mapped File by Commodity OS
Implementation of PC Cluster System with Memory Mapped File by Commodity OS Jun Kanai, Takuro Mori, Takeshi Araki, Noboru Tanabe, Hironori Nakajo and Mitaro Namiki Department of Computer, Information and
More informationComputer System Components
Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware
More informationEE414 Embedded Systems Ch 5. Memory Part 2/2
EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage
More informationTECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS
TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor
More informationNovel Intelligent I/O Architecture Eliminating the Bus Bottleneck
Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks
More informationand data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed
5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLeveraging HyperTransport for a custom high-performance cluster network
Leveraging HyperTransport for a custom high-performance cluster network Mondrian Nüssle HTCE Symposium 2009 11.02.2009 Outline Background & Motivation Architecture Hardware Implementation Host Interface
More informationThe Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1
The Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1 Inoue Hiroaki Ken-ichiro Anjo Junji Yamamoto Jun Tanabe Masaki Wakabayashi Mitsuru Sato Hideharu
More informationIntroduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses
Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More information6.9. Communicating to the Outside World: Cluster Networking
6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and
More informationStructure of Computer Systems
222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a
More informationPM2: High Performance Communication Middleware for Heterogeneous Network Environments
PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,
More informationINT G bit TCP Offload Engine SOC
INT 10011 10 G bit TCP Offload Engine SOC Product brief, features and benefits summary: Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx/Altera FPGAs or Structured ASIC flow.
More informationBuilding blocks for custom HyperTransport solutions
Building blocks for custom HyperTransport solutions Holger Fröning 2 nd Symposium of the HyperTransport Center of Excellence Feb. 11-12 th 2009, Mannheim, Germany Motivation Back in 2005: Quite some experience
More informationBasics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS
Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS
More information14:332:331. Week 13 Basics of Cache
14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head
More informationPCI to SH-3 AN Hitachi SH3 to PCI bus
PCI to SH-3 AN Hitachi SH3 to PCI bus Version 1.0 Application Note FEATURES GENERAL DESCRIPTION Complete Application Note for designing a PCI adapter or embedded system based on the Hitachi SH-3 including:
More informationMain Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be
More informationLectures More I/O
Lectures 24-25 More I/O 1 I/O is slow! How fast can a typical I/O device supply data to a computer? A fast typist can enter 9-10 characters a second on a keyboard. Common local-area network (LAN) speeds
More informationINT 1011 TCP Offload Engine (Full Offload)
INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,
More informationNetwork Interface Architecture and Prototyping for Chip and Cluster Multiprocessors
University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors
More informationSnoop-Based Multiprocessor Design III: Case Studies
Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions
More informationThe Tofu Interconnect 2
The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationChapter 4 Main Memory
Chapter 4 Main Memory Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering fundamentals
More informationPerformance Evaluation of Myrinet-based Network Router
Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation
More informationCache Designs and Tricks. Kyle Eli, Chun-Lung Lim
Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds
More informationPower Reduction Techniques in the Memory System. Typical Memory Hierarchy
Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationInitial Performance Evaluation of the Cray SeaStar Interconnect
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on
More informationLecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )
Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency
More informationa) Memory management unit b) CPU c) PCI d) None of the mentioned
1. CPU fetches the instruction from memory according to the value of a) program counter b) status register c) instruction register d) program status word 2. Which one of the following is the address generated
More informationStructure of Computer Systems. advantage of low latency, read and write operations with auto-precharge are recommended.
148 advantage of low latency, read and write operations with auto-precharge are recommended. The MB81E161622 chip is targeted for small-scale systems. For that reason, the output buffer capacity has been
More informationCMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture
CMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture UMBC, CMSC313, Richard Chang CMSC 313, Computer Organization & Assembly Language Programming
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationTopic & Scope. Content: The course gives
Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors
More information10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G
10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationInfiniBand SDR, DDR, and QDR Technology Guide
White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationChapter Seven Morgan Kaufmann Publishers
Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationCPCI-HPDI32ALT High-speed 64 Bit Parallel Digital I/O PCI Board 100 to 400 Mbytes/s Cable I/O with PCI-DMA engine
CPCI-HPDI32ALT High-speed 64 Bit Parallel Digital I/O PCI Board 100 to 400 Mbytes/s Cable I/O with PCI-DMA engine Features Include: 200 Mbytes per second (max) input transfer rate via the front panel connector
More informationNPE-300 and NPE-400 Overview
CHAPTER 3 This chapter describes the network processing engine (NPE) models NPE-300 and NPE-400 and contains the following sections: Supported Platforms, page 3-1 Software Requirements, page 3-1 NPE-300
More informationComputer Systems Laboratory Sungkyunkwan University
DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width
More informationHY225 Lecture 12: DRAM and Virtual Memory
HY225 Lecture 12: DRAM and irtual Memory Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS May 16, 2011 Dimitrios S. Nikolopoulos Lecture 12: DRAM and irtual Memory 1 / 36 DRAM Fundamentals Random-access
More informationSONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)
Sonics SOC Integration Architecture Drew Wingard 2440 West El Camino Real, Suite 620 Mountain View, California 94040 650-938-2500 Fax 650-938-2577 http://www.sonicsinc.com (Systems-ON-ICS) Overview 10
More informationKeystone Architecture Inter-core Data Exchange
Application Report Lit. Number November 2011 Keystone Architecture Inter-core Data Exchange Brighton Feng Vincent Han Communication Infrastructure ABSTRACT This application note introduces various methods
More informationCreating a Scalable Microprocessor:
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
More informationA Network Storage LSI Suitable for Home Network
258 HAN-KYU LIM et al : A NETWORK STORAGE LSI SUITABLE FOR HOME NETWORK A Network Storage LSI Suitable for Home Network Han-Kyu Lim*, Ji-Ho Han**, and Deog-Kyoon Jeong*** Abstract Storage over (SoE) is
More informationBasic Low Level Concepts
Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock
More informationMemory hierarchy and cache
Memory hierarchy and cache QUIZ EASY 1). What is used to design Cache? a). SRAM b). DRAM c). Blend of both d). None. 2). What is the Hierarchy of memory? a). Processor, Registers, Cache, Tape, Main memory,
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationAN 829: PCI Express* Avalon -MM DMA Reference Design
AN 829: PCI Express* Avalon -MM DMA Reference Design Updated for Intel Quartus Prime Design Suite: 18.0 Subscribe Latest document on the web: PDF HTML Contents Contents 1....3 1.1. Introduction...3 1.1.1.
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationLighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007
Working with 1 Gigabit Ethernet 1, The School of Physics and Astronomy, The University of Manchester, Manchester, M13 9PL UK E-mail: R.Hughes-Jones@manchester.ac.uk Stephen Kershaw The School of Physics
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationk -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)
THE MEMORY SYSTEM SOME BASIC CONCEPTS Maximum size of the Main Memory byte-addressable CPU-Main Memory Connection, Processor MAR MDR k -bit address bus n-bit data bus Memory Up to 2 k addressable locations
More informationLecture 23. Finish-up buses Storage
Lecture 23 Finish-up buses Storage 1 Example Bus Problems, cont. 2) Assume the following system: A CPU and memory share a 32-bit bus running at 100MHz. The memory needs 50ns to access a 64-bit value from
More informationWilliam Stallings Computer Organization and Architecture 8th Edition. Cache Memory
William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics
More information15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling
More informationSunFire range of servers
TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-2800 (DDR3-600) 200 MHz (internal base chip clock) 8-way interleaved
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCan User-Level Protocols Take Advantage of Multi-CPU NICs?
Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationMotivation for Caching and Optimization of Cache Utilization
Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization
More informationArchitectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad
nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationChapter 12: Multiprocessor Architectures
Chapter 12: Multiprocessor Architectures Lesson 03: Multiprocessor System Interconnects Hierarchical Bus and Time Shared bus Systems and multi-port memory Objective To understand multiprocessor system
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationTextbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:
Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: www.vrtechnology.org 1 Textbook web site: www.vrtechnology.org Laboratory Hardware 2 Topics 14:332:331
More informationUniversal Serial Bus Host Interface on an FPGA
Universal Serial Bus Host Interface on an FPGA Application Note For many years, designers have yearned for a general-purpose, high-performance serial communication protocol. The RS-232 and its derivatives
More informationA First Implementation of In-Transit Buffers on Myrinet GM Software Λ
A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More informationTAG Word 0 Word 1 Word 2 Word 3 0x0A0 D2 55 C7 C8 0x0A0 FC FA AC C7 0x0A0 A5 A6 FF 00
ELE 758 Final Examination 2000: Answers and solutions Number of hits = 15 Miss rate = 25 % Miss rate = [5 (misses) / 20 (total memory references)]* 100% = 25% Show the final content of cache using the
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationInterconnection Structures. Patrick Happ Raul Queiroz Feitosa
Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues that affect interconnection design. Interconnection Structures 2 Outline Introduction Computer Busses Bus Types
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More information