Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot

Size: px
Start display at page:

Download "Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot"

Transcription

1 Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot Noboru Tanabe Toshiba Corporation noboru.tanabe@toshiba.co.jp Hideki Imashiro Hitachi Information Technology Co., Ltd. himashi@hitachi-it.co.jp Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology t.kudoh@aist.go.jp Yoshihiro Hamada, Hironori Nakajo Tokyo University of Agriculture and Technology hamada@nj.cs.tuat.ac.jp, nakajo@cc.tuat.ac.jp Junji Yamamoto Hitachi, Ltd. junji-y@crl.hitachi.co.jp Hideharu Amano Keio University hunga@am.ics.keio.ac.jp Abstract DIMMnet-1 is a high performance network interface for PC clusters that can be directly plugged into the DIMM slot of a PC. By using both low latency AOTF (Atomic On-The- Fly) sending and high bandwidth BOTF (Block On-The-Fly) sending, it can overcome the overhead caused by standard I/O such as the PCI bus. Two types of DIMMnet-1 prototype boards (providing optical and electrical network interfaces) containing a Martini network interface controller chip are currently available. They can be plugged into a 100MHz DIMM slot of a PC with a Pentium-3, Pentium- 4 or Athlon processor. The round-trip time for AOTF on this incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time for AOTF is 4 times faster than that of an SR8000 supercomputer. The inter-two-node floating sum operation time is 1903 ns. This shows that DIMMnet-1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange. 1 Introduction Many high performance PC clusters use system area networks such as Myrinet for interconnection. Myrinet2000, Pentium is registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other product and company names mentioned herein may be trademarks and/or service marks of their respective owners. based on 64 bit 66 MHz PCI with a 2 Gbps electrical or optical link is well known. The sustained one-way bandwidth is 245 MB/s and its short-message latency is 7 µs[1]. Dolphin s PCI-SCI D330, based on 64 bit 66 MHz PCI with a 667 MB/s link has a good reputation. The sustained oneway bandwidth is 200 MB/s, and its remote write latency is 1.46 µs[2]. However most current PCs have only 32 bit 33 MHz conventional PCI slots. If Myrinet or a PCI-SCI were to be plugged into a conventional PCI slot, the performance would be degraded. In addition, because of the half duplex bus, PCI based NIC (network interface card) bandwidth will degrade when the sending operation and the receiving operation are executed simultaneously. On the other hand, over 1 GB/s communication bandwidth per NIC is becoming more realistic by using optical interconnection. However, because of PCI bandwidth limitations, bus handling overheads, and protocol software overheads, the potential of optical links cannot be fully realized on the NIC. In order to overcome these problems, we have proposed the use of a memory slot on a PC s motherboard to plug high performance NICs into low cost PCs. We have named this class of NIC MEMOnet, and the NIC plugged into a DIMM slot DIMMnet. We are currently developing a DIMMnet-1 prototype NIC based on a DIMMnet. Here, the architecture, implementation and performance evaluation of DIMMnet-1 are described. 1

2 2 Architecture MEMOnet is a class of NIC proposed by the authors. We call a NIC plugged into a memory slot MEMOnet. A NIC plugged into the DIMM slot is called DIMMnet. In order to create an effective DIMMnet implementation, four techniques described below are proposed. A_kick_addr_P.page (20bit) 4bit 8bit 8bit (Fixed) 256word 1bit 1bit 1bit 1bit 8bit 8bit 8bit 8bit v v v v = = = = To : Send FIFO Header seed 2.1 MEMOnet MEMOnet is a class of NIC proposed by the authors. We called MEMOnet as a class of NIC plugged into the memory slot. A NIC plugged into the DIMM slot is called DIMMnet. In order to realize effective DIMMnet implementation, four techniques shown below are proposed. Switched small memory modules Small memory modules such as SO-DIMMs or micro- DIMMs will be inexpensive upgradeable broadband large memory on a NIC board. Multiple memory modules allow double buffering. Multiplexed address restoring Physical address is multiplexed as row address and column address on memory bus by a north bridge chip. The multiplexed address has to be restored for MEM- Onet. On-chip low latency common memory (LLCM) In spite of the large capacity, switched small memory modules require overhead for bank-switching. Onchip multi-ported memory accommodates this disadvantage. In spite of the small capacity, it allows for low latency communication. Memory capacity masquerading In order to obtain a large address space for MEMOnet, memory capacity masquerading by the address decoder and SPD information on the NIC board is effective. 2.2 AOTF with header TLB Atomic on-the-fly (AOTF) sending is the communication architecture with the lowest overhead, and is realized by means of a header TLB. Figure1 shows an example of a header TLB. A header TLB associates a packet header seed with the address of a memory access to the memory slot. A header seed is a complete packet header without the lower part of the remote address. misshit (Interrupt to core CPU) A_kick_addr_P.offset (12bit) 4 to 2 encoder 8bit 2bit 2bit packet generator 4K word Figure 1. Structure of a header TLB 64bit Figure2 shows an overview of atomic on-the-fly sending. In the case where the higher part of the address hits the entry on the header TLB, the packet generator produces a complete packet which can access the remote memory by inserting the lower part of the address on the memory slot, and by combining the header and the data on the memory slot to make a packet. Header Trans. table on NIC Memory (privileged) H Seed A_kick_Addr_V TLB on Host A_kick_Addr_P Data from Host CPU NIC-LSI On chip CPU Miss Tag Data page A_kick_Addr_P H Seed Header TLB Hit (privileged) offset Header Transaction FIFO Figure 2. Atomic on-the-fly sending Network In order to create an effective AOTF sending implementation, the following three techniques are proposed. Utilizing physical remote addresses Because the header TLB is located in kernel space, it can hold physical remote addresses. Physical remote addresses do not need to be translated at the receiving end. This technique allows for low latency communication. Length generation from byte-enabled signals

3 One-byte access is convenient for flag manipulation. This will save space for the flag on small capacity LLCMs. An 8-byte width DIMM slot has 8-bit byteenabled signals. Length generation logic allows for a variable payload size for AOTF sending. Credit-base flow control Because all current north bridge chips have no signal to wait for DRAM access, a special flow control for FIFO transactions of AOTF sending logic is needed. Access for checking FIFO overflow brings high latency. Credit-base flow control reduces the flow control overhead. 2.3 On-the-fly receiving On-the-fly receiving (OTFR) is a shortcut receiving method without DMA controller setting and address translation. This is usually used for short packets produced by AOTF sending. In order to implement effective on-the-fly receiving, two techniques described below have been designed. Flag in header specifying direct forwarding A header TLB protects the header from user modification. A flag in this type of header allows for rapid packet forwarding to shortcut logic for low latency receiving with minimal hardware costs. Selective address representation Selective address representation of remote addresses in headers allows for low latency in physical address representation and rich flexibility in other cases. 2.4 BOTF with protection stampable window Block on-the-fly (BOTF) sending is a low-overhead high-bandwidth communication architecture realized by means of protection stampable window memory units. This performs higher bandwidth sending than AOTF sending. Figure3 shows an overview of block on-the-fly sending. Window memory is the memory used to hold the data from each context on the host. Each window memory is mapped on the user space to allow for low-overhead userlevel communication. The high speed host CPU moves data for sending packets from the register, cache or main memory to the window memory mapped on the user space. When the kick address is accessed by the host CPU, the BOTF sending controller makes a packet and sends it to the network. In this way, if context switching has occurred, words written by a process to window memory cannot be overwritten by other processes. In the packet making state, the BOTF sending controller stamps protection information associated with the kick address. In addition, it checks the physical access enable flag which may only be set by the AOTF sending controller, because the window can be freely written by the owner of the window s user process. When the flag in a word in window memory is set by a user process, the communication request will be aborted. B_kick_addr_V B_push_addr_V TLB on Host B_kick_addr_P B_push_addr_P Header Seeds on Main Memory (USER area) Length H Seed H Seed Data1 Data2 Combined block data by writebuffer of Host CPU window status register for USER x Decoder x x-b Window x-a Window x-b Window memory b a NIC-LSI window occupied flags PGID x-a Process Group ID Table (privileged) Send FIFO Protection stamp & check stage Figure 3. Block on-the-fly sending Network In order to create an effective BOTF sending implementation, two techniques are proposed. Using multi window memory If a user can use only a single window memory, effective bandwidth with BOTF sending is cut in half. Utilizing two window memories allows for writing to the window while simultaneously sending to the network. Using multiwindow memory helps to reduce the status checking overhead with the technique described below. Multi status bits prefetch Because window status checking accesses an uncached area, the cost is not insignificant. Multiwindow status bits prefetching allows for a reduction in the status checking overhead and an increase in the effective bandwidth by BOTF sending. 3 Prototype DIMMnet-1 is a prototype NIC based on the architecture described above. 3.1 Switch Four types of switches for DIMMnet-1 described below have been implemented. 1. Electrical version RHiNET2 switch 2. Optical version RHiNET2 switch

4 3. Optical RHiNET3 switch 4. Optical IP based switch with an electrical port 3.2 NIC DIMMnet-1 is a prototype NIC based on the MEM- Onet architecture, in which the NIC is plugged into a PC133 (DIMM) slot. Figure 4 shows the basic structure of DIMMnet-1. Table1 shows the specifications of DIMMnet- 1. SO-DIMM1 (S-DRAM) FET-SW1 FET-SW3 168pin DIMM Interface LINK I/F Martini LSI FET-SW2 FET-SW4 /MWAIT /MIRQ SO-DIMM2 (S-DRAM) (with 2 PEMM signals) Figure 4. Basic structure of DIMMnet-1 Table 1. Basic specifications of DIMMnet-1 Host interface PEMM[8] or DIMM Common memory on NIC PC133 SO-DIMM 2 SO-DIMM capacity / slot 64MB to 512MB Capacity of Low Latency 128KB (on-chip) Common Memory (LLCM) Instruction SRAM capacity 128KB (on-chip) Data SRAM capacity 128KB (on-chip) On-chip processor R3000 like 32bit RISC ASIC s link port 12pair LVDS Common memory 1064MB/s(for Network) bandwidth 1064MB/s(for Host) Send hardware latency 135ns(DIMM to Link I/F) Receive hardware latency 68ns(Link I/F to LLCM) Technology of the ASIC 0.14µm CMOS Embedded Array by Hitachi Supported chipsets Pro133A, Pro266(Pentium-3) P4X266, P4M266(Pentium-4) KT133A(Athlon, AthlonXP) Three types of DIMMnet-1 have been designed as described below. The first and the second prototype have already been implemented. 1. Electrical version DIMMnet-1 for switches 1 and 4 2. Optical version DIMMnet-1 for switch 2 3. Optical version DIMMnet-1 for switch Martini : communication controller ASIC Martini is a dedicated LSI not only for DIMMnet-1, but also for RHiNET2/NI and RHiNET3/NI which are PCI type NICs. Martini supports AOTF sending, BOTF sending, onthe-fly receiving and hardware-based remote DMA communication. It is implemented with all techniques mentioned in the previous section. There are two versions of Martini. The second version operates at higher frequency than the first one. 4 Performance 4.1 Remote write latency by AOTF The asumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to DIMM is 10 DIMM clock cycles(75 ns when the DIMM frequency is 133 MHz). The estimated minimum AOTF sending latency of the NIC s low speed part is 18 clock cycles (135 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for AOTF is 9 clock cycles (68 ns) by cycle accurate simulator. Therefore, the remote write latency of DIMMnet-1 is about 353 ns when the DIMM frequency is 133 MHz and the high-speed part frequency is 500 MHz. The remote write latency of Dolphin s latest PCI- SCI (D330) is 1,460 ns. Therefore, the remote write latency of DIMMnet-1 by AOTF is 4.1 times faster than that of the D Remote write latency by BOTF The assumed latency from the execution of a move instruction at an initiator node to the beginning of the write operation to the DIMM is 10 DIMM clock cycles (75 ns when the DIMM frequency is 133 MHz). The assumed memory bus cycles for 54 bytes of data (length = 8 B, header = 32 B, payload = 8 B, kick = 8 B) copied from the CPU to the window memory on DIMMnet-1 with a writecombining attribute is 10 DIMM clock cycles (75 ns). The

5 latency is increased by n DIMM interface clocks for sending 8 n bytes in the packet payload, when using a highperformance CPU with the write-combining mechanism. The estimated minimum BOTF sending latency of the NIC s low speed part is 21 clock cycles (158 ns when the frequency is 133 MHz) by cycle accurate simulator. The measured latency between the NIC s high speed part and low speed part is 151 ns when the frequency is 250 MHz on the first optical type prototype. The assumed latency between the NIC s high speed part and low speed part is 75 ns when the frequency is 500 MHz. The estimated minimum reception latency of the NIC s low speed part for BOTF is 19 clock cycles (143 ns) by cycle accurate simulator. This is larger than that for AOTF because of the additional overhead for address translation and DMA operation. Therefore, remote write latency of DIMMnet-1 by BOTF is about 441 ns when the DIMM frequency is 133 MHz and the high speed part frequency is 500 MHz. 4.3 Round-trip time by AOTF The round-trip time contains some other software overheads. The measured round-trip time of DIMMnet-1 is 1,875 ns. The round-trip time of Myrinet2000 with GM is 14,000 ns. The measured round-trip time of an incompletely tuned DIMMnet-1 is 7.5 times faster than that of Myrinet Round-trip time by BOTF The round-trip time for a ping pong operation contains some other software overheads. The measured round-trip time of the first optical type DIMMnet-1 is 4,271 ns when the DIMM frequency is 66 MHz and the link frequency is 250 MHz on an 850 MHz Pentium-3-based PC. A faster CPU, faster DIMM frequency and faster link frequency will allow for shorter round-trip times. 4.5 Barrier synchronization time On DIMMnet-1, AOTF sending and OTF receiving to LLCM (on-chip Low Latency Common Memory) is recommended for realizing barrier synchronization, as this is the fastest DIMMnet-1 communication method. The simplest implementation of barrier synchronization between 8 child processes is shown in Figure 5. All child processes execute remote writing (PUSH) of to the LLCM of a home node with AOTF. After the initialization, all child processes can request synchronization packets only by writing of data to the special address called the AOTF kick address. The home node process executes polling all of the bytes to be written specified value on the LLCM. If the home node process detects that all child processes have updated the byte of data, it execute a PUSH byte to the LLCM of all child processes. Changing of the byte data on the LLCM has to be polled by child processes. When the changes are detected by the child processes, a barrier synchronization phase is completed. write AOTF child1 poll LLCM OTFR OTFR SW AOTF LLCM poll write 8bytes home write AOTF child7 poll LLCM OTFR Data for next phase Figure 5. Eight nodes barrier synchronization by DIMMnet-1 The measured barrier synchronization time between two nodes is 2,075 ns. The barrier synchronization time of the Hitachi supercomputer SR8000 with synchronization hardware is about 8,000 ns. Therefore, the barrier synchronization time of DIMMnet-1 is 4 times faster than that of the SR Reduction operations Barrier synchronization is a kind of reduce with scatter operation. Fast barrier synchronization described above is implemented by software on a host with no assistance by the CPU on the NIC. Therefore, other reduce operations which are difficult for the NIC to process [10] (for example, the global sum of floating numbers) will be fast when performed with DIMMnet-1. The measured time for floating sum operations between two nodes returned by an incompletely tuned DIMMnet-1 and host PC is only 1,903 ns. A K-ary tree structured algorithm will achieve a high-speed reduction operation on large scale PC clusters. Therefore, many applications which have not been accelerated on large scale parallel platforms because of reduction operations would appear to benefit from

6 acceleration by PC clusters with DIMMnet-1. 5 Conclusion A high performance network interface for PC clusters called DIMMnet-1 that can be directly plugged into the DIMM slots of PCs has been presented. Low latency communication by AOTF sending and BOTF sending have been evaluated on an incompletely tuned DIMMnet-1 prototype. The round-trip time by AOTF on an incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time by AOTF on an incompletely tuned DIMMnet-1 is 4 times faster than that of the SR8000 supercomputer with a hardware barrier mechanism. The floating reduction operation time between two nodes is only 1,903 ns. Many applications which have not been accelerated on a large scale parallel platform because of frequent finegrained communications or collective operations such as reduction operations and barrier synchronization seemed to be accelerated by using PC clusters with DIMMnet- 1. Reduction operations have been inefficient not only on PC clusters with other NICs, but also on shared memory multiprocessors[11]. Some applications in electrical engineering have these characteristics. For example, circuit simulation on large PC clusters requires fine grained communications caused by the random sparse matrix structure and reduction operations caused by pivot selection and convergence testing. We are currently evaluating the second version of the Martini-based DIMMnet-1. In addition we are making software environment for PC cluster connected by DIMMnet-1 and switches. Message passing library development, paralellizing compiler development and the evaluation with aplications are planed. Acknowledgment The authors would like to express their sincere gratitude to Hiroaki Nishi, Hitoshi Suda, Akihiro Mitsuhashi, Toshiaki Uejima, Hidetoshi Kinno, Hiroaki Terakawa, Kouzou Oosugi, Hidenobu Iwata, Hayato Yamamoto, Yoshimasa Kashiwabara, Toshiteru Keikouin, Jun-ichiro Tsuchiya, Kounosuke Watanabe and others who cooperated in the developpment of Martini LSI or DIMMnet-1. This work is supported by the Real World Computing Partnership(RWCP). References [1] Myricom corp. GM API Performance with PCI64B and PCI64C Myrinet/PCI Interfaces, June 2002 [2] Dolphin Corp. PCI SCI-64 Adapter Card pci-sci64.htm [3] Dolphin Corp. PCI-SCI Adapter Card D320/D321 Functional Overview, Part no.:d , Nov.1999 [4] InfiniBand Trade Association, [5] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano MEMOnet : Network interface plugged into a memory slot, IEEE International Conference on Cluster Computing (CLUS- TER2000), Nov.2000, pp [6] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano On-the-fly sending: a low latency high bandwidth message transger mechanism, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN2000), Dec.2000, pp [7] N. Tanabe, J. Yamamoto, H. Nishi, T. Kudoh, Y. Hamada, H. Nakajo, H. Amano Low Latency High Bandwidth Message Transfer Mechanisms for a network interface plugged into a memory slot, Cluster Computing Journal, Vol.5, No.1, Jan.2002, pp.7-17 [8] Standard of Electronic Industries Association of Japan Processor Enhanced Memory Module (PEMM) Standard for Processor Enhanced Memory Module Functional Specifications, EIAJ ED-5514 Jul.1998 [9] O.Tatebe, U.Nagashima, S.Sekiguchi, H.Kitabayashi, Y.Hayashida Design and implementation of FMPL, a fast message-passing library for remote memory operations, Proceedings of Conference on High Performance Networking and Computing (SC2001), Nov.2001 [10] D.Buntinas, D.K.Panda, P.Sadayappan Performance Benefits of NIC-Based Barrier on Myrinet/GM, Proceedings of the Workshop on Communication Architecture for Clusters (CAC) with IPDPS 01, Apr.2001 [11] M.J.Garzaran, M.Prvulovic, Y.Zhang, A.Jula, H.Yu, L.Rauchwerger, J.torrellas Architectural support for parallel reductions in scalable shared-memory multiprocessors, Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT 01), Sep.2001

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing RHiNET-3/SW: an 0-Gbit/s high-speed network switch for distributed parallel computing S. Nishimura 1, T. Kudoh 2, H. Nishi 2, J. Yamamoto 2, R. Ueno 3, K. Harasawa 4, S. Fukuda 4, Y. Shikichi 4, S. Akutsu

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Implementation of PC Cluster System with Memory Mapped File by Commodity OS

Implementation of PC Cluster System with Memory Mapped File by Commodity OS Implementation of PC Cluster System with Memory Mapped File by Commodity OS Jun Kanai, Takuro Mori, Takeshi Araki, Noboru Tanabe, Hironori Nakajo and Mitaro Namiki Department of Computer, Information and

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

EE414 Embedded Systems Ch 5. Memory Part 2/2

EE414 Embedded Systems Ch 5. Memory Part 2/2 EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Leveraging HyperTransport for a custom high-performance cluster network

Leveraging HyperTransport for a custom high-performance cluster network Leveraging HyperTransport for a custom high-performance cluster network Mondrian Nüssle HTCE Symposium 2009 11.02.2009 Outline Background & Motivation Architecture Hardware Implementation Host Interface

More information

The Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1

The Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1 The Preliminary Evaluation of MBP-light with Two Protocol Policies for A Massively Parallel Processor JUMP-1 Inoue Hiroaki Ken-ichiro Anjo Junji Yamamoto Jun Tanabe Masaki Wakabayashi Mitsuru Sato Hideharu

More information

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Structure of Computer Systems

Structure of Computer Systems 222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

INT G bit TCP Offload Engine SOC

INT G bit TCP Offload Engine SOC INT 10011 10 G bit TCP Offload Engine SOC Product brief, features and benefits summary: Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx/Altera FPGAs or Structured ASIC flow.

More information

Building blocks for custom HyperTransport solutions

Building blocks for custom HyperTransport solutions Building blocks for custom HyperTransport solutions Holger Fröning 2 nd Symposium of the HyperTransport Center of Excellence Feb. 11-12 th 2009, Mannheim, Germany Motivation Back in 2005: Quite some experience

More information

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

PCI to SH-3 AN Hitachi SH3 to PCI bus

PCI to SH-3 AN Hitachi SH3 to PCI bus PCI to SH-3 AN Hitachi SH3 to PCI bus Version 1.0 Application Note FEATURES GENERAL DESCRIPTION Complete Application Note for designing a PCI adapter or embedded system based on the Hitachi SH-3 including:

More information

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be

More information

Lectures More I/O

Lectures More I/O Lectures 24-25 More I/O 1 I/O is slow! How fast can a typical I/O device supply data to a computer? A fast typist can enter 9-10 characters a second on a keyboard. Common local-area network (LAN) speeds

More information

INT 1011 TCP Offload Engine (Full Offload)

INT 1011 TCP Offload Engine (Full Offload) INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

Snoop-Based Multiprocessor Design III: Case Studies

Snoop-Based Multiprocessor Design III: Case Studies Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Chapter 4 Main Memory

Chapter 4 Main Memory Chapter 4 Main Memory Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering fundamentals

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections ) Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency

More information

a) Memory management unit b) CPU c) PCI d) None of the mentioned

a) Memory management unit b) CPU c) PCI d) None of the mentioned 1. CPU fetches the instruction from memory according to the value of a) program counter b) status register c) instruction register d) program status word 2. Which one of the following is the address generated

More information

Structure of Computer Systems. advantage of low latency, read and write operations with auto-precharge are recommended.

Structure of Computer Systems. advantage of low latency, read and write operations with auto-precharge are recommended. 148 advantage of low latency, read and write operations with auto-precharge are recommended. The MB81E161622 chip is targeted for small-scale systems. For that reason, the output buffer capacity has been

More information

CMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture

CMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture CMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture UMBC, CMSC313, Richard Chang CMSC 313, Computer Organization & Assembly Language Programming

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

Topic & Scope. Content: The course gives

Topic & Scope. Content: The course gives Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

InfiniBand SDR, DDR, and QDR Technology Guide

InfiniBand SDR, DDR, and QDR Technology Guide White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

CPCI-HPDI32ALT High-speed 64 Bit Parallel Digital I/O PCI Board 100 to 400 Mbytes/s Cable I/O with PCI-DMA engine

CPCI-HPDI32ALT High-speed 64 Bit Parallel Digital I/O PCI Board 100 to 400 Mbytes/s Cable I/O with PCI-DMA engine CPCI-HPDI32ALT High-speed 64 Bit Parallel Digital I/O PCI Board 100 to 400 Mbytes/s Cable I/O with PCI-DMA engine Features Include: 200 Mbytes per second (max) input transfer rate via the front panel connector

More information

NPE-300 and NPE-400 Overview

NPE-300 and NPE-400 Overview CHAPTER 3 This chapter describes the network processing engine (NPE) models NPE-300 and NPE-400 and contains the following sections: Supported Platforms, page 3-1 Software Requirements, page 3-1 NPE-300

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width

More information

HY225 Lecture 12: DRAM and Virtual Memory

HY225 Lecture 12: DRAM and Virtual Memory HY225 Lecture 12: DRAM and irtual Memory Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS May 16, 2011 Dimitrios S. Nikolopoulos Lecture 12: DRAM and irtual Memory 1 / 36 DRAM Fundamentals Random-access

More information

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS) Sonics SOC Integration Architecture Drew Wingard 2440 West El Camino Real, Suite 620 Mountain View, California 94040 650-938-2500 Fax 650-938-2577 http://www.sonicsinc.com (Systems-ON-ICS) Overview 10

More information

Keystone Architecture Inter-core Data Exchange

Keystone Architecture Inter-core Data Exchange Application Report Lit. Number November 2011 Keystone Architecture Inter-core Data Exchange Brighton Feng Vincent Han Communication Infrastructure ABSTRACT This application note introduces various methods

More information

Creating a Scalable Microprocessor:

Creating a Scalable Microprocessor: Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.

More information

A Network Storage LSI Suitable for Home Network

A Network Storage LSI Suitable for Home Network 258 HAN-KYU LIM et al : A NETWORK STORAGE LSI SUITABLE FOR HOME NETWORK A Network Storage LSI Suitable for Home Network Han-Kyu Lim*, Ji-Ho Han**, and Deog-Kyoon Jeong*** Abstract Storage over (SoE) is

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Memory hierarchy and cache

Memory hierarchy and cache Memory hierarchy and cache QUIZ EASY 1). What is used to design Cache? a). SRAM b). DRAM c). Blend of both d). None. 2). What is the Hierarchy of memory? a). Processor, Registers, Cache, Tape, Main memory,

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

AN 829: PCI Express* Avalon -MM DMA Reference Design

AN 829: PCI Express* Avalon -MM DMA Reference Design AN 829: PCI Express* Avalon -MM DMA Reference Design Updated for Intel Quartus Prime Design Suite: 18.0 Subscribe Latest document on the web: PDF HTML Contents Contents 1....3 1.1. Introduction...3 1.1.1.

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007 Working with 1 Gigabit Ethernet 1, The School of Physics and Astronomy, The University of Manchester, Manchester, M13 9PL UK E-mail: R.Hughes-Jones@manchester.ac.uk Stephen Kershaw The School of Physics

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.) THE MEMORY SYSTEM SOME BASIC CONCEPTS Maximum size of the Main Memory byte-addressable CPU-Main Memory Connection, Processor MAR MDR k -bit address bus n-bit data bus Memory Up to 2 k addressable locations

More information

Lecture 23. Finish-up buses Storage

Lecture 23. Finish-up buses Storage Lecture 23 Finish-up buses Storage 1 Example Bus Problems, cont. 2) Assume the following system: A CPU and memory share a 32-bit bus running at 100MHz. The memory needs 50ns to access a 64-bit value from

More information

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

SunFire range of servers

SunFire range of servers TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-2800 (DDR3-600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

Can User-Level Protocols Take Advantage of Multi-CPU NICs? Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Motivation for Caching and Optimization of Cache Utilization

Motivation for Caching and Optimization of Cache Utilization Motivation for Caching and Optimization of Cache Utilization Agenda Memory Technologies Bandwidth Limitations Cache Organization Prefetching/Replacement More Complexity by Coherence Performance Optimization

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Chapter 12: Multiprocessor Architectures

Chapter 12: Multiprocessor Architectures Chapter 12: Multiprocessor Architectures Lesson 03: Multiprocessor System Interconnects Hierarchical Bus and Time Shared bus Systems and multi-port memory Objective To understand multiprocessor system

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site: Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: www.vrtechnology.org 1 Textbook web site: www.vrtechnology.org Laboratory Hardware 2 Topics 14:332:331

More information

Universal Serial Bus Host Interface on an FPGA

Universal Serial Bus Host Interface on an FPGA Universal Serial Bus Host Interface on an FPGA Application Note For many years, designers have yearned for a general-purpose, high-performance serial communication protocol. The RS-232 and its derivatives

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

TAG Word 0 Word 1 Word 2 Word 3 0x0A0 D2 55 C7 C8 0x0A0 FC FA AC C7 0x0A0 A5 A6 FF 00

TAG Word 0 Word 1 Word 2 Word 3 0x0A0 D2 55 C7 C8 0x0A0 FC FA AC C7 0x0A0 A5 A6 FF 00 ELE 758 Final Examination 2000: Answers and solutions Number of hits = 15 Miss rate = 25 % Miss rate = [5 (misses) / 20 (total memory references)]* 100% = 25% Show the final content of cache using the

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Interconnection Structures. Patrick Happ Raul Queiroz Feitosa

Interconnection Structures. Patrick Happ Raul Queiroz Feitosa Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues that affect interconnection design. Interconnection Structures 2 Outline Introduction Computer Busses Bus Types

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information