Accessing Data on SGI Altix: An Experience with Reality

Size: px
Start display at page:

Download "Accessing Data on SGI Altix: An Experience with Reality"

Transcription

1 Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance Computing 162 Dresden, Germany {guido.juckeland matthias.mueller wolfgang.nagel Abstract The SGI Altix system architecture allows to support very large ccnuma shared memory systems. Nevertheless, the system layout sets boundaries to the sustained memory performance which can only be avoided by selecting the right data access strategies. The paper presents the results of cache and memory performance studies on SGI Altix 35. It demonstrates limitations and benefits of the system and the processor underneath. I. INTRODUCTION The increasing performance gap between processor and memory speed in computer systems strongly requires sophisticated data management strategies. While vector systems utilize proprietary memory modules with many memory banks to allow a large number of parallel memory accesses, superscalar processors introduced caches to buffer small pieces of frequently used data. A lot of research has gone into optimizing the microarchitecture of the processors as well as applications and compilers for both techniques. The memory system of a computer has two key characteristics: Bandwidth (how much data the system can transfer per time unit) and latency (how long the at least has to wait for the first data element). While the caches of a scalar processor have a low latency, high bandwidth connection with the processing units, they are very expensive and can only hold a limited amount of data. Modern microarchitectures of scalar processors, therefore, contain a multi-staged cache-hierarchy to balance the trade-off between size and speed. Main memory access being the bottleneck in every computer system based on a scalar processor is without question. The knowledge of the exact latencies and bandwidths for all cache levels and the various memory levels in a distributed shared memory environment, however, is a key for understanding and improving application performance on a specific computer system. This paper presents the research on SGI Altix 35 to establish and verify those metrics. It will introduce the Altix system architecture and the Itanium 2 processor architecture with focus on the memory subsystem in the first section. Afterwards, the measurement results for single and parallel access patterns will be discussed. The presented results will enable programmers to avoid major pitfalls when working on SGI Altix systems. II. THE SGI ALTIX 35 The processor with its good floating point performance and the large amount of addressable shared memory in the SGI Altix series combine for a unique HPC system. The characteristics of these two components with regard to memory performance are the subject of this section. A. The Processor The Itanium 2 processor distinguishes itself from other superscalar processors, since it implements the IA-64 architecture and, therefore, an improved VLIW ISA (called EPIC). Since not every combination of operations within an instruction word (bundle) is permitted, this also has an impact on the number of memory access operations the can be issued at once. The Itanium instruction set allows for a maximum of two such operations per bundle. The microarchitecture can issue two bundles per clock which results in at most four memory transactions issued per clock cycle. It will be shown that this corresponds with the maximum transfer capabilities between the caches and the computation units (see [1]). The IA-64 ISA is a so called load-store-architecture compute operations can only be issued on register contents and the data has to be moved into/out of the registers by separate load/store instructions. 1) Cache Hierarchy: The three staged cache hierarchy of the Itanium 2 processor is shown in figure 1. It can be seen that data, before reaching the computation units, has to flow through the three cache levels. There is, however, a unique exception to traditional access schemes, as can be seen in figure 2: The floating point unit (FPU) is not connected to the first level data cache but to the level two cache. This allows pointers to data structures to remain in the L1D cache as the floating point data is processed in the L2 cache. The transfer rates from the caches into the computation units are shown in table I. It can be seen, as mentioned earlier, that the microarchitecture only allows at most four load-storeinstructions per clock cycle. 2) Translation Lookaside Buffers (TLBs): As will be shown later on, the TLB size is of significant importance to the cache and memory access latencies. Since cache and memory are accessed on Itanium 2 with their physical addresses, the time for the address translation t t adds to the physical access time

2 Memory Itanium 2 Processor Northbridge 6,4 GB/s System Bus System Bus Control- Logik L3, up to 9MB, 128 Byte, 14+ Clocks 32 Byte/Clock L2, 256 KB, 128 Byte, 5+ Clocks 32 Byte/Clockl 32 Byte/Clock 16 Byte/Clock L1I, 16 KB, 64 Byte, 1 Clock L1D, 16 KB, 64 Byte, 1 Clock I/O Caches Description: Name, SIze, Cache Line Size, Access Latency Fig. 1. Cache hierarchy of the Itanium 2 [1] ISB Branch Prediction L1 Instruction-Cache and Fetch/Prefetch-Unit Instruction- Queue ITLB 8 Bundles IA-32 Decoding and Control B B B M M M M I I F F Register Stack Engine / Re-Mapping L3 Cache ECC L2 Cache - Quad Port Scoreboard, Predicates NaTs, Exceptions Branch- and Predicate-Registers Integer- Verzweige and Verzweige MM-Unit Verzweige Verzweige Verzweige 128 GPRs 128 FPRs Dual-Port L1 Data-Cache and DTLB ALAT Branch- Verzweige Verzweige Unit FP- Verzweige Unit ECC ECC Bus Controller Fig. 2. Itanium 2 block diagram [1]

3 TABLE I TRANSFER RATES BETWEEN COMPUTE UNITS AND THE ATTACHED CACHES (IN BYTE PER CLOCK CYCLE) [1] Unit Connected to Read rate Write Rate ALU L1D 16 Byte 16 Byte FPU L2 16 Byte 16 Byte or 32 Byte t p and results in the total access time t a. t a = t t + t p (1) While the physical access time t p is fixed (depending on the cache or memory level), the total access time t a is depending heavily on the translation time t t. If the address translation has to be done by the operating system (OS), which is ultimately responsible for page allocation and translation, a penalty larger than the physical access time t p will occur. The Itanium 2 features a two-levelled TLB structure which buffers a number of most recently used translations to reduce the translation time t t. Its characteristics are summarized in table II. TABLE II ITANIUM 2 TLB CHARACTERISTICS [1] The best case penalties for this page walking are shown in table III. Event TABLE III BEST CASE HPW PENALTIES [1] Penalty in clock cycles Hit in L2 VHPT 25 Miss in L2, hit in L3 31 Miss in L2 and L3 2 + Main memory latency B. The SGI Altix 35 System Architecture The SGI Altix series contains the small to medium range servers of the Altix 35 series and the large supercomputers of the Altix 3 series. The system studied in this paper is an Altix 35 with four (Madison) s with 3 MBytes of L3 Cache each. The system is set up of so called modules of different types: Base, extension and router modules. The base module contains the s as well as the main memory. Its layout is shown in figure 3. The Altix 35 at hand is running Red Hat Enterprise Linux 3 with ProPack 3 (SP4) and the Intel compilers version were used for all measurements. SGI Altix 35 Base Module Instruction TLB Data TLB Characteristic Level 1 Level2 Level 1 Level2 Number of entries Associativity Full Full Main Memory 1.2 GB/s NUMAlink Penalty for L1 miss 2 clock cycles 4 clock cycles Front Side Bus SHUB The TLBs have to deal with the flexibility of the Itanium 2 microarchitecture regarding the supported memory page sizes. They can range from 4 KByte up to 4 GByte. The first level TLBs can only handle 4 KByte pages, but can work with larger pages by using segments of that page. The L1 TLBs can, therefore, contain translations for 32 x 4 KByte = 128 KByte of memory space each. The L1 TLBs and the L1 caches are tightly coupled an L1 cache line will be invalidated, if a corresponding L1 TLB entry is evicted. The second level TLBs support all page sizes but suffer from OS limitations. SGI is running Linux on the Altix systems which uses a fixed page size of 16 KByte for all processes. Hence, the address space, for which translations can be placed into the L2 TLBs, is limited to 128 x 16 KByte = 2 MByte. Beyond that point another Itanium 2 specific mechanism tries to avoid invoking the operating system: The hardware page walker (HPW). 3) Hardware Page Walking: When no address translation can be located in the TLBs the hardware page walker will access the virtual hash page tables (VHPTs) a special data structure kept additionally to the L2 and L3 cache and in a portion of main memory. These tables contain the address translation for their corresponding caches. The translation for a memory access that for example misses the TLBs but its cache line being held in the L2 Cache will use the L2 VHPT. I/O System 2.4 GB/s PCI/PCI-X PCI/PCI-X PCI/PCI-X PCI/PCI-X Fig. 3. SGI Altix 35 Base Module [2] NUMAlink 1) System Layout: The modules are combined to form larger systems using the two NUMAlink 4 connections available per base module. The connected modules form a cc- NUMA shared memory environment so that one running process can allocate all of the system s main memory. The system layout for the SGI Altix 35 at hand is shown in figure 4. The system contains 8 GByte of main memory. 2) Cache Coherency: Ensuring cache coherency in such large scale shared memory systems poses quite a challenge. SGI uses the SHUBs and NUMAlinks to communicate the coherency information. When the system is bootet, the SHUB reserves a portion of its local memory to contain the addresses of all cache lines and a bit mask containing one bit for every in the system. This bits are used to mark the processors sharing the cache line. The SHUB also listens to the cache snoop information transmitted by the Itanium 2 s over the

4 Front Side Bus SGI Altix 35 Base Module Fig. 4. Main Memory 1.2 GB/s SHUB 2.4 GB/s I/O System NUMA-Link Main Memory 1.2 GB/s SHUB 2.4 GB/s I/O System 4 SGI Altix 35 system layout SGI Altix 35 Base Module Front Side Bus front side bus and manages the directory accordingly. When a writes to a shared cache line, the SHUB will transmit that cache line to every holding a copy [3]. III. LATENCY The time between the issue of a load instruction and the arrival of the requested data from one of the caches or the main memory is known as access latency. It is in case of the Altix system (as shown earlier) the sum of the physical latency and the address translation time. Intel specifies in [1] the Itanium 2 latencies as shown in table IV. TABLE IV ITANIUM 2 ACCESS LATENCIES pick a new random element. Have the last element point to this new element and so on. Finally, have the last selected element in the vector point to the first element. One receives a kind of interwoven ring of pointers that allows random jumps through the allocated memory area. B. Access on Separate Memory Segments A parallel version of the pointer chasing algorithm was furthermore created using the MPI standard. The involved processes will then all work in their own piece of main memory and one can study their influence on each other regarding the access latency. They do not exchange any data using MPI messages, but do, however, use barriers to synchronize each other. 1) Cache Latencies: When restricting the allocated memory area to fit within the processor cache, the cache latencies can be determined. Since all s are then working out of their caches they do not influence each other. Therefore, only the measurement results for one active process are shown in figure 5. They demonstrate the exceptionally low cache access times of the Itanium 2 processor Itanium 2 cache-latencies Hierarchy level Level 1 cache Level 2 cache Level 3 cache Main memory Physical latency 1 clock cycle 5 clock cycles 12 clock cycles System dependent ( 1 ns) Clock cycles This section will establish the total access times for the three cache levels as well as the main memory latency with one and more than one process accessing the memory. A. Measuring Algorithm The exact measurement of cache and memory access times poses quite some problems for the performance analyst. One access is too fast to use a timer and for multiple access one has to circumvent hard- and software optimizations that try to hide the latency. In this paper a pointer chasing algorithm was used to acquire the measurement data. Pointer chasing uses the data from one access to determine the address of the next access. In case of a random dispersal of the addresses within the allocated memory area neither hardware nor software prefetching can then determine the access pattern and every access will encounter the full latency. By varying the size of the memory area used one can determine all cache latencies as well as the main memory access time. The used memory area is initialized using the following algorithm: Treat the area as a vector of pointers. Select a random position and have the first element point to that position. Remove the element from the list of available elements and Amount of memory used (in KByte) Fig. 5. Cache access time for an Itanium 2 Madison 3M with 1.4 GHz The effects of the address translation can be clearly seen in the picture. While the L1 cache latency is as expected at 2 clock cycles for a pointer chasing, the L2 cache latency increases from 5 to about 9 clock cycles as the L1 TLB cannot hold the translations for all data and the 4 clock cycle penalty for accessing the L2 TLB is added to the access time. The L3 cache latency is somewhat larger than the value provided by Intel due to the L1 TLB misses. As the used memory area hits the capacity of the L2 TLB at 2 MBytes the access time rapidly increases as the hardware page walking adds to the address translation time and the benefit of the Itanium 2 s fast level three cache is gone. This is of specific importance since this result also applies for all Itanium 2 s: After leaving the TLBs the caches lose a lot of their performance. 2) Main Memory Latencies: Using the pointer chasing algorithm to determine the main memory latency results in somewhat contradicting results. The process scheduling also

5 influences process and data placement and has to be taken into account. Therefore, the processes were pinned to the s by using the dplace-tool from SGI. The results gained with and without that tool are gathered in table V 1. TABLE V MEASURED MEMORY LATENCIES WITH 1 MBYTE OF USED MEMORY PER PROCESS Latency for n active processes Total # of processes Without dplace 1 11 ns ns 11 ns ns 144 ns 126 ns ns 144 ns 144 ns 126 ns With dplace 1 11 ns ns 126 ns ns 126 ns 126 ns ns 126 ns 145 ns 126 ns The results raise a number of questions: Why is there decrease for 4 processes when all of them are active compared to keeping a number of them idle? And, why does the usage of dplace increase some access times while it decreases others? Or, why does one only see three different latency value? One possible explanation might be hidden in the system architecture. The values are all about 15 ns apart which corresponds pretty much with the time for one NUMAlink hop. It seems that in case of the 126 ns at least one process is working with memory that is located in the other module (remote memory). In case of the 144 ns it could be that at least one process is in fact working with its local memory (locate it its own module) but the data is somehow sent to the other module first, before coming back to the original module. This would result in two NUMAlink hops and would add about 3 ns to the initial latency. IV. BANDWIDTH While access latencies can be hidden by a number of hardand software mechanisms, the available bandwidth cannot be increased by such techniques. Therefore, bandwidth is the limiting factor for most applications, especially in data intensive computing. The maximum data transmission rates for the SGI Altix are shown in table VI. This section will compare these values with the obtained measurement results and will discuss the scalability with more than one active process. A. Measuring Algorithm The STREAM benchmark has established itself as a standard for the measurement of memory transfer rates. It uses a set of different vector operations (e. g. a vector triad) on double precision floating point numbers where no piece of 1 The access time increases to about 15 ns for 1 GByte of used memory space. Since there are the same effects visible, the memory space was reduced to 1 MByte to reduce the measurement runtime. TABLE VI MAXIMAL BANDWIDTH WITHIN SGI ALTIX 35 L2 Cache L3 L2 Front side bus SHUB Memory NUMAlink 4 Bandwidth (1.4 GHz) 22.4 GB/s read GB/s read or write 44.8 GB/s read or write Within one Module read or write 1.2 GB/s read or write 3.2 GB/s read GB/s write data is reused. This worst-case scenario for scalar processors practically circumvents the caches as it does not include any kind of data locality. The available bandwidth BW can be determined from the number of vectors used n vector, the length of the vectors l vector, the size of one element in the vector s element, and the run time t run as follows: BW = n vector l vector s element t run (2) Originally, the STREAM benchmark only measures the bandwidth for one fixed memory size. This was changed in a way that the bandwidth can be determined for a range of memory sizes. Furthermore, two self-selected vector operations and T emp := T emp + A + B C (3) A := A + B C (4) have been selected to determine the available bandwidths for pure read and combined read/write operations. The measurement routines have also been implemented to benchmark the accumulated bandwidth with more than one process accessing separate and shared memory segments. B. Memory Bandwidth with Different Strides A first benchmark, which is focussed on cache bandwidth, measures the available bandwidth when accessing data with different strides. The memory range was selected from 32 KByte up to 1 MByte and the stride sizes were varied from every element (8 Byte) to every 128th element (1KByte). The results of the benchmark are displayed in figures 6 and 7. The first figure underlines an important characteristic of the second level cache. The sudden drop in performance between strides of 128 Bytes and 256 Bytes points to bank conflicts within the L2 cache. As the bank width of that cache is 256 Bytes, all requests that are a multiple of that access the same cache bank and are, therefore, serialized. This cuts the available bandwidth in half. The drop in performance as the capacity of the L2 TLB is reached at 2 MByte can be observed as well. The results for the combined read/write access point to a weakness of the compilers. The obtainable cache bandwidth for accessing every vector element stays below the bandwidth for accessing every second and forth element. This is due to

6 Memory bandwidth for read-only access with different strides sizes Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 124 Bytes 12 1 Accumulated memory bandwith with read-only access on separate memory segments 4 active, inactive processes 3 active, 1 inactive process 2 active, 2 inactive processes 1 active, 3 inactive processes Anount of memory used (in KByte) Amount of memory used per process (in KByte) Fig. 6. Read-only bandwidth with different strides Fig. 8. Read-only bandwidth with different # of processes Read/write bandwidth for different stride sizes Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 124 Bytes 12 1 Accumulated memory bandwidth for combined read/write access on separate memory segments 4 active, inactive processes 3 active, 1 inactiver process 2 active, 2 inactive processes 1 active, 3 inactive processes Amount of memory used (in KByte) Amount of memory used per process (in KByte) Fig. 7. Read/write bandwidth with different strides Fig. 9. Read/write bandwidth with different # of processes bad read/write interleaving and resulting bank conflicts as the compiler assumes an access stride greater than one. C. Access on Separate Memory Segments For multi-processor shared memory systems it is always of interest to what degree the running processes influence each other as they are accessing the main memory. To study that behavior, the presented algorithm was adapted as a parallel MPI program. All running processes were instructed to use their own memory segments and no communication took place to exchange vector data. The processes were, however, synchronized with barriers, so that they were all working on the same vector sizes. Since one active process is enough to fill the front side (or system) bus of a base module, this experiment will show how well the Altix system handles memory access overload situations. The results of the benchmark are shown in figures 8 and 9. At first, one can extract from the plots that the obtainable cache bandwidth is independent from the number of processes running on the system. Secondly, the cache bandwidth for read-only accesses is slightly below that for combined read/write accesses which suggests that the read-only algorithm is not using the full L3 cache transfer capability. As introduced earlier, the L3 cache is capable of transmitting four double precision floating point numbers per clock cycle to the L2 cache. Those capabilities are not fully used by both versions. Another interesting discovery within the results demonstrates that the Itanium 2 is not capable of using the full 6.4 of front side bus bandwidth. The reason for such behavior can be too small memory request buffers within the the system bus cannot be saturated with memory requests by one and the performance is limited to some degree by the main memory latency as well. The read-only access reaches a transfer rate of 5.5 for one active, the combined read/write access 4.7. The lower performance for the combined accesses is caused by bus turnaround cycles on the system bus, since it is an unidirectional bus which requires a few clock cycles to change the transmission direction. Finally, the measurement data displays a somewhat sur-

7 prising result: The accumulated bandwidth for three active processes drops below that of two active processes. This ist due to load-imbalances within the modules. While two processes can be placed by the scheduler so that each module contains one process (which then has the system bus all to itself), three processes require one module to use both s. At that point those two s have to share the system bus, thus limiting the available bandwidth per process to 3.2. Since the runtime is taken when all processes are finished with their work, the third process having the other module to itself is held back by the other two. For three processes one receives then a theoretical bandwidth of 3 x 3.2 = 9.6 versus 2 x 6.4 = 12.8 for two active processes. D. Access on Shared Memory Segments Within shared memory environments one can use multiple threads to work on data within the same address space. In this case, bandwidth is even more of importance, as the threads running on the different processors might access the same physical piece of main memory. When using OpenMP to share the work in this case the vectors to compute each thread will work on a part of the overall data. The number of threads used in this measurement was varied from one to four; the results are plotted in figures 1 and Accumulated memory bandwidth for combined read/write access on shared memory segments 4 threads 3 threads 2 threads 1 thread Total amount of memory used (in KByte) Fig. 11. Accumulated memory bandwidth for read-only access on shared memory segments 4 threads 3 threads 2 threads 1 thread Read/write bandwidth with different # of threads feed data to the local processors and the remote processor by saturating the connection the memory. Therefore, adding a fourth thread does not result in any further performance improvements. E. Access with Varying Degree of Randomness A final experiment was used to determine how the bandwidth develops as randomness within the memory accesses increases. For that purpose a so called gather-code was produced. It access data indirectly over an index vector. The values of A, which is accessed by using the index vector J, are summed up to generate a read-only access pattern. By arranging the elements in J sequentially and then interchanging a varying number of elements one can generate the different degrees of randomness when accessing A. The experiment was run in parallel, where each process was working on its own memory segment; the results for 5 MByte of used memory space are displayed in figure Total amount of memory used (in KByte) 1 9 Accumulated memory bandwidth for accessing separare memory segments with varying randomness 4 processes 3 processes 2 processes 1 process Fig. 1. Read-only bandwidth with different # of threads The cache expansion effect of shared memory computing can be observed well in both figures. While the performance drops for one active thread as it leaves the cache, two threads can actually prolong that drop as they divide the work between each other. Hence, they virtually double the available cache size. This effect can be observed for more threads as well. Additionally the cache bandwidth also increases constantly when adding more threads. Interestingly, the accumulated memory bandwidth exceeds 6.4 when switching from two to three threads. This confirmes the system design which allows a transfer rate of 1.2 between the memory and the SHUB. The data for this experiment is located in only one module. The other module has to remotely access that data. The SHUB can then Degree of randomness (in %) Fig. 12. Bandwidth with different # of processes The results are comparable to the ones received when accessing the memory with different strides: The higher the

8 randomness, which corresponds to large distances between two accesses, the lower the obtained bandwidth. This behavior is not surprising since the cache lines loaded from main memory are then only accessed once or twice before being evicted again and one eighth or one forth of the main memory bandwidth is lost. The usage of two process doubles the received bandwidth. Adding more processes only leads to a slight increase in performance as the front side busses of the two module are already almost saturated. V. CONCLUSION The research presented in this paper has shown that in some case the obtainable results for applications correspond with the best cases offered by the hard- and software. In most cases, however, the measured performance stays behind the processor s capabilities. This was to be expected and is in no way a surprising result, since it is the case for most modern superscalar microprocessors. It could, furthermore, be shown that the SGI Altix 35 usually can live up to its capabilities when following a few basic principles: Use the system evenly; try to avoid load imbalances which occur when allocating an odd number of threads or processes. Bind processes to their s using dplace. This avoids them being moved away from their data. Initialize memory pages by the process which will be working on the data. This will place the memory page (if possible) onto the module the process is running on and will, therefore, minimize remote memory accesses and network traffic. The characteristics and limitations of a superscalar processor in general and the specifically also suggest keeping the following in mind while using the processor: Avoid access patterns that spill and reload cache lines. Avoid access patterns with a multiple of 256 Bytes between two accesses. Try not to use more then 2 MBytes of cache as latency will rapidly increase beyond that point while bandwidth drops. Overall, the scalability and sustainability of the obtained results encourages the usage of Altix. The systems offer a unique amount of shared memory space and offer very good floating point performance at the same time. The main memory bandwidth is, however, the boundary that cannot be crossed by data intensive applications. The front side bus then becomes the bottleneck to feed data to the two processors. Hence, more than two s per module would show no performance gain and SGI is well advised to reduce the number of Itanium 2 Montecito (dual core) s per module in the next generation of Altix systems to one to be capable of supplying them with data. Further research from the side of the authors will evaluate whole applications with respect to their memory performance on Altix. Additionally, code optimization on those applications are planned for the large Altix system soon to be installed in Dresden. ACKNOWLEDGMENT All measurements presented in this paper have been done using BenchIT ( a performance analysis environment developed at the Center for Information Services and High Performance Computing, Dresden. REFERENCES [1] Intel, Processor Reference Manual For Software Development and Optimization, May 24, Document-No.: [2] SGI, The SGI Altix 35 Server, January 25, Document-No.: J [3] D. Lenoski et al., The Stanford Dash Multiprocessor, IEEE Computer, vol. 25, no. 3, pp , March 1992.

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation Next Generation Itanium Processor Overview Gary Hammond Principal Architect Enterprise Platform Group Corporation August 27-30, 2001 Sam Naffziger Lead Circuit Architect Microprocessor Technology Lab HP

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

A 1.5GHz Third Generation Itanium Processor

A 1.5GHz Third Generation Itanium Processor A 1.5GHz Third Generation Itanium Processor Jason Stinson, Stefan Rusu Intel Corporation, Santa Clara, CA 1 Outline Processor highlights Process technology details Itanium processor evolution Block diagram

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division thomas.siebold@hp.com Agenda Terminology What is the Itanium Architecture? 1 Terminology Processor Architectures

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

More information

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Unit 2. Chapter 4 Cache Memory

Unit 2. Chapter 4 Cache Memory Unit 2 Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation Location CPU Internal External Capacity Word

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

The Processor Memory Hierarchy

The Processor Memory Hierarchy Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation Blue Gene/P ASIC Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 614 COMPUTER ARCHITECTURE II FALL 2005 CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information