Accessing Data on SGI Altix: An Experience with Reality

Size: px

Start display at page:

Download "Accessing Data on SGI Altix: An Experience with Reality"

Tyrone Gallagher
6 years ago
Views:

1 Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance Computing 162 Dresden, Germany {guido.juckeland matthias.mueller wolfgang.nagel Abstract The SGI Altix system architecture allows to support very large ccnuma shared memory systems. Nevertheless, the system layout sets boundaries to the sustained memory performance which can only be avoided by selecting the right data access strategies. The paper presents the results of cache and memory performance studies on SGI Altix 35. It demonstrates limitations and benefits of the system and the processor underneath. I. INTRODUCTION The increasing performance gap between processor and memory speed in computer systems strongly requires sophisticated data management strategies. While vector systems utilize proprietary memory modules with many memory banks to allow a large number of parallel memory accesses, superscalar processors introduced caches to buffer small pieces of frequently used data. A lot of research has gone into optimizing the microarchitecture of the processors as well as applications and compilers for both techniques. The memory system of a computer has two key characteristics: Bandwidth (how much data the system can transfer per time unit) and latency (how long the at least has to wait for the first data element). While the caches of a scalar processor have a low latency, high bandwidth connection with the processing units, they are very expensive and can only hold a limited amount of data. Modern microarchitectures of scalar processors, therefore, contain a multi-staged cache-hierarchy to balance the trade-off between size and speed. Main memory access being the bottleneck in every computer system based on a scalar processor is without question. The knowledge of the exact latencies and bandwidths for all cache levels and the various memory levels in a distributed shared memory environment, however, is a key for understanding and improving application performance on a specific computer system. This paper presents the research on SGI Altix 35 to establish and verify those metrics. It will introduce the Altix system architecture and the Itanium 2 processor architecture with focus on the memory subsystem in the first section. Afterwards, the measurement results for single and parallel access patterns will be discussed. The presented results will enable programmers to avoid major pitfalls when working on SGI Altix systems. II. THE SGI ALTIX 35 The processor with its good floating point performance and the large amount of addressable shared memory in the SGI Altix series combine for a unique HPC system. The characteristics of these two components with regard to memory performance are the subject of this section. A. The Processor The Itanium 2 processor distinguishes itself from other superscalar processors, since it implements the IA-64 architecture and, therefore, an improved VLIW ISA (called EPIC). Since not every combination of operations within an instruction word (bundle) is permitted, this also has an impact on the number of memory access operations the can be issued at once. The Itanium instruction set allows for a maximum of two such operations per bundle. The microarchitecture can issue two bundles per clock which results in at most four memory transactions issued per clock cycle. It will be shown that this corresponds with the maximum transfer capabilities between the caches and the computation units (see [1]). The IA-64 ISA is a so called load-store-architecture compute operations can only be issued on register contents and the data has to be moved into/out of the registers by separate load/store instructions. 1) Cache Hierarchy: The three staged cache hierarchy of the Itanium 2 processor is shown in figure 1. It can be seen that data, before reaching the computation units, has to flow through the three cache levels. There is, however, a unique exception to traditional access schemes, as can be seen in figure 2: The floating point unit (FPU) is not connected to the first level data cache but to the level two cache. This allows pointers to data structures to remain in the L1D cache as the floating point data is processed in the L2 cache. The transfer rates from the caches into the computation units are shown in table I. It can be seen, as mentioned earlier, that the microarchitecture only allows at most four load-storeinstructions per clock cycle. 2) Translation Lookaside Buffers (TLBs): As will be shown later on, the TLB size is of significant importance to the cache and memory access latencies. Since cache and memory are accessed on Itanium 2 with their physical addresses, the time for the address translation t t adds to the physical access time

Memory Itanium 2 Processor Northbridge 6,4 GB/s System Bus System Bus Control- Logik L3, up to 9MB, 128 Byte, 14+ Clocks 32 Byte/Clock L2, 256 KB, 128

SIze, Cache Line Size, Access Latency Fig. 1.

Decoding and Control B B B M M M M I I F F Register Stack Engine / Re-Mapping L3 Cache ECC L2 Cache - Quad Port Scoreboard, Predicates NaTs, Exceptions

2 Memory Itanium 2 Processor Northbridge 6,4 GB/s System Bus System Bus Control- Logik L3, up to 9MB, 128 Byte, 14+ Clocks 32 Byte/Clock L2, 256 KB, 128 Byte, 5+ Clocks 32 Byte/Clockl 32 Byte/Clock 16 Byte/Clock L1I, 16 KB, 64 Byte, 1 Clock L1D, 16 KB, 64 Byte, 1 Clock I/O Caches Description: Name, SIze, Cache Line Size, Access Latency Fig. 1. Cache hierarchy of the Itanium 2 [1] ISB Branch Prediction L1 Instruction-Cache and Fetch/Prefetch-Unit Instruction- Queue ITLB 8 Bundles IA-32 Decoding and Control B B B M M M M I I F F Register Stack Engine / Re-Mapping L3 Cache ECC L2 Cache - Quad Port Scoreboard, Predicates NaTs, Exceptions Branch- and Predicate-Registers Integer- Verzweige and Verzweige MM-Unit Verzweige Verzweige Verzweige 128 GPRs 128 FPRs Dual-Port L1 Data-Cache and DTLB ALAT Branch- Verzweige Verzweige Unit FP- Verzweige Unit ECC ECC Bus Controller Fig. 2. Itanium 2 block diagram [1]

TABLE I TRANSFER RATES BETWEEN COMPUTE UNITS AND THE ATTACHED CACHES (IN BYTE PER CLOCK CYCLE) [1] Unit Connected to Read rate Write Rate ALU L1D 16 Byte 16 Byte FPU L2 16 Byte 16 Byte or 32 Byte t p

t a = t t + t p (1) While the physical access time t p is fixed (depending on the cache or memory level), the total access time t a is depending heavily on the translation time t t.

3 TABLE I TRANSFER RATES BETWEEN COMPUTE UNITS AND THE ATTACHED CACHES (IN BYTE PER CLOCK CYCLE) [1] Unit Connected to Read rate Write Rate ALU L1D 16 Byte 16 Byte FPU L2 16 Byte 16 Byte or 32 Byte t p and results in the total access time t a. t a = t t + t p (1) While the physical access time t p is fixed (depending on the cache or memory level), the total access time t a is depending heavily on the translation time t t. If the address translation has to be done by the operating system (OS), which is ultimately responsible for page allocation and translation, a penalty larger than the physical access time t p will occur. The Itanium 2 features a two-levelled TLB structure which buffers a number of most recently used translations to reduce the translation time t t. Its characteristics are summarized in table II. TABLE II ITANIUM 2 TLB CHARACTERISTICS [1] The best case penalties for this page walking are shown in table III. Event TABLE III BEST CASE HPW PENALTIES [1] Penalty in clock cycles Hit in L2 VHPT 25 Miss in L2, hit in L3 31 Miss in L2 and L3 2 + Main memory latency B. The SGI Altix 35 System Architecture The SGI Altix series contains the small to medium range servers of the Altix 35 series and the large supercomputers of the Altix 3 series. The system studied in this paper is an Altix 35 with four (Madison) s with 3 MBytes of L3 Cache each. The system is set up of so called modules of different types: Base, extension and router modules. The base module contains the s as well as the main memory. Its layout is shown in figure 3. The Altix 35 at hand is running Red Hat Enterprise Linux 3 with ProPack 3 (SP4) and the Intel compilers version were used for all measurements. SGI Altix 35 Base Module Instruction TLB Data TLB Characteristic Level 1 Level2 Level 1 Level2 Number of entries Associativity Full Full Main Memory 1.2 GB/s NUMAlink Penalty for L1 miss 2 clock cycles 4 clock cycles Front Side Bus SHUB The TLBs have to deal with the flexibility of the Itanium 2 microarchitecture regarding the supported memory page sizes. They can range from 4 KByte up to 4 GByte. The first level TLBs can only handle 4 KByte pages, but can work with larger pages by using segments of that page. The L1 TLBs can, therefore, contain translations for 32 x 4 KByte = 128 KByte of memory space each. The L1 TLBs and the L1 caches are tightly coupled an L1 cache line will be invalidated, if a corresponding L1 TLB entry is evicted. The second level TLBs support all page sizes but suffer from OS limitations. SGI is running Linux on the Altix systems which uses a fixed page size of 16 KByte for all processes. Hence, the address space, for which translations can be placed into the L2 TLBs, is limited to 128 x 16 KByte = 2 MByte. Beyond that point another Itanium 2 specific mechanism tries to avoid invoking the operating system: The hardware page walker (HPW). 3) Hardware Page Walking: When no address translation can be located in the TLBs the hardware page walker will access the virtual hash page tables (VHPTs) a special data structure kept additionally to the L2 and L3 cache and in a portion of main memory. These tables contain the address translation for their corresponding caches. The translation for a memory access that for example misses the TLBs but its cache line being held in the L2 Cache will use the L2 VHPT. I/O System 2.4 GB/s PCI/PCI-X PCI/PCI-X PCI/PCI-X PCI/PCI-X Fig. 3. SGI Altix 35 Base Module [2] NUMAlink 1) System Layout: The modules are combined to form larger systems using the two NUMAlink 4 connections available per base module. The connected modules form a cc- NUMA shared memory environment so that one running process can allocate all of the system s main memory. The system layout for the SGI Altix 35 at hand is shown in figure 4. The system contains 8 GByte of main memory. 2) Cache Coherency: Ensuring cache coherency in such large scale shared memory systems poses quite a challenge. SGI uses the SHUBs and NUMAlinks to communicate the coherency information. When the system is bootet, the SHUB reserves a portion of its local memory to contain the addresses of all cache lines and a bit mask containing one bit for every in the system. This bits are used to mark the processors sharing the cache line. The SHUB also listens to the cache snoop information transmitted by the Itanium 2 s over the

4 Front Side Bus SGI Altix 35 Base Module Fig. 4. Main Memory 1.2 GB/s SHUB 2.4 GB/s I/O System NUMA-Link Main Memory 1.2 GB/s SHUB 2.4 GB/s I/O System 4 SGI Altix 35 system layout SGI Altix 35 Base Module Front Side Bus front side bus and manages the directory accordingly. When a writes to a shared cache line, the SHUB will transmit that cache line to every holding a copy [3]. III. LATENCY The time between the issue of a load instruction and the arrival of the requested data from one of the caches or the main memory is known as access latency. It is in case of the Altix system (as shown earlier) the sum of the physical latency and the address translation time. Intel specifies in [1] the Itanium 2 latencies as shown in table IV. TABLE IV ITANIUM 2 ACCESS LATENCIES pick a new random element. Have the last element point to this new element and so on. Finally, have the last selected element in the vector point to the first element. One receives a kind of interwoven ring of pointers that allows random jumps through the allocated memory area. B. Access on Separate Memory Segments A parallel version of the pointer chasing algorithm was furthermore created using the MPI standard. The involved processes will then all work in their own piece of main memory and one can study their influence on each other regarding the access latency. They do not exchange any data using MPI messages, but do, however, use barriers to synchronize each other. 1) Cache Latencies: When restricting the allocated memory area to fit within the processor cache, the cache latencies can be determined. Since all s are then working out of their caches they do not influence each other. Therefore, only the measurement results for one active process are shown in figure 5. They demonstrate the exceptionally low cache access times of the Itanium 2 processor Itanium 2 cache-latencies Hierarchy level Level 1 cache Level 2 cache Level 3 cache Main memory Physical latency 1 clock cycle 5 clock cycles 12 clock cycles System dependent ( 1 ns) Clock cycles This section will establish the total access times for the three cache levels as well as the main memory latency with one and more than one process accessing the memory. A. Measuring Algorithm The exact measurement of cache and memory access times poses quite some problems for the performance analyst. One access is too fast to use a timer and for multiple access one has to circumvent hard- and software optimizations that try to hide the latency. In this paper a pointer chasing algorithm was used to acquire the measurement data. Pointer chasing uses the data from one access to determine the address of the next access. In case of a random dispersal of the addresses within the allocated memory area neither hardware nor software prefetching can then determine the access pattern and every access will encounter the full latency. By varying the size of the memory area used one can determine all cache latencies as well as the main memory access time. The used memory area is initialized using the following algorithm: Treat the area as a vector of pointers. Select a random position and have the first element point to that position. Remove the element from the list of available elements and Amount of memory used (in KByte) Fig. 5. Cache access time for an Itanium 2 Madison 3M with 1.4 GHz The effects of the address translation can be clearly seen in the picture. While the L1 cache latency is as expected at 2 clock cycles for a pointer chasing, the L2 cache latency increases from 5 to about 9 clock cycles as the L1 TLB cannot hold the translations for all data and the 4 clock cycle penalty for accessing the L2 TLB is added to the access time. The L3 cache latency is somewhat larger than the value provided by Intel due to the L1 TLB misses. As the used memory area hits the capacity of the L2 TLB at 2 MBytes the access time rapidly increases as the hardware page walking adds to the address translation time and the benefit of the Itanium 2 s fast level three cache is gone. This is of specific importance since this result also applies for all Itanium 2 s: After leaving the TLBs the caches lose a lot of their performance. 2) Main Memory Latencies: Using the pointer chasing algorithm to determine the main memory latency results in somewhat contradicting results. The process scheduling also

5 influences process and data placement and has to be taken into account. Therefore, the processes were pinned to the s by using the dplace-tool from SGI. The results gained with and without that tool are gathered in table V 1. TABLE V MEASURED MEMORY LATENCIES WITH 1 MBYTE OF USED MEMORY PER PROCESS Latency for n active processes Total # of processes Without dplace 1 11 ns ns 11 ns ns 144 ns 126 ns ns 144 ns 144 ns 126 ns With dplace 1 11 ns ns 126 ns ns 126 ns 126 ns ns 126 ns 145 ns 126 ns The results raise a number of questions: Why is there decrease for 4 processes when all of them are active compared to keeping a number of them idle? And, why does the usage of dplace increase some access times while it decreases others? Or, why does one only see three different latency value? One possible explanation might be hidden in the system architecture. The values are all about 15 ns apart which corresponds pretty much with the time for one NUMAlink hop. It seems that in case of the 126 ns at least one process is working with memory that is located in the other module (remote memory). In case of the 144 ns it could be that at least one process is in fact working with its local memory (locate it its own module) but the data is somehow sent to the other module first, before coming back to the original module. This would result in two NUMAlink hops and would add about 3 ns to the initial latency. IV. BANDWIDTH While access latencies can be hidden by a number of hardand software mechanisms, the available bandwidth cannot be increased by such techniques. Therefore, bandwidth is the limiting factor for most applications, especially in data intensive computing. The maximum data transmission rates for the SGI Altix are shown in table VI. This section will compare these values with the obtained measurement results and will discuss the scalability with more than one active process. A. Measuring Algorithm The STREAM benchmark has established itself as a standard for the measurement of memory transfer rates. It uses a set of different vector operations (e. g. a vector triad) on double precision floating point numbers where no piece of 1 The access time increases to about 15 ns for 1 GByte of used memory space. Since there are the same effects visible, the memory space was reduced to 1 MByte to reduce the measurement runtime. TABLE VI MAXIMAL BANDWIDTH WITHIN SGI ALTIX 35 L2 Cache L3 L2 Front side bus SHUB Memory NUMAlink 4 Bandwidth (1.4 GHz) 22.4 GB/s read GB/s read or write 44.8 GB/s read or write Within one Module read or write 1.2 GB/s read or write 3.2 GB/s read GB/s write data is reused. This worst-case scenario for scalar processors practically circumvents the caches as it does not include any kind of data locality. The available bandwidth BW can be determined from the number of vectors used n vector, the length of the vectors l vector, the size of one element in the vector s element, and the run time t run as follows: BW = n vector l vector s element t run (2) Originally, the STREAM benchmark only measures the bandwidth for one fixed memory size. This was changed in a way that the bandwidth can be determined for a range of memory sizes. Furthermore, two self-selected vector operations and T emp := T emp + A + B C (3) A := A + B C (4) have been selected to determine the available bandwidths for pure read and combined read/write operations. The measurement routines have also been implemented to benchmark the accumulated bandwidth with more than one process accessing separate and shared memory segments. B. Memory Bandwidth with Different Strides A first benchmark, which is focussed on cache bandwidth, measures the available bandwidth when accessing data with different strides. The memory range was selected from 32 KByte up to 1 MByte and the stride sizes were varied from every element (8 Byte) to every 128th element (1KByte). The results of the benchmark are displayed in figures 6 and 7. The first figure underlines an important characteristic of the second level cache. The sudden drop in performance between strides of 128 Bytes and 256 Bytes points to bank conflicts within the L2 cache. As the bank width of that cache is 256 Bytes, all requests that are a multiple of that access the same cache bank and are, therefore, serialized. This cuts the available bandwidth in half. The drop in performance as the capacity of the L2 TLB is reached at 2 MByte can be observed as well. The results for the combined read/write access point to a weakness of the compilers. The obtainable cache bandwidth for accessing every vector element stays below the bandwidth for accessing every second and forth element. This is due to

6 Memory bandwidth for read-only access with different strides sizes Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 124 Bytes 12 1 Accumulated memory bandwith with read-only access on separate memory segments 4 active, inactive processes 3 active, 1 inactive process 2 active, 2 inactive processes 1 active, 3 inactive processes Anount of memory used (in KByte) Amount of memory used per process (in KByte) Fig. 6. Read-only bandwidth with different strides Fig. 8. Read-only bandwidth with different # of processes Read/write bandwidth for different stride sizes Stride 8 Bytes Stride 16 Bytes Stride 32 Bytes Stride 64 Bytes Stride 128 Bytes Stride 256 Bytes Stride 512 Bytes Stride 124 Bytes 12 1 Accumulated memory bandwidth for combined read/write access on separate memory segments 4 active, inactive processes 3 active, 1 inactiver process 2 active, 2 inactive processes 1 active, 3 inactive processes Amount of memory used (in KByte) Amount of memory used per process (in KByte) Fig. 7. Read/write bandwidth with different strides Fig. 9. Read/write bandwidth with different # of processes bad read/write interleaving and resulting bank conflicts as the compiler assumes an access stride greater than one. C. Access on Separate Memory Segments For multi-processor shared memory systems it is always of interest to what degree the running processes influence each other as they are accessing the main memory. To study that behavior, the presented algorithm was adapted as a parallel MPI program. All running processes were instructed to use their own memory segments and no communication took place to exchange vector data. The processes were, however, synchronized with barriers, so that they were all working on the same vector sizes. Since one active process is enough to fill the front side (or system) bus of a base module, this experiment will show how well the Altix system handles memory access overload situations. The results of the benchmark are shown in figures 8 and 9. At first, one can extract from the plots that the obtainable cache bandwidth is independent from the number of processes running on the system. Secondly, the cache bandwidth for read-only accesses is slightly below that for combined read/write accesses which suggests that the read-only algorithm is not using the full L3 cache transfer capability. As introduced earlier, the L3 cache is capable of transmitting four double precision floating point numbers per clock cycle to the L2 cache. Those capabilities are not fully used by both versions. Another interesting discovery within the results demonstrates that the Itanium 2 is not capable of using the full 6.4 of front side bus bandwidth. The reason for such behavior can be too small memory request buffers within the the system bus cannot be saturated with memory requests by one and the performance is limited to some degree by the main memory latency as well. The read-only access reaches a transfer rate of 5.5 for one active, the combined read/write access 4.7. The lower performance for the combined accesses is caused by bus turnaround cycles on the system bus, since it is an unidirectional bus which requires a few clock cycles to change the transmission direction. Finally, the measurement data displays a somewhat sur-

7 prising result: The accumulated bandwidth for three active processes drops below that of two active processes. This ist due to load-imbalances within the modules. While two processes can be placed by the scheduler so that each module contains one process (which then has the system bus all to itself), three processes require one module to use both s. At that point those two s have to share the system bus, thus limiting the available bandwidth per process to 3.2. Since the runtime is taken when all processes are finished with their work, the third process having the other module to itself is held back by the other two. For three processes one receives then a theoretical bandwidth of 3 x 3.2 = 9.6 versus 2 x 6.4 = 12.8 for two active processes. D. Access on Shared Memory Segments Within shared memory environments one can use multiple threads to work on data within the same address space. In this case, bandwidth is even more of importance, as the threads running on the different processors might access the same physical piece of main memory. When using OpenMP to share the work in this case the vectors to compute each thread will work on a part of the overall data. The number of threads used in this measurement was varied from one to four; the results are plotted in figures 1 and Accumulated memory bandwidth for combined read/write access on shared memory segments 4 threads 3 threads 2 threads 1 thread Total amount of memory used (in KByte) Fig. 11. Accumulated memory bandwidth for read-only access on shared memory segments 4 threads 3 threads 2 threads 1 thread Read/write bandwidth with different # of threads feed data to the local processors and the remote processor by saturating the connection the memory. Therefore, adding a fourth thread does not result in any further performance improvements. E. Access with Varying Degree of Randomness A final experiment was used to determine how the bandwidth develops as randomness within the memory accesses increases. For that purpose a so called gather-code was produced. It access data indirectly over an index vector. The values of A, which is accessed by using the index vector J, are summed up to generate a read-only access pattern. By arranging the elements in J sequentially and then interchanging a varying number of elements one can generate the different degrees of randomness when accessing A. The experiment was run in parallel, where each process was working on its own memory segment; the results for 5 MByte of used memory space are displayed in figure Total amount of memory used (in KByte) 1 9 Accumulated memory bandwidth for accessing separare memory segments with varying randomness 4 processes 3 processes 2 processes 1 process Fig. 1. Read-only bandwidth with different # of threads The cache expansion effect of shared memory computing can be observed well in both figures. While the performance drops for one active thread as it leaves the cache, two threads can actually prolong that drop as they divide the work between each other. Hence, they virtually double the available cache size. This effect can be observed for more threads as well. Additionally the cache bandwidth also increases constantly when adding more threads. Interestingly, the accumulated memory bandwidth exceeds 6.4 when switching from two to three threads. This confirmes the system design which allows a transfer rate of 1.2 between the memory and the SHUB. The data for this experiment is located in only one module. The other module has to remotely access that data. The SHUB can then Degree of randomness (in %) Fig. 12. Bandwidth with different # of processes The results are comparable to the ones received when accessing the memory with different strides: The higher the

8 randomness, which corresponds to large distances between two accesses, the lower the obtained bandwidth. This behavior is not surprising since the cache lines loaded from main memory are then only accessed once or twice before being evicted again and one eighth or one forth of the main memory bandwidth is lost. The usage of two process doubles the received bandwidth. Adding more processes only leads to a slight increase in performance as the front side busses of the two module are already almost saturated. V. CONCLUSION The research presented in this paper has shown that in some case the obtainable results for applications correspond with the best cases offered by the hard- and software. In most cases, however, the measured performance stays behind the processor s capabilities. This was to be expected and is in no way a surprising result, since it is the case for most modern superscalar microprocessors. It could, furthermore, be shown that the SGI Altix 35 usually can live up to its capabilities when following a few basic principles: Use the system evenly; try to avoid load imbalances which occur when allocating an odd number of threads or processes. Bind processes to their s using dplace. This avoids them being moved away from their data. Initialize memory pages by the process which will be working on the data. This will place the memory page (if possible) onto the module the process is running on and will, therefore, minimize remote memory accesses and network traffic. The characteristics and limitations of a superscalar processor in general and the specifically also suggest keeping the following in mind while using the processor: Avoid access patterns that spill and reload cache lines. Avoid access patterns with a multiple of 256 Bytes between two accesses. Try not to use more then 2 MBytes of cache as latency will rapidly increase beyond that point while bandwidth drops. Overall, the scalability and sustainability of the obtained results encourages the usage of Altix. The systems offer a unique amount of shared memory space and offer very good floating point performance at the same time. The main memory bandwidth is, however, the boundary that cannot be crossed by data intensive applications. The front side bus then becomes the bottleneck to feed data to the two processors. Hence, more than two s per module would show no performance gain and SGI is well advised to reduce the number of Itanium 2 Montecito (dual core) s per module in the next generation of Altix systems to one to be capable of supplying them with data. Further research from the side of the authors will evaluate whole applications with respect to their memory performance on Altix. Additionally, code optimization on those applications are planned for the large Altix system soon to be installed in Dresden. ACKNOWLEDGMENT All measurements presented in this paper have been done using BenchIT ( a performance analysis environment developed at the Center for Information Services and High Performance Computing, Dresden. REFERENCES [1] Intel, Processor Reference Manual For Software Development and Optimization, May 24, Document-No.: [2] SGI, The SGI Altix 35 Server, January 25, Document-No.: J [3] D. Lenoski et al., The Stanford Dash Multiprocessor, IEEE Computer, vol. 25, no. 3, pp , March 1992.

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms