Analyzing Cache Bandwidth on the Intel Core 2 Architecture

John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms and Applications, C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), John von Neumann Institute for Computing, Jülich, NIC Series, Vol. 38, ISBN 978-3-981843-4-4, pp. 365-372, 27. Reprinted in: Advances in Parallel Computing, Volume 15, ISSN 927-5452, ISBN 978-1-5863-796-3 (IOS Press), 28. c 27 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume38

Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, and Stefan Pflüger Center for Information Services and High Performance Computing Technische Universität Dresden 162 Dresden, Germany E-mail: {robert.schoene, wolfgang.nagel, stefan.pflueger}@tu-dresden.de Intel Core 2 processors are used in servers, desktops, and notebooks. They combine the Intel64 Instruction Set Architecture with a new microarchitecture based on Intel Core and are proclaimed by their vendor as the world s best processors. In this paper, measured bandwidths between the computing cores and the different caches are presented. The STREAM benchmark 1 is one of the most used kernels by scientists to determine the memory bandwidth. For deeper insight the STREAM benchmark was redesigned to get exact values for small problem sizes as well. This analysis gives hints to faster data access and compares performance results for standard and tuned routines on the Intel Core 2 Architecture. 1 Introduction For analyzing the details of a computer architecture and its implementation as well as software influences, a convenient performance measuring tool is necessary. For this kind of tasks BenchIT 2,3 has been developed at the Center for Information Services and High Performance Computing at the Technische Universität Dresden. BenchIT implements some features this paper benefits from. A variable problem size for measuring algorithms, remote measurement support, and easy comparison possibilities are some of them. The memory performance is latency and bandwidth bound. Since the memory bandwidth in modern computer systems does not grow as fast as the arithmetical performance, caches are essential for the performance in most applications. This work will show that the transfer rate is not only bound to the hardware limitations but also depends on software, compiler, and compiler flags. 2 The Measured Systems In this paper an Intel Core 2 Duo so called Woodcrest is the reference object of analysis. A short overview is shown in Table 1, more information can be obtained at the Intel Homepage 4. Performance results for the other processors listed are presented in Section 4. 3 STREAM Benchmark Related Analysis STREAM was first presented in 1991 and is a synthetic benchmark, based on different routines which use one-dimensional fields of double precision floating point data. Thus the total performance is bound to several factors: First of all, the total bandwidth in all 365

parts of the system between the FPU and the highest memory-level, in which the data can be stored. This can be limited by transfer rates between the different memory-levels but also by the width of the result bus for data transfers within cores. Secondly, there is the maximal floating point performance which is high enough in most cases a. 3.1 STREAM Benchmark Overview The benchmark consists of four different parts which are measured separately. Every part implements a vector operation on double precision floating point data. These parts are copy, scale, add, and triad. All operations have one resulting vector and up to two source vectors. 3.2 Implementation of Measuring Routines The original STREAM benchmark is available as a source code in C and FORTRAN, but as binary for several systems as well. A fragment of the C code is listed below. # d e f i n e N 1 # d e f i n e OFFSET double a [N+OFFSET ] ; double b [N+OFFSET ] ; double c [N+OFFSET ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ]= a [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) b [ j ]= s c a l a r c [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ] = a [ j ]+ b [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) a [ j ] = b [ j ]+ s c a l a r c [ j ] ; Listing 1. Fragment of the STREAM Benchmark 3.3 First Performance Measurements First measurements derived from the STREAM benchmark led to unsatisfying results. The timer granularity was not high enough to determine bandwidths for problem sizes which fit into the L1 Cache. However, results for the L2 Cache can be imprecise as well. In order to reduce these effects, a better timer has been used (read time stamp counter rdtsc). Furthermore, the benchmark now has been adapted to fit the BenchIT-Interface. 366

8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 1. Measurement on Intel Xeon 516 (woodcrest), non-optimized, compiler flag -O3, 1 core Using the compiler flag -O3 led to the performance results as shown in Fig. 1. To benefit from special features of these processors, the compiler flag -xp can be used b. It adds support for SSE3 operations as well as all previous vectorization possibilities (e.g. MMX) which were added after the IA-32 definition. The usage of vector operations leads to a performance benefit of at least fifty percent which can be seen in Fig. 2. 8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 2. Measurement on Intel Xeon 516 (woodcrest), optimizing compiler flags, non-optimized, compiler flags -O3 -xp -openmp a An exception, for example, is SUNs UltraSPARC T1 which implements only one FPU for up to 8 cores. b With compiler version 1. additional flags were introduced especially for the use with Core processors. These are -xo and -xt 367

To parallelize the vector operations, STREAM uses OpenMP which is supported by many compilers. When the flag -openmp is used additional to those mentioned before, a performance benefit appears in the L2 cache. For small problem sizes, the influence of the parallelizing overhead is too large to gain exact performance results. The complete results can be seen in Fig. 3. 8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 3. Measurement on Intel Xeon 516 (woodcrest), 2 cores active (OpenMP), non-optimized, compiler flags -O3 -openmp, 2 cores A performance comparison for different compiler flag combinations shows that when a loop is parallelized with OpenMP, the vectorizations are disabled. The compiler output also indicates that the LOOP WAS PARALLELIZED but not VECTORIZED. 3.4 Optimizations As previous results have shown there are challenges which arise from overheads as well as the lack of vectorizing and OpenMP-parallelizing code simultaneously. The overhead causes inaccurate measurements for small problem sizes and can be reduced easily. When repeating the functions, the runtime is extended by the number of repeats as well. It may be possible that the compiler removes or alters these repetitions for at least copy and scale. This has been checked in all following measurements and did not occur. A repetition leads to other cache borders in the resulting figure. They are indicated by a shift related to the problem sizes of thirty percent later for copy and scale operations c. To combine SIMD- and OpenMP-parallelization, the loop is divided in two parallel parts. The first thread calculates the first half of the vectors, the second thread calculates the other fifty percent. When changing the loop, the timer is also moved into the OpenMP parallel region and surrounds the single vector operations with barriers which also reduces the overhead. c This calculation is based on storing only two vectors with size N in cache instead of three. 368

However, the resulting performance is not yet as high as possible. Previous measurements have shown that an alignment of 16 bytes helps SSE memory operations to complete faster. The compiler directive #pragma vector aligned can be written on top of loops to give a hint that all vectors within this loop are 16 byte aligned d. A normal memory allocation does not guarantee this alignment, therefore specific routines should be used. For these cases Intels C-compiler allows the usage of the routine _mm_malloc(...) when including headers for SIMD support. The implementation and usage of these hints and routines achieve a better performance but other negative effects for the L1 cache performance are visible. Looking at the results closely, it appears that the algorithm performs better on problem sizes which are multiples of 16. This fact can be declared easily. If the length of the vector is a multiple of 16, both cores compute on a part which is 64 byte aligned, which complies to the cache line size. When these cases are selected solely, the resulting performance is stable on all memory levels. 16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of Vectors) reducing overhead 16 byte alignment 128 byte alignment Figure 4. Measurement on Intel Xeon 516 (woodcrest), Triad, optimized, compiler flags -O3 -xp -openmp, 2 cores As an example the bandwidth for triad is shown for all optimization steps in Fig. 4. The speedup compared to a sequential execution is about 2 within the caches - no matter whether two cores are on the same die (as in the results shown before) or on different dies.speedup results can be seen in Fig. 4. 4 Comparison to other Dual Core Processors After the performance has been optimized on the Intel Xeon 516, those results are compared to previous x86 processors by the same vendor. These are an Intel Core Duo T26 d This directive is also available under FORTRAN as!dec VECTOR ALIGNED 369

and Intel Xeon 56. A short overview of some key properties are summarized in Table 4. The predecessors in desktop and mobile computing are based on different microarchitectures: Whilst the Xeon 56 is a representative of the Netburst era, the T26 represents the Pentium M architecture used for mobile computing. Additionally, an AMD Opteron 285 processor has been tested. Intel Xeon Intel Core Intel Xeon AMD Opteron 516 Duo T26 56 285 Codename Woodcrest Yonah Dempsey Italy Compiler icc 9.1-em64t icc 9.1-32 icc 9.1-em64t icc 9.1-em64t Clockrate 3. GHz 2.167 GHz 3.2 GHz 2.6 GHz L1 I-Cache 32 kb 32 kb 16 kb 64 kb per Core L1 D-Cache 32 kb 32 kb 12 kµops 64 kb per Core L2 Cache 4 MB shared 2 MB shared 2 * 2 MB 2 * 512 kb Table 1. Overview about measured systems 16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of vectors) Woodcrest Yonah Dempsey Italy Figure 5. Measurement on different processors, Triad, compiler flags -O3 -xp -openmp, 2 cores The results in Figs. 5 and 6 show that the Core 2 architecture outperforms other processors by at least factor two. The main reason has its origin within the processor core. The result bus was widened to 128 bit and the number of floating point operations that can be performed in one cycle were increased. Also the transfer rate between L1 Cache and core was widened so 32 byte can be read and 32 byte can be written per cycle. 37

16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of Vectors) Woodcrest Yonah Dempsey Italy Figure 6. Measurement on different processors, Copy, compiler flags -O3 -xp -openmp, 2 cores 5 Conclusion The Intel Core 2 Duo processors have a very high bandwidth within the cores when memory is accessed linearly. This can be achieved by using high optimizing compilers and architecture specific flags. Compiler optimizations are quite restricted and the user has to optimize manually to achieve reasonable results. When parallelizing loops with OpenMP, benefits from compiler flags may be lost as has been shown. In addition to the optimizing flags, a memory alignment of 128 byte and specific hints for the compiler like #pragma vector aligned provide the best performance in this case, significantly outperforming previous x86 processors. Acknowledgement This work could not have been done without help and granted access to several computer systems of the Regionales Rechenzentrum Erlangen HPC-Group. 371

References 1. J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, (1995). 2. G. Juckeland, S. Börner, M. Kluge, S. Kölling, W. E. Nagel, S. Pflüger, H. Röding, S. Seidl, T. William, and Robert Wloch, BenchIT - Performance Measurement and Comparison for Scientific Applications, Proc. ParCo23, pp. 51 58, (24). http://www.benchit.org/download/doc/parco23 paper.pdf 3. R. Schöne, G. Juckeland, W. E. Nagel, S. Pflüger, and R. Wloch, Performance comparison and optimization: case studies using BenchIT, Proc. ParCo25, G. Joubert et al., eds., pp. 877 884, (26). 4. Intel Corporation, Intel Xeon Processor Website, http://www.intel.com/design/xeon/documentation.htm. 372