Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Similar documents
Performance comparison and optimization: Case studies using BenchIT

Tracing the Cache Behavior of Data Structures in Fortran Applications

Accessing Data on SGI Altix: An Experience with Reality

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Distribution of Periscope Analysis Agents on ALTIX 4700

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

BenchIT Performance Measurement and Comparison for Scientific Applications

Simultaneous Multithreading on Pentium 4

Parallelized Progressive Network Coding with Hardware Acceleration

Martin Kruliš, v

Generic Locking and Deadlock-Prevention with C++

FFTSS Library Version 3.0 User s Guide

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Parallel Exact Inference on the Cell Broadband Engine Processor

Potentials and Limitations for Energy Efficiency Auto-Tuning

Intel released new technology call P6P

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Genomic-Scale Analysis of DNA Words of Arbitrary Length by Parallel Computation

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Detecting Memory-Boundedness with Hardware Performance Counters

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

EJEMPLOS DE ARQUITECTURAS

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Next Generation Technology from Intel Intel Pentium 4 Processor

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Advanced Processor Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Efficient Object Placement including Node Selection in a Distributed Virtual Machine

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures

Parallel Algorithm Engineering

Performance analysis basics

Inside Intel Core Microarchitecture

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Developing Scalable Applications with Vampir, VampirServer and VampirTrace

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

TUNING CUDA APPLICATIONS FOR MAXWELL

Basics of Performance Engineering

Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

JCudaMP: OpenMP/Java on CUDA

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Masterpraktikum Scientific Computing

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

TUNING CUDA APPLICATIONS FOR MAXWELL

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University

Intel Enterprise Processors Technology

Native Computing and Optimization. Hang Liu December 4 th, 2013

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Registers. Registers

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

LLVM and Clang on the Most Powerful Supercomputer in the World

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

AUTOMATIC SMT THREADING

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Advanced Threading and Optimization

Performance of the AMD Opteron LS21 for IBM BladeCenter

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013

Technical Specifications and Hardware Requirements

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Minimum Hardware and OS Specifications

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Characterization of Native Signal Processing Extensions

Six-Core AMD Opteron Processor

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures

Our new HPC-Cluster An overview

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

Performance of Variant Memory Configurations for Cray XT Systems

Technology for a better society. hetcomp.com

Parallel Computer Architecture - Basics -

Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

How to write powerful parallel Applications

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

The Optimal CPU and Interconnect for an HPC Cluster

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Advanced OpenMP Features

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Transcription:

John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms and Applications, C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), John von Neumann Institute for Computing, Jülich, NIC Series, Vol. 38, ISBN 978-3-981843-4-4, pp. 365-372, 27. Reprinted in: Advances in Parallel Computing, Volume 15, ISSN 927-5452, ISBN 978-1-5863-796-3 (IOS Press), 28. c 27 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume38

Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, and Stefan Pflüger Center for Information Services and High Performance Computing Technische Universität Dresden 162 Dresden, Germany E-mail: {robert.schoene, wolfgang.nagel, stefan.pflueger}@tu-dresden.de Intel Core 2 processors are used in servers, desktops, and notebooks. They combine the Intel64 Instruction Set Architecture with a new microarchitecture based on Intel Core and are proclaimed by their vendor as the world s best processors. In this paper, measured bandwidths between the computing cores and the different caches are presented. The STREAM benchmark 1 is one of the most used kernels by scientists to determine the memory bandwidth. For deeper insight the STREAM benchmark was redesigned to get exact values for small problem sizes as well. This analysis gives hints to faster data access and compares performance results for standard and tuned routines on the Intel Core 2 Architecture. 1 Introduction For analyzing the details of a computer architecture and its implementation as well as software influences, a convenient performance measuring tool is necessary. For this kind of tasks BenchIT 2,3 has been developed at the Center for Information Services and High Performance Computing at the Technische Universität Dresden. BenchIT implements some features this paper benefits from. A variable problem size for measuring algorithms, remote measurement support, and easy comparison possibilities are some of them. The memory performance is latency and bandwidth bound. Since the memory bandwidth in modern computer systems does not grow as fast as the arithmetical performance, caches are essential for the performance in most applications. This work will show that the transfer rate is not only bound to the hardware limitations but also depends on software, compiler, and compiler flags. 2 The Measured Systems In this paper an Intel Core 2 Duo so called Woodcrest is the reference object of analysis. A short overview is shown in Table 1, more information can be obtained at the Intel Homepage 4. Performance results for the other processors listed are presented in Section 4. 3 STREAM Benchmark Related Analysis STREAM was first presented in 1991 and is a synthetic benchmark, based on different routines which use one-dimensional fields of double precision floating point data. Thus the total performance is bound to several factors: First of all, the total bandwidth in all 365

parts of the system between the FPU and the highest memory-level, in which the data can be stored. This can be limited by transfer rates between the different memory-levels but also by the width of the result bus for data transfers within cores. Secondly, there is the maximal floating point performance which is high enough in most cases a. 3.1 STREAM Benchmark Overview The benchmark consists of four different parts which are measured separately. Every part implements a vector operation on double precision floating point data. These parts are copy, scale, add, and triad. All operations have one resulting vector and up to two source vectors. 3.2 Implementation of Measuring Routines The original STREAM benchmark is available as a source code in C and FORTRAN, but as binary for several systems as well. A fragment of the C code is listed below. # d e f i n e N 1 # d e f i n e OFFSET double a [N+OFFSET ] ; double b [N+OFFSET ] ; double c [N+OFFSET ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ]= a [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) b [ j ]= s c a l a r c [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ] = a [ j ]+ b [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) a [ j ] = b [ j ]+ s c a l a r c [ j ] ; Listing 1. Fragment of the STREAM Benchmark 3.3 First Performance Measurements First measurements derived from the STREAM benchmark led to unsatisfying results. The timer granularity was not high enough to determine bandwidths for problem sizes which fit into the L1 Cache. However, results for the L2 Cache can be imprecise as well. In order to reduce these effects, a better timer has been used (read time stamp counter rdtsc). Furthermore, the benchmark now has been adapted to fit the BenchIT-Interface. 366

8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 1. Measurement on Intel Xeon 516 (woodcrest), non-optimized, compiler flag -O3, 1 core Using the compiler flag -O3 led to the performance results as shown in Fig. 1. To benefit from special features of these processors, the compiler flag -xp can be used b. It adds support for SSE3 operations as well as all previous vectorization possibilities (e.g. MMX) which were added after the IA-32 definition. The usage of vector operations leads to a performance benefit of at least fifty percent which can be seen in Fig. 2. 8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 2. Measurement on Intel Xeon 516 (woodcrest), optimizing compiler flags, non-optimized, compiler flags -O3 -xp -openmp a An exception, for example, is SUNs UltraSPARC T1 which implements only one FPU for up to 8 cores. b With compiler version 1. additional flags were introduced especially for the use with Core processors. These are -xo and -xt 367

To parallelize the vector operations, STREAM uses OpenMP which is supported by many compilers. When the flag -openmp is used additional to those mentioned before, a performance benefit appears in the L2 cache. For small problem sizes, the influence of the parallelizing overhead is too large to gain exact performance results. The complete results can be seen in Fig. 3. 8, 7, 6, 5, 4, 3, 2, 1, 1 1 1 1 1 N (Length of vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 3. Measurement on Intel Xeon 516 (woodcrest), 2 cores active (OpenMP), non-optimized, compiler flags -O3 -openmp, 2 cores A performance comparison for different compiler flag combinations shows that when a loop is parallelized with OpenMP, the vectorizations are disabled. The compiler output also indicates that the LOOP WAS PARALLELIZED but not VECTORIZED. 3.4 Optimizations As previous results have shown there are challenges which arise from overheads as well as the lack of vectorizing and OpenMP-parallelizing code simultaneously. The overhead causes inaccurate measurements for small problem sizes and can be reduced easily. When repeating the functions, the runtime is extended by the number of repeats as well. It may be possible that the compiler removes or alters these repetitions for at least copy and scale. This has been checked in all following measurements and did not occur. A repetition leads to other cache borders in the resulting figure. They are indicated by a shift related to the problem sizes of thirty percent later for copy and scale operations c. To combine SIMD- and OpenMP-parallelization, the loop is divided in two parallel parts. The first thread calculates the first half of the vectors, the second thread calculates the other fifty percent. When changing the loop, the timer is also moved into the OpenMP parallel region and surrounds the single vector operations with barriers which also reduces the overhead. c This calculation is based on storing only two vectors with size N in cache instead of three. 368

However, the resulting performance is not yet as high as possible. Previous measurements have shown that an alignment of 16 bytes helps SSE memory operations to complete faster. The compiler directive #pragma vector aligned can be written on top of loops to give a hint that all vectors within this loop are 16 byte aligned d. A normal memory allocation does not guarantee this alignment, therefore specific routines should be used. For these cases Intels C-compiler allows the usage of the routine _mm_malloc(...) when including headers for SIMD support. The implementation and usage of these hints and routines achieve a better performance but other negative effects for the L1 cache performance are visible. Looking at the results closely, it appears that the algorithm performs better on problem sizes which are multiples of 16. This fact can be declared easily. If the length of the vector is a multiple of 16, both cores compute on a part which is 64 byte aligned, which complies to the cache line size. When these cases are selected solely, the resulting performance is stable on all memory levels. 16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of Vectors) reducing overhead 16 byte alignment 128 byte alignment Figure 4. Measurement on Intel Xeon 516 (woodcrest), Triad, optimized, compiler flags -O3 -xp -openmp, 2 cores As an example the bandwidth for triad is shown for all optimization steps in Fig. 4. The speedup compared to a sequential execution is about 2 within the caches - no matter whether two cores are on the same die (as in the results shown before) or on different dies.speedup results can be seen in Fig. 4. 4 Comparison to other Dual Core Processors After the performance has been optimized on the Intel Xeon 516, those results are compared to previous x86 processors by the same vendor. These are an Intel Core Duo T26 d This directive is also available under FORTRAN as!dec VECTOR ALIGNED 369

and Intel Xeon 56. A short overview of some key properties are summarized in Table 4. The predecessors in desktop and mobile computing are based on different microarchitectures: Whilst the Xeon 56 is a representative of the Netburst era, the T26 represents the Pentium M architecture used for mobile computing. Additionally, an AMD Opteron 285 processor has been tested. Intel Xeon Intel Core Intel Xeon AMD Opteron 516 Duo T26 56 285 Codename Woodcrest Yonah Dempsey Italy Compiler icc 9.1-em64t icc 9.1-32 icc 9.1-em64t icc 9.1-em64t Clockrate 3. GHz 2.167 GHz 3.2 GHz 2.6 GHz L1 I-Cache 32 kb 32 kb 16 kb 64 kb per Core L1 D-Cache 32 kb 32 kb 12 kµops 64 kb per Core L2 Cache 4 MB shared 2 MB shared 2 * 2 MB 2 * 512 kb Table 1. Overview about measured systems 16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of vectors) Woodcrest Yonah Dempsey Italy Figure 5. Measurement on different processors, Triad, compiler flags -O3 -xp -openmp, 2 cores The results in Figs. 5 and 6 show that the Core 2 architecture outperforms other processors by at least factor two. The main reason has its origin within the processor core. The result bus was widened to 128 bit and the number of floating point operations that can be performed in one cycle were increased. Also the transfer rate between L1 Cache and core was widened so 32 byte can be read and 32 byte can be written per cycle. 37

16, 14, 12, 1, 8, 6, 4, 2, 1 1 1 1 1 N (Length of Vectors) Woodcrest Yonah Dempsey Italy Figure 6. Measurement on different processors, Copy, compiler flags -O3 -xp -openmp, 2 cores 5 Conclusion The Intel Core 2 Duo processors have a very high bandwidth within the cores when memory is accessed linearly. This can be achieved by using high optimizing compilers and architecture specific flags. Compiler optimizations are quite restricted and the user has to optimize manually to achieve reasonable results. When parallelizing loops with OpenMP, benefits from compiler flags may be lost as has been shown. In addition to the optimizing flags, a memory alignment of 128 byte and specific hints for the compiler like #pragma vector aligned provide the best performance in this case, significantly outperforming previous x86 processors. Acknowledgement This work could not have been done without help and granted access to several computer systems of the Regionales Rechenzentrum Erlangen HPC-Group. 371

References 1. J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, (1995). 2. G. Juckeland, S. Börner, M. Kluge, S. Kölling, W. E. Nagel, S. Pflüger, H. Röding, S. Seidl, T. William, and Robert Wloch, BenchIT - Performance Measurement and Comparison for Scientific Applications, Proc. ParCo23, pp. 51 58, (24). http://www.benchit.org/download/doc/parco23 paper.pdf 3. R. Schöne, G. Juckeland, W. E. Nagel, S. Pflüger, and R. Wloch, Performance comparison and optimization: case studies using BenchIT, Proc. ParCo25, G. Joubert et al., eds., pp. 877 884, (26). 4. Intel Corporation, Intel Xeon Processor Website, http://www.intel.com/design/xeon/documentation.htm. 372