Analyzing Cache Bandwidth on the Intel Core 2 Architecture
|
|
- Aileen Jones
- 6 years ago
- Views:
Transcription
1 John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms and Applications, C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), John von Neumann Institute for Computing, Jülich, NIC Series, Vol. 38, ISBN , pp , 27. Reprinted in: Advances in Parallel Computing, Volume 15, ISSN , ISBN (IOS Press), 28. c 27 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.
2 Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, and Stefan Pflüger Center for Information Services and High Performance Computing Technische Universität Dresden 162 Dresden, Germany {robert.schoene, wolfgang.nagel, stefan.pflueger}@tu-dresden.de Intel Core 2 processors are used in servers, desktops, and notebooks. They combine the Intel64 Instruction Set Architecture with a new microarchitecture based on Intel Core and are proclaimed by their vendor as the world s best processors. In this paper, measured bandwidths between the computing cores and the different caches are presented. The STREAM benchmark 1 is one of the most used kernels by scientists to determine the memory bandwidth. For deeper insight the STREAM benchmark was redesigned to get exact values for small problem sizes as well. This analysis gives hints to faster data access and compares performance results for standard and tuned routines on the Intel Core 2 Architecture. 1 Introduction For analyzing the details of a computer architecture and its implementation as well as software influences, a convenient performance measuring tool is necessary. For this kind of tasks BenchIT 2,3 has been developed at the Center for Information Services and High Performance Computing at the Technische Universität Dresden. BenchIT implements some features this paper benefits from. A variable problem size for measuring algorithms, remote measurement support, and easy comparison possibilities are some of them. The memory performance is latency and bandwidth bound. Since the memory bandwidth in modern computer systems does not grow as fast as the arithmetical performance, caches are essential for the performance in most applications. This work will show that the transfer rate is not only bound to the hardware limitations but also depends on software, compiler, and compiler flags. 2 The Measured Systems In this paper an Intel Core 2 Duo so called Woodcrest is the reference object of analysis. A short overview is shown in Table 1, more information can be obtained at the Intel Homepage 4. Performance results for the other processors listed are presented in Section 4. 3 STREAM Benchmark Related Analysis STREAM was first presented in 1991 and is a synthetic benchmark, based on different routines which use one-dimensional fields of double precision floating point data. Thus the total performance is bound to several factors: First of all, the total bandwidth in all 365
3 parts of the system between the FPU and the highest memory-level, in which the data can be stored. This can be limited by transfer rates between the different memory-levels but also by the width of the result bus for data transfers within cores. Secondly, there is the maximal floating point performance which is high enough in most cases a. 3.1 STREAM Benchmark Overview The benchmark consists of four different parts which are measured separately. Every part implements a vector operation on double precision floating point data. These parts are copy, scale, add, and triad. All operations have one resulting vector and up to two source vectors. 3.2 Implementation of Measuring Routines The original STREAM benchmark is available as a source code in C and FORTRAN, but as binary for several systems as well. A fragment of the C code is listed below. # d e f i n e N 1 # d e f i n e OFFSET double a [N+OFFSET ] ; double b [N+OFFSET ] ; double c [N+OFFSET ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ]= a [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) b [ j ]= s c a l a r c [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) c [ j ] = a [ j ]+ b [ j ] ; # pragma omp p a r a l l e l f o r f o r ( j =; j<n; j ++) a [ j ] = b [ j ]+ s c a l a r c [ j ] ; Listing 1. Fragment of the STREAM Benchmark 3.3 First Performance Measurements First measurements derived from the STREAM benchmark led to unsatisfying results. The timer granularity was not high enough to determine bandwidths for problem sizes which fit into the L1 Cache. However, results for the L2 Cache can be imprecise as well. In order to reduce these effects, a better timer has been used (read time stamp counter rdtsc). Furthermore, the benchmark now has been adapted to fit the BenchIT-Interface. 366
4 8, 7, 6, 5, 4, 3, 2, 1, N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 1. Measurement on Intel Xeon 516 (woodcrest), non-optimized, compiler flag -O3, 1 core Using the compiler flag -O3 led to the performance results as shown in Fig. 1. To benefit from special features of these processors, the compiler flag -xp can be used b. It adds support for SSE3 operations as well as all previous vectorization possibilities (e.g. MMX) which were added after the IA-32 definition. The usage of vector operations leads to a performance benefit of at least fifty percent which can be seen in Fig. 2. 8, 7, 6, 5, 4, 3, 2, 1, N (Length of Vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 2. Measurement on Intel Xeon 516 (woodcrest), optimizing compiler flags, non-optimized, compiler flags -O3 -xp -openmp a An exception, for example, is SUNs UltraSPARC T1 which implements only one FPU for up to 8 cores. b With compiler version 1. additional flags were introduced especially for the use with Core processors. These are -xo and -xt 367
5 To parallelize the vector operations, STREAM uses OpenMP which is supported by many compilers. When the flag -openmp is used additional to those mentioned before, a performance benefit appears in the L2 cache. For small problem sizes, the influence of the parallelizing overhead is too large to gain exact performance results. The complete results can be seen in Fig. 3. 8, 7, 6, 5, 4, 3, 2, 1, N (Length of vectors) Bandwidth Copy Bandwidth Scale Bandwidth Add Bandwidth Triad Figure 3. Measurement on Intel Xeon 516 (woodcrest), 2 cores active (OpenMP), non-optimized, compiler flags -O3 -openmp, 2 cores A performance comparison for different compiler flag combinations shows that when a loop is parallelized with OpenMP, the vectorizations are disabled. The compiler output also indicates that the LOOP WAS PARALLELIZED but not VECTORIZED. 3.4 Optimizations As previous results have shown there are challenges which arise from overheads as well as the lack of vectorizing and OpenMP-parallelizing code simultaneously. The overhead causes inaccurate measurements for small problem sizes and can be reduced easily. When repeating the functions, the runtime is extended by the number of repeats as well. It may be possible that the compiler removes or alters these repetitions for at least copy and scale. This has been checked in all following measurements and did not occur. A repetition leads to other cache borders in the resulting figure. They are indicated by a shift related to the problem sizes of thirty percent later for copy and scale operations c. To combine SIMD- and OpenMP-parallelization, the loop is divided in two parallel parts. The first thread calculates the first half of the vectors, the second thread calculates the other fifty percent. When changing the loop, the timer is also moved into the OpenMP parallel region and surrounds the single vector operations with barriers which also reduces the overhead. c This calculation is based on storing only two vectors with size N in cache instead of three. 368
6 However, the resulting performance is not yet as high as possible. Previous measurements have shown that an alignment of 16 bytes helps SSE memory operations to complete faster. The compiler directive #pragma vector aligned can be written on top of loops to give a hint that all vectors within this loop are 16 byte aligned d. A normal memory allocation does not guarantee this alignment, therefore specific routines should be used. For these cases Intels C-compiler allows the usage of the routine _mm_malloc(...) when including headers for SIMD support. The implementation and usage of these hints and routines achieve a better performance but other negative effects for the L1 cache performance are visible. Looking at the results closely, it appears that the algorithm performs better on problem sizes which are multiples of 16. This fact can be declared easily. If the length of the vector is a multiple of 16, both cores compute on a part which is 64 byte aligned, which complies to the cache line size. When these cases are selected solely, the resulting performance is stable on all memory levels. 16, 14, 12, 1, 8, 6, 4, 2, N (Length of Vectors) reducing overhead 16 byte alignment 128 byte alignment Figure 4. Measurement on Intel Xeon 516 (woodcrest), Triad, optimized, compiler flags -O3 -xp -openmp, 2 cores As an example the bandwidth for triad is shown for all optimization steps in Fig. 4. The speedup compared to a sequential execution is about 2 within the caches - no matter whether two cores are on the same die (as in the results shown before) or on different dies.speedup results can be seen in Fig Comparison to other Dual Core Processors After the performance has been optimized on the Intel Xeon 516, those results are compared to previous x86 processors by the same vendor. These are an Intel Core Duo T26 d This directive is also available under FORTRAN as!dec VECTOR ALIGNED 369
7 and Intel Xeon 56. A short overview of some key properties are summarized in Table 4. The predecessors in desktop and mobile computing are based on different microarchitectures: Whilst the Xeon 56 is a representative of the Netburst era, the T26 represents the Pentium M architecture used for mobile computing. Additionally, an AMD Opteron 285 processor has been tested. Intel Xeon Intel Core Intel Xeon AMD Opteron 516 Duo T Codename Woodcrest Yonah Dempsey Italy Compiler icc 9.1-em64t icc icc 9.1-em64t icc 9.1-em64t Clockrate 3. GHz GHz 3.2 GHz 2.6 GHz L1 I-Cache 32 kb 32 kb 16 kb 64 kb per Core L1 D-Cache 32 kb 32 kb 12 kµops 64 kb per Core L2 Cache 4 MB shared 2 MB shared 2 * 2 MB 2 * 512 kb Table 1. Overview about measured systems 16, 14, 12, 1, 8, 6, 4, 2, N (Length of vectors) Woodcrest Yonah Dempsey Italy Figure 5. Measurement on different processors, Triad, compiler flags -O3 -xp -openmp, 2 cores The results in Figs. 5 and 6 show that the Core 2 architecture outperforms other processors by at least factor two. The main reason has its origin within the processor core. The result bus was widened to 128 bit and the number of floating point operations that can be performed in one cycle were increased. Also the transfer rate between L1 Cache and core was widened so 32 byte can be read and 32 byte can be written per cycle. 37
8 16, 14, 12, 1, 8, 6, 4, 2, N (Length of Vectors) Woodcrest Yonah Dempsey Italy Figure 6. Measurement on different processors, Copy, compiler flags -O3 -xp -openmp, 2 cores 5 Conclusion The Intel Core 2 Duo processors have a very high bandwidth within the cores when memory is accessed linearly. This can be achieved by using high optimizing compilers and architecture specific flags. Compiler optimizations are quite restricted and the user has to optimize manually to achieve reasonable results. When parallelizing loops with OpenMP, benefits from compiler flags may be lost as has been shown. In addition to the optimizing flags, a memory alignment of 128 byte and specific hints for the compiler like #pragma vector aligned provide the best performance in this case, significantly outperforming previous x86 processors. Acknowledgement This work could not have been done without help and granted access to several computer systems of the Regionales Rechenzentrum Erlangen HPC-Group. 371
9 References 1. J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, (1995). 2. G. Juckeland, S. Börner, M. Kluge, S. Kölling, W. E. Nagel, S. Pflüger, H. Röding, S. Seidl, T. William, and Robert Wloch, BenchIT - Performance Measurement and Comparison for Scientific Applications, Proc. ParCo23, pp , (24). paper.pdf 3. R. Schöne, G. Juckeland, W. E. Nagel, S. Pflüger, and R. Wloch, Performance comparison and optimization: case studies using BenchIT, Proc. ParCo25, G. Joubert et al., eds., pp , (26). 4. Intel Corporation, Intel Xeon Processor Website, 372
Performance comparison and optimization: Case studies using BenchIT
John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current
More informationTracing the Cache Behavior of Data Structures in Fortran Applications
John von Neumann Institute for Computing Tracing the Cache Behavior of Data Structures in Fortran Applications L. Barabas, R. Müller-Pfefferkorn, W.E. Nagel, R. Neumann published in Parallel Computing:
More informationAccessing Data on SGI Altix: An Experience with Reality
Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance
More informationFakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview
Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur BenchIT Project Overview Nöthnitzer Straße 46 Raum INF 1041 Tel. +49 351-463 - 38458 (stefan.pflueger@tu-dresden.de)
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationDistribution of Periscope Analysis Agents on ALTIX 4700
John von Neumann Institute for Computing Distribution of Periscope Analysis Agents on ALTIX 4700 Michael Gerndt, Sebastian Strohhäcker published in Parallel Computing: Architectures, Algorithms and Applications,
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationBenchIT Performance Measurement and Comparison for Scientific Applications
1 BenchIT Performance Measurement and Comparison for Scientific Applications Guido Juckeland a, Stefan Börner a, Michael Kluge a, Sebastian Kölling a, Wolfgang E. Nagel a, Stefan Pflüger a, Heike Röding
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationParallelized Progressive Network Coding with Hardware Acceleration
Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationGeneric Locking and Deadlock-Prevention with C++
John von Neumann Institute for Computing Generic Locking and Deadlock-Prevention with C++ Michael Suess, Claudia Leopold published in Parallel Computing: Architectures, Algorithms and Applications, C.
More informationFFTSS Library Version 3.0 User s Guide
Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationGenomic-Scale Analysis of DNA Words of Arbitrary Length by Parallel Computation
John von Neumann Institute for Computing Genomic-Scale Analysis of DNA Words of Arbitrary Length by Parallel Computation X.Y. Yang, A. Ripoll, V. Arnau, I. Marín, E. Luque published in Parallel Computing:
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationDetecting Memory-Boundedness with Hardware Performance Counters
Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)
More informationWhat Transitioning from 32-bit to 64-bit x86 Computing Means Today
What Transitioning from 32-bit to 64-bit x86 Computing Means Today Chris Wanner Senior Architect, Industry Standard Servers Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationEJEMPLOS DE ARQUITECTURAS
Maestría en Electrónica Arquitectura de Computadoras Unidad 4 EJEMPLOS DE ARQUITECTURAS M. C. Felipe Santiago Espinosa Marzo/2017 ARM & MIPS Similarities ARM: the most popular embedded core Similar basic
More informationPerformance Analysis of the Lattice Boltzmann Method on x86-64 Architectures
Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm
More informationNext Generation Technology from Intel Intel Pentium 4 Processor
Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationUsing the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results
Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results by Todd Rosenquist, Technical Consulting Engineer, Intel Math Kernal Library and
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationEfficient Object Placement including Node Selection in a Distributed Virtual Machine
John von Neumann Institute for Computing Efficient Object Placement including Node Selection in a Distributed Virtual Machine Jose M. Velasco, David Atienza, Katzalin Olcoz, Francisco Tirado published
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationInside Intel Core Microarchitecture
White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationDeveloping Scalable Applications with Vampir, VampirServer and VampirTrace
John von Neumann Institute for Computing Developing Scalable Applications with Vampir, VampirServer and VampirTrace Matthias S. Müller, Andreas Knüpfer, Matthias Jurenz, Matthias Lieber, Holger Brunst,
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationPowernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems
Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems Thomas Ilsche, Marcus Hähnel, Robert Schöne, Mario Bielert, and Daniel Hackenberg Technische Universität Dresden Observation
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationMINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2
MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation
More informationSaman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology
Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology http://cag.csail.mit.edu/ps3 6.189-chair@mit.edu A new processor design pattern emerges: The Arrival of Multicores MIT Raw 16 Cores
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationGreat Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties
Overview Course Overview Course theme Five realities Computer Systems 1 2 Course Theme: Abstraction Is Good But Don t Forget Reality Most CS courses emphasize abstraction Abstract data types Asymptotic
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationComputer Systems Laboratory Sungkyunkwan University
ARM & IA-32 Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ARM (1) ARM & MIPS similarities ARM: the most popular embedded core Similar basic set
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationRegisters. Registers
All computers have some registers visible at the ISA level. They are there to control execution of the program hold temporary results visible at the microarchitecture level, such as the Top Of Stack (TOS)
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationLLVM and Clang on the Most Powerful Supercomputer in the World
LLVM and Clang on the Most Powerful Supercomputer in the World Hal Finkel November 7, 2012 The 2012 LLVM Developers Meeting Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November
More informationA New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *
A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationQuantifying power consumption variations of HPC systems using SPEC MPI benchmarks
Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationAdvanced Threading and Optimization
Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d
More informationPerformance of the AMD Opteron LS21 for IBM BladeCenter
August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the
More informationTools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017
Tools and techniques for optimization and debugging Andrew Emerson, Fabio Affinito November 2017 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object,
More informationCMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013
CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013 TOPICS TODAY Moore s Law Evolution of Intel CPUs IA-32 Basic Execution Environment IA-32 General Purpose Registers
More informationTechnical Specifications and Hardware Requirements
Technical Specifications and Hardware Requirements Insight Legal Software Ltd. Westmead House, Westmead, Farnborough, Hampshire, GU14 7LP 01252 518939 info@insightlegal.co.uk www.insightlegal.co.uk VAT
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMinimum Hardware and OS Specifications
Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationCharacterization of Native Signal Processing Extensions
Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if
More informationSix-Core AMD Opteron Processor
What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationx86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures
x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures topics Preliminary material a look at what Assembly Language works with - How processors work»a moment
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH
ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationOptimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology
Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology JIRI SCHIMME Department of Telecommunications EEC Brno University of Technology Purkynova
More informationLoad Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes
Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Agnieszka Debudaj-Grabysz 1 and Rolf Rabenseifner 2 1 Silesian University of Technology, Gliwice, Poland; 2 High-Performance Computing
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationHow to write powerful parallel Applications
How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationPerformance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process
Performance evaluation It s an everyday process CS/COE0447: Computer Organization and Assembly Language Chapter 4 Sangyeun Cho Dept. of Computer Science When you buy food Same quantity, then you look at
More information