Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Size: px
Start display at page:

Download "Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System"

Transcription

1 Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies Raleigh, eptember 16th 2009 Daniel Molka Daniel Hackenberg Robert chöne Matthias. Müller

2 Outline Motivation Benchmark Design Implementation Intel Xeon X5570 (Nehalem-EP) Results Latency Bandwidth ummary Recent Developments and Future Work Daniel Molka 2

3 Motivation Nehalem Quadcore Core 0 Core 1 Core 2 Core 3 L1 L1 L1 L1 L2 L2 L2 L2 Nehalem Quadcore Core 4 Core 5 Core 6 Core 7 L1 L1 L1 L1 L2 L2 L2 L2 hared Level 3 Cache hared Level 3 Cache IMC (3 Channel) QPI QPI IMC (3 Channel) DDR3 A DDR3 B DDR3 C I/O Hub DDR3 D DDR3 E DDR3 F Growing complexity of memory subsystem hared resources NUMA ystems Not covered by existing latency and bandwidth benchmarks More sophisticated benchmarks required to understand behavior of parallel applications Daniel Molka 3

4 Benchmark Design L E L3 E X Memory Memory Core 0 reads X L L3 X Memory Memory Data placement in arbitrary location Access other cores caches Access certain cache levels Daniel Molka 4

5 Benchmark Design E M L3 L3 E L3 L3 M X Memory Memory X Memory Memory Core 0 reads X Core 0 reads X L L3 L L3 X Memory Memory X Memory Memory Data placement in arbitrary location Access other cores caches Coherency state control Enforce certain coherency states Access certain cache levels Access each cache line only once during measurement Daniel Molka 5

6 Implementation Data placement Access data of other cores One thread pinned to each core Threads load data into caches of corresponding core Access certain cache levels Optional cache flushes Coherency state control Modified: write data (invalidates other copies) Exclusive: enforce modified state + flush caches (clflush) + read data hared: enforce exclusive state + read from another core Daniel Molka 6

7 Implementation Time tamp Counter (rdtsc instr.) Precise measurement of short durations Required to measure without cache line reuse Assembler implementation of critical parts Measurement routines (including timestamps) ynchronization of concurrently running threads Memory allocation NUMA aware Hugetlbfs support BenchIT Framework to develop and run microbenchmarks Daniel Molka 7

8 Data Placement Cache Level L1 L2 L3 RAM Latency without cache flushes Mixture of effects from different cache levels Daniel Molka 8

9 Data Placement Cache Level L1 L2 L3 RAM Latency without cache flushes Mixture of effects from different cache levels Latency with cache flushes All cache levels and memory latency clearly visible Daniel Molka 9

10 Data Placement Other Cores Caches L1 L2 L3 RAM Data in local caches Performance for data that is not used by other cores Daniel Molka 10

11 Data Placement Other Cores Caches L1 L2 L3 RAM Data in local caches Performance for data that is not used by other cores Data in other cores caches Analyze cache coherency protocol Daniel Molka 11

12 Coherency tate Control L1 L2 L3 RAM Modified cache lines transfered from other core Daniel Molka 12

13 Coherency tate Control L1 L2 L3 RAM Modified cache lines transfered from other core hared cache lines transferred from inclusive L3 Daniel Molka 13

14 Benchmarks Latency Pointer chasing One thread loads data in its cache Thread on core 0 performs measurement on that data Bandwidth between cores Consecutive read or write One thread loads data, core 0 measures bandwidth Bandwidth of concurrent accesses All threads load their data into certain cache level Threads access data concurrently Earliest start timestamp and latest stop timestamp used to calculate bandwidth Daniel Molka 14

15 Test ystem Overview Dual socket Intel Xeon X GHz (Turbo Boost disabled) Quadcore (MT disabled) 32 KiB L1I, 32 KiB L1D 256 KiB L2 8 MiB shared L3 Inclusive of L1/L GHz 6x 2 GiB DDR GB/s per socket Quick Path Interconnect (QPI) 25.6 GB/s (12.8 per direction) C o r e 0 D D R 3 A L 1 L 2 Nehalem Quadcore hared Level 3 Cache I M C ( 3 Channel ) D D R 3 B C o r e 1 C o r e 2 C o r e 3 L 1 L 2 L 2 L 2 D D R 3 C L 1 Q P I L 1 I / O Hub C o r e 4 L 1 L 2 Nehalem Quadcore hared Level 3 Cache Q P I C o r e 5 C o r e 6 C o r e 7 L 1 L 1 L 1 L 2 L 2 L 2 D D R 3 D I M C ( 3 Channel ) D D R 3 E D D R 3 F Daniel Molka 15

16 Core Valid Bits E L 3 M L L L3 keeps track which cores have a copy Used to reduces core snoops 1 bit set Line is exclusive or modified L3 copy might be outdated 2 bits set Line is shared L3 copy is valid all bits 0 L L3 has the only copy Daniel Molka 16

17 Core Valid Bits ilent Evictions ilent eviction of unmodified cache lines E L E v i c t f r o m c o r e L Write back not required Core valid bits remains unchanged Explicit write back of modified data M L E v i c t f r o m c o r e L L3 copy needs to be updated Also resets core valid bit Daniel Molka 17

18 Latency Results Exclusive and Modified Exclusive cache lines L1: 4 cycles, L2 10 cycles L3: 38 cycles (13 ns) On-die transfer: 22 ns Remote access: 65 ns Daniel Molka 18

19 Latency Results Exclusive and Modified Exclusive cache lines L1: 4 cycles, L2 10 cycles L3: 38 cycles (13 ns) On-die transfer: 22 ns Remote access: 65 ns Modified cache lines Identical for local access On-die transfer: L1: 28 ns, L2: 26 ns L3: 13 ns Remote access: >100 ns (write backs to memory) Daniel Molka 19

20 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency Daniel Molka 20

21 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency ilently evicted as well L Evict from cores L core valid bits set Cores not snooped Daniel Molka 21

22 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency ilently evicted as well L Evict from cores L core valid bits set Cores not snooped Main memory Local: 65 ns, remote: 106 ns 41 ns for access via QPI Daniel Molka 22

23 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Daniel Molka 23

24 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Modified cache lines Faster on-die transfer from L3 Rather slow from other cores Remote 5.6 GB/s (write backs) Daniel Molka 24

25 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Modified cache lines Faster on-die transfer from L3 Rather slow from other cores Remote 5.6 GB/s (write backs) Main memory Local: 10.1 GB/s Remote: 6.3 GB/s (below QPI limit) Daniel Molka 25

26 Bandwidth of Concurrent Accesses Read bandwidth (exclusive data) L1/L2 scale well L3 limit at 85 GB/s per socket Main memory Max. 23 GB/s per socket 72% of theoretical peak Daniel Molka 26

27 Bandwidth of Concurrent Accesses Read bandwidth (exclusive data) L1/L2 scale well L3 limit at 85 GB/s per socket Main memory Max. 23 GB/s per socket 72% of theoretical peak Write bandwidth (modified data) L1/L2 scale well L3 limit at 26 GB/s per socket Main memory Max. 12 GB/s per socket Write allocate Daniel Molka 27

28 Bandwidth of Concurrent Accesses Coherency Overhead Coherency state control Exclusive: silently evict cache lines Daniel Molka 28

29 Bandwidth of Concurrent Accesses Coherency Overhead Coherency state control Exclusive: silently evict cache lines Modified: write back of higher level caches required Daniel Molka 29

30 ummary Benchmarks Unveil important undocumented performance data Measure properties that are not covered by existing benchmarks Data placement Analyze performance of communication between cores Coherency state control Analyze coherency protocol implementation Nehalem Performance Inclusive L3 cache handles all coherency issues between cores on die Core valid bits filter most unnecessary snoops Limited L3 write bandwidth Daniel Molka 30

31 Recent Developments and Future Work Recent Developments Experimental support for Owned and Forward state Performance counter support (PAPI) Experimental uncore performance counter support (perfmon2) Future Work upport other architectures Analyze larger shared memory HPC systems Measure impact of AMD s HT Assits Daniel Molka 31

32 Thanks for your Attention Benchmarks and BenchIT Framework available as open source BenchIT available at Find x86 benchmarks at Daniel Molka 32

Detecting Memory-Boundedness with Hardware Performance Counters

Detecting Memory-Boundedness with Hardware Performance Counters Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,

More information

Potentials and Limitations for Energy Efficiency Auto-Tuning

Potentials and Limitations for Energy Efficiency Auto-Tuning Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne

More information

Modelling and Evaluating Performance of Atomic Operations

Modelling and Evaluating Performance of Atomic Operations Modelling and Evaluating Performance of Atomic Operations Hermann Schweizer hermannschweizer@hotmail.com January 11 215 Advisors: Prof. Dr. T. Hoefler, Maciej Besta Department of Computer Science, ETH

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Intel Core i7 Processor

Intel Core i7 Processor Intel Core i7 Processor Vishwas Raja 1, Mr. Danish Ather 2 BSc (Hons.) C.S., CCSIT, TMU, Moradabad 1 Assistant Professor, CCSIT, TMU, Moradabad 2 1 vishwasraja007@gmail.com 2 danishather@gmail.com Abstract--The

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Evaluating the Cost of Atomic Operations on Modern Architectures

Evaluating the Cost of Atomic Operations on Modern Architectures Evaluating the Cost of Atomic Operations on Modern Architectures Hermann Schweizer Dept. of Computer Science ETH Zurich hermannschweizer@hotmail.com Maciej Besta Dept. of Computer Science ETH Zurich maciej.besta@inf.ethz.ch

More information

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Map3D V58 - Multi-Processor Version

Map3D V58 - Multi-Processor Version Map3D V58 - Multi-Processor Version Announcing the multi-processor version of Map3D. How fast would you like to go? 2x, 4x, 6x? - it's now up to you. In order to achieve these performance gains it is necessary

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

PARALLEL MEMORY ARCHITECTURE

PARALLEL MEMORY ARCHITECTURE PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last

More information

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Binding Nested OpenMP Programs on Hierarchical Memory Architectures Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Performance of the AMD Opteron LS21 for IBM BladeCenter

Performance of the AMD Opteron LS21 for IBM BladeCenter August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the

More information

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere

More information

FinisTerrae: Memory Hierarchy and Mapping

FinisTerrae: Memory Hierarchy and Mapping galicia supercomputing center Applications & Projects Department FinisTerrae: Memory Hierarchy and Mapping Technical Report CESGA-2010-001 Juan Carlos Pichel Tuesday 12 th January, 2010 Contents Contents

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Performance Impact of Resource Contention in Multicore Systems

Performance Impact of Resource Contention in Multicore Systems Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J. Chang, J. Djomehri, S. Gavali, D. Jespersen, K. Taylor, R. Biswas Commodity Multicore Chips in NASA HEC 2004:

More information

Server Sizing Joe Chang qdpma.com Jchang6 at yahoo

Server Sizing Joe Chang qdpma.com Jchang6 at yahoo Server Sizing 2018 Joe Chang qdpma.com Jchang6 at yahoo About Joe SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?)

More information

28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133

28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133 Intel Core i7 is a brand new architecture featuring the QPI bus which replaces the FSB bus. So, how does this affect overclocking? The Core i7 processor s frequency is Bclk * CPU multiplier. For ex. Intel

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Master s Thesis Nr. 10

Master s Thesis Nr. 10 Master s Thesis Nr. 10 Systems Group, Department of Computer Science, ETH Zurich Performance isolation on multicore hardware by Kaveh Razavi Supervised by Akhilesh Singhania and Timothy Roscoe November

More information

Lecture 19. Optimizing for the memory hierarchy NUMA

Lecture 19. Optimizing for the memory hierarchy NUMA Lecture 19 Optimizing for the memory hierarchy NUMA A Performance Puzzle When we fuse the ODE and PDE loops in A3, we slow down the code But, we actually observe fewer memory reads and cache misses! What

More information

The MPCAM Based Multi-core Processor Architecture: A Contention Free Architecture

The MPCAM Based Multi-core Processor Architecture: A Contention Free Architecture The MPCAM Based Multi-core Processor Architecture: A Contention Free Architecture ALLAM ABUMWAIS, Department of Computer Engineering, Near East University, LEFKOSA, CYPRUS E-mail: Allam.Abumwais@aauj.edu

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Intel Compiler. Advanced Technical Skills (ATS) North America. IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D.

Intel Compiler. Advanced Technical Skills (ATS) North America. IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. Intel Compiler IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. yjw@us.ibm.com 2/22/2010 Nehalem-EP CPU Summary Performance/Features: 4 cores 8M on-chip Shared Cache Simultaneous Multi-

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Intel QuickPath Interconnect Architectural Features Supporting Scalable System Architectures

Intel QuickPath Interconnect Architectural Features Supporting Scalable System Architectures 2010 18th IEEE Symposium on High Performance Interconnects Intel QuickPath Interconnect Architectural Features Supporting Scalable System Architectures Dimitrios Ziakas, Allen Baum, Robert A. Maddox, Robert

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

ARMageddon: Cache Attacks on Mobile Devices

ARMageddon: Cache Attacks on Mobile Devices ARMageddon: Cache Attacks on Mobile Devices Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clémentine Maurice, Stefan Mangard Graz University of Technology 1 TLDR powerful cache attacks (like Flush+Reload)

More information

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS th August DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS Stefan Nürnberger, Randolf Rotta, Gabor Drescher, Daniel Danner, Jörg Nolte ACKNOWLEDGED EVENT PROPAGATION What does it do?

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

BlackjackBench: Portable Hardware Characterization

BlackjackBench: Portable Hardware Characterization BlackjackBench: Portable Hardware Characterization Anthony Danalis University of Tennessee Knoxville, TN, USA adanalis@eecs.utk.edu Jeffrey S. Vetter Oak Ridge National Lab. Oak Ridge, TN, USA vetter@ornl.gov

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,

More information

EPYC VIDEO CUG 2018 MAY 2018

EPYC VIDEO CUG 2018 MAY 2018 AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal

More information

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance A Dell Technical White Paper Dell Database Solutions Engineering Jisha J Leena Basanthi October 2010 THIS WHITE PAPER IS FOR INFORMATIONAL

More information

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

Modelling Communications in Cache Coherent Systems. Sabela Ramos Garea Torsten Hoefler

Modelling Communications in Cache Coherent Systems. Sabela Ramos Garea Torsten Hoefler Modelling Communications in Cache Coherent Systems Sabela Ramos Garea Torsten Hoefler March 8, 2013 Contents A Communication Model for Cache-Coherent Systems 2 1 Development of Simple Cache Models 4 1.1

More information

Accessing Data on SGI Altix: An Experience with Reality

Accessing Data on SGI Altix: An Experience with Reality Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

Copyright 2017 Intel Corporation

Copyright 2017 Intel Corporation Agenda Intel Xeon Scalable Platform Overview Architectural Enhancements 2 Platform Overview 3x16 PCIe* Gen3 2 or 3 Intel UPI 3x16 PCIe Gen3 Capabilities Details 10GbE Skylake-SP CPU OPA DMI Intel C620

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking

Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking 55 Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking ZHENMAN FANG, University of California Los Angeles SANYAM MEHTA, PEN-CHUNG YEW and ANTONIA ZHAI,

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

CIT 668: System Architecture. Computer Systems Architecture

CIT 668: System Architecture. Computer Systems Architecture CIT 668: System Architecture Computer Systems Architecture 1. System Components Topics 2. Bandwidth and Latency 3. Processor 4. Memory 5. Storage 6. Network 7. Operating System 8. Performance Implications

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Introducing OTF / Vampir / VampirTrace

Introducing OTF / Vampir / VampirTrace Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems presented by Wayne Joubert Motivation Design trends are leading to non-power of 2 core counts for multicore processors, due to layout constraints

More information

Intel QuickPath Interconnect Electrical Architecture Overview

Intel QuickPath Interconnect Electrical Architecture Overview Chapter 1 Intel QuickPath Interconnect Electrical Architecture Overview The art of progress is to preserve order amid change and to preserve change amid order Alfred North Whitehead The goal of this chapter

More information

NUMA effects on multicore, multi socket systems

NUMA effects on multicore, multi socket systems NUMA effects on multicore, multi socket systems Iakovos Panourgias 9/9/2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract Modern multicore/multi socket

More information

Introduction to cache memories

Introduction to cache memories Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf

More information