Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Size: px

Start display at page:

Download "Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System"

Gerard Jones
6 years ago
Views:

Technologies Raleigh, eptember 16th 2009 Daniel Molka (daniel.molka@tu-dresden.

1 Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies Raleigh, eptember 16th 2009 Daniel Molka Daniel Hackenberg Robert chöne Matthias. Müller

2 Outline Motivation Benchmark Design Implementation Intel Xeon X5570 (Nehalem-EP) Results Latency Bandwidth ummary Recent Developments and Future Work Daniel Molka 2

3 Motivation Nehalem Quadcore Core 0 Core 1 Core 2 Core 3 L1 L1 L1 L1 L2 L2 L2 L2 Nehalem Quadcore Core 4 Core 5 Core 6 Core 7 L1 L1 L1 L1 L2 L2 L2 L2 hared Level 3 Cache hared Level 3 Cache IMC (3 Channel) QPI QPI IMC (3 Channel) DDR3 A DDR3 B DDR3 C I/O Hub DDR3 D DDR3 E DDR3 F Growing complexity of memory subsystem hared resources NUMA ystems Not covered by existing latency and bandwidth benchmarks More sophisticated benchmarks required to understand behavior of parallel applications Daniel Molka 3

4 Benchmark Design L E L3 E X Memory Memory Core 0 reads X L L3 X Memory Memory Data placement in arbitrary location Access other cores caches Access certain cache levels Daniel Molka 4

5 Benchmark Design E M L3 L3 E L3 L3 M X Memory Memory X Memory Memory Core 0 reads X Core 0 reads X L L3 L L3 X Memory Memory X Memory Memory Data placement in arbitrary location Access other cores caches Coherency state control Enforce certain coherency states Access certain cache levels Access each cache line only once during measurement Daniel Molka 5

6 Implementation Data placement Access data of other cores One thread pinned to each core Threads load data into caches of corresponding core Access certain cache levels Optional cache flushes Coherency state control Modified: write data (invalidates other copies) Exclusive: enforce modified state + flush caches (clflush) + read data hared: enforce exclusive state + read from another core Daniel Molka 6

7 Implementation Time tamp Counter (rdtsc instr.) Precise measurement of short durations Required to measure without cache line reuse Assembler implementation of critical parts Measurement routines (including timestamps) ynchronization of concurrently running threads Memory allocation NUMA aware Hugetlbfs support BenchIT Framework to develop and run microbenchmarks Daniel Molka 7

8 Data Placement Cache Level L1 L2 L3 RAM Latency without cache flushes Mixture of effects from different cache levels Daniel Molka 8

9 Data Placement Cache Level L1 L2 L3 RAM Latency without cache flushes Mixture of effects from different cache levels Latency with cache flushes All cache levels and memory latency clearly visible Daniel Molka 9

10 Data Placement Other Cores Caches L1 L2 L3 RAM Data in local caches Performance for data that is not used by other cores Daniel Molka 10

11 Data Placement Other Cores Caches L1 L2 L3 RAM Data in local caches Performance for data that is not used by other cores Data in other cores caches Analyze cache coherency protocol Daniel Molka 11

12 Coherency tate Control L1 L2 L3 RAM Modified cache lines transfered from other core Daniel Molka 12

13 Coherency tate Control L1 L2 L3 RAM Modified cache lines transfered from other core hared cache lines transferred from inclusive L3 Daniel Molka 13

14 Benchmarks Latency Pointer chasing One thread loads data in its cache Thread on core 0 performs measurement on that data Bandwidth between cores Consecutive read or write One thread loads data, core 0 measures bandwidth Bandwidth of concurrent accesses All threads load their data into certain cache level Threads access data concurrently Earliest start timestamp and latest stop timestamp used to calculate bandwidth Daniel Molka 14

15 Test ystem Overview Dual socket Intel Xeon X GHz (Turbo Boost disabled) Quadcore (MT disabled) 32 KiB L1I, 32 KiB L1D 256 KiB L2 8 MiB shared L3 Inclusive of L1/L GHz 6x 2 GiB DDR GB/s per socket Quick Path Interconnect (QPI) 25.6 GB/s (12.8 per direction) C o r e 0 D D R 3 A L 1 L 2 Nehalem Quadcore hared Level 3 Cache I M C ( 3 Channel ) D D R 3 B C o r e 1 C o r e 2 C o r e 3 L 1 L 2 L 2 L 2 D D R 3 C L 1 Q P I L 1 I / O Hub C o r e 4 L 1 L 2 Nehalem Quadcore hared Level 3 Cache Q P I C o r e 5 C o r e 6 C o r e 7 L 1 L 1 L 1 L 2 L 2 L 2 D D R 3 D I M C ( 3 Channel ) D D R 3 E D D R 3 F Daniel Molka 15

16 Core Valid Bits E L 3 M L L L3 keeps track which cores have a copy Used to reduces core snoops 1 bit set Line is exclusive or modified L3 copy might be outdated 2 bits set Line is shared L3 copy is valid all bits 0 L L3 has the only copy Daniel Molka 16

17 Core Valid Bits ilent Evictions ilent eviction of unmodified cache lines E L E v i c t f r o m c o r e L Write back not required Core valid bits remains unchanged Explicit write back of modified data M L E v i c t f r o m c o r e L L3 copy needs to be updated Also resets core valid bit Daniel Molka 17

18 Latency Results Exclusive and Modified Exclusive cache lines L1: 4 cycles, L2 10 cycles L3: 38 cycles (13 ns) On-die transfer: 22 ns Remote access: 65 ns Daniel Molka 18

Latency Results Exclusive and Modified Exclusive cache lines L1: 4 cycles, L2 10 cycles L3: 38 cycles (13 ns) On-die transfer: 22 ns Remote access: 65 ns

19 Latency Results Exclusive and Modified Exclusive cache lines L1: 4 cycles, L2 10 cycles L3: 38 cycles (13 ns) On-die transfer: 22 ns Remote access: 65 ns Modified cache lines Identical for local access On-die transfer: L1: 28 ns, L2: 26 ns L3: 13 ns Remote access: >100 ns (write backs to memory) Daniel Molka 19

20 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency Daniel Molka 20

21 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency ilently evicted as well L Evict from cores L core valid bits set Cores not snooped Daniel Molka 21

Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency ilently evicted as well L3

22 Latency Results hared and Main Memory hared Cache lines On-die transfer Faster than exclusive Equal to local L3 latency ilently evicted as well L Evict from cores L core valid bits set Cores not snooped Main memory Local: 65 ns, remote: 106 ns 41 ns for access via QPI Daniel Molka 22

23 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Daniel Molka 23

24 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Modified cache lines Faster on-die transfer from L3 Rather slow from other cores Remote 5.6 GB/s (write backs) Daniel Molka 24

25 Bandwidth of Transfers Between Cores (and Processors) Exclusive cache lines L1: 45.6, L2: 31.1, L3: 26.2 GB/s On-die transfer: 19.7 GB/s Remote: 9.2 GB/s (limited by QPI) Modified cache lines Faster on-die transfer from L3 Rather slow from other cores Remote 5.6 GB/s (write backs) Main memory Local: 10.1 GB/s Remote: 6.3 GB/s (below QPI limit) Daniel Molka 25

26 Bandwidth of Concurrent Accesses Read bandwidth (exclusive data) L1/L2 scale well L3 limit at 85 GB/s per socket Main memory Max. 23 GB/s per socket 72% of theoretical peak Daniel Molka 26

27 Bandwidth of Concurrent Accesses Read bandwidth (exclusive data) L1/L2 scale well L3 limit at 85 GB/s per socket Main memory Max. 23 GB/s per socket 72% of theoretical peak Write bandwidth (modified data) L1/L2 scale well L3 limit at 26 GB/s per socket Main memory Max. 12 GB/s per socket Write allocate Daniel Molka 27

28 Bandwidth of Concurrent Accesses Coherency Overhead Coherency state control Exclusive: silently evict cache lines Daniel Molka 28

29 Bandwidth of Concurrent Accesses Coherency Overhead Coherency state control Exclusive: silently evict cache lines Modified: write back of higher level caches required Daniel Molka 29

30 ummary Benchmarks Unveil important undocumented performance data Measure properties that are not covered by existing benchmarks Data placement Analyze performance of communication between cores Coherency state control Analyze coherency protocol implementation Nehalem Performance Inclusive L3 cache handles all coherency issues between cores on die Core valid bits filter most unnecessary snoops Limited L3 write bandwidth Daniel Molka 30

performance counter support (perfmon2) Future Work upport other architectures

31 Recent Developments and Future Work Recent Developments Experimental support for Owned and Forward state Performance counter support (PAPI) Experimental uncore performance counter support (perfmon2) Future Work upport other architectures Analyze larger shared memory HPC systems Measure impact of AMD s HT Assits Daniel Molka 31

32 Thanks for your Attention Benchmarks and BenchIT Framework available as open source BenchIT available at Find x86 benchmarks at Daniel Molka 32

Detecting Memory-Boundedness with Hardware Performance Counters

Detecting Memory-Boundedness with Hardware Performance Counters Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)