Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models

Size: px

Start display at page:

Download "Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models"

Mae Blankenship
6 years ago
Views:

1 Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo

2 Sorry for missing last week s lecture! 2

3 To talk about some silliness in MPI-3 3

Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen? Mittwoch, 8. November 2016, 15.00 19.

4 Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen? Mittwoch, 8. November 2016, Uhr ETH Zürich, Hauptgebäude Informationen und Anmeldung:

6 A more complete view 6

7 Architecture Developments < >2020 distributed memory machines communicating through messages large cachecoherent multicore machines communicating through coherent memory access and messages large cachecoherent multicore machines communicating through coherent memory access and remote direct memory access coherent and noncoherent manycore accelerators and multicores communicating through memory access and remote direct memory access largely noncoherent accelerators and multicores communicating through remote direct memory access Sources: various vendors 7

8 Computer Architecture vs. Physics Physics (technological constraints) Cost of data movement Capacity of DRAM cells Clock frequencies (constrained by end of Dennard scaling) Speed of Light Melting point of silicon Computer Architecture (design of the machine) Power management ISA / Multithreading SIMD widths Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints. Fred Brooks (IBM, 1962) Have converted many former power problems into cost problems Credit: John Shalf (LBNL) 8

9 Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Cubic power improvement with lower clock rate due to V 2 F Intel Core2 Slower clock rates enable use of simpler cores Power 5 Simpler cores use less area (lower leakage) and reduce cost Tailor design to application to REDUCE WASTE Credit: John Shalf (LBNL) 9

Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Intel Core2 Power 5 Power5 (server) 120W@1900MHz Baseline Intel Core2 sc (laptop) : 15W@1000MHz 4x more

10 Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Intel Core2 Power 5 Power5 (server) 120W@1900MHz Baseline Intel Core2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU Core or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Credit: John Shalf (LBNL) 10

11 Low-Power Design Principles (2005) Tensilica XTensa Power5 (server) Baseline Intel Core2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU Core or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient. Credit: John Shalf (LBNL) 11

2 mm Latency Optimized Core (LOC) Most energy efficient if you don t have

12 Heterogeneous Future (LOCs and TOCs) Big cores (very few) Tiny core Lots of them! 0.23mm 0.2 mm Latency Optimized Core (LOC) Most energy efficient if you don t have lots of parallelism Credit: John Shalf (LBNL) Throughput Optimized Core (TOC) Most energy efficient if you DO have a lot of parallelism! 12

2 * Frequency * Capacitance Capacitance ~= Area of Transistor Transistor efficiency improves as you shrink it Net result is that moving

13 Data movement the wires Energy Efficiency of copper wire: Power = Frequency * Length / cross-section-area wire Wire efficiency does not improve as feature size shrinks Photonics could break through Energy Efficiency the bandwidth-distance of a Transistor: limit Power = V 2 * Frequency * Capacitance Capacitance ~= Area of Transistor Transistor efficiency improves as you shrink it Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics) Credit: John Shalf (LBNL) 13

14 Pin Limits Moore s law doesn t apply to adding pins to package 30%+ per year nominal Moore s Law Pins grow at ~1.5-3% per year at best 4000 Pins is aggressive pin package Half of those would need to be for power and ground Of the remaining 2k pins, run as differential pairs Beyond 15Gbps per pin power/complexity costs hurt! 10Gpbs * 1k pins is ~1.2TBytes/sec 2.5D Integration gets boost in pin density But it s a 1 time boost (how much headroom?) 4TB/sec? (maybe 8TB/s with single wire signaling?) Credit: John Shalf (LBNL) 14

tor Accelerators Die Photos (3 classes of cores) 1.1mm, Henry Cook 20mm, Chen Sun, sanović cs.berkeley.edu, sunchen@mit.

15 tor Accelerators Die Photos (3 classes of cores) 1.1mm, Henry Cook 20mm, Chen Sun, sanović cs.berkeley.edu, iversity of California, Berkeley, CA, USA ts Institute of Technology, Cambridge, MA, USA 8.4mm 3mm 1MB SRAM Array ore RISC-V Processor Credit: John Shalf (LBNL) VRF Core Logic Rocket Scalar Core 16K L1I$ FPGA FSB/ HTIF L1D$ 0.23mm L1I$ L1VI$ Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 2.8mm 1.2mm 0.2 mm Coherence Hub Rocket Scalar Core 16K L1I$ 1MB SRAM Array Core Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 0.5 mm 15

16 tor Accelerators Strip down the core 1.1mm, Henry Cook 2.7 mm, Chen Sun, sanović cs.berkeley.edu, iversity of California, Berkeley, CA, USA 4.5 mm ts Institute of Technology, Cambridge, MA, USA 3mm 1MB SRAM Array ore RISC-V Processor Credit: John Shalf (LBNL) VRF Core Logic Rocket Scalar Core 16K L1I$ FPGA FSB/ HTIF L1D$ 0.23mm L1I$ L1VI$ Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 2.8mm 1.2mm 0.2 mm Coherence Hub Rocket Scalar Core 16K L1I$ 1MB SRAM Array Core Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 0.5 mm 16

17 Actual Size HTIF 6mm Rocket Scalar Core 16K L1I$ Arbiter 32K L1D$ 8KB L1VI$ 1MB SRAM Array Rocket Scalar Core 16K L1I$ 32K L1D$ Arbiter 8KB L1VI$ L1I$ L1VI$ Core 1.1mm VRF L1D$ 3mm 2.8mm 2.7mm 4.5 mm their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Rocket scalar core, a 64-bit Hwacha vector accelerator, and demand-paged virtual memory environment. II. CHIP ARCHITECTURE Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision floating-point operations, demonstrating that high efficiency can be obtained without sacrificing a unified which can execute single- and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45nm SOI is non-blocking, allowing the core to exploit memory-level parallelism. Rocket has an optional IEEE compliant FPU, heterogeneous accelerators into general-purpose processors for greater energy efficiency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application softwareto makeuseof separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be flexible and extensible to better integrate new efficient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA verification suite, a Linux port, and additional documentation, scalar datapath isfully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache Abstract A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% higher in DMIPS/MHz than the Cortex-A5, ARM s comparable single-issue in-order scalar core, and is49% morearea-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order scalar core. The vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor, and 2.6 more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency of 1.3GHz at 1.2V and peak energy efficiency of 16.7 doubleprecision GFLOPS/W at 0.65 V with an ar ea of 3mm 2. I. I NTRODUCTION As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. A. Rocket Scalar Core Rocket is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The 1MB SRAM Array FPGA FSB/ Dual-Core RISC-V Vector Processor Coherence Hub Hwacha Vector Accelerator Hwacha Vector Accelerator Core Logic A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanović, Krste Asanović { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 0.5mm 1.2 mm 0.23mm 0.2 mm Credit: John Shalf (LBNL) 17

Basic Stats HTIF 6mm Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter 1MB SRAM Array L1I$ L1VI$ Core 1.

Rocket scalar core, a 64-bit Hwacha vector accelerator, and demand-paged virtual memory environment. II. CHIP ARCHITECTURE Figure 1 shows the block diagram of the dual-core processor.

Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.

double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The and is available at www. r i scv. or g.

18 Basic Stats HTIF 6mm Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter 1MB SRAM Array L1I$ L1VI$ Core 1.1mm VRF L1D$ 3mm 2.8mm their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Rocket scalar core, a 64-bit Hwacha vector accelerator, and demand-paged virtual memory environment. II. CHIP ARCHITECTURE Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision floating-point operations, demonstrating that high efficiency can be obtained without sacrificing a unified which can execute single- and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45nm SOI is non-blocking, allowing the core to exploit memory-level parallelism. Rocket has an optional IEEE compliant FPU, 2.7mm A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanović, Krste Asanović { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Abstract A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% Core Logic higher in DMIPS/MHz than the Cortex-A5, ARM s comparable single-issue in-order scalar core, and is49% morearea-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order Accelerator Accelerator scalar core. The vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor, and 2.6 more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub Vector Processor of 1.3GHz at 1.2V and peak energy efficiency of 16.7 doubleprecision GFLOPS/W at 0.65 V with an ar ea of 3mm 2 1MB SRAM Array FPGA FSB/. I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy efficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application softwareto makeuseof separate ISAs operating scalar datapath isfully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compiler- in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating efficient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, 0.5mm 1.2 mm 4.5 mm Core Energy/Area est. Area: mm 2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Area: 0.6 mm 2 Power: 0.3W (<0.2W) Clock: 1.3 GHz E/op: 150 (75) pj Wire Energy Assumptions for 22nm 100 fj/bit per mm 64bit operand Energy: 1mm=~6pj 20mm=~120pj 0.23mm 0.2 mm Area: mm 2 Power: 0.025W Clock: 1.0 GHz E/op: 22 pj Credit: John Shalf (LBNL) 18

19 When does data movement dominate? 4.5 mm 1.2 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu 3mm 1MB SRAM Array Dual-Core RISC-V Vector Processor Fig. 2. ITLB I$ Access VITLB VI$ Access Int.RF Inst. Decode VInst. Decode 16K L1I$ Rocket Scalar Core 2.8mm L1I$ L1VI$ L1D$ Arbiter 32K L1D$ Rocket Scalar Core 16K L1I$ Arbiter 32K L1D$ 8KB L1VI$ Hwacha Vector Accelerator Core 1MB SRAM Array Coherence Hub 8KB L1VI$ Hwacha Vector Accelerator Core Logic VRF FPGA FSB/ HTIF Expand FP.EX1 DTLB D$ Access FP.EX2 Commit FP.EX3 to Hwacha Hwacha Pipeline Bank8 R/W Int.EX... FP.RF Bank1 R/W Sequencer Rocket scalar plus Hwacha vector pipeline diagram. from Rocket PC Gen. Rocket Pipeline bypass paths omitted for simplicity PC Gen. Rocket has an optional IEEE compliant FPU, which can execute single- and double-precision ﬂoating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The Rocket is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level parallelism. A. Rocket Scalar Core Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. 1.1mm I NTRODUCTI ON CHI P A RCHI TECTURE 6mm of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department I. Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This is the ﬁr st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% higher in DM I PS/M Hz than the Cor tex-a5, ARM s compar able single-issue in-or der scalar cor e, and is 49% mor e ar ea-efﬁcient. To demonstr ate the extensibility of the RI SC-V I SA, we integr ate a custom vector acceler ator alongside each single-issue in-or der scalar core. The vector acceler ator is 1.8 more ener gy-efﬁcient than the I BM Blue Gene/Q processor, and 2.6 mor e than the I BM Cell pr ocessor, both fabr icated in the same pr ocess. The dual-cor e RI SC-V processor achieves maximum clock fr equency of 1.3 GHz at 1.2 V and peak ener gy efﬁciency of 16.7 doubleprecision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efﬁciency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application software to make use of separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be ﬂexible and extensible to better integrate new efﬁcient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA veriﬁcation suite, a Linux port, and additional documentation, and is available at www. r i scv. or g. II. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45 nm SOI process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efﬁcient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision ﬂoating-point operations, demonstrating that high efﬁciency can be obtained without sacriﬁcing a uniﬁed demand-paged virtual memory environment. Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order Rocket scalar core, a 64-bit Hwacha vector accelerator, and their associated instruction and data caches, as described below. mm2 Area: Power: 0.025W Clock: 1.0 GHz E/op: 22 pj 0.2 mm Credit: John Shalf (LBNL) 0.5mm Compute Op == data movement 3.6mm Energy Ratio for 20mm 5.5x 0.23mm Area: 0.6 Power: 0.3W (<0.2W) Clock: 1.3 GHz E/op: 150 (75) pj Compute Op == data movement 12mm Energy Ratio for 20mm 1.6x mm2 Compute Op == data movement 108mm Energy Ratio for 20mm 0.2x Area: mm2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Data Movement Cost Core Energy/Area est. 2.7mm 19

20 DPHPC Overview 20

21 Goals of this lecture Memory Trends Cache Coherence Memory Consistency 21

Memory CPU gap widens Measure processor speed as throughput FLOPS/s, IOPS/s, Moore s law - ~60% growth per year Source: Jack Dongarra Today s architectures

22 Memory CPU gap widens Measure processor speed as throughput FLOPS/s, IOPS/s, Moore s law - ~60% growth per year Source: Jack Dongarra Today s architectures POWER8: 338 dp GFLOP/s 230 GB/s memory bw BW i7-5775c: 883 GFLOPS/s ~50 GB/s memory bw Trend: memory performance grows 10% per year Source: John Mc.Calpin 22

23 Issues (AMD Interlagos as Example) How to measure bandwidth? Data sheet (often peak performance, may include overheads) Frequency times bus width: 51 GiB/s Microbenchmark performance Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s Why? Application performance As observed (performance counters) Somewhere in between stride 1 and random access How to measure Latency? Data sheet (often optimistic, or not provided) <100ns Random pointer chase 110 ns with one core, 258 ns with 32 cores! 23

24 Conjecture: Buffering is a must! Two most common examples: Write Buffers Delayed write back saves memory bandwidth Data is often overwritten or re-read Caching Directory of recently used locations Stored as blocks (cache lines) 24

25 Cache Coherence Different caches may have a copy of the same memory location! Cache coherence Manages existence of multiple copies Cache architectures Multi level caches Shared vs. private (partitioned) Inclusive vs. exclusive Write back vs. write through Victim cache to reduce conflict misses 25

26 Exclusive Hierarchical Caches 26

27 Shared Hierarchical Caches 27

28 Shared Hierarchical Caches with MT 28

29 Caching Strategies (repeat) Remember: Write Back? Write Through? Cache coherence requirements A memory system is coherent if it guarantees the following: Write propagation (updates are eventually visible to all readers) Write serialization (writes to the same location must be observed in order) Everything else: memory model issues (later) 29

30 Write Through Cache 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache stores X=1 in memory 4. CPU 1 reads X from its cache loads X=0 from its cache Incoherent value for X on CPU 1 CPU 1 may wait for update! Requires write propagation! 30

31 Write Back Cache 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache 4. CPU 1 writes X =2 stores X=2 in its cache 5. CPU 1 writes back cache line stores X=2 in in memory 6. CPU 0 writes back cache line stores X=1 in memory Later store X=2 from CPU 1 lost Requires write serialization! 31

32 A simple (?) example Assume C99: struct twoint { int a; int b; } Two threads: Initially: a=b=0 Thread 0: write 1 to a Thread 1: write 1 to b Assume non-coherent write back cache What may end up in main memory? 32

33 Cache Coherence Protocol Programmer can hardly deal with unpredictable behavior! Cache controller maintains data integrity All writes to different locations are visible Fundamental Mechanisms Snooping Shared bus or (broadcast) network Directory-based Record information necessary to maintain coherence: E.g., owner and state of a line etc. 33

34 Fundamental CC mechanisms Snooping Shared bus or (broadcast) network Cache controller snoops all transactions Monitors and changes the state of the cache s data Works at small scale, challenging at large-scale E.g., Intel Broadwell Directory-based Record information necessary to maintain coherence E.g., owner and state of a line etc. Central/Distributed directory for cache line ownership Scalable but more complex/expensive E.g., Intel Xeon Phi KNC/KNL Source: Intel 34

35 Cache Coherence Parameters Concerns/Goals Performance Implementation cost (chip space, more important: dynamic energy) Correctness (Memory model side effects) Issues Detection (when does a controller need to act) Enforcement (how does a controller guarantee coherence) Precision of block sharing (per block, per sub-block?) Block size (cache line size?) 35

36 An Engineering Approach: Empirical start Problem 1: stale reads Cache 1 holds value that was already modified in cache 2 Solution: Disallow this state Invalidate all remote copies before allowing a write to complete Problem 2: lost update Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data Solution: Disallow more than one modified copy 36

37 Invalidation vs. update (I) Invalidation-based: On each write of a shared line, it has to invalidate copies in remote caches Simple implementation for bus-based systems: Each cache snoops Invalidate lines written by other CPUs Signal sharing for cache lines in local cache to other caches Update-based: Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic) 37

38 Invalidation vs. update (II) Invalidation-based: Only write misses hit the bus (works with write-back caches) Subsequent writes to the same cache line are local Good for multiple writes to the same line (in the same cache) Update-based: All sharers continue to hit cache line after one core writes Implicit assumption: shared lines are accessed often Supports producer-consumer pattern well Many (local) writes may waste bandwidth! Hybrid forms are possible! 38

39 MESI Cache Coherence Most common hardware implementation of discussed requirements aka. Illinois protocol Each line has one of the following states (in a cache): Modified (M) Local copy has been modified, no copies in other caches Memory is stale Exclusive (E) No copies in other caches Memory is up to date Shared (S) Unmodified copies may exist in other caches Memory is up to date Invalid (I) Line is not in cache 39

40 Terminology Clean line: Content of cache line and main memory is identical (also: memory is up to date) Can be evicted without write-back Dirty line: Content of cache line and main memory differ (also: memory is stale) Needs to be written back eventually Time depends on protocol details Bus transaction: A signal on the bus that can be observed by all caches Usually blocking Local read/write: A load/store operation originating at a core connected to the cache 40

41 Transitions in response to local reads State is M No bus transaction State is E No bus transaction State is S No bus transaction State is I Generate bus read request (BusRd) May force other cache operations (see later) Other cache(s) signal sharing if they hold a copy If shared was signaled, go to state S Otherwise, go to state E After update: return read value 41

42 Transitions in response to local writes State is M No bus transaction State is E No bus transaction Go to state M State is S Line already local & clean There may be other copies Generate bus read request for upgrade to exclusive (BusRdX*) Go to state M State is I Generate bus read request for exclusive ownership (BusRdX) Go to state M 42

43 Transitions in response to snooped BusRd State is M Write cache line back to main memory Signal shared Go to state S (or E) State is E Signal shared Go to state S and signal shared State is S Signal shared State is I Ignore 43

44 Transitions in response to snooped BusRdX State is M Write cache line back to memory Discard line and go to I State is E Discard line and go to I State is S Discard line and go to I State is I Ignore BusRdX* is handled like BusRdX! 44

45 MESI State Diagram (FSM) Source: Wikipedia 45

46 Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x P2 reads x P1 writes x P1 reads x P3 writes x 46

47 Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x E I I BusRd Memory P2 reads x S S I BusRd Cache P1 writes x M I I BusRdX* Cache P1 reads x M I I - Cache P3 writes x I I M BusRdX Memory 47

48 Optimizations? Class question: what could be optimized in the MESI protocol to make a system faster? 48

49 Related Protocols: MOESI (AMD) Extended MESI protocol Cache-to-cache transfer of modified cache lines Cache in M or O state always transfers cache line to requesting cache No need to contact (slow) main memory Avoids write back when another process accesses cache line Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache! 49

50 MOESI State Diagram Source: AMD64 Architecture Programmer s Manual 50

51 Related Protocols: MOESI (AMD) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Owner (O): Modified Shared Exclusive right to make changes Other S copies may exist ( dirty sharing ) Memory is stale, cache supplies copy (reply to BusRd*) Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date unless an O copy exists in another cache Invalid (I): Same as MESI 51

52 Related Protocols: MESIF (Intel) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date Invalid (I): Same as MESI Forward (F): Special form of S state, other caches may have line in S Most recent requester of line is in F state Cache acts as responder for requests to this line 52

53 Multi-level caches Most systems have multi-level caches Problem: only last level cache is connected to bus or network Yet, snoop requests are relevant for inner-levels of cache (L1) Modifications of L1 data may not be visible at L2 (and thus the bus) L1/L2 modifications On BusRd check if line is in M state in L1 It may be in E or S in L2! On BusRdX(*) send invalidations to L1 Everything else can be handled in L2 If L1 is write through, L2 could remember state of L1 cache line May increase traffic though 53

54 Directory-based cache coherence Snooping does not scale Bus transactions must be globally visible Implies broadcast Typical solution: tree-based (hierarchical) snooping Root becomes a bottleneck Directory-based schemes are more scalable Directory (entry for each CL) keeps track of all owning caches Point-to-point update to involved processors No broadcast Can use specialized (high-bandwidth) network, e.g., HT, QPI 54

maintain a directory entry N presence bits Set if block in

55 Markus Püschel Computer Science Basic Scheme System with N processors P i For each memory block (size: cache line) maintain a directory entry N presence bits Set if block in cache of P i 1 dirty bit First proposed by Censier and Feautrier (1978) 55

56 Directory-based CC: Read miss P i intends to read, misses If dirty bit (in directory) is off Read from main memory Set presence[i] Supply data to reader If dirty bit is on Recall cache line from P j (determine by presence[]) Update memory Unset dirty bit, block shared Set presence[i] Supply data to reader 56

57 Directory-based CC: Write miss P i intends to write, misses If dirty bit (in directory) is off Send invalidations to all processors P j with presence[j] turned on Unset presence bit for all processors Set dirty bit Set presence[i], owner P i If dirty bit is on Recall cache line from owner P j Update memory Unset presence[j] Set presence[i], dirty bit remains set Supply data to writer 57

58 Discussion Scaling of memory bandwidth No centralized memory Directory-based approaches scale with restrictions Require presence bit for each cache Number of bits determined at design time Directory requires memory (size scales linearly) Shared vs. distributed directory Software-emulation Distributed shared memory (DSM) Emulate cache coherence in software (e.g., TreadMarks) Often on a per-page basis, utilizes memory virtualization and paging 59

59 Open Problems (for projects or theses) Tune algorithms to cache-coherence schemes What is the optimal parallel algorithm for a given scheme? Parameterize for an architecture Measure and classify hardware Read Maranget et al. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models and have fun! RDMA consistency is barely understood! GPU memories are not well understood! Huge potential for new insights! Can we program (easily) without cache coherence? How to fix the problems with inconsistent values? Compiler support (issues with arrays)? 60

60 Case Study: Intel Xeon Phi 61

61 Communication? Ramos, Hoefler: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, HPDC 13 62

62 Invalid read R I =278 ns Local read: R L = 8.6 ns Remote read R R = 235 ns Inspired by Molka et al.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system 63

63 Single-Line Ping Pong Prediction for both in E state: 479 ns Measurement: 497 ns (O=18) 64

64 Multi-Line Ping Pong More complex due to prefetch Number of CLs Amortization of startup Asymptotic Fetch Latency for each cache line (optimal prefetch!) Startup overhead 65

65 Multi-Line Ping Pong E state: o=76 ns q=1,521ns p=1,096ns I state: o=95ns q=2,750ns p=2,017ns 66

66 DTD Contention E state: a=0ns b=320ns c=56.2ns 67

Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view

Sorry for missing last week s lecture! Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: https://www.youtube.com/watch?v=zjybff6pqeq