Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models

Size: px
Start display at page:

Download "Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models"

Transcription

1 Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo

2 Sorry for missing last week s lecture! 2

3 To talk about some silliness in MPI-3 3

4 Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen? Mittwoch, 8. November 2016, Uhr ETH Zürich, Hauptgebäude Informationen und Anmeldung:

5

6 A more complete view 6

7 Architecture Developments < >2020 distributed memory machines communicating through messages large cachecoherent multicore machines communicating through coherent memory access and messages large cachecoherent multicore machines communicating through coherent memory access and remote direct memory access coherent and noncoherent manycore accelerators and multicores communicating through memory access and remote direct memory access largely noncoherent accelerators and multicores communicating through remote direct memory access Sources: various vendors 7

8 Computer Architecture vs. Physics Physics (technological constraints) Cost of data movement Capacity of DRAM cells Clock frequencies (constrained by end of Dennard scaling) Speed of Light Melting point of silicon Computer Architecture (design of the machine) Power management ISA / Multithreading SIMD widths Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints. Fred Brooks (IBM, 1962) Have converted many former power problems into cost problems Credit: John Shalf (LBNL) 8

9 Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Cubic power improvement with lower clock rate due to V 2 F Intel Core2 Slower clock rates enable use of simpler cores Power 5 Simpler cores use less area (lower leakage) and reduce cost Tailor design to application to REDUCE WASTE Credit: John Shalf (LBNL) 9

10 Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Intel Core2 Power 5 Power5 (server) 120W@1900MHz Baseline Intel Core2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU Core or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Credit: John Shalf (LBNL) 10

11 Low-Power Design Principles (2005) Tensilica XTensa Power5 (server) Baseline Intel Core2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU Core or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient. Credit: John Shalf (LBNL) 11

12 Heterogeneous Future (LOCs and TOCs) Big cores (very few) Tiny core Lots of them! 0.23mm 0.2 mm Latency Optimized Core (LOC) Most energy efficient if you don t have lots of parallelism Credit: John Shalf (LBNL) Throughput Optimized Core (TOC) Most energy efficient if you DO have a lot of parallelism! 12

13 Data movement the wires Energy Efficiency of copper wire: Power = Frequency * Length / cross-section-area wire Wire efficiency does not improve as feature size shrinks Photonics could break through Energy Efficiency the bandwidth-distance of a Transistor: limit Power = V 2 * Frequency * Capacitance Capacitance ~= Area of Transistor Transistor efficiency improves as you shrink it Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics) Credit: John Shalf (LBNL) 13

14 Pin Limits Moore s law doesn t apply to adding pins to package 30%+ per year nominal Moore s Law Pins grow at ~1.5-3% per year at best 4000 Pins is aggressive pin package Half of those would need to be for power and ground Of the remaining 2k pins, run as differential pairs Beyond 15Gbps per pin power/complexity costs hurt! 10Gpbs * 1k pins is ~1.2TBytes/sec 2.5D Integration gets boost in pin density But it s a 1 time boost (how much headroom?) 4TB/sec? (maybe 8TB/s with single wire signaling?) Credit: John Shalf (LBNL) 14

15 tor Accelerators Die Photos (3 classes of cores) 1.1mm, Henry Cook 20mm, Chen Sun, sanović cs.berkeley.edu, iversity of California, Berkeley, CA, USA ts Institute of Technology, Cambridge, MA, USA 8.4mm 3mm 1MB SRAM Array ore RISC-V Processor Credit: John Shalf (LBNL) VRF Core Logic Rocket Scalar Core 16K L1I$ FPGA FSB/ HTIF L1D$ 0.23mm L1I$ L1VI$ Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 2.8mm 1.2mm 0.2 mm Coherence Hub Rocket Scalar Core 16K L1I$ 1MB SRAM Array Core Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 0.5 mm 15

16 tor Accelerators Strip down the core 1.1mm, Henry Cook 2.7 mm, Chen Sun, sanović cs.berkeley.edu, iversity of California, Berkeley, CA, USA 4.5 mm ts Institute of Technology, Cambridge, MA, USA 3mm 1MB SRAM Array ore RISC-V Processor Credit: John Shalf (LBNL) VRF Core Logic Rocket Scalar Core 16K L1I$ FPGA FSB/ HTIF L1D$ 0.23mm L1I$ L1VI$ Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 2.8mm 1.2mm 0.2 mm Coherence Hub Rocket Scalar Core 16K L1I$ 1MB SRAM Array Core Hwacha Vector Accelerator 32K L1D$ Arbiter 8KB L1VI$ 0.5 mm 16

17 Actual Size HTIF 6mm Rocket Scalar Core 16K L1I$ Arbiter 32K L1D$ 8KB L1VI$ 1MB SRAM Array Rocket Scalar Core 16K L1I$ 32K L1D$ Arbiter 8KB L1VI$ L1I$ L1VI$ Core 1.1mm VRF L1D$ 3mm 2.8mm 2.7mm 4.5 mm their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Rocket scalar core, a 64-bit Hwacha vector accelerator, and demand-paged virtual memory environment. II. CHIP ARCHITECTURE Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision floating-point operations, demonstrating that high efficiency can be obtained without sacrificing a unified which can execute single- and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45nm SOI is non-blocking, allowing the core to exploit memory-level parallelism. Rocket has an optional IEEE compliant FPU, heterogeneous accelerators into general-purpose processors for greater energy efficiency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application softwareto makeuseof separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be flexible and extensible to better integrate new efficient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA verification suite, a Linux port, and additional documentation, scalar datapath isfully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache Abstract A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% higher in DMIPS/MHz than the Cortex-A5, ARM s comparable single-issue in-order scalar core, and is49% morearea-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order scalar core. The vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor, and 2.6 more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency of 1.3GHz at 1.2V and peak energy efficiency of 16.7 doubleprecision GFLOPS/W at 0.65 V with an ar ea of 3mm 2. I. I NTRODUCTION As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. A. Rocket Scalar Core Rocket is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The 1MB SRAM Array FPGA FSB/ Dual-Core RISC-V Vector Processor Coherence Hub Hwacha Vector Accelerator Hwacha Vector Accelerator Core Logic A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanović, Krste Asanović { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 0.5mm 1.2 mm 0.23mm 0.2 mm Credit: John Shalf (LBNL) 17

18 Basic Stats HTIF 6mm Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter Rocket Hwacha Scalar Vector Core 16K 32K 8KB L1I$ L1D$ L1VI$ Arbiter 1MB SRAM Array L1I$ L1VI$ Core 1.1mm VRF L1D$ 3mm 2.8mm their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Rocket scalar core, a 64-bit Hwacha vector accelerator, and demand-paged virtual memory environment. II. CHIP ARCHITECTURE Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision floating-point operations, demonstrating that high efficiency can be obtained without sacrificing a unified which can execute single- and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45nm SOI is non-blocking, allowing the core to exploit memory-level parallelism. Rocket has an optional IEEE compliant FPU, 2.7mm A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanović, Krste Asanović { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Abstract A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% Core Logic higher in DMIPS/MHz than the Cortex-A5, ARM s comparable single-issue in-order scalar core, and is49% morearea-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order Accelerator Accelerator scalar core. The vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor, and 2.6 more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub Vector Processor of 1.3GHz at 1.2V and peak energy efficiency of 16.7 doubleprecision GFLOPS/W at 0.65 V with an ar ea of 3mm 2 1MB SRAM Array FPGA FSB/. I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy efficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application softwareto makeuseof separate ISAs operating scalar datapath isfully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compiler- in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating efficient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, 0.5mm 1.2 mm 4.5 mm Core Energy/Area est. Area: mm 2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Area: 0.6 mm 2 Power: 0.3W (<0.2W) Clock: 1.3 GHz E/op: 150 (75) pj Wire Energy Assumptions for 22nm 100 fj/bit per mm 64bit operand Energy: 1mm=~6pj 20mm=~120pj 0.23mm 0.2 mm Area: mm 2 Power: 0.025W Clock: 1.0 GHz E/op: 22 pj Credit: John Shalf (LBNL) 18

19 When does data movement dominate? 4.5 mm 1.2 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu 3mm 1MB SRAM Array Dual-Core RISC-V Vector Processor Fig. 2. ITLB I$ Access VITLB VI$ Access Int.RF Inst. Decode VInst. Decode 16K L1I$ Rocket Scalar Core 2.8mm L1I$ L1VI$ L1D$ Arbiter 32K L1D$ Rocket Scalar Core 16K L1I$ Arbiter 32K L1D$ 8KB L1VI$ Hwacha Vector Accelerator Core 1MB SRAM Array Coherence Hub 8KB L1VI$ Hwacha Vector Accelerator Core Logic VRF FPGA FSB/ HTIF Expand FP.EX1 DTLB D$ Access FP.EX2 Commit FP.EX3 to Hwacha Hwacha Pipeline Bank8 R/W Int.EX... FP.RF Bank1 R/W Sequencer Rocket scalar plus Hwacha vector pipeline diagram. from Rocket PC Gen. Rocket Pipeline bypass paths omitted for simplicity PC Gen. Rocket has an optional IEEE compliant FPU, which can execute single- and double-precision floating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The Rocket is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. Rocket implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level parallelism. A. Rocket Scalar Core Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. 1.1mm I NTRODUCTI ON CHI P A RCHI TECTURE 6mm of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department I. Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This is the fir st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% higher in DM I PS/M Hz than the Cor tex-a5, ARM s compar able single-issue in-or der scalar cor e, and is 49% mor e ar ea-efficient. To demonstr ate the extensibility of the RI SC-V I SA, we integr ate a custom vector acceler ator alongside each single-issue in-or der scalar core. The vector acceler ator is 1.8 more ener gy-efficient than the I BM Blue Gene/Q processor, and 2.6 mor e than the I BM Cell pr ocessor, both fabr icated in the same pr ocess. The dual-cor e RI SC-V processor achieves maximum clock fr equency of 1.3 GHz at 1.2 V and peak ener gy efficiency of 16.7 doubleprecision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application software to make use of separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be flexible and extensible to better integrate new efficient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA verification suite, a Linux port, and additional documentation, and is available at www. r i scv. or g. II. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45 nm SOI process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efficient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision floating-point operations, demonstrating that high efficiency can be obtained without sacrificing a unified demand-paged virtual memory environment. Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order Rocket scalar core, a 64-bit Hwacha vector accelerator, and their associated instruction and data caches, as described below. mm2 Area: Power: 0.025W Clock: 1.0 GHz E/op: 22 pj 0.2 mm Credit: John Shalf (LBNL) 0.5mm Compute Op == data movement 3.6mm Energy Ratio for 20mm 5.5x 0.23mm Area: 0.6 Power: 0.3W (<0.2W) Clock: 1.3 GHz E/op: 150 (75) pj Compute Op == data movement 12mm Energy Ratio for 20mm 1.6x mm2 Compute Op == data movement 108mm Energy Ratio for 20mm 0.2x Area: mm2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Data Movement Cost Core Energy/Area est. 2.7mm 19

20 DPHPC Overview 20

21 Goals of this lecture Memory Trends Cache Coherence Memory Consistency 21

22 Memory CPU gap widens Measure processor speed as throughput FLOPS/s, IOPS/s, Moore s law - ~60% growth per year Source: Jack Dongarra Today s architectures POWER8: 338 dp GFLOP/s 230 GB/s memory bw BW i7-5775c: 883 GFLOPS/s ~50 GB/s memory bw Trend: memory performance grows 10% per year Source: John Mc.Calpin 22

23 Issues (AMD Interlagos as Example) How to measure bandwidth? Data sheet (often peak performance, may include overheads) Frequency times bus width: 51 GiB/s Microbenchmark performance Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s Why? Application performance As observed (performance counters) Somewhere in between stride 1 and random access How to measure Latency? Data sheet (often optimistic, or not provided) <100ns Random pointer chase 110 ns with one core, 258 ns with 32 cores! 23

24 Conjecture: Buffering is a must! Two most common examples: Write Buffers Delayed write back saves memory bandwidth Data is often overwritten or re-read Caching Directory of recently used locations Stored as blocks (cache lines) 24

25 Cache Coherence Different caches may have a copy of the same memory location! Cache coherence Manages existence of multiple copies Cache architectures Multi level caches Shared vs. private (partitioned) Inclusive vs. exclusive Write back vs. write through Victim cache to reduce conflict misses 25

26 Exclusive Hierarchical Caches 26

27 Shared Hierarchical Caches 27

28 Shared Hierarchical Caches with MT 28

29 Caching Strategies (repeat) Remember: Write Back? Write Through? Cache coherence requirements A memory system is coherent if it guarantees the following: Write propagation (updates are eventually visible to all readers) Write serialization (writes to the same location must be observed in order) Everything else: memory model issues (later) 29

30 Write Through Cache 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache stores X=1 in memory 4. CPU 1 reads X from its cache loads X=0 from its cache Incoherent value for X on CPU 1 CPU 1 may wait for update! Requires write propagation! 30

31 Write Back Cache 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache 4. CPU 1 writes X =2 stores X=2 in its cache 5. CPU 1 writes back cache line stores X=2 in in memory 6. CPU 0 writes back cache line stores X=1 in memory Later store X=2 from CPU 1 lost Requires write serialization! 31

32 A simple (?) example Assume C99: struct twoint { int a; int b; } Two threads: Initially: a=b=0 Thread 0: write 1 to a Thread 1: write 1 to b Assume non-coherent write back cache What may end up in main memory? 32

33 Cache Coherence Protocol Programmer can hardly deal with unpredictable behavior! Cache controller maintains data integrity All writes to different locations are visible Fundamental Mechanisms Snooping Shared bus or (broadcast) network Directory-based Record information necessary to maintain coherence: E.g., owner and state of a line etc. 33

34 Fundamental CC mechanisms Snooping Shared bus or (broadcast) network Cache controller snoops all transactions Monitors and changes the state of the cache s data Works at small scale, challenging at large-scale E.g., Intel Broadwell Directory-based Record information necessary to maintain coherence E.g., owner and state of a line etc. Central/Distributed directory for cache line ownership Scalable but more complex/expensive E.g., Intel Xeon Phi KNC/KNL Source: Intel 34

35 Cache Coherence Parameters Concerns/Goals Performance Implementation cost (chip space, more important: dynamic energy) Correctness (Memory model side effects) Issues Detection (when does a controller need to act) Enforcement (how does a controller guarantee coherence) Precision of block sharing (per block, per sub-block?) Block size (cache line size?) 35

36 An Engineering Approach: Empirical start Problem 1: stale reads Cache 1 holds value that was already modified in cache 2 Solution: Disallow this state Invalidate all remote copies before allowing a write to complete Problem 2: lost update Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data Solution: Disallow more than one modified copy 36

37 Invalidation vs. update (I) Invalidation-based: On each write of a shared line, it has to invalidate copies in remote caches Simple implementation for bus-based systems: Each cache snoops Invalidate lines written by other CPUs Signal sharing for cache lines in local cache to other caches Update-based: Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic) 37

38 Invalidation vs. update (II) Invalidation-based: Only write misses hit the bus (works with write-back caches) Subsequent writes to the same cache line are local Good for multiple writes to the same line (in the same cache) Update-based: All sharers continue to hit cache line after one core writes Implicit assumption: shared lines are accessed often Supports producer-consumer pattern well Many (local) writes may waste bandwidth! Hybrid forms are possible! 38

39 MESI Cache Coherence Most common hardware implementation of discussed requirements aka. Illinois protocol Each line has one of the following states (in a cache): Modified (M) Local copy has been modified, no copies in other caches Memory is stale Exclusive (E) No copies in other caches Memory is up to date Shared (S) Unmodified copies may exist in other caches Memory is up to date Invalid (I) Line is not in cache 39

40 Terminology Clean line: Content of cache line and main memory is identical (also: memory is up to date) Can be evicted without write-back Dirty line: Content of cache line and main memory differ (also: memory is stale) Needs to be written back eventually Time depends on protocol details Bus transaction: A signal on the bus that can be observed by all caches Usually blocking Local read/write: A load/store operation originating at a core connected to the cache 40

41 Transitions in response to local reads State is M No bus transaction State is E No bus transaction State is S No bus transaction State is I Generate bus read request (BusRd) May force other cache operations (see later) Other cache(s) signal sharing if they hold a copy If shared was signaled, go to state S Otherwise, go to state E After update: return read value 41

42 Transitions in response to local writes State is M No bus transaction State is E No bus transaction Go to state M State is S Line already local & clean There may be other copies Generate bus read request for upgrade to exclusive (BusRdX*) Go to state M State is I Generate bus read request for exclusive ownership (BusRdX) Go to state M 42

43 Transitions in response to snooped BusRd State is M Write cache line back to main memory Signal shared Go to state S (or E) State is E Signal shared Go to state S and signal shared State is S Signal shared State is I Ignore 43

44 Transitions in response to snooped BusRdX State is M Write cache line back to memory Discard line and go to I State is E Discard line and go to I State is S Discard line and go to I State is I Ignore BusRdX* is handled like BusRdX! 44

45 MESI State Diagram (FSM) Source: Wikipedia 45

46 Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x P2 reads x P1 writes x P1 reads x P3 writes x 46

47 Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x E I I BusRd Memory P2 reads x S S I BusRd Cache P1 writes x M I I BusRdX* Cache P1 reads x M I I - Cache P3 writes x I I M BusRdX Memory 47

48 Optimizations? Class question: what could be optimized in the MESI protocol to make a system faster? 48

49 Related Protocols: MOESI (AMD) Extended MESI protocol Cache-to-cache transfer of modified cache lines Cache in M or O state always transfers cache line to requesting cache No need to contact (slow) main memory Avoids write back when another process accesses cache line Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache! 49

50 MOESI State Diagram Source: AMD64 Architecture Programmer s Manual 50

51 Related Protocols: MOESI (AMD) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Owner (O): Modified Shared Exclusive right to make changes Other S copies may exist ( dirty sharing ) Memory is stale, cache supplies copy (reply to BusRd*) Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date unless an O copy exists in another cache Invalid (I): Same as MESI 51

52 Related Protocols: MESIF (Intel) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date Invalid (I): Same as MESI Forward (F): Special form of S state, other caches may have line in S Most recent requester of line is in F state Cache acts as responder for requests to this line 52

53 Multi-level caches Most systems have multi-level caches Problem: only last level cache is connected to bus or network Yet, snoop requests are relevant for inner-levels of cache (L1) Modifications of L1 data may not be visible at L2 (and thus the bus) L1/L2 modifications On BusRd check if line is in M state in L1 It may be in E or S in L2! On BusRdX(*) send invalidations to L1 Everything else can be handled in L2 If L1 is write through, L2 could remember state of L1 cache line May increase traffic though 53

54 Directory-based cache coherence Snooping does not scale Bus transactions must be globally visible Implies broadcast Typical solution: tree-based (hierarchical) snooping Root becomes a bottleneck Directory-based schemes are more scalable Directory (entry for each CL) keeps track of all owning caches Point-to-point update to involved processors No broadcast Can use specialized (high-bandwidth) network, e.g., HT, QPI 54

55 Markus Püschel Computer Science Basic Scheme System with N processors P i For each memory block (size: cache line) maintain a directory entry N presence bits Set if block in cache of P i 1 dirty bit First proposed by Censier and Feautrier (1978) 55

56 Directory-based CC: Read miss P i intends to read, misses If dirty bit (in directory) is off Read from main memory Set presence[i] Supply data to reader If dirty bit is on Recall cache line from P j (determine by presence[]) Update memory Unset dirty bit, block shared Set presence[i] Supply data to reader 56

57 Directory-based CC: Write miss P i intends to write, misses If dirty bit (in directory) is off Send invalidations to all processors P j with presence[j] turned on Unset presence bit for all processors Set dirty bit Set presence[i], owner P i If dirty bit is on Recall cache line from owner P j Update memory Unset presence[j] Set presence[i], dirty bit remains set Supply data to writer 57

58 Discussion Scaling of memory bandwidth No centralized memory Directory-based approaches scale with restrictions Require presence bit for each cache Number of bits determined at design time Directory requires memory (size scales linearly) Shared vs. distributed directory Software-emulation Distributed shared memory (DSM) Emulate cache coherence in software (e.g., TreadMarks) Often on a per-page basis, utilizes memory virtualization and paging 59

59 Open Problems (for projects or theses) Tune algorithms to cache-coherence schemes What is the optimal parallel algorithm for a given scheme? Parameterize for an architecture Measure and classify hardware Read Maranget et al. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models and have fun! RDMA consistency is barely understood! GPU memories are not well understood! Huge potential for new insights! Can we program (easily) without cache coherence? How to fix the problems with inconsistent values? Compiler support (issues with arrays)? 60

60 Case Study: Intel Xeon Phi 61

61 Communication? Ramos, Hoefler: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, HPDC 13 62

62 Invalid read R I =278 ns Local read: R L = 8.6 ns Remote read R R = 235 ns Inspired by Molka et al.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system 63

63 Single-Line Ping Pong Prediction for both in E state: 479 ns Measurement: 497 ns (O=18) 64

64 Multi-Line Ping Pong More complex due to prefetch Number of CLs Amortization of startup Asymptotic Fetch Latency for each cache line (optimal prefetch!) Startup overhead 65

65 Multi-Line Ping Pong E state: o=76 ns q=1,521ns p=1,096ns I state: o=95ns q=2,750ns p=2,017ns 66

66 DTD Contention E state: a=0ns b=320ns c=56.2ns 67

Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view

Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view Sorry for missing last week s lecture! Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: https://www.youtube.com/watch?v=zjybff6pqeq

More information

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W ISC-V Processor with Vector Accelerators" Yunsup Lee 1, Andrew Waterman 1, imas Avizienis 1,! Henry Cook 1, Chen Sun 1,2,! Vladimir Stojanovic 1,2, Krste Asanovic

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

RISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley

RISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley RISC-V Rocket Chip SoC Generator in Chisel Yunsup Lee UC Berkeley yunsup@eecs.berkeley.edu What is the Rocket Chip SoC Generator?! Parameterized SoC generator written in Chisel! Generates Tiles - (Rocket)

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Snooping-Based Cache Coherence

Snooping-Based Cache Coherence Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro Lecture 7: M, ache coherence Topics: handling M errors and writes, cache coherence intro 1 hase hange Memory Emerging NVM technology that can replace Flash and DRAM Much higher density; much better scalability;

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 Instructors: Krste Asanović & Randy H Katz http://insteecsberkeleyedu/~cs6c/ Parallel Requests Assigned to computer eg, Search

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro Lecture 7: M Wrap-Up, ache coherence Topics: handling M errors and writes, cache coherence intro 1 Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 Instructors: Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/ 10/19/17 Fall 2017 - Lecture #16 1 Parallel

More information

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

History. PowerPC based micro-architectures. PowerPC ISA. Introduction PowerPC based micro-architectures Godfrey van der Linden Presentation for COMP9244 Software view of Processor Architectures 2006-05-25 History 1985 IBM started on AMERICA 1986 Development of RS/6000 1990

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L5- Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 15 pages Notes: Not all questions are

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

Lecture 1: Introduction

Lecture 1: Introduction Lecture 1: Introduction ourse organization: 4 lectures on cache coherence and consistency 2 lectures on transactional memory 2 lectures on interconnection networks 4 lectures on caches 4 lectures on memory

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

CS 152 Laboratory Exercise 5

CS 152 Laboratory Exercise 5 CS 152 Laboratory Exercise 5 Professor: Krste Asanovic TA: Christopher Celio Department of Electrical Engineering & Computer Science University of California, Berkeley April 11, 2012 1 Introduction and

More information