Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view

Size: px

Start display at page:

Download "Sorry for missing last week s lecture! To talk about some silliness in MPI-3. A more complete view"

Silas George
6 years ago
Views:

1 Sorry for missing last week s lecture! Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo 2 To talk about some silliness in MPI-3 Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen? Mittwoch, 8. November 2016, Uhr ETH Zürich, Hauptgebäude Informationen und Anmeldung: 3 A more complete view 6

messages large cachecoherent multicore machines communicating through coherent memory access and

remote direct memory access 13-20 >2020 coherent and noncoherent manycore accelerators and multicores

and multicores communicating through remote direct memory access Physics (technological constraints)

Speed of Light Melting point of silicon Computer Architecture (design of the machine) Power management

determining the needs of the user of a structure and then designing to meet those needs as effectively

Fred Brooks (IBM, 1962) Have converted many former power problems into cost problems 7 Sources: various

Atom Intel 2 Power 5 Tensilica XTensa Cubic power improvement with lower clock rate due to V2F Slower

15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.

09W@600MHz 400x more (80x-120x sustained) Intel Atom Intel 2 Simpler cores use less area (lower

Principles (2005) Heterogeneous Future (LOCs and TOCs) Tensilica XTensa Power5 (server) 120W@1900MHz

Optimized (LOC) Most energy efficient if you don t have lots of parallelism Even if each simple core is

2 Computer Architecture vs. Physics Architecture Developments < distributed memory machines communicating through messages large cachecoherent multicore machines communicating through coherent memory access and messages large cachecoherent multicore machines communicating through coherent memory access and remote direct memory access >2020 coherent and noncoherent manycore accelerators and multicores communicating through memory access and remote direct memory access largely noncoherent accelerators and multicores communicating through remote direct memory access Physics (technological constraints) Cost of data movement Capacity of DRAM cells Clock frequencies (constrained by end of Dennard scaling) Speed of Light Melting point of silicon Computer Architecture (design of the machine) Power management ISA / Multithreading SIMD widths Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints. Fred Brooks (IBM, 1962) Have converted many former power problems into cost problems 7 Sources: various vendors 8 Low-Power Design Principles (2005) Low-Power Design Principles (2005) Tensilica XTensa Intel Atom Intel 2 Power 5 Tensilica XTensa Cubic power improvement with lower clock rate due to V2F Slower clock rates enable use of simpler cores Power5 (server) 120W@1900MHz Baseline Intel 2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Intel Atom Intel 2 Simpler cores use less area (lower leakage) and reduce cost Power 5 Tailor design to application to REDUCE WASTE 9 10 Low-Power Design Principles (2005) Heterogeneous Future (LOCs and TOCs) Tensilica XTensa Power5 (server) 120W@1900MHz Baseline Intel 2 sc (laptop) : 15W@1000MHz 4x more FLOPs/watt than baseline Intel Atom (handhelds) 0.625W@800MHz 80x more GPU or XTensa/Embedded 0.09W@600MHz 400x more (80x-120x sustained) Big cores (very few) Latency Optimized (LOC) Most energy efficient if you don t have lots of parallelism Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient. 11 Tiny core Lots of them! Throughput Optimized (TOC) Most energy efficient if you DO have a lot of parallelism! 12

vlada, krste} @eecs.berkeley.edu, sunchen@mit.

edu Department Electrical s, University of California, Berkeley, of CA, USA Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering husetts

2 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c Email: { yunsup, waterman, rimas, hcook, edu I. I NTRODUCTI ON CHI P A RCHI TECTURE Fig.

from Pipeline bypass paths omitted for simplicity has an optional IEEE 754-2008-compliant FPU, which can execute single- and double-precision ﬂoating-point operations, including fused multiply-add

The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches.

implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups.

3 4.5 mm 20mm, 2.7 mm Lee, Andrew Waterman, Rimas Avizienis, Henry Cook zienis, Henry Cook, Chen SunYunsup, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c ste Asanovi c { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu Department Electrical s, University of California, Berkeley, of CA, USA Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering husetts Institute of Technology, Cambridge, MA, USA and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 8.4mm 4.5 mm 1.2 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu I. I NTRODUCTI ON CHI P A RCHI TECTURE Fig. 2. 1MB SRAM Array ITLB I$ VITLB VI$ Int.RF Inst. VInst. 2.8mm 1MB SRAM Array Coherence Hub Logic VRF FPGA FSB/ HTIF Expand FP.EX1 DTLB D$ FP.EX2 Commit FP.EX3 to Pipeline Bank8 Int.EX... FP.RF Bank1 Sequencer scalar plus vector pipeline diagram. from Pipeline bypass paths omitted for simplicity has an optional IEEE compliant FPU, which can execute single- and double-precision ﬂoating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level parallelism. A. Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. Dual- RISC-V Processor 3mm of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with s II. 1MB SRAM Array 2.8mm 1MB SRAM Array Coherence Hub Logic VRF FPGA FSB/ HTIF 1.1mm ITLB I$ VITLB VI$ Int.RF Inst. VInst. Int.EX FP.RF Sequencer DTLB D$ FP.EX1 Expand Bank8 FP.EX3 to... Pipeline Bank1 FP.EX2 Commit scalar plus vector pipeline diagram. from Pipeline bypass paths omitted for simplicity Sequencer Expand Bank1... VInst. Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This is the ﬁr st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% higher in DM I PS/M Hz than the Cor tex-a5, ARM s compar able single-issue in-or der scalar cor e, and is 49% mor e ar ea-efﬁcient. To demonstr ate the extensibility of the RI SC-V I SA, we integr ate a custom vector acceler ator alongside each single-issue in-or der scalar core. The vector acceler ator is 1.8 more ener gy-efﬁcient than the I BM Blue Gene/Q processor, and 2.6 mor e than the I BM Cell pr ocessor, both fabr icated in the same pr ocess. The dual-cor e RI SC-V processor achieves maximum clock fr equency of 1.3 GHz at 1.2 V and peak ener gy efﬁciency of 16.7 doubleprecision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efﬁciency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application software to make use of separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be ﬂexible and extensible to better integrate new efﬁcient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA veriﬁcation suite, a Linux port, and additional documentation, and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45 nm SOI process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efﬁcient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision ﬂoating-point operations, demonstrating that high efﬁciency can be obtained without sacriﬁcing a uniﬁed demand-paged virtual memory environment. Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order scalar core, a 64-bit vector accelerator, and their associated instruction and data caches, as described below. 1.2 mm 3mm Dual- RISC-V Processor Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. A. is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level parallelism. has an optional IEEE compliant FPU, which can execute single- and double-precision ﬂoating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The Fig. 2. FP.EX3 VITLB VI$ 1.1mm 0.5 mm 1.1mm 0.5 mm I NTRODUCTI ON CHI P A RCHI TECTURE Wire Energy 4.5 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu II. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with s I. Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This is the ﬁr st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% higher in DM I PS/M Hz than the Cor tex-a5, ARM s compar able single-issue in-or der scalar cor e, and is 49% mor e ar ea-efﬁcient. To demonstr ate the extensibility of the RI SC-V I SA, we integr ate a custom vector acceler ator alongside each single-issue in-or der scalar core. The vector acceler ator is 1.8 more ener gy-efﬁcient than the I BM Blue Gene/Q processor, and 2.6 mor e than the I BM Cell pr ocessor, both fabr icated in the same pr ocess. The dual-cor e RI SC-V processor achieves maximum clock fr equency of 1.3 GHz at 1.2 V and peak ener gy efﬁciency of 16.7 doubleprecision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efﬁciency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application software to make use of separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be ﬂexible and extensible to better integrate new efﬁcient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA veriﬁcation suite, a Linux port, and additional documentation, and is available at www. r i scv. or g. In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45 nm SOI process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efﬁcient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision ﬂoating-point operations, demonstrating that high efﬁciency can be obtained without sacriﬁcing a uniﬁed demand-paged virtual memory environment. Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order scalar core, a 64-bit vector accelerator, and their associated instruction and data caches, as described below. FP.EX2 Pipeline from FP.EX1 VInst. Figure Seq-1 uencer Bank8 FP.RF Bank1 diagram Bank8of the dual-core proshows Expandthe block... cessor. Each core incorporates a 64-bit single-issue in-order Pipeline scalar core, a 64-bit vector accelerator, and m Pipeline II. et Pipeline FP.EX3 FP.EX2 FP.EX1 CHI P A RCHI TECTURE to bypass paths omitted for simplicity Commit ITLB I$ Int.RF Inst. DTLB D$ FP.RF Assumptions for 22nm 100 fj/bit per mm 64bit operand Energy: 1mm=~6pj 20mm=~120pj 6mm Power: 0.025W has an optional IEEE compliant FPU, Clock: 1.0 GHz which can execute single- and double-precision ﬂoating-point E/op: 22 pj operations, including fused multiply-add (FMA), with hardware17support for denormals and other exceptional values. The In this paper, we present a 64-bit ocket has an optional IEEE compliant FPU, dual-core RISC-V processor with custom vector accelerators h can execute single- and double-precision ﬂoating-point in a 45 nm SOI Our RISC-V scalar achieves ations, includingprocess. fused multiply-add (FMA),core with hard DMIPS/MHz, outperforming ARM s Cortex-A5 score support for denormals and other exceptional values. Theof 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efﬁcient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for ITLB Int.RF DTLB Int.EX Commit operations, to I$ Inst. D$ double-precision ﬂoating-point demonstrating that high efﬁciency can be obtained without sacriﬁcing a uniﬁed paths omitted demand-paged virtual memory environment. plicity Area: mm2 is a 6-stage single-issue Power: in-order pipeline that 2.5W executes the 64-bit scalar RISC-V ISA (see2.4 Figure Clock: GHz 2). The scalar datapath is fully bypassed but carefully to mine/op: designed 651 pj imize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address 2 Area:impact 0.6 mmof stack together mitigate the performance control 0.5mm Power: 0.3W (<0.2W) hazards. implements an MMU that supports pageclock: modern 1.3 GHz operating based virtual memory and is able to boot E/op: 150 (75) pj systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level Area: mm2 parallelism. 6mm Coherence Hub Dual- RISC-V Processor Logic 16 1MB SRAM Array FPGA FSB/ HTIF 15 VRF 1MB SRAM Array Energy/Area est. 2.7mm A. As we approach the end of conventional transistor scaling, computer 2.7mmarchitects are forced to incorporate specialized and ocket heterogeneous accelerators into general-purpose processors for greatersingle-issue energy efﬁciency. Many proposed ocket is a 6-stage in-order pipeline thataccelerators, such as on GPU require utes the 64-bit those scalarbased RISC-V ISA architectures, (see Figure 2). The a drastic reworking of bypassed applicationbutsoftware todesigned make usetoofminseparate ISAs operating r datapath is fully carefully memory spaces disjoint the demand-paged virtual e the impact of inlong clock-to-output delays from of compilermemory of the Ahost CPU. RISC-V [1] is a new completely ated SRAMs in the caches. 64-entry branch target open general-purpose set architecture (ISA) der, 256-entry two-level branch predictor,instruction and return address veloped the University of California, together mitigate the atperformance impact of control Berkeley, which is 0.5mm designed to be ﬂexible and extensible to better integrate new ds. implements an MMU that supports pageefﬁcient close to theoperating host cores. The open-source d virtual memory and isaccelerators able to boot modern RISC-VBoth software includes a GCC cross-compiler, ms, including Linux. cachestoolchain are virtually indexed an LLVM a software ISA simulator, an ISA cally tagged with parallel cross-compiler, TLB lookups. The data cache veriﬁcation suite, a Linux port, and additional documentation, on-blocking, allowing the core to exploit memory-level and is available at www. r i scv. or g. elism. Basic Stats Actual Size Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. Backside chip micrograph (taken with a removed silicon handle)on and I. I NTRODUCTI sor block diagram. 2.8mm 1.2mm 3mm 3mm 2.8mm 1.2mm Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This VRF is the ﬁr st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% Logic Hz than the Cor tex-a5, ARM s compar able higher in DM I PS/M 1MB single-issue in-or der scalar cor e, and is 49% mor e ar ea-efﬁcient. SRAM ate the extensibility of the RI SC-V I SA, we integr ate Array To demonstr a custom vector acceler ator alongside each single-issue in-or der scalarcore.the vector acceler ator is 1.8 more ener gy-efﬁcient Gene/Q and 2.6 mor e than the than the I BM Blue processor, I BM Cell pr ocessor, both fabr icated in the same pr ocess. The Dual- RISC-V dual-cor e RI SC-VCoherence Hub processor achieves maximum clock fr equency Processor FPGA FSB/ at 1.2 V and peak ener gy efﬁciency of 16.7 doubleof 1.3 GHz 1MB SRAM Array HTIF precision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. 6mm Photonics could break through 1.1mm Wire efficiency does not improve as feature size shrinks Int.EX VITLB VI$ A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W e-precision GFLOPS/W RISC-V Processor with s Die Photos (3 classes of cores) Strips down to the core 2.5D Integration gets boost in pin density But it s a 1 time boost (how much headroom?) 4TB/sec? (maybe 8TB/s with single wire signaling?) Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics) Power = V2 * Frequency * Capacitance Capacitance ~= Area of Transistor Transistor efficiency improves as you shrink it 4000 Pins is aggressive pin package Half of those would need to be for power and ground Of the remaining 2k pins, run as differential pairs Beyond 15Gbps per pin power/complexity costs hurt! 10Gpbs * 1k pins is ~1.2TBytes/sec Energy Efficiency of a Transistor: limit the bandwidth-distance wire Moore s law doesn t apply to adding pins to package 30%+ per year nominal Moore s Law Pins grow at ~1.5-3% per year at best Power = Frequency * Length / cross-section-area Energy Efficiency of copper wire: Pin Limits Data movement the wires

When does data movement dominate? 4.5 mm 1.

edu 3mm 1MB SRAM Array Dual- RISC-V Processor Fig. 2. ITLB I$ VITLB VI$ Int.RF Inst. VInst. 2.8mm 1MB SRAM Array Coherence Hub Logic VRF FPGA FSB/ HTIF Int.EX FP.RF Sequencer DTLB D$ FP.

Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. 1.

Sciences, University of California, Berkeley, CA, USA Department A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with s I.

4 When does data movement dominate? 4.5 mm 1.2 mm Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanovi c, Krste Asanovi c { yunsup, waterman, rimas, hcook, vlada, sunchen@mit.edu 3mm 1MB SRAM Array Dual- RISC-V Processor Fig. 2. ITLB I$ VITLB VI$ Int.RF Inst. VInst. 2.8mm 1MB SRAM Array Coherence Hub Logic VRF FPGA FSB/ HTIF Int.EX FP.RF Sequencer DTLB D$ FP.EX1 Expand Bank8 FP.EX3 to... Pipeline Bank1 FP.EX2 Commit scalar plus vector pipeline diagram. from Pipeline bypass paths omitted for simplicity has an optional IEEE compliant FPU, which can execute single- and double-precision ﬂoating-point operations, including fused multiply-add (FMA), with hardware support for denormals and other exceptional values. The is a 6-stage single-issue in-order pipeline that executes the 64-bit scalar RISC-V ISA (see Figure 2). The scalar datapath is fully bypassed but carefully designed to minimize the impact of long clock-to-output delays of compilergenerated SRAMs in the caches. A 64-entry branch target buffer, 256-entry two-level branch predictor, and return address stack together mitigate the performance impact of control hazards. implements an MMU that supports pagebased virtual memory and is able to boot modern operating systems, including Linux. Both caches are virtually indexed physically tagged with parallel TLB lookups. The data cache is non-blocking, allowing the core to exploit memory-level parallelism. A. Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. 1.1mm I NTRODUCTI ON CHI P A RCHI TECTURE of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA Department A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with s I. Abstract A 64-bit dual-cor e RI SC-V processor with vector acceler ator s has been fabr icated in a 45 nm SOI pr ocess. This is the ﬁr st dual-cor e pr ocessor to implement the open-sour ce RI SC-V I SA designed at the Univer sity of Califor nia, Ber keley. I n a standar d 40 nm pr ocess, the RI SC-V scalar cor e scor es 10% higher in DM I PS/M Hz than the Cor tex-a5, ARM s compar able single-issue in-or der scalar cor e, and is 49% mor e ar ea-efﬁcient. To demonstr ate the extensibility of the RI SC-V I SA, we integr ate a custom vector acceler ator alongside each single-issue in-or der scalar core. The vector acceler ator is 1.8 more ener gy-efﬁcient than the I BM Blue Gene/Q processor, and 2.6 mor e than the I BM Cell pr ocessor, both fabr icated in the same pr ocess. The dual-cor e RI SC-V processor achieves maximum clock fr equency of 1.3 GHz at 1.2 V and peak ener gy efﬁciency of 16.7 doubleprecision GFL OPS/W at 0.65 V with an ar ea of 3 mm2. II. How to measure Latency? Data sheet (often optimistic, or not provided) As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efﬁciency. Many proposed accelerators, such as those based on GPU architectures, require a drastic reworking of application software to make use of separate ISAs operating in memory spaces disjoint from the demand-paged virtual memory of the host CPU. RISC-V [1] is a new completely open general-purpose instruction set architecture (ISA) developed at the University of California, Berkeley, which is designed to be ﬂexible and extensible to better integrate new efﬁcient accelerators close to the host cores. The open-source RISC-V software toolchain includes a GCC cross-compiler, an LLVM cross-compiler, a software ISA simulator, an ISA veriﬁcation suite, a Linux port, and additional documentation, and is available at www. r i scv. or g. Memory Consistency Memory CPU gap widens Goals of this lecture Conjecture: Buffering is a must! Issues (AMD Interlagos as Example) Today s architectures POWER8: 338 dp GFLOP/s 230 GB/s memory bw BW i7-5775c: 883 GFLOPS/s ~50 GB/s memory bw Trend: memory performance grows 10% per year Write Buffers Delayed write back saves memory bandwidth Data is often overwritten or re-read Caching Directory of recently used locations Stored as blocks (cache lines) 22 Source: John Mc.Calpin Two most common examples: Frequency times bus width: 51 GiB/s Microbenchmark performance Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s Why? Application performance As observed (performance counters) Somewhere in between stride 1 and random access How to measure bandwidth? Data sheet (often peak performance, may include overheads) In this paper, we present a 64-bit dual-core RISC-V processor with custom vector accelerators in a 45 nm SOI process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, outperforming ARM s Cortex-A5 score of 1.57 DMIPS/MHz by 10% in a smaller footprint. Our custom vector accelerator is 1.8 more energy-efﬁcient than the IBM Blue Gene/Q processor and 2.6 more than the IBM Cell processor for double-precision ﬂoating-point operations, demonstrating that high efﬁciency can be obtained without sacriﬁcing a uniﬁed demand-paged virtual memory environment. Compute Op == data movement 3.6mm Energy Ratio for 20mm 5.5x Figure 1 shows the block diagram of the dual-core processor. Each core incorporates a 64-bit single-issue in-order scalar core, a 64-bit vector accelerator, and their associated instruction and data caches, as described below. Cache Coherence Area: mm2 Power: 0.025W Clock: 1.0 GHz E/op: 22 pj 6mm Source: Jack Dongarra Compute Op == data movement 12mm Energy Ratio for 20mm 1.6x Measure processor speed as throughput FLOPS/s, IOPS/s, Moore s law - ~60% growth per year Memory Trends Area: 0.6 mm2 Power: 0.3W (<0.2W) Clock: 1.3 GHz E/op: 150 (75) pj Compute Op == data movement 108mm Energy Ratio for 20mm 0.2x 0.5mm Area: mm2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj DPH Overview Data Movement Cost Energy/Area est. 2.7mm Random pointer chase <100ns 110 ns with one core, 258 ns with 32 cores!

write through Victim cache to reduce conflict misses 25 26 Shared Hierarchical Caches Shared Hierarchical Caches with MT 27 28 Caching Strategies (repeat) Write Through Cache Remember: Write Back?

5 Cache Coherence Exclusive Hierarchical Caches Different caches may have a copy of the same memory location! Cache coherence Manages existence of multiple copies Cache architectures Multi level caches Shared vs. private (partitioned) Inclusive vs. exclusive Write back vs. write through Victim cache to reduce conflict misses Shared Hierarchical Caches Shared Hierarchical Caches with MT Caching Strategies (repeat) Write Through Cache Remember: Write Back? Write Through? Cache coherence requirements A memory system is coherent if it guarantees the following: Write propagation (updates are eventually visible to all readers) Write serialization (writes to the same location must be observed in order) Everything else: memory model issues (later) 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache stores X=1 in memory 4. CPU 1 reads X from its cache loads X=0 from its cache Incoherent value for X on CPU 1 CPU 1 may wait for update! Requires write propagation! 29 30

Write Back Cache Requires write serialization! 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3.

CPU 0 writes back cache line stores X=1 in memory Later store X=2 from CPU 1 lost A simple (?

6 Write Back Cache Requires write serialization! 1. CPU 0 reads X from memory loads X=0 into its cache 2. CPU 1 reads X from memory loads X=0 into its cache 3. CPU 0 writes X=1 stores X=1 in its cache 4. CPU 1 writes X =2 stores X=2 in its cache 5. CPU 1 writes back cache line stores X=2 in in memory 6. CPU 0 writes back cache line stores X=1 in memory Later store X=2 from CPU 1 lost A simple (?) example Assume C99: struct twoint { int a; int b; } Two threads: Initially: a=b=0 Thread 0: write 1 to a Thread 1: write 1 to b Assume non-coherent write back cache What may end up in main memory? Cache Coherence Protocol Programmer can hardly deal with unpredictable behavior! Cache controller maintains data integrity All writes to different locations are visible Fundamental Mechanisms Snooping Shared bus or (broadcast) network Directory-based Record information necessary to maintain coherence: E.g., owner and state of a line etc. Fundamental CC mechanisms Snooping Shared bus or (broadcast) network Cache controller snoops all transactions Monitors and changes the state of the cache s data Works at small scale, challenging at large-scale E.g., Intel Broadwell Directory-based Record information necessary to maintain coherence E.g., owner and state of a line etc. Central/Distributed directory for cache line ownership Scalable but more complex/expensive E.g., Intel Xeon Phi KNC/KNL Source: Intel Cache Coherence Parameters Concerns/Goals Performance Implementation cost (chip space, more important: dynamic energy) Correctness (Memory model side effects) An Engineering Approach: Empirical start Problem 1: stale reads Cache 1 holds value that was already modified in cache 2 Solution: Disallow this state Invalidate all remote copies before allowing a write to complete Issues Detection (when does a controller need to act) Enforcement (how does a controller guarantee coherence) Precision of block sharing (per block, per sub-block?) Block size (cache line size?) Problem 2: lost update Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data Solution: Disallow more than one modified copy 35 36

7 Invalidation vs. update (I) Invalidation vs. update (II) Invalidation-based: On each write of a shared line, it has to invalidate copies in remote caches Simple implementation for bus-based systems: Each cache snoops Invalidate lines written by other CPUs Signal sharing for cache lines in local cache to other caches Update-based: Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic) Invalidation-based: Only write misses hit the bus (works with write-back caches) Subsequent writes to the same cache line are local Good for multiple writes to the same line (in the same cache) Update-based: All sharers continue to hit cache line after one core writes Implicit assumption: shared lines are accessed often Supports producer-consumer pattern well Many (local) writes may waste bandwidth! Hybrid forms are possible! MESI Cache Coherence Terminology Most common hardware implementation of discussed requirements aka. Illinois protocol Each line has one of the following states (in a cache): Modified (M) Local copy has been modified, no copies in other caches Memory is stale Exclusive (E) No copies in other caches Memory is up to date Shared (S) Unmodified copies may exist in other caches Memory is up to date Invalid (I) Line is not in cache 39 Clean line: Content of cache line and main memory is identical (also: memory is up to date) Can be evicted without write-back Dirty line: Content of cache line and main memory differ (also: memory is stale) Needs to be written back eventually Time depends on protocol details Bus transaction: A signal on the bus that can be observed by all caches Usually blocking Local read/write: A load/store operation originating at a core connected to the cache 40 Transitions in response to local reads Transitions in response to local writes State is M No bus transaction State is M No bus transaction State is E No bus transaction State is S No bus transaction State is I Generate bus read request (BusRd) May force other cache operations (see later) Other cache(s) signal sharing if they hold a copy If shared was signaled, go to state S Otherwise, go to state E After update: return read value State is E No bus transaction Go to state M State is S Line already local & clean There may be other copies Generate bus read request for upgrade to exclusive (BusRdX*) Go to state M State is I Generate bus read request for exclusive ownership (BusRdX) Go to state M 41 42

8 Transitions in response to snooped BusRd State is M Write cache line back to main memory Signal shared Go to state S (or E) State is E Signal shared Go to state S and signal shared State is S Signal shared State is I Ignore Transitions in response to snooped BusRdX State is M Write cache line back to memory Discard line and go to I State is E Discard line and go to I State is S Discard line and go to I State is I Ignore BusRdX* is handled like BusRdX! MESI State Diagram (FSM) Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x P2 reads x P1 writes x P1 reads x P3 writes x Source: Wikipedia Small Exercise Initially: all in I state Action P1 state P2 state P3 state Bus action Data from P1 reads x E I I BusRd Memory P2 reads x S S I BusRd Cache P1 writes x M I I BusRdX* Cache P1 reads x M I I - Cache P3 writes x I I M BusRdX Memory Optimizations? Class question: what could be optimized in the MESI protocol to make a system faster? 47 48

9 Related Protocols: MOESI (AMD) MOESI State Diagram Extended MESI protocol Cache-to-cache transfer of modified cache lines Cache in M or O state always transfers cache line to requesting cache No need to contact (slow) main memory Avoids write back when another process accesses cache line Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache! 49 Source: AMD64 Architecture Programmer s Manual 50 Related Protocols: MOESI (AMD) Related Protocols: MESIF (Intel) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Modified (M): Modified Exclusive No copies in other caches, local copy dirty Memory is stale, cache supplies copy (reply to BusRd*) Owner (O): Modified Shared Exclusive right to make changes Other S copies may exist ( dirty sharing ) Memory is stale, cache supplies copy (reply to BusRd*) Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date unless an O copy exists in another cache Invalid (I): Same as MESI 51 Exclusive (E): Same as MESI (one local copy, up to date memory) Shared (S): Unmodified copy may exist in other caches Memory is up to date Invalid (I): Same as MESI Forward (F): Special form of S state, other caches may have line in S Most recent requester of line is in F state Cache acts as responder for requests to this line 52 Multi-level caches Directory-based cache coherence Most systems have multi-level caches Problem: only last level cache is connected to bus or network Yet, snoop requests are relevant for inner-levels of cache (L1) Modifications of L1 data may not be visible at L2 (and thus the bus) L1/L2 modifications On BusRd check if line is in M state in L1 It may be in E or S in L2! On BusRdX(*) send invalidations to L1 Everything else can be handled in L2 If L1 is write through, L2 could remember state of L1 cache line May increase traffic though Snooping does not scale Bus transactions must be globally visible Implies broadcast Typical solution: tree-based (hierarchical) snooping Root becomes a bottleneck Directory-based schemes are more scalable Directory (entry for each CL) keeps track of all owning caches Point-to-point update to involved processors No broadcast Can use specialized (high-bandwidth) network, e.g., HT, QPI 53 54

Markus Püschel Computer Science Basic Scheme Directory-based CC: Read miss P i intends to read, misses System with N processors P i For each memory block (size: cache line) maintain a directory entry

If dirty bit is on Recall cache line from P j (determine by presence[]) Update memory Unset dirty bit, block shared Set presence[i] Supply data to reader 55 56 Directory-based CC: Write miss P i

owner P i If dirty bit is on Recall cache line from owner P j Update memory Unset presence[j] Set presence[i], dirty bit remains set Supply data to writer Discussion Scaling of memory bandwidth No

Shared vs. distributed directory Software-emulation Distributed shared memory (DSM) Emulate cache coherence in software (e.g.

10 Markus Püschel Computer Science Basic Scheme Directory-based CC: Read miss P i intends to read, misses System with N processors P i For each memory block (size: cache line) maintain a directory entry N presence bits Set if block in cache of P i 1 dirty bit First proposed by Censier and Feautrier (1978) If dirty bit (in directory) is off Read from main memory Set presence[i] Supply data to reader If dirty bit is on Recall cache line from P j (determine by presence[]) Update memory Unset dirty bit, block shared Set presence[i] Supply data to reader Directory-based CC: Write miss P i intends to write, misses If dirty bit (in directory) is off Send invalidations to all processors P j with presence[j] turned on Unset presence bit for all processors Set dirty bit Set presence[i], owner P i If dirty bit is on Recall cache line from owner P j Update memory Unset presence[j] Set presence[i], dirty bit remains set Supply data to writer Discussion Scaling of memory bandwidth No centralized memory Directory-based approaches scale with restrictions Require presence bit for each cache Number of bits determined at design time Directory requires memory (size scales linearly) Shared vs. distributed directory Software-emulation Distributed shared memory (DSM) Emulate cache coherence in software (e.g., TreadMarks) Often on a per-page basis, utilizes memory virtualization and paging Open Problems (for projects or theses) Case Study: Intel Xeon Phi Tune algorithms to cache-coherence schemes What is the optimal parallel algorithm for a given scheme? Parameterize for an architecture Measure and classify hardware Read Maranget et al. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models and have fun! RDMA consistency is barely understood! GPU memories are not well understood! Huge potential for new insights! Can we program (easily) without cache coherence? How to fix the problems with inconsistent values? Compiler support (issues with arrays)? 60 61

Cache-Coherent SMP Systems - A Case-Study with Xeon

al.: Memory performance and cache coherency effects

Single-Line Ping Pong Multi-Line Ping Pong More

of startup Prediction for both in E state: 479 ns

11 Communication? Ramos, Hoefler: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, HPDC Invalid read R I =278 ns Local read: R L = 8.6 ns Remote read R R = 235 ns Inspired by Molka et al.: Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system 63 Single-Line Ping Pong Multi-Line Ping Pong More complex due to prefetch Number of CLs Amortization of startup Prediction for both in E state: 479 ns Measurement: 497 ns (O=18) Asymptotic Fetch Latency for each cache line (optimal prefetch!) Startup overhead Multi-Line Ping Pong DTD Contention E state: o=76 ns q=1,521ns p=1,096ns I state: o=95ns q=2,750ns p=2,017ns E state: a=0ns b=320ns c=56.2ns 66 67

Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models

Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Motivational video: https://www.youtube.com/watch?v=zjybff6pqeq Instructor: Torsten Hoefler & Markus