Trying to design a simple yet efficient L1 cache. Jean-François Nguyen

Size: px

Start display at page:

Download "Trying to design a simple yet efficient L1 cache. Jean-François Nguyen"

Leonard Jennings
5 years ago
Views:

1 Trying to design a simple yet efficient L1 cache Jean-François Nguyen 1

2 Background Minerva is a 32-bit RISC-V soft CPU It is described in plain Python using nmigen FPGA-friendly Designed for reasonable performance in embedded use cases This talk will discuss our approach to improve our current biggest performance bottleneck: memory access latency. 2

3 Minerva 3

4 MiSoC/LiteX Wishbone UART PCIe CSR bus I-bus ETH D-bus CPU... ROM SRAM SDRAM L2$ 4

Direct accesses to FPGA BRAM take 1 clock cycle, but are size

5 Memory access latency Memory accesses must stall the CPU pipeline until completion. Direct accesses to FPGA BRAM take 1 clock cycle, but are size constrained. XC7A35T has ~1.8M of BRAM Standard Wishbone transactions take 2 cycles (+1 to register). throughput = 3c, latency = 18c 5

6 Memory as a hierarchy Cost/bit 1c? L2 3c SDRAM ~25c... HDD + Access latency 6

7 Step 1: basic cache design 7

8 Goal: exploiting the memory hierarchy Provide the illusion of a large and fast memory Temporal locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, nearby items will tend to be referenced again soon. 8

9 Accessing a cache Address Tag Index Byte offset Index Valid Tag Data = Hit Data 9

10 Multiple words in a line Address Tag Index Word off. Byte off Index Valid Tag Data = Data Hit 10

11 Cache control FSM Flush Refill Check 11

12 Cache control FSM: Flush Flush flushreq Refill Flush the cache. This state is entered on reset. Stall the pipeline Walk over each line to clear the valid bit Check 12

13 Cache control FSM: Check Check for cache misses. Flush flushreq Check Refill ~hit In case of a miss, send a refill request and go to Refill. In case of a flush request, go to Flush. 13

14 Cache control FSM: Refill Operate cache refills. Flush flushreq Check Refill ~hit Stall the pipeline. Wait for data from memory Set data, tag and valid bits Resume execution 14

15 Handling write accesses Write-through Always write data to both memory and cache Cost-effective when paired with a write buffer Simple but slow Write-back Only write data to cache. Memory is updated on refills Cache contents are not consistent with memory Faster, but more complex 15

16 Step 2: Optimizations 16

17 Improving cache performance Average memory access time: Hit latency + miss rate * miss penalty Two strategies: Reduce miss rate Mitigate miss penalties 17

18 Associativity Associativity is the number of locations where a cache line can be placed. Direct-mapped A line can be placed in exactly one location. n-way associative A line can be placed in n locations. Fully associative A line can be placed in any location. 18

19 Example: 2-way associativity Address Tag Index Byte off Index V Tag Data V Tag Data = = Hit Data 19

20 Refill policy In a n-way associative cache, n lines can match a given index. In case of a refill, which line should we choose to replace? Multiple choices: pseudo-lru (Least Recently Used) MRU Pseudo-random... In a 2-way associative cache, Round-Robin is equivalent to LRU. 20

21 Early restart Flush flushreq Refill Stall the pipeline Wait for the requested word Un-stall the pipeline but continue refilling Check ~hit What happens if we miss again? 21

22 Early restart Flush flushreq Refill Stall the pipeline Wait for the requested word Un-stall the pipeline but continue refilling Check ~hit What happens if we miss again? Stall the pipeline again Wait for the requested word 22

23 Critical Word First We start the refill burst with the address of the requested word first. Wishbone incremental bursts can wrap around the address LSB. Lines of 4, 8 or 16 words are supported. 23

24 Conclusion In case of L1 hit: 1 cycle read latency Cost/bit L1 1c In case of L1 miss, assuming L2 hits: 2 cycle miss penalty N+1 cycles for a complete refill (N is number of words per line) Lots of tuning knobs: Words per line Number of lines Associativity (1 or 2 ways) L2 3c SDRAM ~25c... HDD + Access latency 24

25 Demo! 25

26 Thanks! Code: 26

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,