Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Size: px

Start display at page:

Download "Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim"

Archibald Ryan
5 years ago
Views:

1 Cache Designs and Tricks Kyle Eli, Chun-Lung Lim

2 Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds are increasing faster than DRAM access times. Cache is typically much faster than main memory. Multiple caches, each specialized to enhance a different aspect of program execution.

3 What is Cache? A cache e is collection of data duplicating original values stored or computed earlier. Implemented on- or off-chip in SRAM. Faster to fetch or compute relative to original values. Low latency High bandwidth Commonly organized into two or three levels.

4 Diagram of a CPU memory cache

5 Cache Associativity Direct-Mapped Cache Fully Associative Cache Set Associative Cache

6 Direct-Mapped Cache Slots are treated as a large array, with index chosen using the bits of the address. Suffers from many collisions, causing the cache line to be repeatedly evicted even when there are many empty slots. Very simple, only one slot to check.

7 Fully Associative Cache Any slot can store the cache line. Obtains data by comparing tag bits of the address to tag bits of every slot and making sure the valid bit is set. Hardware is complex. Normally used in translation lookaside buffers.

8 Set Associative Cache Combination of fully-associative and direct-mapped schemes. Cache slots are grouped into sets. Finding a set is like direct-mapped scheme. Finding slot within the set is like the fully-associative scheme. Comparison hardware only needed for finding sets. Fewer collisions because you have more slots to choose from, even when cache lines map to the same set.

9 Multi-level Cache There are three levels of cache commonly being used: One on-chip with the processor, referred to as the "Level-1" cache (L1) or primary cache. Another is on-die cache is the "Level 2" cache (L2) or secondary cache. L3 Cache, generally much larger and implemented on a separate chip.

10 Multi-level Caches new design Inclusive caches Data in the L1 cache may also be in the L2 cache. Example : Intel Pentium II, III, IV and most RISCs. Exclusive caches decisions Data is guaranteed to be in at most of the L1 and L2 caches. Example : AMD Athlon

11 Athlon64 Cache Hierarchy

12 Cache Issues Latency: time for cache to respond to a request. Smaller caches typically respond faster. Bandwidth: number of bytes which can be read or written per second. Cost: expensive to implement. A large level 3 cache can generally cost in excess of $1000 to implement. Benefits depends on the application s s access patterns.

13 Cache Issues (continued) Memory requests are satisfied from Cache Cache Hit Occurs when the processor requests an address stored in the cache. Processor writes or reads directly to or from cache. Main Memory Cache Miss Occurs when the processor requests an address that is not stored in the cache.

14 Caching Algorithm Caching algorithms are used to optimize cache management. Cache size is limited. Algorithm used to decide which items to keep and which to discard to make room for new items. Cache algorithms: Least Recently Used (LRU) Least Frequently Used Belady s Min

15 Least Recently Used Discards the least recently used item first. Must keep track of least-recently used item. Using Pseudo-LRU, only one bit per cache item required to work.

16 Least Frequently Used Counts how often an item is used. Items used the least are discarded first.

17 Belady s Min Optimal algorithm, discard information that will not be needed for the longest time in the future. Can not be implemented in hardware as it requires future knowledge. Used in simulations to judge effectiveness of other algorithms.

18 Cache Optimization Locality Spatial Locality Requested data is physically near previously used data. Temporal Locality Requested data was recently used, or frequently re-used.

19 Optimization for Spatial Locality Spatial locality refers to accesses close to one another in position. Spatial locality is important to the caching system because contiguous cache lines are loaded from memory when the first piece of that line is loaded. Subsequent accesses within the same cache line are then practically free until the line is flushed from the cache. Spatial locality is not only an issue in the cache, but also within most main memory systems.

20 Optimization for Spatial Locality Prefetch data in other cache lines.

21 Optimization for Temporal Locality Temporal locality refers to 2 accesses to a piece of memory within a small period of time. The shorter the time between the first and last access to a memory location the less likely it will be loaded from main memory or slower caches multiple times.

22 Optimization for Temporal Locality Re-use data which has been brought to cache as often as possible.

23 Optimization Techniques Prefetching Loop blocking Loop fusion Array padding Array merging

24 Prefetching Many architectures include a prefetch instruction that is a hint to the processor that a value will be needed from memory soon. When the memory access pattern is well defined and the programmer knows many instructions ahead of time, prefetching will result in very fast access when the data is needed.

25 Prefetching (continued) It does no good to prefetch variables that will only be written to. The prefetch should be done as early as possible. Getting values from memory takes a LONG time. Prefetching too early, however will mean that other accesses might flush the prefetched data from the cache. Memory accesses may take 50 processor clock cycles or more.

26 Prefetching (continued) The compiler may be inserting prefetch instructions. May be slower than manual prefetch. The CPU probably has a hardware prefetching feature. Can be dynamically driven by run-time data. Independent of manual prefetch.

27 Loop Blocking Reorder loop iteration so as to operate on all the data in a cache line at once, so it needs only to be brought in from memory once. For instance if an algorithm calls for iterating down the columns of an array in a row-major language, do multiple columns at a time. The number of columns should be chosen to equal a cache line.

28 Loop Fusion Combine loops that access the same data. Leads to a single load of each memory address.

29 Array Padding Arrange accesses to avoid subsequent access to different data that may be cached in the same position.

30 Array Merging Merge arrays so that data that needs to be accessed at once is stored together

31 Pitfalls and Gotchas Basically, the pitfalls of memory access patterns are the inverse of the strategies for optimization. There are also some gotchas that are unrelated to these techniques. The associativity of the cache. Shared memory. Sometimes an algorithm is just not cache friendly.

32 Problems From Associativity When this problem shows itself is highly dependent on the cache hardware being used. It does not exist in fully associative caches. The simplest case to explain is a direct-mapped cache. If the stride between addresses is a multiple of the cache size, only one cache position will be used.

33 Shared Memory It is obvious that shared memory with high contention cannot be effectively cached. However it is not so obvious that unshared memory that is close to memory accessed by another processor is also problematic. When laying out data, complete cache lines should be considered a single location and should not be shared.

34 Optimization Wrapup Only try after the best algorithm has been selected. Cache optimizations will not result in an asymptotic speedup. If the problem is too large to fit in memory or in memory local to a compute node, many of these techniques may be applied to speed up accesses to even more remote storage.

35 Recent Cache Architecture AMD Athlon 64 X2 128kB 2-way set associative L1 (64kB data, 64kB instruction) per core 1MB or 512kB full-speed 16-way set associative L2 cache per core Intel Core (Yonah( Yonah) 64kB L1 (32kB data, 32kB instruction) per core 2MB full-speed 8-way set associative L2 cache, shared Designed for power-saving, cache can be flushed to memory and cache ways can be deactivated.

36 Recent Cache Architecture SUN UltraSparc T1 24kB 4-way set associative L1 (8kB data, 16kB instruction) per core 3072kB full-speed 12-way set associative L2 cache, shared IBM Power5 96kB L1 (64kB 2-way set associative instruction, 32kB 4-way set associative data) 1.92MB full-speed 10-way set associative L2 cache, shared 36MB half-speed 12-way set associative L3 cache, shared (off-die)

37 Recent Cache Architecture Sony/Toshiba/IBM Cell Broadband Engine 9 cores 1 POWER Processing Element (PPE) 64kB L1(32kb 2-way set associative instruction, 32kb 4-way set associative data) 512kB full-speed 8-way set associative L2 8 Synergistic Processing Elements (SPEs( SPEs) 256kB Local Storage per core No direct access to memory Can access any 128-bit word at L1 speed from local storage

38 Specialized Cache Designs CAM-Tag Cache for Low-Power

39 Motivation Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1 Many Industrial Low-Power Processors use CAM (content-addressable-memory) ARM3 64-way set-associative [Furber et. al. 89] StrongArm 32-way set-associative [Santhanam et. al. 98] Intel XScale 32-way set-associative 01 CAM: Fast and Energy-Efficient

40 Set-Associative RAM-tag Cache Tag Status Data Tag Status Data =? =? Not energy- efficient All ways are read out Two-phase approach More energy- efficient 2X latency Tag Index Offset

41 Set-Associative RAM-tag Sub-bank BUS Cache Not energy-efficient All ways are read out Two-phase approach gwl More energy-efficient 2X latency Tag SRAM Cells Address Decoder lwl Data SRAM Cells lwl Data SRAM Cells Sub-banking 1 sub-bank = 1 way Low-swing Bitlines Only for reads, writes performed full-swing Tag Comp Offset Dec. Sense Amps Offset Dec. Sense Amps Wordline Gating addr offset offset I/O BUS

42 CAM-tag Cache Tag Status Data Tag Status Data HIT? HIT? Only one sub- bank activated Associativity within sub-bank Easy to implement high associativity Word Tag Bank Offset

43 CAM-tag Array CAM-tag Cache Sub-bank gwl Offset Dec. lwl 32 SRAM Cells Sense Amps 128 Offset Dec. lwl SRAM Cells Sense Amps Only one subbank activated Associativity within sub-bank Easy to implement high associativity tag offset I/O offset BUS

44 CAM-tag Cache Sub-bank Layout 1-KB Cache Sub-bank implemented in 0.25 µm CMOS technology 2x12x32 CAM Array 32x64 RAM Array 10% area overhead over RAM-tag cache

45 RAM tag Cache Critical Path: Index Bits Delay Comparison Global Wordline Decoding gwl Local Wordline Decoding lwl Data out Decoded offset Tag Comp. Tag bits Tag readout Data readout CAM tag Cache Critical Path: Tag bits Tag bits broadcasting Tag bits gwl Local Wordline Decoding lwl Tag Comp. Decoded offset Data out Within 3% of each other Data readout

46 Hit Energy Comparison Hit Energy per Access for 8KB Cache in pj way RAM 2-way RAM 4-way RAM 8-way RAM 8-way CAM 16-way CAM 32-way CAM LZW ijpeg pegwit perl m88ksim gcc Avg Associativity and Implementation

47 Total Access Energy (pegwit) Pegwit High miss rate for high associativity Total Energy per Access for 8KB Cache in pj X 64X 128X 256X 512X 1024X 1-RAM 2-RAM 4-RAM 8-RAM 8-CAM 16-CAM 32-CAM Miss Energy Expressed in Multiples of 32-bit Read Access Energy

48 Total Access Energy (perl) Perl Very low miss rate for high associativity Total Energy per Access for 8KB Cache in pj X 64X 128X 256X 512X 1024X 1-RAM 2-RAM 4-RAM 8-RAM 8-CAM 16-CAM 32-CAM Miss Energy Expressed in Multiples of 32-bit Read Access Energy

49 References Wikipedia UMD ndex.html Michael Zhang and Krste Asanovic Highly-Associative Caches for Low Power Processors, MIT Laboratory for Computer Science, December 2000 (from Kool Chips Workshop) Cache Designs and Tricks Kevin Leung, Josh Gilkerson,, Albert Kalim, Shaz Husain

50 References Cont d Many academic studies on cache [Albera, Bahar, 98] Power and performance trade-offs [Amrutur,, Horowitz, 98, 98, 00] Speed and power scaling [Bellas,, Hajj, Polychronopoulos, 99] Dynamic cache management [Ghose,, Kamble, 99] Power reduction through sub-banking, etc. [Inoue, Ishihara, Murakami, 99] Way predicting set-associative cache [Kin,Gupta, Mangione-Smith, 97] Filter cache [Ko, Balsara, Nanda, 98] Multilevel caches for RISC and CISC [Wilton, Jouppi, 94] CACTI cache model

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste