Computer and Digital System Architecture

Size: px

Start display at page:

Download "Computer and Digital System Architecture"

Claire Susan Perkins
6 years ago
Views:

6 Caching and memory organization analogy Volume: ~1000s Access: ~minutes-hours Networks of libraries Volume: ~1,000,000s Access: ~days-weeks Local shelves Volume: ~10s Access: ~seconds-minutes Desktop Stevens Institute of Technology - All rights reserved 1-6/68

8 Memory hierarchy High-bandwidth, low-latency interface Low-bandwidth, high-latency interface Processor Small, fast memory Medium size/speed memory Increasing physical, electrical and logical separation Large, slow memory Stevens Institute of Technology - All rights reserved 1-8/68

9 Memory hierarchy Processor Small, fast memory How do you get the important data where you need it? Medium size/speed memory Large, slow memory Stevens Institute of Technology - All rights reserved 1-9/68

10 Typical memory hierarchy embedded systems ~100 bytes Processor registers Access in 1-3 nanoseconds ~8-32 kb RAM or RAM/cache Processor chip Access in ~10 nanoseconds ~10s 100s MB Off-chip RAM Access in ~100 nanoseconds Stevens Institute of Technology - All rights reserved 1-10/68

11 Typical memory hierarchy general purpose systems Processor registers RAM or RAM/cache Processor chip Multi-level cache Off-chip RAM/cache Mass storage Stevens Institute of Technology - All rights reserved 1-11/68

12 On-chip RAM vs cache memory Processor registers Processor registers RAM Processor chip cache RAM Processor chip Requires direct control of memory contents Predictable/deterministic behavior easy to predict interrupt latency Lower power Cheaper Requires complex support logic Handles dynamic operation of program without programmer intervention Unpredictable, nondeterministic behavior Hard to predict interrupt latency Stevens Institute of Technology - All rights reserved 1-12/68

13 Cache architecture FFFF FFFF 16 registers instructions processor address copies of instructions instructions and data data copies of data cache instructions and data memory Unified (single) cache Stevens Institute of Technology - All rights reserved 1-13/68

14 Cache architecture copies of instructions FFFF FFFF 16 cache address instructions instructions registers processor address data data copies of data cache memory Separate (modified Harvard architecture) instruction and data cache Stevens Institute of Technology - All rights reserved 1-14/68

15 Cache performance processor fetch execute fetch execute fetch execute fetch execute fetch execute X cache hit cache cache miss memory Cache hit rate = %(reads from cache)/(total reads) should be ~100% Depends on cache size and organization Stevens Institute of Technology - All rights reserved 1-15/68

16 Cache performance processor fetch execute fetch execute fetch execute fetch execute fetch execute X cache hit cache ( ) t = p t + p t + t access hit cache miss cache memory cache miss memory Cache hit rate = %(reads from cache)/(total reads) should be ~100% Depends on cache size and organization Stevens Institute of Technology - All rights reserved 1-16/68

17 Cache organization Direct mapped cache address: tag index tag RAM data RAM decoder compare mux hit data Typical tag: Block_address (modulo) cache_size Stevens Institute of Technology - All rights reserved 1-17/68

18 Portion of address used as tag for comparison with cache Cache organization Direct mapped cache address: tag index Portion of address used to index cache Tag is shorthand for data address reference decoder tag RAM data RAM Grouped lines of data Tag and index are calculated in parallel to increase speed Tag RAM is much smaller than data RAM faster access compare mux hit data Stevens Institute of Technology - All rights reserved 1-18/68

19 Cache organization Direct mapped cache example address: tag index Consider: decoder tag RAM data RAM 8 kb cache 16-byte data lines 512 lines 32-bit address: - 4-bits within data line - 9-bits to select line - leaves 19-bits for tag - requires ~1kB tag RAM compare mux hit data Stevens Institute of Technology - All rights reserved 1-19/68

20 Cache organization Direct mapped cache address: tag RAM tag index data RAM One memory item takes one cache location Items with same cache address contend for storage decoder Tag contention: addr2 = addr1 + (N x cache_size) compare hit mux data Different physical addresses have the same cache tag Stevens Institute of Technology - All rights reserved 1-20/68

21 Cache organization Set-associative cache address: tag index decoder tag RAM data RAM compare mux Hit Data compare mux decoder tag RAM data RAM Stevens Institute of Technology - All rights reserved 1-21/68

22 Cache organization Set-associative cache address: tag index decoder tag RAM data RAM Two direct-mapped caches compare Hit mux Contention of two data items sharing same cache location is avoided by duplicating Data cache compare mux decoder tag RAM data RAM Stevens Institute of Technology - All rights reserved 1-22/68

23 address: tag Cache organization Set-associative cache example index tag RAM data RAM Consider: decoder compare Hit mux Data 8 kb cache 16-byte data lines 256 lines in each half 32-bit address: - 4-bits within data line - 8-bits to select line - leaves 20-bits for tag - requires ~1kB tag RAM compare mux Slightly longer access time due to MUX delay time, but better performance due to less contention decoder tag RAM data RAM Stevens Institute of Technology - All rights reserved 1-23/68

24 address: tag Cache organization Set-associative cache example index decoder tag RAM data RAM When retrieving data, information present in either half of cache can be accessed, but when writing data, which half of cache should be used? compare mux Hit Data compare mux decoder tag RAM data RAM Stevens Institute of Technology - All rights reserved 1-24/68

25 address: tag Cache organization Set-associative cache example index decoder tag RAM data RAM When retrieving data, information present in either half of cache can be accessed, but when writing data, which half of cache should be used? compare compare Hit mux mux Data Options: Random assignment Least recently used Cyclic (round-robin) assign data alternatively to each half of cache decoder tag RAM data RAM Stevens Institute of Technology - All rights reserved 1-25/68

26 address: address: tag tag Cache organization Set-associative cache example index index decoder decoder tag tag RAM RAM data RAM data RAM The set-associative cache shown was a 2-way set associative cache Same approach could be used for higher level of associativity, but 4 is typical limit compare compare mux mux Hit Hit Data Data compare compare mux mux decoder decoder tag tag RAM RAM data RAM data RAM Stevens Institute of Technology - All rights reserved 1-26/68

27 Cache organization Fully-associative cache CAM: Content addressable memory content comparator is builtin Automatically performs search to find data given by tag address: tag CAM tag index data RAM Tag CAM stores all address bits other than those used to address bytes within line mux hit data Stevens Institute of Technology - All rights reserved 1-27/68

28 Read vs write strategies Read data/instruction processor cache Is info in cache yes no Cache miss: retrieve Information from memory memory Cache hit: Return cached information Continue processing Stevens Institute of Technology - All rights reserved 1-28/68

29 Read vs write strategies processor Write-through: Processor waits as data is written to memory If a copy is cached, cache is updated, also cache memory Advantages: Simplest Disadvantages: Slowest Stevens Institute of Technology - All rights reserved 1-29/68

30 Read vs write strategies processor Write-through with buffered write: Processor writes immediately to write buffer Write buffer writes data back to memory Cache is updated if needed buffer cache memory Advantages: Does not slow processor down Disadvantages: Additional hardware buffer needed Stevens Institute of Technology - All rights reserved 1-30/68

31 Read vs write strategies processor cache Copy-back (write-back): Copy-back cache is not kept consistent with memory Write only updates cache Cache lines remember when they have been modified using a dirty bit If dirty cache line gets allocated to new data, old data gets copied back first memory Advantages: Fastest with minimal memory bandwidth utilization Disadvantages: Most complicated of the three Potential issues if data synchronization is lost Stevens Institute of Technology - All rights reserved 1-31/68

32 Cache features and options Organizational feature Options Cache-MMU relationship Physical cache Virtual cache Cache contents Unified instruction and data cache Separate instruction and data cache Associativity Direct-mapped RAM-RAM Set-associative RAM-RAM Fully-associative CAM-RAM Replacement strategy Round-robin Random LRU Write strategy Write-through Write-through with write buffer Copy-back Stevens Institute of Technology - All rights reserved 1-32/68

33 ARM-3 cache performance example Caching options Relative performance No cache 10 Data-only cache 113 Instruction-only cache 195 Instruction and data cache 25 Stevens Institute of Technology - All rights reserved 1-33/68

34 ARM-3 cache performance example Caching options Relative performance No cache 10 Data-only cache 113 Instruction-only cache 195 Instruction and data cache 25 Instruction cache has maximum impact Stevens Institute of Technology - All rights reserved 1-34/68

35 ARM-3 performance relati ve performance 25 Relative performance depends on cache size and type of cache associative 05 2-way ca che size (Kbytes) 1 1/4 direct-mapped Stevens Institute of Technology - All rights reserved 1-35/68

37 Performance improvement with associativity Improved performance with any associativity 25 Diminished improvement in performance with higher associativity associativity (ways) performance bandwidth Stevens Institute of Technology - All rights reserved 1-37/68

38 Performance improvement with associativity Virtually same performance For 64 vs 256 way assoc cache Split 256-way cache into four 64-way cache sections 3/4 inactive sections can be powered down associativity (ways) performance bandwidth Stevens Institute of Technology - All rights reserved 1-38/68

39 ARM3 cache organization virtual address user/supervisor enable decode [1:0] [3:2] data RAM byte addresses tag CAM tag CAM tag CAM tag CAM [9:0] 64 entry 64 entry 64 entry 64 entry 1024 x 32-bit word [9:4] hit data Stevens Institute of Technology - All rights reserved 1-39/68

40 ARM3 cache organization Select 1 of 4 CAM tag stores word select within cache line 31 address to look up in CAM store virtual address byte select within word user/supervisor enable decode [1:0] [3:2] data RAM byte addresses tag CAM tag CAM tag CAM tag CAM [9:0] 64 entry 64 entry 64 entry 64 entry 1024 x 32-bit word [9:4] hit data Stevens Institute of Technology - All rights reserved 1-40/68

42 Segmented memory management Memory processor MMU process memory table User process 1 stack segment User process 1 data segment User process 1 code segment Stevens Institute of Technology - All rights reserved 1-42/68

43 Segmented MMU Processor segment selector logical address Address base limit segment descriptor table limit + physical address >? access fault base segment logical addr physical address = segment_base_addr + logical_addr exception if (logical_addr > limit) Stevens Institute of Technology - All rights reserved 1-43/68

44 Issues with segmented memory managment Changing mix of coresident processes Memory U_proc 7 stack U_proc -7 data U_proc -7 code U_proc 3 stack U_proc 3 data U_proc -3 code Memory fragmentation User process 1 stack segment User process 1 data segment User process 1 code segment Stevens Institute of Technology - All rights reserved 1-44/68

46 Paging memory management Processor logical address data page directory page table page frame With single translation table, 4 kb pages would require 20-bit addresses, or to translate a 32-bit address, 2 20 x 20 = 25 MB Stevens Institute of Technology - All rights reserved 1-46/68

47 Paging memory management Processor logical address data page directory page table page frame With single translation table, 4 kb pages would require 20-bit addresses, or to translate a 32-bit address, 2 20 x 20 = 25 MB With two levels of paging, 10 bits could be used for 2 nd level page, in 1 st level page directory, next 10 bits could contain physical page number Stevens Institute of Technology - All rights reserved 1-47/68

48 Paging memory management Processor logical address data page directory page table page frame With two levels of paging, 10 bits could be used for 2 nd level page, in 1 st level page directory, next 10 bits could contain physical page number With 32-bits per directory and page table entry, and each 4 kb, a minimal 1 page system with 8 kb could manage 4 MB of memory, or a full system with 4 MB of tables could manage 32 GB Stevens Institute of Technology - All rights reserved 1-48/68

53 Virtual memory paging tables swapped out new new processor MMU memory storage memory access retry access Logical program size can be bigger than physical memory Stevens Institute of Technology - All rights reserved 1-53/68

60 Restartable instructions void main(void) { float x[10]; int y; z = subr(x, y); } Processor MMU cache memory restore state of previous process Processor must retain enough state (CPSR, stack, registers, etc) to restore running program following segmentation fault Stevens Institute of Technology - All rights reserved 1-60/68

61 Tradeoffs with paging Advantages: Complete freedom in running arbitrary code size Transparency to memory limitations/control Disadvantages: Multiple memory accesses to retrieve data (page directory + page table) Stevens Institute of Technology - All rights reserved 1-61/68

62 Tradeoffs with paging Advantages: Complete freedom in running arbitrary code size Transparency to memory limitations/control Disadvantages: Multiple memory accesses to retrieve data (page directory + page table) Solution: Cache recently used page translations (Translation Look-aside Buffers) Stevens Institute of Technology - All rights reserved 1-62/68

65 Virtual and physical caches Virtual address processor MMU cache memory Virtual address Physical address Cache access can start immediately after process produces an address no need to activate MMU if data is in cache But, cache may contain synonyms duplicate copies of same data in cache due to overlapping translations Processor may update one of several synonyms Stevens Institute of Technology - All rights reserved 1-65/68

67 Virtual and physical caches Physical address processor MMU cache memory Virtual address Physical address There are no problems with synonyms, since the cache contains copies of physical memory only MMU must be started up for all cache accesses, consuming power and potentially slowing cache access down Stevens Institute of Technology - All rights reserved 1-67/68

68 Intel i7 cache architecture processor 32 kb/core 4-way set assoc LRU 64-byte block write-back L1 L1 instruction instruction cache cache L1 L1 instruction cache per core 256 kb/core 8-way set assoc LRU 64-byte block write-back L2 unified cache L2 unified cache L2 L2 unified unified cache cache per core L1 L1 data data cache cache L1 L1 data cache per core 32 kb/core 8-way set assoc LRU 64-byte block write-back L3 unified cache (shared) 8 MB-shared 16-way set assoc 64-byte block write-back Memory (2 44 physical, 2 48 virtual) Stevens Institute of Technology - All rights reserved 1-68/68

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs