The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Similar documents
indicates problems that have been selected for discussion in section, time permitting.

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache Structure. Replacement policies Overhead Implementation Handling writes Cache simulations. Comp 411. L15-Cache Structure 1

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

6.004 Tutorial Problems L20 Virtual Memory

Cache Structure. Replacement policies Overhead Implementation Handling writes Cache simulations. Comp 411. L18-Cache Structure 1

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

6.004 Recitation Problems L13 Caches

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory Worksheet

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

ECE331 Homework 4. Due Monday, August 13, 2018 (via Moodle)

EN1640: Design of Computing Systems Topic 06: Memory System

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

Cache Memory and Performance

ECE 30 Introduction to Computer Engineering

Caches. Hiding Memory Access Times

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

Figure 1: Organisation for 128KB Direct Mapped Cache with 16-word Block Size and Word Addressable

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

6.004 Tutorial Problems L14 Cache Implementation

6.004 Tutorial Problems L14 Cache Implementation

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

13-1 Memory and Caches

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

CISC 360. Cache Memories Exercises Dec 3, 2009

Final Exam Fall 2008

ECE331: Hardware Organization and Design

Memory. Lecture 22 CS301

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Memory Hierarchies 2009 DAT105

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Memory hierarchy and cache

Structure of Computer Systems

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Advanced Memory Organizations

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

A Cache Hierarchy in a Computer System

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Computer Architecture CS372 Exam 3

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Project: Visual Cache Simulator Input: Real Addresses Outputs: 1. Number of Hits 2. Number of Misses 3. Hit Ratio 4. Visual Representation of Cache

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 17: Memory Hierarchy: Cache Design

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Overview IN this chapter we will study. William Stallings Computer Organization and Architecture 6th Edition

ECE Sample Final Examination

Review: Computer Organization

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

ECE331: Hardware Organization and Design

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?


ECE 341 Final Exam Solution

CS 61C: Great Ideas in Computer Architecture Caches Part 2

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Improving Memory Access The Cache and Virtual Memory

LSN 7 Cache Memory. ECT466 Computer Architecture. Department of Engineering Technology

CSE 2021: Computer Organization

Memory. Objectives. Introduction. 6.2 Types of Memory

CS 3510 Comp&Net Arch

CSE 2021: Computer Organization

EN1640: Design of Computing Systems Topic 06: Memory System

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

Cache Performance (H&P 5.3; 5.5; 5.6)

CS152 Computer Architecture and Engineering

Memory Hierarchy. Mehran Rezaei

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Main Memory Supporting Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Memory Hierarchy & Caches Worksheet

Transcription:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Comp 411 Computer Organization Fall 2006 Solutions Problem Set #10 Problem 1. Cache accounting The diagram below illustrates a blocked, direct-mapped cache for a computer that uses 32-bit data words and 32-bit byte addresses. The block size is 4 words. 32-bit address from CPU B 2 00 V TAG -00-01 -10-11 Row 15 Row 14 Row 13 Row 12 Row 11 Row 10 Row 9 Row 8 Row 7 Row 6 Row 5 Row 4 Row 3 Row 2 Row 1 Row 0 32 32 32 32 = B HIT (A) What are the maximum number words of data from main memory that can be stored in the cache at any one time? How many bits of the address are used to select which line of the cache is accessed? How many bits wide is the tag field? (5pts) A total of 4 16 = 64 words can be present in the cache at any given time. 4 bits of the address <8:4> are used to select the cache line, and the tag size is 24 bits. (B) Briefly explain the purpose of the one-bit V field associated with each cache line. (5 pts) The V bit indicates that the corresponding cache line contains valid data. (C) Assume that memory location 0x3328C was present in the cache. Using the row and column labels from the figure, in what cache location(s) could we find the data from that memory location? What would the value(s) of the tag field(s) have to be for the cache row(s) in which the data appears? (5 pts) Location 0x3328C would be found in block 3 (Data -11) of line 8, and the tag field would contain 0x000332. Comp 411 - Fall 2006-1 - Problem Set #10

(D) Can data from locations 0x12368 and 0x322F68 be present in the cache at the same time? What about data from locations 0x2536038 and 0x10F4? Explain. (5 pts) The locations 0x12368 and 0x322f68 map to the same cache line, 6, and therefore could not be in the cache simultaneously. Locations 0x2536038 and 0x000010f4 could be in cache simultaneously since they map to different lines, (3 and F respectively) (E) When an access causes a cache miss, how many words need to be fetched from memory to fill the appropriate cache location(s) and satisfy the request? (5 pts) On a cache miss 4 sequential words need to be fetched to fill the line. Problem 2. Where are the Bits? A popular processor is advertised as having a total on-chip cache size of 1,280 Kbytes. It is broken down into 256Kbytes of Level 1 (L1) cache and 1024Kbytes of Level 2 (L2) cache. The first level caches are broken down into separate instruction and data caches, each with 128 Kbytes. All caches use a common block size of 64 bytes (16, 32-bit words). The L1 instruction and data caches are both 2-way set associative, and the unified L2 cache is 16-way associative. Both L1 caches employ an LRU replacement strategy, while the L2 cache uses RANDOM replacement. The L1 data cache uses one dirty bit for each group of 4 words in a line (4 for each line). None of the caches, employ a valid bit, instead the reset handler is responsible for executing reads from a reserved region of memory to initialize the cache s contents. (A) Determine which bits of a 32-bit address are used as tags, cache indices, and block selection for each of the three caches. (10 pts) The block size for all caches is 64-bytes or 16 words thus the lower 4 + 2 address bits are used as block offsets. Each L1 caches is 128 Kbytes divided over a 2-way sets, making only 64K bytes addressable. This means that 16 (6) = 10 bits are needed as the cache index, leaving the upper 16 bits as a cache tag. The L2 cache is 1024 Kbytes divided over 16-way sets, once again making only 64 Kbytes addressable, which leads once again to a 10 bit index and 16 bit tag. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Tag Value Cache Index Block offset (B) Compute how many actual bits are needed each cache, accounting for tags, data, replacement, dirty, and any other overhead. (7 pts) The L1 instruction cache has: 1024 * (1 LRU bit + 2 * (15 bits/tag + 64 bytes/block * 8 bits/byte)) = 1080320 bits The L1 data cache has: 1024 * (1 LRU bit + 2 * (4 dirty bits + 15 bits/tag + 64 bytes/block * 8 bits/byte)) = 1088512 bits The L2 cache has: 1024 * (16 * (4 dirty bits + 15 bits/tag + 64 bytes/block * 8 bits/byte)) = 8699904 bits Assume both L1 caches have a 1-cycle access latency, the L2 cache requires 4 additional cycles, and main memory requires 20-cycles to load a line. A popular benchmark suite reports that this cache design achieves an instruction cache hit rate of 95%, and a data cache hit rate of 80%. A hit rate of 98% has been reported for the L2 cache for the same benchmarks. Comp 411 - Fall 2006-2 - Problem Set #10

(C) What is the average effective access time for instruction fetches in this cache architecture (in cycles)? What is the average effective access time for data fetches (loads and stores) in this cache architecture (in cycles)? If you assume that about 20% of executed instructions access memory, what is the overall effective access time? (7 pts) For instruction fetches, 0.95*1 + (1-0.95)*(0.98*(1+4) + (1-0.98)*(1+4+20)) = 1.22 cycles For data access, 0.80*1 + (1-0.80)*(0.98*(1+4) + (1-0.98)*(1+4+20)) = 1.88 cycles Overall effective access time, Instruction*100/120 + data access * 20/120 = 1.33 cycles (D) The same company also sells a version of the chip with a L2 cache that is half the size (512 Kbytes). In that chip the L2 hit rate is only 90% for the same benchmark, what is its overall effective access time in cycles? (6 pts) For instruction fetches, 0.95*1 + (1-0.95)*(0.9*(1+4) + (1-0.9)*(1+4+20)) = 1.3 cycles For data access, 0.80*1 + (1-0.80)*(0.9*(1+4) + (1-0.9)*(1+4+20)) = 2.2 cycles Overall effective access time, Instruction*100/120 + data access * 20/120 = 1.523 cycles (E) Assuming that your on-chip memory budget is fixed at 1280 Kbytes, what might you suggest to improve the performance of this cache architecture? (5 pts) The performance is largely dictated by the instruction access time, as seen by its contribution to the average access time. Therefore, the first place to start is to improve the L1 instruction cache. Since the total cache size is fixed, to increase the hit ratio, we need to either increase associativity (we d need to be careful because doing so might require more LRU bits) or increase the blocksize (which would save us of tag bits). Increasing the blocksize for the instruction cache is particularly attractive, since loading large blocks effectively prefetches instructions that are likely to be needed soon. The best choice is probably a combination of these two modifications. One last note, notice that the I-cache hit rate is already quite high, meaning that we may have already have reached a point of diminishing returns. Thus, making a small change to the D-cache might, in the end, provide a greater overall improvement. The only way to really know is to apply analysis based on instruction traces from the same benchmarks or applications. Comp 411 - Fall 2006-3 - Problem Set #10

Problem 3. Trip Around the Old Block Lee Hart is assigned the task of designing the cache architecture for an upcoming processor. He is given the following design constraints. Cache accesses, for both hits and misses, require 1 clock cycle. On cache misses, an entire cache block (multiple words) is fetched from main memory. Main memory accesses are not initiated until after the cache is inspected. A main-memory access consists of 2 cycles to send the address, 3 cycles of idle time for main-memory access, and then 1 cycle to transfer word to the cache. Moreover, the processor will stall until the last word of the block has arrived. Thus, if the block size is B words (32 bits), a cache miss will cost 1 + 2 + 3 + B cycles. The following table gives the average cache miss rates of a 1 Mbyte cache for various block sizes: Block size (B) Miss ratio (m), % 1 3.2 4 1.6 16 0.8 64 0.4 256 0.3 (A) Write an expression for the average memory access time for a 1-Mbyte cache and a B-word block size (in terms of the miss ratio m and B). (5 pts) (1-m)*1 + m*(1+2+3+b) = 1 + 5m + Bm (B) What block size yields the best average memory access time? (5 pts) 4 (C) If main memory bus width is expanded to 128 bits (4 words), reducing the number of cycles for accessing the block proportionately, what is the optimal block size? (5 pts) The expression becomes 1 + 5m + Bm/4, and the best blocksize becomes 16. (D) Suppose that expanding the memory bus to 128 bits increases the main memory access latency by 6 clocks (for a total of 12 cycles of overhead before the first access); what then is the optimal block size? (5 pts) The expression becomes (1-m)*1 + m*(1+2+9+b/4), and the best blocksize becomes 64. Problem 4. Approximate LRU Replacement Immediately after the cache-replacement-algorithm lecture in Comp 411, Lori Acan has an epiphany. It dawns on her that it is possible to approximate LRU replacement in a 4-way set with only 3 bits per set, 2 less than an exact LRU implementation and only 1 bit more than the FIFO algorithm. Her scheme works as follows. She partitions the 4-way set into two halves and treats each half as a 2-way set, thus, allotting one replacement bit for computing the LRU line in each half. This requires 2 bits, one for each half. The third bit is used to keep track of the most recently used half of the cache. On a cache miss, the least recently used line in the least recently used half of the cache is replaced. (A) Simulate the behavior of this replacement algorithm on a 4-word fully associate cache using the instruction sequence given in lecture and repeated below where $t3 = 0x4000. Assume the initial values of the replacement bits are 0-00. Estimate the miss rate for this code sequence in the steady state while noting of the values of the 3 replacement bits on each cache access. (5 pts) Address Instruction 400: lw $t0,0($t3) 404: addi $t3,$t3,4 408: bne $t0,$0,400 Comp 411 - Fall 2006-4 - Problem Set #10

The miss ratio is ¼. Addr Line # Miss? Replacement bits 100 2 M 1.01 1000 0 M 0,11 101 3 M 1,10 102 1 M 0,00 100 2 1,01 1001 0 M 0,11 101 3 1,10 102 1 0,00 100 2 1,01 1002 0 M 0,11 101 3 1,10 102 1 0,00 100 2 1,01 1003 0 M 0,11 101 3 1,10 102 1 0,00 (B) Explain how Lori s approximate LRU algorithm can be used for any N-way set where N=2 k. How many bits are required per set in these cases? (5 pts) Half the lines step by step until there are only two lines in each small group. The first bit is used to indicate which half is most recently used, the second is used to indicate which ¼ among the half is used most recently, and so on. The last bit is used to indicate which line is least recently used in the pair of lines. There are totally log(n) levels, the number of bits needed is 1 + 2 + 4 + + N/2 = N-1 (C) Give a memory reference string for which Lori s 4-way LRU replacement algorithm differs from an exact LRU replacement. (5 pts) If the replacement bits are (0,00), line pair (2, 3) is accessed less recently than pair (0, 1), and 2 is accessed before 3 while 0 is accessed before 1. There are three possible access pattern which are (2, 3, 0, 1), (2, 0, 3, 1), and (0, 2, 3, 1). If the true access order is (0, 2, 3, 1), then the second least recently accessed line 2 is more likely to be replaced than line 0. Addr Line # Miss? Replacement bits 100 2 M 1,01 101 0 M 0,11 102 3 M 1,10 103 1 M 0,00 101 0 0,11 102 3 1,10 104 1 M 0,00 When the last miss happens, line 2 is least recently used, but in Lori s algorithm, the second least used line 1 will be replaced. Comp 411 - Fall 2006-5 - Problem Set #10

(D) In the worse case, what will be the highest element in an ordered list of recent accesses that Lori s approximate LRU replacement algorithm would throw out? (5 pts) The case given above is the worst case. The third least recently used value will either be the most recently used element in the other least recently used half, or it will be the least recently used element in the current, most recently used half. In neither case can it be thrown out. Comp 411 - Fall 2006-6 - Problem Set #10