Chapter 5 The Memory System

Similar documents
k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

UNIT-V MEMORY ORGANIZATION

I/O Management and Disk Scheduling. Chapter 11

ECEC 355: Cache Design

ECE 341. Lecture # 18

CS Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CPE300: Digital System Architecture and Design

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Memory Organization MEMORY ORGANIZATION. Memory Hierarchy. Main Memory. Auxiliary Memory. Associative Memory. Cache Memory.

The Memory Hierarchy & Cache

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CPSC 352. Chapter 7: Memory. Computer Organization. Principles of Computer Architecture by M. Murdocca and V. Heuring

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Structure of Computer Systems

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Memory Hierarchy: Motivation

ECE331: Hardware Organization and Design

ECE468 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy: The motivation

ECE4680 Computer Organization and Architecture. Virtual Memory

CS370 Operating Systems

1. Creates the illusion of an address space much larger than the physical memory

Cache Memory Mapping Techniques. Continue to read pp

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Unit IV MEMORY SYSTEM PART A (2 MARKS) 1. What is the maximum size of the memory that can be used in a 16-bit computer and 32 bit computer?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE 30 Introduction to Computer Engineering

Where Have We Been? Ch. 6 Memory Technology

Memory Hierarchy Y. K. Malaiya

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Computer Architecture Memory hierarchies and caches

Virtual to physical address translation

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Memory. Objectives. Introduction. 6.2 Types of Memory

COMPUTER ORGANIZATION AND DESIGN

Lecture 2: Memory Systems

Input Output (IO) Management

CPE300: Digital System Architecture and Design

Introduction to Operating Systems. Chapter Chapter

Summary of Computer Architecture

Advanced Memory Organizations

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

HY225 Lecture 12: DRAM and Virtual Memory

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Chapter 11. I/O Management and Disk Scheduling

Chapter 9. Storage Management

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Memory Technologies. Technology Trends

Chapter 8 Virtual Memory

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

COMPUTER ARCHITECTURE AND ORGANIZATION

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

edram Macro MUX SR (12) Patent Application Publication (10) Pub. No.: US 2002/ A1 1" (RH) Read-Buffer" JO s (19) United States

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

CS370 Operating Systems

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Disks and I/O Hakan Uraz - File Organization 1

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Cache memory. Lecture 4. Principles, structure, mapping

Fall COMP3511 Review

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

ECE 341 Final Exam Solution

1/19/2009. Data Locality. Exploiting Locality: Caches

Memory - Paging. Copyright : University of Illinois CS 241 Staff 1

Contents. Main Memory Memory access time Memory cycle time. Types of Memory Unit RAM ROM

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Question?! Processor comparison!

Operating System Concepts

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Introduction to Operating Systems. Chapter Chapter

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Virtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358

Transcription:

Chapter 5 The Memory System 5.. The block diagram is essentially the same as in Figure 5., except that 6 rows (of four 5 8 chips) are needed. Address lines A 8 are connected to all chips. Address lines A 9 are connected to a 4-bit decoder to select one of the 6 rows. 5.. The minimum refresh rate is given by 5 5 (4.5 ) 9 = 8. s Therefore, each row has to be refreshed every 8 ms. 5.. Need control signals M in and M out to control storing of data into the memory cells and to gate the data read from the memory onto the bus, respectively. A possible circuit is Read/Write circuits and latches M in D Q D Q D in Clk M D out Clk out Data 5.4. (a) It takes 5 + 8 = clock cycles. Total time = Latency = ( 6 ) =.98 6 s = 98 ns 5 ( 6 ) =.8 6 s = 8 ns (b) It takes twice as long to transfer 64 bytes, because two independent -byte transfers have to be made. The latency is the same, i.e. 8 ns.

5.5. A faster processor chip will result in increased performance, but the amount of increase will not be directly proportional to the increase in processor speed, because the cache miss penalty will remain the same if the main memory speed is not improved. 5.6. (a) Main memory address length is 6 bits. TAG field is 6 bits. BLOCK field is bits (8 blocks). WORD field is 7 bits (8 words per block). (b) The program words are mapped on the cache blocks as follows: 4 7 5 8 5 55 79 56 8 8 47 84 48 5 55 5 4 69 64 5 767 768 6 895 896 7 Start 7 65 9 5 End Hence, the sequence of reads from the main memory blocks into cache blocks is :,,,, 4, 5, 6, 7,,, }{{} Pass,,,, }{{} Pass,,...,,,,,,, }{{} Pass 9,,,,, }{{} Pass

As this sequence shows, both the beginning and the end of the outer loop use blocks and in the cache. They overwrite each other on each pass through the loop. s to 7 remain resident in the cache until the outer loop is completed. The total time for reading the blocks from the main memory into the cache is therefore ( + 4 9 + ) 8 τ = 6, 44 τ Executing the program out of the cache: Outer loop inner loop = [( ) (9 64)] τ =, τ Inner loop = (9 64) τ = 5, τ End section of program = 5 = τ Total execution time = 87, 77 τ 5.7. In the first pass through the loop, the Add instruction is stored at address 4 in the cache, and its operand (AC) at address 6. Then the operand is overwritten by the Decrement instruction. The BNE instruction is stored at address. In the second pass, the value 5D9 overwrites the BNE instruction, then BNE is read from the main memory and again stored in location. The contents of the cache, the number of words read from the main memory and from the cache, and the execution time for each pass are as shown below. After pass No. Cache contents MM accesses Cache accesses Time 5E BNE 5D Add 4 4 τ 5D Dec 5E BNE 5D Add τ 5D Dec 5E BNE AA D7 5D Add τ 5D Dec Total 7 5 75 τ

5.8. All three instructions are stored in the cache after the first pass, and they remain in place during subsequent passes. In this case, there is a total of 6 read operations from the main memory and 6 from the cache. Execution time is 66 τ. Instructions and data are best stored in separate caches to avoid the data overwriting instructions, as in Problem 5.7. 5.9. (a) 496 blocks of 8 words each require +7 = 9 bits for the main memory address. (b) TAG field is 8 bits. SET field is 4 bits. WORD field is 7 bits. 5.. (a) TAG field is bits. SET field is 4 bits. WORD field is 6 bits. (b) Words,,,, 45 occupy blocks to 67 in the main memory (MM). After blocks,,,, 6 have been read from MM into the cache on the first pass, the cache is full. Because of the fact that the replacement algorithm is LRU, MM blocks that occupy the first four sets of the 6 cache sets are always overwritten before they can be used on a successive pass. In particular, MM blocks, 6,, 48, and 64 continually displace each other in competing for the 4 block s in cache set. The same thing occurs in cache set (MM blocks,, 7,, 49, 65), cache set (MM blocks, 8, 4, 5, 66) and cache set (MM blocks, 9, 5, 5, 67). MM blocks that occupy the last sets (sets 4 through 5) are fetched once on the first pass and remain in the cache for the next 9 passes. On the first pass, all 68 blocks of the loop must be fetched from the MM. On each of the 9 successive passes, blocks in the last sets of the cache (4 = 48) are found in the cache, and the remaining (68 48) blocks must be fetched from the MM. Improvement factor = Time without cache Time with cache = 68 τ 68 τ + 9( τ + 48 τ) =.5 5.. This replacement algorithm is actually better on this particular large loop example. After the cache has been filled by the main memory blocks,,, 6 on the first pass, block 64 replaces block 48 in set. On the second pass, block 48 replaces block in set. On the third pass, block replaces block 6, and on the fourth pass, block 6 replaces block. On the fourth pass, there are two replacements: kicks out 64, and 64 kicks out 48. On the sixth, seventh, and eighth passes, there is only one replacement in set. On the ninth pass there are two replacements in set, and on the final pass there is one replacement. The situation is similar in sets,, and. Again, there is no contention in sets 4 through 5. In total, there are replacements in set in passes through. The same is true in sets,, and. Therefore, the improvement factor is 68 τ 68 τ + 4 τ + (9 68 44) τ =.8 4

5.. For the first loop, the contents of the cache are as indicated in Figures 5. through 5.. For the second loop, they are as follows. (a) Direct-mapped cache Contents of data cache after pass: j = 9 i = i = i = 5 i = 7 i = 9 A(,8) A(,) A(,) A(,4) A(,6) A(,8) 4 A(,9) A(,) A(,) A(,5) A(,7) A(,9) 5 6 7 (b) Associative-mapped cache 4 5 6 7 Contents of data cache after pass: j = 9 i = i = 5 i = 9 A(,8) A(,8) A(,8) A(,6) A(,9) A(,9) A(,9) A(,7) A(,) A(,) A(,) A(,8) A(,) A(,) A(,) A(,9) A(,4) A(,5) A(,6) A(,7) A(,4) A(,5) A(,6) A(,7) A(,) A(,) A(,4) A(,5) A(,) A(,) A(,4) A(,5) 5

(c) Set-associative-mapped cache Set Set Contents of data cache after pass: j = 9 i = A(,8) A(,) A(,9) A(,) A(,6) A(,) A(,7) A(,) i = 7 A(,6) A(,7) A(,4) A(,5) i = 9 A(,6) A(,7) A(,8) A(,9) In all cases, all elements are overwritten before they are used in the second loop. This suggests that the LRU algorithm may not lead to good performance if used with arrays that do not fit into the cache. The performance can be improved by introducing some randomness in the replacement algorithm. 5.. The two least-significant bits of an address, A, specify a byte within a -bit word. For a direct-mapped cache, bits A 4 specify the block. For a set-associative-mapped cache, bit A specifies the set. (a) Direct-mapped cache 4 5 6 7 Pass [] [4] [8] [4C] [F4] Contents of data cache after: Pass Pass Pass 4 [] [] [] [4] [8] [4C] [F4] [4] [8] [4C] [F4] [4] [8] [4C] [F4] Hit rate = /48 =.69 6

(b) Associative-mapped cache Contents of data cache after: Pass Pass Pass Pass 4 [] [] [] [] [4] [4] [4] [4] [4C] [C] [4C] 4 [F4] [F4] [F4] [F4] 5 [C] [4C] 6 7 [C] [4C] [C] Hit rate = /48 =.44 (c) Set-associative-mapped cache Contents of data cache after: Pass [] Pass [] Pass [] Pass 4 [] Set [8] [8] [8] [8] [4] [4] [4] [4] Set [4C] [F4] [F4] [4C] [F4] [F4] [4C] [4C] Hit rate = /48 =.6 7

5.4. The two least-significant bits of an address, A, specify a byte within a -bit word. For a direct-mapped cache, bits A 4 specify the block. For a set-associative-mapped cache, bit A specifies the set. (a) Direct-mapped cache Contents of data cache after: Pass [] [4] Pass Pass Pass 4 [] [] [] [4] [4] [4] [48] [48] [48] [48] [4C] [4C] [4C] [4C] [F4] [F4] [F4] [F4] Hit rate = 7/48 =.77 (b) Associative-mapped cache Contents of data cache after: Pass Pass Pass Pass 4 [] [] [] [] [4] [4] [4] [4] [48] [48] [4C] [4C] [F4] [F4] [F4] [F4] [48] [48] [4C] [4C] Hit rate = 4/48 =.7 8

(c) Set-associative-mapped cache Set Set Contents of data cache after: Pass Pass Pass Pass 4 [] [] [] [] [4] [4] [4] [4] [F4] [F4] [F4] [F4] [48] [48] [4C] [4C] [48] [48] [4C] [4C] Hit rate = 4/48 =.7 5.5. The block size (number of words in a block) of the cache should be at least as large as k, in order to take full advantage of the multiple module memory when transferring a block between the cache and the main memory. Power of multiples of k work just as efficiently, and are natural because block size is k for k bits in the word field. 5.6. Larger size fewer misses if most of the data in the block are actually used wasteful if much of the data are not used before the cache block is ejected from the cache Smaller size more misses 5.7. For 6-word blocks the value of M is + 8 + 4 + 4 = 5 cycles. Then Time without cache Time with cache = 4.4 In order to compare the 8-word and 6-word blocks, we can assume that two 8-word blocks must be brought into the cache for each 6-word block. Hence, the effective value of M is 7 = 4. Then Time without cache Time with cache =. 9

Similarly, for 4-word blocks the effective value of M is 4(+8+4) = 5 cycles. Then Time without cache =.4 Time with cache Clearly, interleaving is more effective if larger cache blocks are used. 5.8. The hit rates are h = h = h =.95 for instructions The average access time is computed as (a) With interleaving M = 7. Then =.9 for data t ave = hc + ( h)hc + ( h) M t ave =.95 +.5.95 +.5 7 +.(.9 +..9 +. 7) =.585 cycles (b) Without interleaving M = 8. Then t ave =.74 cycles. (c) Without interleaving the average access takes.74/.585 =.56 times longer. 5.9. Suppose that it takes one clock cycle to send the address to the L cache, one cycle to access each word in the block, and one cycle to transfer a word from the L cache to the L cache. This leads to C = 6 cycles. (a) With interleaving M = + 8 + 4 =. Then t ave =.79 cycles. (b) Without interleaving M = + 8 + 4 + =. Then t ave =.86 cycles. (c) Without interleaving the average access takes.86/.79 =.9 times longer. 5.. The analogy is good with respect to: relative sizes of toolbox, truck and shop versus L cache, L cache and main memory relative access times relative frequency of use of tools in the storage places versus the data accesses in caches and the main memory The analogy fails with respect to the facts that: at the start of a working day the tools placed into the truck and the toolbox are preselected based on the experience gained on previous jobs, while in the case of a new program that is run on a computer there is no relevant data loaded into the caches before execution begins

most of the tools in the toolbox and the truck are useful in successive jobs, while the data left in a cache by one program are not useful for the subsequent programs tools displaced by the need to use other tools are never thrown away, while data in the cache blocks are simply overwritten if the blocks are not flagged as dirty 5.. Each -bit number comprises 4 bytes. Hence, each page holds 4 numbers. There is space for 56 pages in the M-byte portion of the main memory that is allocated for storing data during the computation. (a) Each column is one page; there will be 4 page faults. (b) Processing of entire columns, one at a time, would be very inefficient and slow. However, if only one quarter of each column (for all columns) is processed before the next quarter is brought in from the disk, then each element of the array must be loaded into the memory twice. In this case, the number of page faults would be 48. (c) Assuming that the computation time needed to normalize the numbers is negligible compared to the time needed to bring a page from the disk: Total time for (a) is 4 4 ms = 4 s Total time for (b) is 48 4 ms = 8 s 5.. The operating system may increase the main memory pages allocated to a program that has a large number of page faults, using space previously allocated to a program with a few page faults. 5.. Continuing the execution of an instruction interrupted by a page fault requires saving the entire state of the processor, which includes saving all registers that may have been affected by the instruction as well as the control information that indicates how far the execution has progressed. The alternative of re-executing the instruction from the beginning requires a capability to reverse any changes that may have been caused by the partial execution of the instruction. 5.4. The problem is that a page fault may occur during intermediate steps in the execution of a single instruction. The page containing the referenced location must be transferred from the disk into the main memory before execution can proceed. Since the time needed for the page transfer (a disk operation) is very long, as compared to instruction execution time, a context-switch will usually be made. (A context-switch consists of preserving the state of the currently executing program, and switching the processor to the execution of another program that is resident in the main memory.) The page transfer, via DMA, takes place while this other program executes. When the page transfer is complete, the original program can be resumed. Therefore, one of two features are needed in a system where the execution of an individual instruction may be suspended by a page fault. The first possibility

is to save the state of instruction execution. This involves saving more information (temporary programmer-transparent registers, etc.) than needed when a program is interrupted between instructions. The second possibility is to unwind the effects of the portion of the instruction completed when the page fault occurred, and then execute the instruction from the beginning when the program is resumed. 5.5. (a) The maximum number of bytes that can be stored on this disk is 4 4 4 5 = 68.8 9 bytes. (b) The data transfer rate is (4 5 7)/6 = 4.58 6 bytes/s. (c) Need 9 bits to identify a sector, 4 bits for a track, and 5 bits for a surface. Thus, a possible scheme is to use address bits A 8 for sector, A 9 for track, and A 7 for surface identification. Bits A 8 are not used. 5.6. The average seek time and rotational delay are 6 and ms, respectively. The average data transfer rate from a track to the data buffer in the disk controller is 4 Mbytes/s. Hence, it takes 8K/4M =. ms to transfer a block of data. (a) The total time needed to access each block is 9 +. = 9. ms. The portion of time occupied by seek and rotational delay is 9/9. =.97 = 97%. (b) Only rotational delays are involved in 9% of the cases. Therefore, the average time to access a block is.9 +. 9 +. =.89 ms. The portion of time occupied by seek and rotational delay is.6/.89 =.9 = 9%. 5.7. (a) The rate of transfer to or from any one disk is megabytes per second. Maximum memory transfer rate is 4/( 9 ) = 4 6 bytes/s, which is 4 megabytes per second. Therefore, disks can be simultaneously flowing data to/from the main memory. (b) 8K/M =.7 ms is needed to transfer 8K bytes to/from the disk. Seek and rotational delays are 6 ms and ms, respectively. Therefore, 8K/4 = K words are transferred in 9.7 ms. But in 9.7 ms there are (9.7 )/(. 6 ) = 97 memory (word) cycles available. Therefore, over a long period of time, any one disk steals only (/97) =.% of available memory cycles. 5.8. The sector size should influence the choice of page size, because the sector is the smallest directly addressable block of data on the disk that is read or written as a unit. Therefore, pages should be some small integral number of sectors in size. 5.9. The next record, j, to be accessed after a forward read of record i has just been completed might be in the forward direction, with probability.5 (4 records distance to the beginning of j), or might be in the backward direction with probability.5 (6 records distance to the beginning of j plus direction reversals). Time to scan over one record and an interrecord gap is s 8 cm cm 4 bits ms + =.5 + = 5.5 ms bit

Therefore, average access and read time is.5(4 5.5) +.5(6 5.5 + 5) + 5.5 = 58 ms If records can be read while moving in both directions, average access and read time is.5(4 5.5) +.5(5 5.5 + 5) + 5.5 = 4.75 ms Therefore, the average percentage gain is (58 4.75)/58 = 44.7% The major gain is because the records being read are relatively close together, and one less direction reversal is needed.