ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Similar documents
ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Memory latency: Affects cache miss penalty. Measured by:

Memory Hierarchy Y. K. Malaiya

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

EN1640: Design of Computing Systems Topic 06: Memory System

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CENG3420 Lecture 08: Memory Organization

LECTURE 5: MEMORY HIERARCHY DESIGN

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

14:332:331. Week 13 Basics of Cache

Mainstream Computer System Components

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

EE414 Embedded Systems Ch 5. Memory Part 2/2

14:332:331. Week 13 Basics of Cache

COSC 6385 Computer Architecture - Memory Hierarchies (III)

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

EEM 486: Computer Architecture. Lecture 9. Memory

EN1640: Design of Computing Systems Topic 06: Memory System

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory. Lecture 22 CS301

Mainstream Computer System Components

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Copyright 2012, Elsevier Inc. All rights reserved.

CpE 442. Memory System

Memory Technologies. Technology Trends

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

The Memory Hierarchy & Cache

Caches. Hiding Memory Access Times

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EC 513 Computer Architecture

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Adapted from David Patterson s slides on graduate computer architecture

Computer Science 146. Computer Architecture

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

ECE 30 Introduction to Computer Engineering

Computer System Components

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Course Administration

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Computer Architecture Memory hierarchies and caches

Advanced Memory Organizations

Virtual Memory, Address Translation

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE331: Hardware Organization and Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

The University of Adelaide, School of Computer Science 13 September 2018

COMPUTER ORGANIZATION AND DESIGN

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Lecture 17. Fall 2007 Prof. Thomas Wenisch. row enable. _bitline. Lecture 18 Slide 1 EECS 470

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner

Transistor: Digital Building Blocks

Transcription:

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.auburn.edu/~uzg0005/ Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University) [Adapted from Computer Organization and Design, Patterson & Hennessy, 2014] 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 1

Designing a Computer Control Input Datapath Central Processing Unit (CPU) or Processor Output

Types of Computer Memories From the cover of: A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Upper Saddle River, New Jersey: Pearson Prentice Hall, 2006.

Random Access (RAM) Address bits Address decoder cell array Read/write circuits Data bits

Six-Transistor SRAM Cell bit bit Word line Bit line Bit line

Dynamic RAM (DRAM) Cell Bit line Word line Single-transistor DRAM cell Robert Dennard s 1967 invention

Classical RAM Organization (~Square) R o w D e c o d e r RAM Cell Array bit (data) lines Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell word (row) line row address Column Selector & I/O Circuits data bit or word column address One memory row holds a block of data, so the column address selects the requested bit or word from that block

Classical DRAM Organization (~Square Planes) bit (data) lines R o w D e c o d e r row address RAM Cell Array Column Selector & I/O Circuits data bit data bit data bit Each intersection represents a 1-T DRAM cell word (row) line column address The column address selects the requested bit from the row in each plane

N rows Classical DRAM Operation DRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires a RAS / CAS cycle Column Address N cols DRAM Row Address Cycle Time M-bit Output M bit planes 1 st M-bit Access 2 nd M-bit Access RAS CAS Row Address Col Address Row Address Col Address

N rows Page Mode DRAM Operation Page Mode DRAM N x M SRAM to save a row After a row is read into the SRAM register RAS CAS Only CAS is needed to access other M-bit words on that row RAS remains asserted while CAS is toggled Cycle Time Column Address N cols DRAM N x M SRAM M-bit Output 1 st M-bit Access 2 nd M-bit 3 rd M-bit 4 th M-bit Row Address M bit planes Row Address Col Address Col Address Col Address Col Address

N rows Synchronous DRAM (SDRAM) Operation After a row is read into the SRAM register Column Address Inputs CAS as the starting burst address along with a burst length Transfers a burst of data from a series of sequential addresses within that row - A clock controls transfer of successive words in the burst 300MHz in 2004 Cycle Time +1 N cols DRAM N x M SRAM M-bit Output 1 st M-bit Access 2 nd M-bit 3 rd M-bit 4 th M-bit Row Address M bit planes RAS CAS Row Address Col Address Row Add

Other SDRAM Architectures Double Data Rate SDRAMs DDR-SDRAMs Double data rate because they transfer data on both the rising and falling edge of the clock Most widely used form of SDRAMs For DDR memory, 2n prefetch architecture means Internal bus width is twice of external bus width Hence, internal column access freq can be half of external data rate For users, 2n prefetch means that data access occurs in pairs A single READ fetches two data words A single WRITE, two data words must be provided

Other SDRAM Architectures- Cont. https://www.synopsys.com/company/publications/dwtb/pages/dwtb-ddr4-bank-groups-2013q2.aspx

Systems that Support Caches The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus bank 0 bank 1 bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization a. One-word-wide memory organization

One Word Wide Organization One word wide bus and one word wide memory The bus contains both address and data lines on-chip CPU Cache 32-bit data & 32-bit addr per cycle bus

Wide Organization Increase the bandwidth to one word memory by widening Buses between processor and memory To allow parallel access to all the words of a block Logic between processor and cache consists of A MUX on READs Control logic to update the appropriate words on WRITEs CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus bank 0 bank 1 bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization

Interleaved Organization Widen the memory but not the interconnection Example: 4-way interleaving one word wide bus and four word wide memory Sending an address to several banks permits them all to read simultaneously Hence, incurring full DRAM latency only once on-chip CPU Cache bus bank 0 bank 1 bank 2 bank 3

Interleaved Organization Bank Conflict Two consecutive memory operations using the same bank Cause the memory to stall until the busy bank has completed the prior operation on-chip CPU Cache bus bank 0 bank 1 bank 2 bank 3

Example --- how a memory system affects overall performance Assume at cache miss to read DRAM 1 clock cycle to send the address 25 clock cycles for DRAM cycle time 8 clock cycles for DRAM access time 1 clock cycle to return a word of data -Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/cpu per clock cycle 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 19

Performance of One Word Wide Organization on-chip CPU Cache bus If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 25 1 27 cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/27 = 0.148 bytes per clock

Performance of One Word Wide Organization on-chip CPU Cache bus What if the block size is four words? 25 cycles 1 4 x 25 = 100 4 x 1 = 4 104 25 cycles cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/104 = 0.154 As soon as data is available, address can be changed to access the next word bytes per clock 25 cycles

Performance of One Word Wide Organization on-chip CPU Cache bus What if the block size is four words and if a fast page mode DRAM is used? 25 cycles 1 25 + 3*8 = 49 4 x 1 = 4 54 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 8 cycles 8 cycles 8 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/54 = 0.296 bytes per clock

Performance of Wide Organization CPU Multiplexor Cache Bus b. Wide memory organization What if the cache block size is four words CPU and main memory width of two words? 25 cycles 1 2 x 25 = 50 2 x 1 = 2 bank 0 bank 1 Cache Bus 53 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bank 2 c. Interleaved memory organization bank 3 25 cycles Each access has two words in parallel e ization Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/53 = 0.302 bytes per clock

Performance of Wide Organization CPU Multiplexor Cache Bus What if the cache block size is four words CPU and main memory width of four words? bank 0 bank 1 1 Cache 25 Bus 1 27 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization 25 cycles This access has four words in parallel e ization Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/27 = 0.593 bytes per clock

Performance of Interleaved Organization on-chip CPU Cache For a block size of four words 1 25 4 x 1 = 4 30 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bus bank 0 bank 1 bank 2 bank 3 25 cycles 25 cycles 25 cycles 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock

DRAM System Summary Its important to match the cache characteristics Caches access one block at a time (usually more than one word) DRAM characteristics Use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache -bus characteristics Make sure the memory-bus can support the DRAM access rates and patterns The goal of increasing the -Bus to Cache bandwidth

Virtual Hardware Support

Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor L1$ L2$ Main 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Secondary 1,024+ bytes (disk sector = page) (Relative) size of the memory at each level

Hierarchy Registers Cache (1 or more levels) Main memory Physical Virtual Words transferred via load/store Blocks transferred automatically upon cache miss Pages transferred automatically upon page fault

Disk Cache Miss and Page Fault All data, organized in Pages (~4KB), accessed by Physical addresses Pages (Write-back, same as in cache) Main Cached pages, Page table Page fault: a required page is not found in main memory Processor Cache MMU Cache miss: a required block is not found in cache Page fault in virtual memory is similar to miss in cache.

Virtual vs. Physical Address Processor assumes a virtual memory addressing scheme: Disk is a virtual memory (large, slow) A block of data is called a virtual page An address is called virtual (or logical) address (VA) Main memory may have a different addressing scheme: Physical memory consists of caches address is called physical address MMU translates virtual address to physical address Complete address translation table is large and is kept in main memory MMU contains TLB (translation-lookaside buffer), which keeps record of recent address translations.

Two Programs Sharing Physical A program s address space is divided into pages (all are one fixed size) or segments (variable sizes) The starting location of each page (either in main memory or in secondary memory) is contained in the program s page table disk Program 1 virtual address space main memory Program 2 virtual address space

Hierarchy Example 32-bit address (byte addressing) 4 GB virtual main memory (disk space) Page size = 4 KB Number of virtual pages = 4 2 30 /(4 2 10 ) = 1M Bits for virtual page number = log 2 (1M) = 20 1 GB physical main memory Page size 4 KB Number of physical pages = 1 2 30 /(4 2 10 ) = 256K Bits for physical page number = log 2 (256K) = 18 Page table contains 1M records specifying where each virtual page is located.

Virtual Address (VA) Address Translation A virtual address is translated to a physical address by a combination of hardware and software 31 30... 12 11... 0 Virtual page number Page offset Translation Physical page number Page offset 29... 12 11 0 Physical Address (PA) So each memory request first requires an address translation from the virtual space to the physical space A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault

1M virtual page numbers Physical main memory (pages) Page Table Page table register Address of Page table in main memory 0 1 2 3.. K... Page locations - 2-1 3 -- - - 0 -- Page 0 Page 1 Page 2 Valid bit Other flags, e.g., dirty bit, LRU ref. bit Virtual main memory (pages on disk) Page 3

32-bit Virtual Address (4 KB Page) 20-bit virtual page number 10-b word number within page 2-b byte offset 1K words (4KB data) A virtual page (contains 4KB, or 1K words) 32 bits (4 bytes)

Virtual to Physical Address Translation Virtual address 20-bit virtual page number 12-bit byte offset within page Address translation 18-bit physical page number 12-bit byte offset within page Physical address

Virtual System Virtual or logical address (VA) MMU: management unit with TLB Physical address (PTE) Processor Data Cache SRAM Data Physical address DRAM Main with page table DMA: Direct memory access Disk

TLB: Translation-Lookaside Buffer A processor request requires two accesses to main memory: Access page table to get physical address Access physical address TLB acts as a cache of page table Holds recent virtual to physical page translations Eliminates one main memory access if requested virtual page address is found in TLB (hit)

1M virtual page numbers Physical main memory (pages) TLB Organization V D R Tag Phy. Pg. Addr. Page table register Address of Page table in main memory 0 1 2 3 4. K... 1 1 1 4 3 1 0 1 1 2-2 - 1 3 - - - - 0 -- Page 0 Page 1 Page 2 Valid bit Other flags, e.g., dirty bit, LRU ref. bit Page locations Page 3 Virtual main memory (pages on disk)

TLB Data V: Valid bit D: Dirty bit R: Reference bit (LRU) Tag: Index in page table Physical page address

Typical TLB Characteristics TLB size: 16 512 entries Block size: 1 2 page table entries of 4 8 bytes each Hit time: 0.5 1 clock cycle Miss penalty: 10 100 clock cycles Miss rate: 0.01% 1%

Integrating Virtual, TLBs, and Caches

Intel IA-32 Management The memory management for IA-32 architecture are divided into two parts: segmentation and paging. Segmentation provides a mechanism of isolating individual code, data, and stack modules Multiple programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system Sections of a program s execution environment are mapped into physical memory as needed. The processor uses two stages of address translation Logical address to linear address (Segmentation) Linear address to physical address (Paging) *Intel 64 and IA-32 Architectures Software Developer s Manual, Volume 3A: System Programming Guide, Part 1

Segmentation Segmentation provides a mechanism for dividing the processor s addressable memory space (called the linear address space) into smaller protected address spaces called segments. All the segments in a system are contained in the processor s linear address space. To locate a byte in a particular segment, a logical address (far pointer) has to be provided.

Logical Address Translation

Paging Paging (or linear-address translation) is the process of translating linear addresses so that they can be used to access memory or I/O devices. Paging translates each linear address to a physical address and determines, for each translation, what accesses to the linear address are allowed (the address s access rights)

Linear Address Translation

Segmentation and Paging

Complete Picture: IA-32 System-Level Registers and Data Structures

Next Class Performance 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 51