ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.auburn.edu/~uzg0005/ Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University) [Adapted from Computer Organization and Design, Patterson & Hennessy, 2014] 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 1

Designing a Computer Control Input Datapath Central Processing Unit (CPU) or Processor Output

Types of Computer Memories From the cover of: A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Upper Saddle River, New Jersey: Pearson Prentice Hall, 2006.

Random Access (RAM) Address bits Address decoder cell array Read/write circuits Data bits

Six-Transistor SRAM Cell bit bit Word line Bit line Bit line

Dynamic RAM (DRAM) Cell Bit line Word line Single-transistor DRAM cell Robert Dennard s 1967 invention

Classical RAM Organization (~Square) R o w D e c o d e r RAM Cell Array bit (data) lines Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell word (row) line row address Column Selector & I/O Circuits data bit or word column address One memory row holds a block of data, so the column address selects the requested bit or word from that block

Classical DRAM Organization (~Square Planes) bit (data) lines R o w D e c o d e r row address RAM Cell Array Column Selector & I/O Circuits data bit data bit data bit Each intersection represents a 1-T DRAM cell word (row) line column address The column address selects the requested bit from the row in each plane

N rows Classical DRAM Operation DRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires a RAS / CAS cycle Column Address N cols DRAM Row Address Cycle Time M-bit Output M bit planes 1 st M-bit Access 2 nd M-bit Access RAS CAS Row Address Col Address Row Address Col Address

N rows Page Mode DRAM Operation Page Mode DRAM N x M SRAM to save a row After a row is read into the SRAM register RAS CAS Only CAS is needed to access other M-bit words on that row RAS remains asserted while CAS is toggled Cycle Time Column Address N cols DRAM N x M SRAM M-bit Output 1 st M-bit Access 2 nd M-bit 3 rd M-bit 4 th M-bit Row Address M bit planes Row Address Col Address Col Address Col Address Col Address

N rows Synchronous DRAM (SDRAM) Operation After a row is read into the SRAM register Column Address Inputs CAS as the starting burst address along with a burst length Transfers a burst of data from a series of sequential addresses within that row - A clock controls transfer of successive words in the burst 300MHz in 2004 Cycle Time +1 N cols DRAM N x M SRAM M-bit Output 1 st M-bit Access 2 nd M-bit 3 rd M-bit 4 th M-bit Row Address M bit planes RAS CAS Row Address Col Address Row Add

Other SDRAM Architectures Double Data Rate SDRAMs DDR-SDRAMs Double data rate because they transfer data on both the rising and falling edge of the clock Most widely used form of SDRAMs For DDR memory, 2n prefetch architecture means Internal bus width is twice of external bus width Hence, internal column access freq can be half of external data rate For users, 2n prefetch means that data access occurs in pairs A single READ fetches two data words A single WRITE, two data words must be provided

Other SDRAM Architectures- Cont. https://www.synopsys.com/company/publications/dwtb/pages/dwtb-ddr4-bank-groups-2013q2.aspx

Systems that Support Caches The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus bank 0 bank 1 bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization a. One-word-wide memory organization

One Word Wide Organization One word wide bus and one word wide memory The bus contains both address and data lines on-chip CPU Cache 32-bit data & 32-bit addr per cycle bus

Wide Organization Increase the bandwidth to one word memory by widening Buses between processor and memory To allow parallel access to all the words of a block Logic between processor and cache consists of A MUX on READs Control logic to update the appropriate words on WRITEs CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus bank 0 bank 1 bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization

Interleaved Organization Widen the memory but not the interconnection Example: 4-way interleaving one word wide bus and four word wide memory Sending an address to several banks permits them all to read simultaneously Hence, incurring full DRAM latency only once on-chip CPU Cache bus bank 0 bank 1 bank 2 bank 3

Interleaved Organization Bank Conflict Two consecutive memory operations using the same bank Cause the memory to stall until the busy bank has completed the prior operation on-chip CPU Cache bus bank 0 bank 1 bank 2 bank 3

Example --- how a memory system affects overall performance Assume at cache miss to read DRAM 1 clock cycle to send the address 25 clock cycles for DRAM cycle time 8 clock cycles for DRAM access time 1 clock cycle to return a word of data -Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/cpu per clock cycle 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 19

Performance of One Word Wide Organization on-chip CPU Cache bus If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 25 1 27 cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/27 = 0.148 bytes per clock

Performance of One Word Wide Organization on-chip CPU Cache bus What if the block size is four words? 25 cycles 1 4 x 25 = 100 4 x 1 = 4 104 25 cycles cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/104 = 0.154 As soon as data is available, address can be changed to access the next word bytes per clock 25 cycles

Performance of One Word Wide Organization on-chip CPU Cache bus What if the block size is four words and if a fast page mode DRAM is used? 25 cycles 1 25 + 3*8 = 49 4 x 1 = 4 54 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 8 cycles 8 cycles 8 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/54 = 0.296 bytes per clock

Performance of Wide Organization CPU Multiplexor Cache Bus b. Wide memory organization What if the cache block size is four words CPU and main memory width of two words? 25 cycles 1 2 x 25 = 50 2 x 1 = 2 bank 0 bank 1 Cache Bus 53 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bank 2 c. Interleaved memory organization bank 3 25 cycles Each access has two words in parallel e ization Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/53 = 0.302 bytes per clock

Performance of Wide Organization CPU Multiplexor Cache Bus What if the cache block size is four words CPU and main memory width of four words? bank 0 bank 1 1 Cache 25 Bus 1 27 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bank 2 bank 3 b. Wide memory organization c. Interleaved memory organization 25 cycles This access has four words in parallel e ization Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/27 = 0.593 bytes per clock

Performance of Interleaved Organization on-chip CPU Cache For a block size of four words 1 25 4 x 1 = 4 30 cycle to send 1 st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty bus bank 0 bank 1 bank 2 bank 3 25 cycles 25 cycles 25 cycles 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock

DRAM System Summary Its important to match the cache characteristics Caches access one block at a time (usually more than one word) DRAM characteristics Use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache -bus characteristics Make sure the memory-bus can support the DRAM access rates and patterns The goal of increasing the -Bus to Cache bandwidth

Virtual Hardware Support

Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor L1$ L2$ Main 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Secondary 1,024+ bytes (disk sector = page) (Relative) size of the memory at each level

Hierarchy Registers Cache (1 or more levels) Main memory Physical Virtual Words transferred via load/store Blocks transferred automatically upon cache miss Pages transferred automatically upon page fault

Disk Cache Miss and Page Fault All data, organized in Pages (~4KB), accessed by Physical addresses Pages (Write-back, same as in cache) Main Cached pages, Page table Page fault: a required page is not found in main memory Processor Cache MMU Cache miss: a required block is not found in cache Page fault in virtual memory is similar to miss in cache.

Virtual vs. Physical Address Processor assumes a virtual memory addressing scheme: Disk is a virtual memory (large, slow) A block of data is called a virtual page An address is called virtual (or logical) address (VA) Main memory may have a different addressing scheme: Physical memory consists of caches address is called physical address MMU translates virtual address to physical address Complete address translation table is large and is kept in main memory MMU contains TLB (translation-lookaside buffer), which keeps record of recent address translations.

Two Programs Sharing Physical A program s address space is divided into pages (all are one fixed size) or segments (variable sizes) The starting location of each page (either in main memory or in secondary memory) is contained in the program s page table disk Program 1 virtual address space main memory Program 2 virtual address space

Hierarchy Example 32-bit address (byte addressing) 4 GB virtual main memory (disk space) Page size = 4 KB Number of virtual pages = 4 2 30 /(4 2 10 ) = 1M Bits for virtual page number = log 2 (1M) = 20 1 GB physical main memory Page size 4 KB Number of physical pages = 1 2 30 /(4 2 10 ) = 256K Bits for physical page number = log 2 (256K) = 18 Page table contains 1M records specifying where each virtual page is located.

Virtual Address (VA) Address Translation A virtual address is translated to a physical address by a combination of hardware and software 31 30... 12 11... 0 Virtual page number Page offset Translation Physical page number Page offset 29... 12 11 0 Physical Address (PA) So each memory request first requires an address translation from the virtual space to the physical space A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault

1M virtual page numbers Physical main memory (pages) Page Table Page table register Address of Page table in main memory 0 1 2 3.. K... Page locations - 2-1 3 -- - - 0 -- Page 0 Page 1 Page 2 Valid bit Other flags, e.g., dirty bit, LRU ref. bit Virtual main memory (pages on disk) Page 3

32-bit Virtual Address (4 KB Page) 20-bit virtual page number 10-b word number within page 2-b byte offset 1K words (4KB data) A virtual page (contains 4KB, or 1K words) 32 bits (4 bytes)

Virtual to Physical Address Translation Virtual address 20-bit virtual page number 12-bit byte offset within page Address translation 18-bit physical page number 12-bit byte offset within page Physical address

Virtual System Virtual or logical address (VA) MMU: management unit with TLB Physical address (PTE) Processor Data Cache SRAM Data Physical address DRAM Main with page table DMA: Direct memory access Disk

TLB: Translation-Lookaside Buffer A processor request requires two accesses to main memory: Access page table to get physical address Access physical address TLB acts as a cache of page table Holds recent virtual to physical page translations Eliminates one main memory access if requested virtual page address is found in TLB (hit)

1M virtual page numbers Physical main memory (pages) TLB Organization V D R Tag Phy. Pg. Addr. Page table register Address of Page table in main memory 0 1 2 3 4. K... 1 1 1 4 3 1 0 1 1 2-2 - 1 3 - - - - 0 -- Page 0 Page 1 Page 2 Valid bit Other flags, e.g., dirty bit, LRU ref. bit Page locations Page 3 Virtual main memory (pages on disk)

TLB Data V: Valid bit D: Dirty bit R: Reference bit (LRU) Tag: Index in page table Physical page address

Typical TLB Characteristics TLB size: 16 512 entries Block size: 1 2 page table entries of 4 8 bytes each Hit time: 0.5 1 clock cycle Miss penalty: 10 100 clock cycles Miss rate: 0.01% 1%

Integrating Virtual, TLBs, and Caches

Intel IA-32 Management The memory management for IA-32 architecture are divided into two parts: segmentation and paging. Segmentation provides a mechanism of isolating individual code, data, and stack modules Multiple programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system Sections of a program s execution environment are mapped into physical memory as needed. The processor uses two stages of address translation Logical address to linear address (Segmentation) Linear address to physical address (Paging) *Intel 64 and IA-32 Architectures Software Developer s Manual, Volume 3A: System Programming Guide, Part 1

Segmentation Segmentation provides a mechanism for dividing the processor s addressable memory space (called the linear address space) into smaller protected address spaces called segments. All the segments in a system are contained in the processor s linear address space. To locate a byte in a particular segment, a logical address (far pointer) has to be provided.

Logical Address Translation

Paging Paging (or linear-address translation) is the process of translating linear addresses so that they can be used to access memory or I/O devices. Paging translates each linear address to a physical address and determines, for each translation, what accesses to the linear address are allowed (the address s access rights)

Linear Address Translation

Segmentation and Paging

Complete Picture: IA-32 System-Level Registers and Data Structures

Next Class Performance 1/8/2017 ELEC 5200-001/6200-001 Lecture 7 51