Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Similar documents
Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory Technology Innovations

Lecture: Memory Technology Innovations

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Lecture 24: Memory, VM, Multiproc

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

M2 Outline. Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Computer parallelism Flynn s categories

Lecture 24: Virtual Memory, Multiprocessors

EEM 486: Computer Architecture. Lecture 9. Memory

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

CS311 Lecture 21: SRAM/DRAM/FLASH

CS 320 February 2, 2018 Ch 5 Memory

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 18: DRAM Technologies

Memory technology and optimizations ( 2.3) Main Memory

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

COMPUTER ARCHITECTURES

Chap. 4 Multiprocessors and Thread-Level Parallelism

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Intro to Computer Architecture, Spring 2012 Midterm Exam II. Name:

Computer Systems Architecture

Chapter 5 Internal Memory

The Memory Hierarchy 1

Mainstream Computer System Components

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CENG3420 Lecture 08: Memory Organization

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

Brief Background in Fiber Optics

Computer Systems Laboratory Sungkyunkwan University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

An introduction to SDRAM and memory controllers. 5kk73

Computer Systems Architecture

CS/EE 6810: Computer Architecture

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

CENG4480 Lecture 09: Memory 1

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Organization. 8th Edition. Chapter 5 Internal Memory

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

Lecture-14 (Memory Hierarchy) CS422-Spring

CSE502: Computer Architecture CSE 502: Computer Architecture

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

LECTURE 5: MEMORY HIERARCHY DESIGN

Parallel Computing: Parallel Architectures Jin, Hai

Memories: Memory Technology

Intro to Computer Architecture, Spring 2012 Midterm Exam II

Introduction to cache memories

Phase Change Memory An Architecture and Systems Perspective

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to memory system :from device to system

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Topic 21: Memory Technology

Topic 21: Memory Technology

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture Lecture 24: Memory Scheduling

EE414 Embedded Systems Ch 5. Memory Part 2/2

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (5 th Week)

Chapter Seven Morgan Kaufmann Publishers

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction

Developing a Prototyping Board for Emerging Memory

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

ECE 571 Advanced Microprocessor-Based Design Lecture 17

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Transcription:

Lecture: Memory, Multiprocessors Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models 1

Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically refreshes the row Every refresh command performs refresh on a number of rows, the memory system is unavailable during that time A refresh command is issued by the memory controller once every 7.8us on average 2

Problem 5 Consider a single 4 GB memory rank that has 8 banks. Each row in a bank has a capacity of 8KB. On average, it takes 40ns to refresh one row. Assume that all 8 banks can be refreshed in parallel. For what fraction of time will this rank be unavailable? How many rows are refreshed with every refresh command? 3

Problem 5 Consider a single 4 GB memory rank that has 8 banks. Each row in a bank has a capacity of 8KB. On average, it takes 40ns to refresh one row. Assume that all 8 banks can be refreshed in parallel. For what fraction of time will this rank be unavailable? How many rows are refreshed with every refresh command? The memory has 4GB/8KB = 512K rows There are 8K refresh operations in one 64ms interval. Each refresh operation must handle 512K/8K = 64 rows Each bank must handle 8 rows One refresh operation is issued every 7.8us and the memory is unavailable for 320ns, i.e., for 4% of time. 4

Address Mapping Policies Consecutive cache lines can be placed in the same row to boost row buffer hit rates Consecutive cache lines can be placed in different ranks to boost parallelism Example address mapping policies: row:rank:bank:channel:column:blkoffset row:column:rank:bank:channel:blkoffset 5

Reads and Writes A single bus is used for reads and writes The bus direction must be reversed when switching between reads and writes; this takes time and leads to bus idling Hence, writes are performed in bursts; a write buffer stores pending writes until a high water mark is reached Writes are drained until a low water mark is reached 6

Scheduling Policies FCFS: Issue the first read or write in the queue that is ready for issue First Ready - FCFS: First issue row buffer hits if you can Close page -- early precharge Stall Time Fair: First issue row buffer hits, unless other threads are being neglected 7

Error Correction For every 64-bit word, can add an 8-bit code that can detect two errors and correct one error; referred to as SECDED single error correct double error detect A rank is now made up of 9 x8 chips, instead of 8 x8 chips Stronger forms of error protection exist: a system is chipkill correct if it can handle an entire DRAM chip failure 8

Modern Memory System...... PROC.. 4 DDR4 channels 72-bit data channels 1200 MHz channels 1-2 DIMMs/channel 1-4 ranks/channel 9

Cutting-Edge Systems PROC SMB.... The link into the processor is narrow and high frequency The Scalable Memory Buffer chip is a router that connects to multiple DDR3 channels (wide and slow) Boosts processor pin bandwidth and memory capacity More expensive, high power 10

Problem 6 What is the boost in capacity and bandwidth provided by using an SMB? Assume that a DDR3 channel requires 64 data wires, 32 addr/cmd wires, and runs at a frequency of 800 MHz (DDR). Assume that the SMB connects to the processor with two 16-bit links that run at frequencies of 6.4 GHz (no DDR). Assume that two DDR3 channels can connect to an SMB. Assume that 50% of the downstream link s bandwidth is used for commands and addresses. 11

Problem 6 What is the boost in capacity and bandwidth provided by using an SMB? Assume that a DDR3 channel requires 64 data wires, 32 addr/cmd wires, and runs at a frequency of 800 MHz (DDR). Assume that the SMB connects to the processor with two 16-bit links that run at frequencies of 6.4 GHz (no DDR). Assume that two DDR3 channels can connect to an SMB. Assume that 50% of the downstream link s bandwidth is used for commands and addresses. The increase in processor read/write bw = (6.4GHz x 72) /(800MHz x 2 x 64) = 4.5x (for every 96 wires used by DDR3, you can have 3 32-bit links; each 32-bit link supports effectively 24 bits of read/write data) Increase in per-pin capacity = 4 DIMMs-per-32-pins / 12 2 DIMMs-per-96-pins = 6x

Future Memory Trends Processor pin count is not increasing High memory bandwidth requires high pin frequency High memory capacity requires narrow channels per DIMM 3D stacking can enable high memory capacity and high channel frequency (e.g., Micron HMC) 13

Future Memory Cells DRAM cell scaling is expected to slow down Emerging memory cells are expected to have better scaling properties and eventually higher density: phase change memory (PCM), spin torque transfer (STT-RAM), etc. PCM: heat and cool a material with elec pulses the rate of heat/cool determines if the material is crystalline/amorphous; amorphous has higher resistance (i.e., no longer using capacitive charge to store a bit) Advantages: non-volatile, high density, faster than Flash/disk Disadvantages: poor write latency/energy, low endurance 14

Silicon Photonics Game-changing technology that uses light waves for communication; not mature yet and high cost likely No longer relies on pins; a few waveguides can emerge from a processor Each waveguide carries (say) 64 wavelengths of light (dense wave division multiplexing DWDM) The signal on a wavelength can be modulated at high frequency gives very high bandwidth per waveguide 15

Multiprocs -- Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory since all processors see the same memory organization uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? not if you have large caches and employ fewer than a dozen processors 16

SMPs or Centralized Shared-Memory Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System 17

Multiprocs -- Memory Organization - II For higher scalability, memory is distributed among processors distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory 18

Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Interconnection network 19

Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence simpler hardware Explicit communication easier for the programmer to restructure code Sender can initiate data transfer 20

Title Bullet 21