EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Similar documents
The Design Space of Data-Parallel Memory Systems

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

An introduction to SDRAM and memory controllers. 5kk73

Lecture 18: DRAM Technologies

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

The University of Texas at Austin

Memories: Memory Technology

Lecture: Memory Technology Innovations

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

Mainstream Computer System Components

EEM 486: Computer Architecture. Lecture 9. Memory

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

DRAM Main Memory. Dual Inline Memory Module (DIMM)

CS311 Lecture 21: SRAM/DRAM/FLASH

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

M2 Outline. Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM

Topic 21: Memory Technology

Topic 21: Memory Technology

CSE502: Computer Architecture CSE 502: Computer Architecture

Introduction to memory system :from device to system

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

CSE502: Computer Architecture CSE 502: Computer Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

ECE 485/585 Microprocessor System Design

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Memory Hierarchy and Caches

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Memory Access Scheduling

Lecture-14 (Memory Hierarchy) CS422-Spring

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Systems Laboratory Sungkyunkwan University

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

EECS150 - Digital Design Lecture 17 Memory 2

ECE 485/585 Microprocessor System Design

The Memory Hierarchy 1

2. Link and Memory Architectures and Technologies

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CS 426 Parallel Computing. Parallel Computing Platforms

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory. Lecture 22 CS301

Chapter 8 Memory Basics

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Caches. Hiding Memory Access Times

LECTURE 5: MEMORY HIERARCHY DESIGN

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Cache Refill/Access Decoupling for Vector Machines

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer System Components

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Virtualized ECC: Flexible Reliability in Memory Systems

The University of Adelaide, School of Computer Science 13 September 2018

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Lecture 24: Memory Scheduling

Copyright 2012, Elsevier Inc. All rights reserved.

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

CENG3420 Lecture 08: Memory Organization

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

CpE 442. Memory System

ECE 551 System on Chip Design

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

The Memory Hierarchy & Cache

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Addressing the Memory Wall

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

Lecture 11 Cache. Peng Liu.

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Systems IRAM. Principle of IRAM

Transcription:

EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 1

Outline 2 technology organization and mechanism Memory system organization and design option ew emerging trends Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

is Expensive 3 $/bit is almost nothing $2 x 10-9 Memory in system is expensive %10 50 of system cost Rule of thumb the more memory the better Minimize $/bit EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

What is? 4 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Cell 5 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Cell 6 Capacity density 3D Recessed Channel Array Transistor (capacitor on top) Samsung, Hynix, Elpida, Micron Trench capacitor Infineon, anya, ProMos, Winbond IBM, Toshiba in embedded- Hynix 80nm EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Infineon 80nm

Array 7 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Sense Amplifier 8 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Array 9 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Array 10 EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Optimization 11 The more memory the better optimize capacity Also need to worry about power and drivers Result is compromise in latency Small capacitors and large sub-arrays increase access time What about bandwidth? Bandwidth is expensive: $.05 - $.10 per package pin DDR2 requires 80 pins Secondary goal is optimizing BW/pin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Massively parallel processor architecture 12 > 4000 GB/s < 200 GB/s BW demand of ALUs >> BW supply from s EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Stream processor architecture RF RF RF RF RF LRF : register file SM SM SM SRF : stream register file Memory System EE382: LRF and SRF provide a hierarchy of bandwidth and ity SRF decouples execution from memory Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 13

Streaming Memory Systems (SMSs) 14 PE 0 PE 1 PE 2 PE 3 Memory System Memory Channel Memory Channel Memory Channel Memory Channel Off-chip s need to meet the processor s bandwidth demands Multiple address-interleaved memory channels High bandwidth per channel EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 15

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 16

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: 2) Because the performance of modern s is very sensitive to access patterns Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 17

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 18

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) act (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) wr wr wr wr wr data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) act wr (0, 0, 1) rd (0, 0, 2) wr (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 19

The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: Parallelism and ity are necessary for efficient usage Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 20

Memory system designs rely on the inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 PE 3 Memory System A stream load or operation yields a large number of related memory accesses, EE382: Stream A Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 21

Memory system designs rely on inherent parallelism/ity of memory accesses 22 PE 0 PE 1 PE 2 PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Memory System Stream A EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads 23 Stream A Stream load B EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads Parallelism by both of above Locality by generating entire references of a thread 24 It is Stream important A to understand how the interactions between Stream load these B different factors affect performance EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Outline 25 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Row decoder A chip is containsmultiple memory banks where each bank is a 2-D array 26 (bank, row, column) request bank bank n-1 0 Memory array Many shared resources on a chip Row & column accesses shared by request path. Data read & written through shared data path. All banks share request and data path. Sense amplifier Column decoder This sharing and the dynamic nature of result in strict access rules data and timing constraints EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Row decoder A follows rules and occupies resources to access a location 27 request bank n-1 bank 1 bank 0 Memory array Sense amplifier Column decoder Operation Resource Utilization Cycle Precharge Bank Request Data Activate Bank Request Data Read Bank Request Data data Write Bank Request Data EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Row decoder A follows rules and occupies resources to access a location 28 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Activate Bank Request Data

Row decoder A follows rules and occupies resources to access a location 29 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Read Bank Request Data

Row decoder A follows rules and occupies resources to access a location 30 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Precharge Bank data Request Data

Row decoder operation sequences for two memory read requests 31 Read (Bank, Row, Column) 1. (0, 0, 0) (0, 0, 1) : different column bank n-1 bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) request Memory array Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Row decoder operation sequences for two memory read requests 32 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder data EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Row decoder operation sequences for two memory read requests 33 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder 3. (0, 0, 0) (0, 1, 1) : different row & column data act (0, 0, *) rd (0, *, 0) pre (0, 0, *) act (0, 1, *) rd (0, *, 1) Internal bank conflict : requiring more commands and cycles EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Activate to active time determines random access performance Row decoder Row decoder Row decoder clock request act act pre act trr : act to act between different banks trc : act to act within the same bank bank 0 bank 1 bank n-1 Memory array Memory array Memory array Sense amplifier Sense amplifier Sense amplifier EE382: Column decoder Column decoder Column decoder Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 34

Switches between read/write commands and data transfer require timing delay clock request wr wr rd wr tdwr : write request to read request time tdrw : read request to write request time data trwbub : Hi-Z between read and write data transfer twrbub : Hi-Z between write and read transfer EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 35

parameter trends over various 36 generations (ns), (bit) (ns/bit) 100 10 10 1 1 S DDR DDR2 GDDR3 GDDR4 XDR 0.1 (tck)x10 bit/(col cmd) 1/(Pin Bandwidth) EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

parameter trends over various 37 generations tck trc : act to act within the same bank trr : act to act between different banks 100 tdwr : write cmd to read cmd time 10 1 S DDR DDR2 GDDR3 GDDR4 XDR Timing trends show that performance is very sensitive to EE382: Principles of Computer the Architecture, presented Fall 2011 -- Lecture 23 access (c) Mattan Erez, Jung patterns Ho Ahn

Outline 38 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Streaming Memory Systems 39 Bulk stream loads and s Hierarchical control Expressive and effective addressing modes Can t afford to waste memory bandwidth Use hardware when performance is non-deterministic SRF MEM Strided access Gather Scatter Automatic SIMD alignment Makes SIMD trivial (SIMD short-vector) O x O y O z H1 x H1 y H1 z H2 x H2 y H2 z Stream memory system helps the programmer and maximizes I/O throughput EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

A streaming memory system consists of AGs, cross-point switch and MCs 40 PE 0 PE 1 PE 2 PE 3 AG AG & RB AG AG & RB AG : address generator MC : memory channel Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d AG AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG : address generator [address, data] EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 41

An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Record size : # of consecutive words per data record mapped to a PE Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 42

An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Stride : address gap between consecutive records Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 43

Cross-point Switch PE 0 PE 1 PE 2 PE 3 AG [10, b] [18, d] [11, f] [19, h] AG [6, a] [14, c] [7, e] [15, g] On-chip and off-chip have different address spaces EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 44

A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 [10, b] [18, d] [11, f] [19, h] PE 3 Memory Channel MSHRs AG AG & RB Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel MSHR : miss status handling register EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 45

A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 PE 3 Memory Channel MSHRs AG AG & RB [ (0, [10, 1), b] b* ] [ (0,[18, 2), d] d* ] [ (0,[11, 1), f] *f ] [ (0,[19, 2), h] *h ] Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel EE382: A memory controller checks all the pending requests in a channel buffer and generates proper commands Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 46

The design space of a Streaming Memory System 47 PE 0 PE 1 PE 2 PE 3 Address generator AG AG & RB AG AG & RB # of AG AG width Memory Channel Memory Channel Memory Channel Memory Channel Memory channel # of channel buffer entries MAS policy Channel-split configuration EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

AG design space 48 PE 0 PE 1 PE 2 PE 3 A single wide AG vs. multiple narrow AGs AG AG Inter-thread vs. intrathread parallelism Load balancing across MC PE 0 PE 1 PE 2 PE 3 The AG width # of accesses each AG can generate per cycle AG EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Memory controller in a MC determines command per cycle based on the MAS policy 49 Per command cycle, the memory controller Looks at the status of every bank bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Memory controller in a MC determines command per cycle based on the MAS policy 50 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle

Memory controller in a MC determines command per cycle based on the MAS policy 51 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints Selects the command to issue among all available commands based on the priority of the chosen policy bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle

Memory controller in a MC determines command per cycle based on the MAS policy 52 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Memory controller in a MC determines command per cycle based on the MAS policy 53 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Closed : a row is precharged as soon as the last available reference to that row is performed Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

MAS determines command order to serve pending memory accesses Reorder row Reorder column Access Algorithm Window size commands commands Precharging selection inorder 1 /A /A /A /A EE382: Inorder policy processes pending requests one by one, effectively having window size of 1 Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 54

MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First Inorderla looks ahead of other pending requests and generates EE382: row commands, not column commands Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 55

MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A Firstready policy checks and processes pending requests one by one from EE382: the oldest until it finishes looking at all the CBEs Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 56

MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A opcol ncb Yes Yes Open Column First oprow ncb Yes Yes Open Row First clcol ncb Yes Yes Closed Column First clrow ncb Yes Yes Closed Row First Remaining four policies reorder both row and column commands EE382: using open/closed or column first/row first Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn 57

A new micro-architecture design for a memory channel a channel-5split configuration MSHR and CBE per AG Memory Channel MSHRs MSHRs MSHRs MSHRs Channel Buffer nag Memory Controller Switch thread when the requests are completely drained Avoid Resource monopolization Internal bank conflicts Read/write turnaround penalty EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Outline 59 technology Impact on memory system Stream architecture review organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Six applications from multimedia and scientific domains are used for study 60 DEPTH : stereo depth encoder MPEG : MPEG-2 video encoder RTSL : graphics rendering pipeline QRD : complex matrix upper triangular&othorgonal FEM : finite element method MOLE : n-body molecular dynamics EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

A cycle-accurate Imagine simulator is used for performance evaluation 61 PE 0 PE 1 PE 2 PE 3 1 GHz 8 processing elements AG AG Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

A cycle-accurate Imagine simulator is used for performance evaluation 62 PE 0 PE 1 PE 2 PE 3 AG width : 4 AG AG & RB AG AG & RB # of MCs : 4 # of CBEs : 16 Memory Channel Memory Channel Memory Channel Memory Channel CBEs for channel-split config : 16 per AG MAS policy : opcol EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

A cycle-accurate Imagine simulator is used for performance evaluation 63 PE 0 PE 1 PE 2 PE 3 Memory Channel AG Memory Channel AG Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn burst length : 4 words Peak BW : 4 GW/s # of internal banks : 8 typing params : XDR Peak BW : 2 # of commands for accessing an inactive row : 3

Throughput (GW/s) Throughput (GW/s) Memory system performance for representative configurations on six apps 64 4 20% 3.5 3 16% 2.5 2 1.5 1 0.5 0 4 3.5 3 2.5 2 1.5 1 0.5 DEPTH 12% 8% 4% 0% MOLE 20% 16% 12% 8% 4% (# of AG, burst length, MAS) FEM 2ag, burst32, inorder 2ag, burst4, inorder 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch 0 DEPTH 0% row commands MPEG RTSL QRD EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

The key memory system related characteristics of six applications 65 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH 1.96 1802 1.95 1 1170 1180 46.6% 63.0% MPEG 1 1515 1 1 1280 2309 90.1% 70.2% DEPTH & MPEG has small record, stride size, and index range EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Throughput (GW/s) Memory system performance for representative configurations on six apps 66 4 20% 3.5 3 16% 2.5 2 1.5 12% 8% (# of AG, burst length, MAS) 2ag, burst32, inorder 1 0.5 4% 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Small record, stride size and index range means high spatial ity between generated requests from an access thread EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

The key memory system related characteristics of six applications 67 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH 1.96 1802 1.95 1 1170 1180 46.6% 63.0% MPEG 1 1515 1 1 1280 2309 90.1% 70.2% RTSL 4 1170 4 1 264 216494 65.1% 83.5% MOLE 1 480 1 9 3252 7190 9.9% 99.5% Streams with small record size and large index ranges lacks spatial ity between generated requests EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Throughput (GW/s) Memory system performance for representative configurations on six apps 68 4 20% 3.5 3 16% 2.5 2 1.5 12% 8% (# of AG, burst length, MAS) 2ag, burst32, inorder 1 0.5 4% 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Long bursts hurt memory system performance EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

The key memory system related characteristics of six applications 69 Average strided Average indexed Applicati on record size (W) stream length (W) stride/ recor d record size (W) stream length (W) EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn index range stride d acce ss read acce ss DEPTH 1.96 1802 1.95 1 1170 1180 46.6% 63.0% MPEG 1 1515 1 1 1280 2309 90.1% 70.2% RTSL 4 1170 4 1 264 21649 4 65.1% 83.5% MOLE 1 480 1 9 3252 7190 9.9% 99.5% QRD 115 1053 350 /A /A /A 100% 69.0 % Large record size means high spatial ity in generated requests FEM 12.4 1896 12.4 24 3853 20334 2 48.8% 74.0 %

Throughput (GW/s) Memory system performance for representative configurations on six apps 70 4 20% 3.5 3 16% 2.5 2 1.5 12% 8% 2ag, burst32, inorder 1 0.5 4% 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD QRD & FEM perform similar to DEPTH & MPEG EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers 71 4 3.5 3 2.5 2 1.5 1 0.5 0 DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers 72 4 3.5 3 2.5 2 1.5 1 0.5 0 DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers 73 4 3.5 3 2.5 2 1.5 1 0.5 0 DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers rises as the number of AGs is increased 74 4 3.5 3 2.5 2 1.5 1 0.5 0 DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Throughput (GW/s) Reordering row & column commands are important, but specific MAS policies are not 75 4 3.5 3 2.5 2 1.5 0.5 1 0 1ag 2agcs 2ag 1ag 2agcs 2ag DEPTH FEM inorder inorderla firstready oprow opcol clrow clcol EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Emerging Memory Technology 76 ew non-volatile storage devices Fine-grained access like, non-volatile like FLASH o refresh (or almost no refresh) Density equal to or better than Potentially more scalable than Higher latencies, especially for writes Higher write energy Possible endurance issues Very similar array structure and interface design to ew interface options 3D integration Memory on top of a processor Memory cubes Optical interconnect EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

Conclusion 77 trends Data BW increases rapidly while latency and cmd BW improve slowly access granularity grows Throughput is very sensitive to access patterns Locality must be exploited To minimize internal bank conflicts and read-write turnaround penalties Memory system design space umber of AGs inter-thread vs. intra-thread parallelism Load balance across MCs channel interleaving, multiple threads, and AG width The amount of MC buffering determines the window size of MAS Design suggestions A single wide AG exploits ity well Channel-split mechanism exploits ity and balances loads across multiple channels simultaneously at the cost of additional hardware EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn