A Memory System Design Framework: Creating Smart Memories

Size: px

Start display at page:

Download "A Memory System Design Framework: Creating Smart Memories"

Marsha Thomasina Roberts
5 years ago
Views:

1 A Memory System Design Framework: Creating Smart Memories Amin Firoozshahian, Alex Solomatnikov Hicamp Systems Inc. Ofer Shacham, Zain Asgar, Stephen Richardson, Christos Kozyrakis, Mark Horowitz Stanford University

the same die Claim: Scale performance Keep design

2 An Era of Chip-Multiprocessors Single-thread performance scaling has stopped More processor cores on the same die Claim: Scale performance Keep design complexity constant Intel Nehalem Sun Rock IBM Cell Amin Firoozshahian 2

3 Looking a Little More Closely Sun Rock

4 Reality Replicated cores Incredibly complicated memory system Large amounts of logic Innovation is in the memory system Transactions, streaming, fast synchronization, security, etc. Never exactly the same Where all the bugs are! Amin Firoozshahian 4

5 ISA for Memory Systems Can we regularize the memory system hardware? Program it rather than Design it? Benefits: Reduce design time Patch errors Run-time tuning How can we do this? Amin Firoozshahian 5

6 Shared Memory System Resources: Local memory Data, state bits Interconnect Controllers Operations: Probing state bits Track requests Communication Data movements (spill / refill) Amin Firoozshahian Proc $ Cache Controller Msg miss Interconnect Memory $ Proc Cache Controller 6

7 Streaming Memory System Resources: Local memory Proc Proc Interconnect Controllers DMA Local Mem DMA Local Mem Operations: Communication Data movements Track outstanding transfers Memory Interconnect Amin Firoozshahian 7

8 Transactional Memory System Resources Local memory More state bits Interconnect Addr. FIFO Proc $ Addr. FIFO $ Proc Controllers Operations Commit Controller Interconnect Commit Controller Data movements State checks / updates Communication Memory Amin Firoozshahian 8

9 Commonalities Same resources and operations Different in: How the operations are sequenced Interpretation of state bits We need: Flexible local storage and interconnect Programmable controllers Amin Firoozshahian 9

10 Local Memories Programmable memory mat Data array State bits PLA logic State Data Comparator Accessed by Address, Opcode Returns data, state, compare result Opcode Address Update Cmp [K. Mai et.al., Architecture and Circuit Techniques for a Reconfigurable Memory Block, IEEE International Solid-State Circuits Conference, February

11 Programmable Controllers Use an off-the-shelf processor? FLASH, Typhoon, etc. Too slow All the way to the L1 cache interface Our approach: Micro-coded engines (functional units) Each class of operations in a separate engine Amin Firoozshahian 11

12 Programming A set of subroutines A set of basic operations Executed in a functional unit Each one calls next Link subroutines to each other Unit 2 Msg Unit 1 Unit 3 Msg Amin Firoozshahian 12

13 Microarchitecture A small pipeline Configuration ( program ) memories Horizontal micro-code Decide what to do Decide how to proceed To other units Amin Firoozshahian 13

14 Organization DMA DMA DMA To/From local storages Tracking State Update Data Movement MSHR USHR Interrupt Line Buffers Processor Interface Network Interface To/From Processors To/From Network Amin Firoozshahian 14

15 Read Miss Example DMA DMA DMA Access Tags Access Data Tracking Evict State Read Line Read Miss Update WB Data / Miss Movement MSHR USHR Interrupt Line Buffers Processor Interface Read Miss Miss Network Read Spill Interface Miss Amin Firoozshahian 15

16 Programming Complexity Cache Coherence Message types received by controller: 6 From processor: Cache miss, Upgrade miss, Prefetch From network: Coherence request, Refill, Upgrade Subroutine types in Tracking unit: 11 Streaming Message types: 5 Direct access, Gather, Scatter, Gather reply, Scatter ack. Subroutine types in Tracking unit: 9 Amin Firoozshahian 16

17 7.77mm Smart Memories 8-core CMP system ST 90nm-GP CMOS technology 5.5 ns cycle time (181MHz) 2.9M gates, 55M transistors 7.77mm Memory Controller Memory Controller Quad Quad Quad Quad Quad Quad Quad Quad Quad Memory Controller Memory Controller TX/RX Tile 3 Tile 0 Configurable Protocol Controller Tile 2 Tile 1 Configurable Memory Mats Configurable Xbar Configurable Ld/St Unit Data / Inst CPU 0 Data / Inst CPU 1 System Quad Tile 17

18 Status System bring-up..... System configuration.... JTAG tests.... Coherent shared memory tests Transactional tests (TCC). Streaming tests More testing in progress Planning for a 32-processor system Test Chip Amin Firoozshahian 18

19 Evaluation Comparison with a hardwired controller But which one? You would claim I am cheating! Compare with an ideal controller Assume controller actions occur in zero time Account for external actions Data read/write Message send/receive Gives an upper bound Amin Firoozshahian 19

20 Cycles Average Read Latency Average Read Latency - 32 processor system Real Controllers Ideal controllers Coherent Shared Memory Streaming Transactions Amin Firoozshahian 20

21 Overhead (%) Execution Time Total average overhead: 15% 30 Average Overhead (%) Coherent Shared Memory Streaming Transactions Amin Firoozshahian 21

22 Conclusion Strong similarity between memory systems Common resources and operations A framework for memory systems design Generate specific instances Modest performance overhead Compared to ideal systems Amin Firoozshahian 22

SMART MEMORIES: A RECONFIGURABLE MEMORY SYSTEM ARCHITECTURE A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING

SMART MEMORIES: A RECONFIGURABLE MEMORY SYSTEM ARCHITECTURE A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL