Leveraging Cache Coherence in Active Memory Systems

Size: px

Start display at page:

Download "Leveraging Cache Coherence in Active Memory Systems"

Noel Morrison
5 years ago
Views:

1 Leveraging Cache Coherence in Active Memory Systems Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University

2 Outline Introduction Active Memory Techniques Data Coherence Memory Controller Architecture Simulation Results & Analysis Conclusions

3 Motivation Memory Wall Gap between Processor Speed and Memory Speed Solutions: Caching, Prefetching, Data-Intensive Applications Example: Scientific Calculations, Multimedia Operations Characteristics: Huge Data Size Poor Locality New approach: Active Memory Systems

4 Active Memory Systems Active Memory Controller Improving Cache Behavior by Address Re-mapping Memory Controller Processor Cache Address Re-mapping Memory Active Memory Element Data Parallel Processing in Memory Memory DRAM DRAM DRAM DRAM Logic Logic Logic Logic

5 Data Coherence Data Coherence Active Memory Controller: Accessing same data via different addresses Active Memory Element: Multiple processors in the memory system Conventional Approach Cache Flush: Flush overhead Not transparent Hard to support multiprocessor systems Our Approach Leverage and Extend Cache Coherence Protocol: Solves the above problems Supports more active memory techniques

6 Related Work Active Memory Controller Approach Impulse (University of Utah) Memory Forwarding (Carnegie Mellon University) Active Memory Element Approach Active Pages (University of California at Davis) DIVA (University of Southern California) FlexRAM (University of Illinois at Urbana Champaign) Main Contribution of Our Study Transparent Address Re-mapping Techniques Multiprocessor Active Memory Systems New Classes of Active Memory Operations

7 Active Memory Techniques Matrix Transpose Sparse Matrix Linked List Linearization Parallel Reduction

8 Matrix Transpose Example Code for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A for i=0 to N-1 for j=0 to N-1 x += A[j][i]; Problem Column-wise Accesses of Matrix A

9 Matrix Transpose Active Memory Optimization for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A A A = Transpose(A); for i=0 to N-1 for j=0 to N-1 x += A [i][j]; Data Coherence A[i][j] A [j][i]

10 Linked List Linearization Example Code ptr = A; while ptr!= NULL sum += ptr data; ptr = ptr next; A Problem Traversal of Linked List A

11 Linked List Linearization Active Memory Optimization A = Linearize(A); ptr = A ; while ptr!= NULL sum += ptr data; ptr = ptr next; A A Data Coherence i-th Element of A i-th Element of A

12 Data Coherence Problem Processor Cache C1 Dirty C2 Shared D D Memory C0 D C1 D C2 D C3 D

13 Solution to Data Coherence Problem Solution Mutual Exclusion Only one data element among the mapped ones can be cached at a time Implementation Extension of Cache Coherence Protocol Memory controller invalidates any other cache lines that are mapped to the requested line

14 Example: Mutual Exclusion Processor Cache C0 Shared C1 Dirty C2 Shared D D D Memory C0 D C1 D C2 D C3 D

15 Active Memory Controller Architecture PI NI Dispatch Unit Instruction Cache AMPU AMDU Memory Interface Memory Cells Data Cache Active Memory Processor Unit Active Memory Data Unit Data Buffer Send Unit PI NI

16 Example: Matrix Transpose for i=0 to N-1 for j=0 to N-1 x += A[i][j]; A A for A = i=0 Transpose(A); to N-1 for i=0 to N-1 for j=0 to N-1 for j=0 to N-1 x += += A[i][j]; A [i][j]; for i=0 to N-1 for j=0 to N-1 x += A[i][j];

17 Example: Matrix Transpose Cache Memory Controller Memory C0 C1 C2 A A C7

18 Example: Matrix Transpose Cache C1 C2 Dirty Shared Memory Controller Memory C0 C1 C2 A A C7

19 Example: Matrix Transpose Cache C1 C2 Dirty Shared Memory Controller Memory C0 C1 C2 A A C C7

20 Example: Matrix Transpose Cache C1 Dirty C Dirty C2 Shared Memory Controller Memory C0 C1 C2 A A C C7

21 Example: Matrix Transpose Cache C Dirty Memory Controller Memory C0 C1 C2 A A C C7

22 Example: Matrix Transpose Cache C Dirty Memory Controller Memory C0 C1 C2 A A C C7

23 Example: Matrix Transpose Cache C0 C Dirty Memory Controller Memory C0 C1 C2 A A C C7

24 Simulation Environment Simulator Main Processor: MIPS based ISA, 2 GHz Instruction Cache, Data Cache: 2 way set associative, 32 KB Unified Second-level Cache: 2 way set associative, 512 KB TLB: Fully associative, 64 Entries Memory Latency: 125 ns Invalidation-based Bitvector Protocol User-level Cache Flush Applications Matrix Transpose: SPLASH-2 FFT, FFTW, Transpose Sparse Matrix: Conjugate Gradient, SMVP Linked List Linearization: Health, MST, Traverse Parallel Reduction: MMM, Sparse Flow, Reduction

25 Matrix Transpose Normal AM AM+Prefetch Flush Speedup FFT FFTW Transpose

26 Sparse Matrix Normal AM AM+Prefetch Flush Speedup Conjugate Gradient SMVM

27 Linked List Linearization 8 7 Normal AM AM+Prefetch 6 Speedup Health MST Traverse

28 SMP - FFT Normal AM 3 Speedup Number of Processors

29 SMP Parallel Reduction Normal AM 3.5 Speedup MMM Sparse Flow Reduction

30 Conclusions Future Work Multi-node Active Memory Systems Active Memory Elements Summary Data Coherence Problem in Active Memory Systems Cache Coherence Active Memory Architecture Transparent Address Re-mapping Techniques Multiprocessor Active Memory Systems New Classes of Active Memory Operations Simulations Results: 1.3 to 7.7 Speedup

ACTIVE memory systems provide a promising approach

ACTIVE memory systems provide a promising approach IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 1 Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems Daehyun Kim, Mainak Chaudhuri, Student Member, IEEE, Mark