SPINTRONIC MEMORY ARCHITECTURE

Size: px

Start display at page:

Download "SPINTRONIC MEMORY ARCHITECTURE"

Shauna Francis
5 years ago
Views:

1 SPINTRONIC MEMORY ARCHITECTURE Anand Raghunathan Integrated Systems Laboratory School of ECE, Purdue University Rangharajan Venkatesan Shankar Ganesh Ramasubramanian, Ashish Ranjan Kaushik Roy 7 th NCN-NEEDS Summer School Spintronics: Science, Circuits, and Systems

2 AGENDA Background Memory architecture Designing caches with STT-MRAM Exploring DWM for on-chip memory

DEMAND FOR ON-CHIP MEMORY Over 50% of chip area devoted to memory Multi-cores, increasing processor-memory gap accelerate the demand for onchip memory Cache Transistors (Million) 1000 Cache

3 DEMAND FOR ON-CHIP MEMORY Over 50% of chip area devoted to memory Multi-cores, increasing processor-memory gap accelerate the demand for onchip memory Cache Transistors (Million) 1000 Cache transistors 800 % chip transistors in cache Cache trends in Intel microprocessors % chip transistors in cache On-chip memory in SoCs

4 STORAGE HIERARCHIES Fundamental tradeoff between speed and capacity Would like to have both Solution: Organize memory/storage in a hierarchical manner Source: Avnet

MICROPROCESSOR ON-CHIP MEMORY HIERARCHY Microprocessors utilize multiple levels of on-chip cache to hide memory latency and improve bandwidth Exploits key properties of memory accesses Temporal

5 MICROPROCESSOR ON-CHIP MEMORY HIERARCHY Microprocessors utilize multiple levels of on-chip cache to hide memory latency and improve bandwidth Exploits key properties of memory accesses Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. Typical on-chip memory hierarchy Intel Core i7 Mobile CPU cache hierarchy

6 CACHE BASICS A cache holds a copy of some subset of data from the next lower level of the hierarchy Each cache line consists of an address tag and a data block Cache address = Tag + Index + Offset Source: MIT OCW, Course 6.823

7 CACHE BASICS Cache access Given an address, check if tag is in cache If yes, cache hit return data from cache If no, cache miss retrieve data from next level of hierarchy Where should the data be written in the cache? What if data that is already present at that location? Source: MIT OCW, Course Average access time = Hit time + Miss rate * Miss penalty

8 CACHE BASICS Caches help hide memory latency and increase throughput Why? LITTLE S LAW: The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = λw.

9 CACHE BASICS Mapping of locations (addresses) may be direct (many-to-one) or associative (many-to-many) Direct-mapped 2-way associative Why associativity? Source: Wikipedia

and data arrays may be accessed serially or in parallel Organization

10 CACHE BASICS A cache consists of a tag array, data array, and control logic Tag array lookup indicates whether access is a hit or miss Tag and data arrays may be accessed serially or in parallel Organization of a direct-mapped cache Source: MIT OCW, Course Source: Wikipedia

11 CACHE BASICS Various design choices present a rich tradeoff space Cache Size Cache Block Associativity Replacement policy Inclusivity Write policy Average access time = Hit time + Miss rate * Miss penalty

12 THE CASE FOR SPINTRONIC ON-CHIP MEMORIES The combination of endurance, speed, non-volatility and density make spintronic memories promising for on-chip applications Source: Toshiba Corp., IEDM 2012

13 DESIGNING CACHES WITH STT-MRAM Fixed Layer Tunneling oxide Free layer

14 STT-MRAM CACHE CHARACTERISTICS Iso-capacity comparison of 1MB SRAM and STT-MRAM caches Area Read latency Write latency SRAM STT-MRAM Leakage Read energy Write energy Adverse Favorable

15 WHERE IS STT-MRAM USED IN CACHE HIERARCHY? CPU Small size Optimized for latency L1 cache (SRAM) Large size Optimized for density L2 cache (STT-MRAM) Spin-Transfer Torque MRAM (STT-MRAM) STT-MRAM is suitable for lower levels of cache hierarchy (due to high write latency) Architectural techniques required to address inefficient writes

16 ARCHITECTURAL OPTIMIZATIONS FOR STT-MRAM Hybrid Cache (X. Wu et al. ISCA 2009, A. Jadidi et al. ISLPED 2011) Reduce write intensity Write biasing (M. Rasquinha et al. ISLPED 2010) Partial line update (S. P. Park et al. DAC 2012 ) Adverse Favorable Volatile STT-MRAM (C. W. Smullen et al. HPCA 2011, A. Jog et al. DAC 2012) Write asymmetry aware architecture (K. Kwon et al. TVLSI 2014)

17 HYBRID SRAM/STT-MRAM L2 CACHE Observation: Writes are concentrated to a small portion of address space Conventional Cache Tag array Data array W0 W1 W2 W3 W4 W5 W6 W7 Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Hybrid cache Split cache into read and write intensive regions Write region SRAM Read region STT-MRAM Policy to control which region a cache block resides in Tag W0 W1 Data Way0 Way1 Tag array W2 W3 W4 W5 W6 W7 Control Policy Write region Hybrid Cache Way2 Way3 Way4 Way5 Way6 Way7 Read region Data array X. Wu et al. ISCA 2009

18 HYBRID SRAM/STT-MRAM L2 CACHE Aims to combine the benefits of SRAM and STT-MRAM Large parts of cache with STT-MRAM Small part of cache with SRAM SRAM used to store write-intensive cache blocks Efficient writes High density, Low leakage SRAM STT-MRAM Hybrid Leakage Density Write

19 ARCHITECTURAL OPTIMIZATIONS FOR STT-MRAM Hybrid Cache (X. Wu et al. ISCA 2009, A. Jadidi et al. ISLPED 2011) Reduce write intensity Write biasing (M. Rasquinha et al. ISLPED 2010) Partial line update (S. P. Park et al. DAC 2012 ) Adverse Favorable Volatile STT-MRAM (C. W. Smullen et al. HPCA 2011, A. Jog et al. DAC 2012) Write asymmetry aware architecture (K. Kwon et al. TVLSI 2014)

20 REDUCING WRITE INTENSITY: WRITE BIASING CPU L1 cache (SRAM) L2 cache (STT-MRAM) Dirty block evictions from L1 cache Write on miss Typical eviction policies (LRU, FIFO) does not distinguish between clean and dirty blocks Only dirty blocks need to be written to L2! Write biasing Modify eviction policy to increase residency of dirty blocks in L1

21 REDUCING WRITE INTENSITY: WRITE BIASING Traditional cache eviction policy (LRU) Status stack of 4-way set-associative cache Load C Store E Highest priority A B C D C A B D E C A B Lowest priority Last accessed block is always inserted at the top of stack (TOS) M. Rasquinha et al. ISLPED 2010 Write biasing policy Status stack of 4-way set-associative cache Clean block Dirty block Insert position K=2 Highest Lowest priority priority A B C D Load C C A B D Promote to TOS Store E C A E B Insert to pos. K Store E E C A B Promote to TOS Store G E C G A Insert to pos. K Store G G E C A Promote to TOS Load A G E A C Promote to pos.k as all blocks above Blocks with repeated writes prioritized K are dirty over recently accessed clean blocks

22 ARCHITECTURAL OPTIMIZATIONS FOR STT-MRAM Hybrid Cache (X. Wu et al. ISCA 2009, A. Jadidi et al. ISLPED 2011) Reduce write intensity Write biasing (M. Rasquinha et al. ISLPED 2010) Partial line update (S. P. Park et al. DAC 2012 ) Adverse Favorable Volatile STT-MRAM (C. W. Smullen et al. HPCA 2011, A. Jog et al. DAC 2012) Write asymmetry aware architecture (K. Kwon et al. TVLSI 2014)

23 REDUCING WRITE INTENSITY Observation: Most bits in dirty cache blocks are unchanged Redundant writes upon eviction Solution Partial line update Redundant bit writes to L2 in SPEC 2006 benchmarks P. Zhou et al. ICCAD 2009

24 PARTIAL LINE UPDATE SCHEME Tag Array Data Array Tag H Tag H Data Array CPU CPU L1 CACHE (SRAM) One Cache Line 64B Write to L2 L1 CACHE (SRAM) History Partial Lines in a 64B Line 16B 16B 16B 16B Write to L2 X X X 64B 16B 16B 16B 16B L2 CACHE (SRAM) Tag Array Data Array L2 CACHE (MRAM) Tag Array Data Array Conventional Cache When a dirty cache line is evicted from L1 cache, the entire cache line is written to L2 cache. Partial line update Track dirty cache lines at finer granularity using additional bits in tag array of L1 cache When cache line is evicted, only dirty parts written to L2 cache S. P. Park et al. DAC 2012

25 ARCHITECTURAL OPTIMIZATIONS FOR STT-MRAM Hybrid Cache (X. Wu et al. ISCA 2009, A. Jadidi et al. ISLPED 2011) Reduce write intensity Write biasing (M. Rasquinha et al. ISLPED 2010) Partial line update (S. P. Park et al. DAC 2012 ) Adverse Favorable Volatile STT-MRAM (C. W. Smullen et al. HPCA 2011, A. Jog et al. DAC 2012) Write asymmetry aware architecture (K. Kwon et al. TVLSI 2014)

26 VOLATILE STT-MRAM Lifetime of cache blocks in caches is small Lifetimes of cache blocks in L2 cache Relax retention time to improve write efficiency Retention Time 10 years 1 sec 10 ms Write latency 11ns 6ns 3ns Retention time vs. write latency for STT-MRAM L2 cache A. Jog et al. DAC 2012

27 VOLATILE STT-MRAM ARCHITECTURE Reducing retention time makes cache volatile Requires refresh or write-back for dirty blocks State machine to determine block status 2-bit state machine for every block 4 states: S0,S1,S2,S3 Initial state: S0 (just written) Intermediate state: S1 Diminishing state: S2 (about to become invalid) Invalid state: S3 A. Jog et al. DAC 2012 Write Valid 0 10 ms (Retention time) Block lifetime T T T Invalid S0 S1 S2 S3 D W W State machine to determine block status W T = Counter pulse width W = Write D = Diminishing I = Invalid I

28 VOLATILE STT-MRAM ARCHITECTURE Stores block status 2-bit counter/block Tag Data MRU LRU Way ID Block State S0 S2 S1 S1 S2 S1 S2 S0 S1 S0 S2 S2 S1 S1 S0 S2 Refresh Write back 2-bit counter in tag array to store block status (S0,S1,S2,S3) For blocks in S2 (diminishing) state Refresh, if more recently used (MRU) Write-back, if less recently used (LRU) A. Jog et al. DAC 2012

29 ARCHITECTURAL OPTIMIZATIONS FOR STT-MRAM Hybrid Cache (X. Wu et al. ISCA 2009, A. Jadidi et al. ISLPED 2011) Reduce write intensity Write biasing (M. Rasquinha et al. ISLPED 2010) Partial line update (S. P. Park et al. DAC 2012 ) Adverse Favorable Volatile STT-MRAM (C. W. Smullen et al. HPCA 2011, A. Jog et al. DAC 2012) Write asymmetry aware architecture (K. Kwon et al. TVLSI 2014)

ASYMMETRIC WRITE ARCHITECTURE WITH REDUNDANT BLOCKS (AWARE) Observation: Write latency of STT-MRAM bit-cells is asymmetric (AP P is ~3X faster than P AP) Idea: Have slow and fast writes Fast writes

30 ASYMMETRIC WRITE ARCHITECTURE WITH REDUNDANT BLOCKS (AWARE) Observation: Write latency of STT-MRAM bit-cells is asymmetric (AP P is ~3X faster than P AP) Idea: Have slow and fast writes Fast writes involve only AP P switching Increase frequency of fast writes Add redundant blocks (RBLs) to the cache data array, which are pre-set to AP state Fast write: Write to clean RBLs, and swap with data blocks (DBLs) Slow write: When no clean RBLs available, update DBL and clean all RBLs that share the word-line 0: AP, 1:P Example : Write operation into DBL2 Fast write operation in 5.5ns (AP P) K. Kwon et al. TVLSI 2014 Slow write operation in 16.5ns (P AP)

31 SUMMARY STT-MRAM is a promising candidate for lower-level caches High density, non-volatility and low leakage Key challenges: High write energy, high write latency, asymmetric writes Suitable architectural optimizations can significantly mitigate the impact of inefficient writes

32 EXPLORING DOMAIN WALL MEMORY FOR ON-CHIP CACHES Ferromagnetic Wire I read/write0 I write1 I shift left Fixed Layer MTJ I shift right

domains of ferromagnetic wire Operation Shift, Read, and Write Read/write operations

33 BACKGROUND: DOMAIN WALL MEMORY (DWM) Ferromagnetic Wire Free layer I shift left Structure Ferromagnetic wire, Magnetic Tunnel Junction (MTJ) Data stored in magnetic domains of ferromagnetic wire Operation Shift, Read, and Write Read/write operations performed using MTJ Bits shifted along the ferromagnetic wire by applying current pulse MTJ Fixed layer

DWM HISTORY Initially proposed for secondary storage and storage

on-chip memory and logic Concept S. Parkin et al. Science 2008.

34 DWM HISTORY Initially proposed for secondary storage and storage class memory applications Recent efforts explore potential for on-chip memory and logic Concept S. Parkin et al. Science NEC Prototypes IBM Applications Accelerator memory (NanoArch 2011) Re-configurable logic (TMag. 2011) General purpose cache (ISLPED 2012, DATE 2013, DAC 2013)

35 DWM: BENEFITS AND CHALLENGES DWM offers excellent density Variable access latency is a unique challenge Access time 1ms 100us 10us 1us 100ns 10ns 1ns FLASH- NAND DWM DRAM STT-MRAM FLASH-NOR PCRAM FeRAM SRAM Cell area per bit (F 2 ) Idle Power Low High Write Energy Low Medium High Comparison of different memory technologies (data from S. Parkin et al. Science 2008) Adverse Favorable

.. WL Shift Port WL BLB T RW1 RW Port MTJ T RW2 Schematic of DWM cell.

36 BASIC DWM BIT-CELL BL DWM bit-cell structure DWM device, Shift ports and RW port Min. sized shift ports Large RW port SWL T S1... WL Shift Port WL BLB T RW1 RW Port MTJ T RW2 Schematic of DWM cell... Shift Port SWL T S2 Bit-cell area dominated by access transistors BL SWL WL WL SWL Layout of DWM cell BLB R. Venkatesan et al. ISLPED 2012

37 DWM: LOGICAL VIEW Tape capable of storing multiple bits Tape head controlled by shift controller Head status is stored to track the current location of tape head Variable access latency Tradeoff between density and access latency Data Input Shift Controller Head Status Shift enable Tape Head Location 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 DWM Tape R. Venkatesan et al. ISLPED 2012 Density vs. latency tradeoff

38 DWM BIT-CELL OPTIMIZATIONS Shift-based write 1 Read latency I Write0 Free Domain Fixed Domains I Write1 Area Write latency Fast and energy-efficient Varying bits per tape Tradeoff between latency and area RWL WWL SWL BL WWL BL T2 RWL T1 SL Leakage Read energy SWL WL MultibitDWM Read-only ports MTJ 1bitDWM Accelerate performance critical reads RWL WWL SL RWL SWL BL Write energy SRAM STT-MRAM DWM SWL RWL WL Read-only port MultibitDWM with read-only ports SL R. Venkatesan et al. DATE 2013, S. Fukami et al. VLSI Symp. 2009

using 1bitDWM Hybrid L2 cache 1bitDWM-based tag array MultibitDWM-based data array R. Venkatesan et al. ISLPED 2012, R. Venkatesan et al. DATE 2013.

39 TAPECACHE: DWM-BASED CACHE FOR GENERAL- PURPOSE PROCESSORS Processor Address bits WWL BL T2 RWL T1 SL Tag array L1 cache Data array Tag bits Index bits MTJ Tag array Data array L2 cache Tag array Bitlines 1bitDWM RWL WWL SWL BL Comparator DECODER Data array SWL MultibitDWM L1 caches using 1bitDWM Hybrid L2 cache 1bitDWM-based tag array MultibitDWM-based data array R. Venkatesan et al. ISLPED 2012, R. Venkatesan et al. DATE WL SL Data array Tag array 0.5 Area Energy Area/Energy distribution for 1MB SRAM cache Column mux Sense amp Output drivers Data output

40 TAPECACHE: DWM TAPE CLUSTERS Bit-interleaved DTC organization Each bit of a block in a different DWM tape Parallel read of all bits in a cache block Amortized head control circuitry Cache Block 0 Cache Block 1 Tape 1 N Bits Block 0 Bit 1 Block 1 Bit 1 Block 2 Bit 1 Block 3 Bit 1 Tape 2 N Bits Block 0 Bit 2 Block 1 Bit 2 Block 2 Bit 2 Block 3 Bit 2 Tape 3 N Bits Block 0 Bit 3 Block 1 Bit 3 Block 2 Bit 3 Block 3 Bit 3 Tape Head Tape M N Bits Block 0 Bit M Block 1 Bit M Block 2 Bit M Block 3 Bit M Block N Bit 1 Block N Bit 2 Block N Bit 3 Block N Bit M Bitlines DWM Tape Cluster (DTC) Bit-interleaved DTC organization R. Venkatesan et al. ISLPED 2012

TAPECACHE DATA ARRAY ORGANIZATION Data array organization Randomly addressable Data Array DWM Tape Cluster (DTC) Bit-interleaved MultibitDWM bit-cell Head status array Stores current head location

41 TAPECACHE DATA ARRAY ORGANIZATION Data array organization Randomly addressable Data Array DWM Tape Cluster (DTC) Bit-interleaved MultibitDWM bit-cell Head status array Stores current head location for each DTC Shift control logic Determines the no. of shifts required to access the block Index bits split into two parts: Decode and Shift Tag Array Comparator Address Bits Tag Bits Index Bits Decode Bits DECODER Shift Bits Head Status Array DTC DTC DTC DTC DTC DTC Bitlines DTC DTC DTC DTC DTC DTC Sense Amp Output Drivers Shift Control Logic Data Array Data Output BLB WWL SWL SWL RWL WWL BL DWM Tape Cluster(DTC) BLB WWL SWL SWL RWL WWL BL R. Venkatesan et al. ISLPED 2012

42 TAPECACHE MANAGEMENT POLICY Tape head selection Static: Each cache block assigned a tape head statically Dynamic: Select the tape head nearest to required cache block Tape head update Eager: Restore the tape head to a default location after each access Lazy: Update head status to track tape head location Exploits spatial locality Pre-shifting: Predict the bit that is likely to be accessed next and position the tape head Address 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x tape head1 tape head2 Address 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x tape head1 Head Status tape head2 R. Venkatesan et al. ISLPED 2012, R. Venkatesan et al. DATE 2013.

Cache (STT I-L1 D-L1 MRAM) Core L2 Cache (DWM) I-L1 D-L1 SimpleScalar/gem5 DTC DTC DTC Cache design TAG ARRAY DATA ARRAY

43 DEVICE TO ARCHITECTURE SIMULATION FRAMEWORK Architectural evaluation L2 Core Cache (STT I-L1 D-L1 MRAM) L2 Core Cache (STT I-L1 D-L1 MRAM) OS scheduler L2 Core Cache (STT I-L1 D-L1 MRAM) L3 Cache (DWM) Core L2 Cache (DWM) I-L1 D-L1 L2 Core Cache (STT I-L1 D-L1 MRAM) Core L2 Cache (DWM) I-L1 D-L1 SimpleScalar/gem5 DTC DTC DTC Cache design TAG ARRAY DATA ARRAY DTC DTC DTC DWM-CACTI Comparator Hit/Miss Data. OUT Domain wall engineering Device simulation (C. Augustine et al. IEDM 2011)

44 EXPERIMENTAL SETUP System configuration Processor Core Functional Units I/D-Cache Unified L2 Cache Alpha pipeline, Issue width 4, 3GHz Integer - 8 ALUs, 4 multipliers Floating point - 2 ALUs, 2 multipliers 32 KB, Directly mapped, 32B line size 1MB, 4 way set-associative, 128B line size, 4 banks Technology node: 32nm Benchmarks: SPEC2K6 Evaluated energy, area and performance using isocapacity replacement

SPINTRONIC CACHES FOR GPGPUS GPUs are everywhere Huge parallelism

$memory in GPUs with generation Memory consumes significant fraction of GPU$ (MB) 9 8 7 6 5 4 3 2 1 0 GPU Mobile Nvidia GPUs AMD GPUs G80 GT200 GPU

45 SPINTRONIC CACHES FOR GPGPUS GPUs are everywhere Huge parallelism ( cores) High memory bandwidth requirements Increasing on-chip memory in GPUs with generation Memory consumes significant fraction of GPU area and power Energy dominated by leakage and read On-chip memory size (MB) GPU Mobile Nvidia GPUs AMD GPUs G80 GT200 GPU Servers Radeon 7970 GF104 Desktop GPU GK110 GK104 Radeon 290X Year

STAG: OVERVIEW Streaming Multiprocessor Warp Scheduler & Dispatch Core Core Register File Core SM GPGPU Host Interface Thread Scheduler SM SM Challenge Interleaved accesses from multiple SMs

46 STAG: OVERVIEW Streaming Multiprocessor Warp Scheduler & Dispatch Core Core Register File Core SM GPGPU Host Interface Thread Scheduler SM SM Challenge Interleaved accesses from multiple SMs Consecutive accesses from an SM belongs to different warps Low locality Shared memory L1 I-Cache L1 D-Cache Tex. Cache Const. Cache 1bitDWM bit-cell SaPB Tag Array Shift logic Data Array L2 Cache DRAM Interface RWL WWL SWL BL Warp 0 Warp N Warp 0 Warp N SM Tag Array SM Data Array L2 Cache SM SWL MultibitDWM bit-cell Proposal Warp-id based prediction Exploits intra-warp localilty Shift-aware prefetch buffer design Eliminates contention from different SMs R. Venkatesan et al. ISCA 2014 WL SL History Table WarpID Address Stride Confidence 127 0x782747AE xAD xBDA Predicted address Shift aware prefetch buffer

47 STAG: RESULTS Simulation setup GPGPU-Sim Rodinia and Parboil benchmarks Results Processor L1 caches L2 cache Baseline System Configuration 16 SMs with 32 cores/sm, 32KB register file, 48KB shared memory I-cache: 4KB, 64B blocks, 4 ways D-cache: 16 KB, 128B blocks, 4 ways Constant cache: 4KB, 64B block, 2 ways Texture cache: 16KB, 64B blocks, 16 ways 768 KB, 128B blocks, 16 way associative, 6 banks IPC (Normalized) STT -MRAM All-1bitDWM STAG DWM-Ideal Energy (Normalized) STT -MRAM 1bitDWM Proposed DWM-Ideal Compared to iso-area SRAM 26% IPC improvement, 4X energy reduction. Compared to iso-area STT-MRAM R. Venkatesan et al. ISCA % IPC improvement, 3.6X energy reduction

48 SUMMARY While initially proposed for secondary storage, DWM is also a promising candidate for on-chip memories Excellent density, in addition to non-volatility and low leakage Key challenge: Variable latency due to shift operations Suitable bit-cell design, cache architecture and management policies can eliminate the impact on performance

49 REFERENCES: STT-MRAM 1. S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, and K. Roy, Future cache design using STT MRAMs for improved energy efficiency: devices, circuits and architecture, in Proc. DAC, Kon-Woo Kwon, Sri Harsha Choday, Yusung Kim, Kaushik Roy: AWARE (Asymmetric Write Architecture With REdundant Blocks): A High Write Speed STT-MRAM Cache Architecture. IEEE TVLSI, M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and S. Yalamanchili, An energy efficient cache design using Spin Torque Transfer (STT) RAM, in Proc. ISLPED, X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, Hybrid cache architecture with disparate memory technologies, in Proc. ISCA, A. Jadidi, M. Arjomand, and H. S. Azad, High-endurance and performance efficient design of hybrid cache architectures through adaptive line replacement, in Proc. ISLPED, C. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. Stan, Relaxing non-volatility for fast and energy-efficient STT-RAM caches, in Proc. HPCA, A. Jog, A. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. Das, Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs, in Proc. DAC 2012.

50 REFERENCES: DWM 1. R. Venkatesan, V. Kozhikkottu, C. Augustine, A. Raychowdhury, K. Roy and A. Raghunathan, TapeCache: A High Density, Energy Efficient Cache based on Domain Wall Memory, in Proc. ISLPED, R. Venkatesan, M. Sharad, K. Roy and A. Raghunathan, DWM-TAPESTRI - An Energy Efficient All- Spin Cache using Domain wall Shift based Writes, in Proc. DATE, R. Venkatesan, M. Sharad, V. J. Kozhikottu, C. Augustine, A. Raychowdhury, K. Roy, and A. Raghunathan, Cache Design with Domain Wall Memory, IEEE Transactions on Computers. (accepted for publication) 4. R. Venkatesan, S. Ramasubramanium, S. Venkataramani, K. Roy, and A. Raghunathan, STAG: Spintronic-Tape Architecture for GPGPU Cache Hierarchies, in Proc. ISCA, M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, Multi-level Magnetic RAM using Domain wall Shift for Energy-Efficient, High-Density Caches, in Proc. ISLPED, R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, Dense, Energy-efficient All-Spin Cache Hierarchy using Shift based Writes and Multi-Level Storage, ACM Journal on Emerging Technologies in Computing Systems. (accepted for publication)

51 THANK YOU! Questions?

This material is based upon work supported in part by Intel Corporation /DATE13/ c 2013 EDAA

This material is based upon work supported in part by Intel Corporation /DATE13/ c 2013 EDAA DWM-TAPESTRI - An Energy Efficient All-Spin Cache using Domain wall Shift based Writes Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan School of Electrical and Computer Engineering,