Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy

Similar documents
Analysis of Cache Configurations and Cache Hierarchies Incorporating Various Device Technologies over the Years

Cache Memory Introduction and Analysis of Performance Amongst SRAM and STT-RAM from The Past Decade

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM,

Cache Memory Configurations and Their Respective Energy Consumption

Comparisons Of Different Level Of Cache Using Various Technologies From Multiple Reverences

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores

A novel SRAM -STT-MRAM hybrid cache implementation improving cache performance

Area, Power, and Latency Considerations of STT-MRAM to Substitute for Main Memory

Emerging NVM Memory Technologies

International Journal of Information Research and Review Vol. 05, Issue, 02, pp , February, 2018

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

SPINTRONIC MEMORY ARCHITECTURE

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems

Improving Energy Efficiency of Write-asymmetric Memories by Log Style Write

Hibachi: A Cooperative Hybrid Cache with NVRAM and DRAM for Storage Arrays

The University of Adelaide, School of Computer Science 13 September 2018

Physical characteristics (such as packaging, volatility, and erasability Organization.

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems

Efficient Energy and Power Consumption of 3-D Chip Multiprocessor with NUCA Architecture

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

An Energy Improvement in Cache System by Using Write Through Policy

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

CS 3510 Comp&Net Arch

Reconfigurable Spintronic Fabric using Domain Wall Devices

A Low-Power Hybrid Magnetic Cache Architecture Exploiting Narrow-Width Values

Large and Fast: Exploiting Memory Hierarchy

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Energy-Efficient Spin-Transfer Torque RAM Cache Exploiting Additional All-Zero-Data Flags

Memory memories memory

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Amnesic Cache Management for Non-Volatile Memory

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

MEMORY. Objectives. L10 Memory

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware

Performance Enhancement Guaranteed Cache Using STT-RAM Technology

A LITERATURE SURVEY ON CPU CACHE RECONFIGURATION

Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache

Perfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice

Evaluation of NOC Using Tightly Coupled Router Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

The Role of Storage Class Memory in Future Hardware Platforms Challenges and Opportunities


Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

NVMe: The Protocol for Future SSDs

Chapter 12 Wear Leveling for PCM Using Hot Data Identification

Write only as much as necessary. Be brief!

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Designing Enterprise Controllers with QLC 3D NAND

Memory Overview. Overview - Memory Types 2/17/16. Curtis Nelson Walla Walla University

A Page-Based Storage Framework for Phase Change Memory

Snoop-Based Multiprocessor Design III: Case Studies

UNIT:4 MEMORY ORGANIZATION

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Memory Hierarchy. Slides contents from:

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

SF-LRU Cache Replacement Algorithm

Pipeline Optimizations of Architecting STT-RAM as Registers in Rad-Hard Environment

Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration.

Unleashing the Power of Embedded DRAM

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

DYNORA: A New Caching Technique

CS311 Lecture 21: SRAM/DRAM/FLASH

New Memory Organizations For 3D DRAM and PCMs

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

A Review on Cache Memory with Multiprocessor System

MEMORY BHARAT SCHOOL OF BANKING- VELLORE

LECTURE 11. Memory Hierarchy

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Cache/Memory Optimization. - Krishna Parthaje

Design and Implementation of a Random Access File System for NVRAM

CS 261 Fall Mike Lam, Professor. Memory

A Cache Hierarchy in a Computer System

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

NAND Flash Memory. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

This material is based upon work supported in part by Intel Corporation /DATE13/ c 2013 EDAA

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

CAUSE: Critical Application Usage-Aware Memory System using Non-volatile Memory for Mobile Devices

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Transcription:

Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando, FL 32816-2362 Abstract Many devices are currently being tested to replace the antiquated S, static random access memory, which has been used for nearly a decade. New technological devices such as or ed or even P are being introduced and presently being tested to replace S. Not only are these devices eventually being replacing S but also certain testing is being conducted to determine which cache level configuration these devices should be placed in for maximum efficiency and data retrieval performance; whether they be placed in cache level 1, level 2, or level 3. Although most devices such as are placed in cache level 3 hierarchy but now new designs incorporate these devices on level 1 or level 2. Keywords, S, ED P,, Volatile, Non- Volatile, Cache, Level 1, Level 2, Level 3, LLC, Associativity, Protocol, write instruction, read instruction I. INTRODUCTION As with many technological advancements and changes in recent years, developments in computer system processors are being realized. Within these processors exist memory called cache, section of memory located inside the CPU that is used to store data needed to be retrieved as requested by the CPU. Cache when compared to memory located on the motherboard is incredibly fast and with the placement of cache come a variety of configurations for multi-leveled cache: Level 1, Level, 2, and Level 3. Either the cache level can be optimized or cache itself to store more data. Variability exists when optimizing a processor s cache, either individual cache level capacity can be increased to store additional data or cache levels themselves can be optimized to have two level cache levels up to three level cache structures. Multi-level caches are important and vital for data retrieval performance. When cache reaches its maximum capacity rather than having to store data in main memory, data can be also placed in another level 2 cache. Miss penalty also decreases with multi-leveled caches [13]. As well has having multiple cache level organizations; there exist several cache associativity approaches for data retrieval by placing data blocks within cache. Some of these methods include full associativity, set associativity, and direct mapping each having their own unique advantages and disadvantages upon implementations [13]. Memory is another very important aspect of computer systems. Many technological devices exist such as Static Random Access Memory (S), Dynamic Random Access Memory (D), Spin Transfer Torque Ram () and Embedded Random Access Memory (ed) [13]. While every particular technological device listed are used for storing data, some devices differ from other in that some are considered volatile while other non-volatile. Memory components such as S, D, and ed are volatile. These devices have the capability of losing data without the consistent provision of voltage. Other devices such as are non-volatile, no data leakage or loss without the source of voltage. Depending on the cache associativity, the manner in which data is stored and accessed differs from method to method. Using direct mapped cache, each block from memory is situated to only one line in cache whereas in set associativity a specific number of cache lines store exactly one block from memory. mapping has regulated mapping while direct mapping is fixed, one block for one cache line. Direct mapping works by transferring the data at a specific memory address, using a tag to determine where the desired block was positioned in the cache [13]. mapping works by placing the memory blocks in limited number of cache lines, where depending on the cache one block can fill either 1, 2, or 3 cache lines. Although many computers systems differ in cache level organization, cache structure remain consistently the same. Just as memory cells contain data, within cache exist cache lines which store the data from memory. Cache lines are segmented to contain specific information needed by the processor, cache lines contain both the tags and the data. The tags are essential for determining the destination in main memory from which the block of data is being retrieved [13]. Cache operates and functions just as the main memory, difference is memory is lower located significantly lower in memory hierarchy when compared to cache which is located within the CPU chip. The hardware placement of cache in the chip causes significant speed for data retrieval yet lack in space whereas main memory lacks speed yet delay in speed [13]. Page 1 of 5

In the figure, cache placement is shown to be within the CPU chip and contained with the cache are segmented sections or cells known as cache lines. The data placed into cache lines are retrieved from main memory as displayed in the figure below. Whenever data retrieval is in process, each time segment of data from memory are located in cache, a hit occurs. Whenever a block of memory is not able to be located within any of the cache levels, whether they be a Level 1, Level 2, or Level 3, a miss occurs. When a high hit ratio occurs, data retrieval is significantly faster since the data exist within cache delivering considerably low timing for the data retrieval. When a high miss ratio occurs, there is data that is not able to be located within cache that has to be retrieved from main memory and has to be placed in the cache lines for the CPU to access; this entire process of location data and relocation causes delay in data retrieval performance decreasing overall speed. In the upcoming section of the paper, new advancements will be discussed that have taken place over the past decade with certain technological devices such as Spin Static Torque Ram and embedded Random Access Memory as well as other devices. Along with these devices, certain cache levels configurations will be discussed as well as optimizations that have taken place within each cache level, such as with Level 2 and most importantly Level 3. II. LITERATURE REVIEW While adding cache memory to a computer system can be an excellent alternative to increase the data accessing speed while decreasing the retrieval time, certain issues have risen such as that of cache coherence [2]. The issue takes place when multiple levels of cache contain data that needs to be altered in one level, precautions must take place to ensure that data is modified throughout the entire levels of cache to maintain consistency [2]. With the immense enhancements in cache-leveled structures new technological hybrid devices are being designed able to sustain heavily oriented memory tasks. One of the hybrids designed known as ASTRO which focuses on retrieving instructions stored in the main memory of the system [3]. Not only does ASTRO receive instructions but the hybrid also reduces energy dissipation when compared to other devices [3]. Not only are enhancements being made for cache level structures but certain advancements in energy preservation by minimal dissipation [4]. With issues surfacing with the use of technological devices such as S, new substitutions are being made such as the use of. Spin Transfer Torque Ram now the new device being implemented, unlike S does not reduce power loss but rather preserves while also be a non-volatile device. Last Level Cache also known as LLC are placed lower in the memory hierarchy while also maintaining remarkable speeds yet problems arrive when the CPU is waiting for the data retrieval from cache lines [5]. Solutions for solving these processor related issues incorporate replacing the S with, Spin Transfer Torque Random Access Memory [5]. is specifically used for data storage in the CPU last level cache which also can lead to a decrease in miss ratio, reducing the time to return the data from cache directly to the processor [5]. As specified earlier, or Static Transfer Torque has shown remarkable improvements when compared to cache Level 1 and cache Level 2. Not only is the overall area reduced but this improvement leads to lower amount of misses while also retaining information without any type of loss to data, causing increased delay in data retrieval [7]. Although has shown significant results when placed in cache level 3, certain factors inhibit its placement within any other localities in cache levels [7]. One of these factors include the excessive amount of read and write instructions stored in the device which can cause overheating and inaccurate placements of blocks within the pertaining cache lines. As shown in the configuration, when was placed in Level 1 cache, the write instructions prove to be considerably slower than compared to the placement of S in cache Level 1 as well as an increase in read and write energy consumption [7]. With new discoveries being made for cache level organization, new technological devices are beginning to come into light for the purpose of replacing S in cache levels [8]. Some of these new technological devices include as mentioned consistently throughout the paper but also a new device called ed [8]. Many cache levels are now being tested with these various device types, optimizing them for maximum efficiency. Although some show to be faster regarding data and instruction retrieval, these devices shows downsides as well. Downsides for include increases in energy use; while ed requires certain processes to retain the correct data block retrieved from the memory preventing data corruption [8]. With many options presented to replace technological devices such as with or ed; alternatives exist. Alternatives that rather than complete replacement, require mixture of two types of devices, known as hybrids [10]. These hybrids are composed of a new classification of device type known as non-volatile ram. Not only do the hybrids contain greater storage when compared to singular type devices but also do not require as much energy use as that of Ram [10].

As discussed throughout the paper, is currently being tested to replace the antiquated use of S which although have fast data generation, other devices are much faster with certain instruction while offering a tremendous increase in storage capacity which also happen to be non-volatile [11]. One issue that arises with is also the quantity of refresh instruction executed in cache which can cause excessive loss of energy [11]. A solution which can reduce the quantity of refresh instructions is cache coherence adaptive refresh which has the capability of minimizing the refresh instructions leading to a decrease in energy loss [11]. III. DATA ANALYSIS 5 Read Cache Latency (nsec) 4 3 2 1 0 The figure above depicts various Read Cache Latency in nanoseconds for their pertaining technological device as shown in the x-axis. Read latency is shown for multiple cache level configurations with device name and capacity listed. L1 4MB L2 S 512KB ed L3 STTS L2 S Re- 8MB S 32KB Read Energy Consumption (nj) 0 0.5 1 1.5 2 2.5 The graph above displays the Read Energy Consumtion for various devices such as, ED, S and many other devices. The energy consumtion shown is measured in nanojoules. IV. CONCLUSION Throughout the paper, many technological devices were mentioned, many which are currently undergoing examination to replace the long lasting Static or S. Some of these new devices included which has increased storage capacity while also having an increase in energy use. Other devices such as ed or embedded dynamic are also being examined to contain certain instruction to retain data reliability without the presence of corruption. Not only are some of the new devices replacing S but also being tested to be positioned in certain cache levels other than Level 3 such as Level 2 or Level 1. As more testing is done, solutions to decrease energy loss and minimize excessive writing instructions will begin to surface as they are as current testing is proving.

REFERENCES [1] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, Bit-Upset Vulnerability Factor for ed Last Level Cache Immunity Analysis, Proceedings of 17th International Symposium on Quality Electronic Design (ISQED 2016), Santa Clara, CA, USA, March 15-16, 2016. [2] S. E. Crawford and R. F. DeMara, "Cache coherence in a multiport memory environment," in Proceedings of the Second International Conference on Massively Parallel Computing Systems (MPCS-95), pp. 632-642, Ischia, Italy, May 2-6, 1995. [3] M. Lin, et al. "ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism" Microprocessors and Microsystems 39.7 (2015): 553-564. [4] X. Chen, N. Khoshavi, J. Zhou, D. Huang, R. F. DeMara, J. Wang, W. Wen and Y. Chen, AOS: Adaptive Overwrite Scheme for Energy- Efficient MLC Cache, 53rd Design Automation Conference, Austing, TX, USA, 2016. [5] N. Khoshavi, X. Chen, J. Wang and R. F. DeMara, "Read-Tuned and ed Cache Hierarchies for Throughput and Energy Enhancement, arxiv preprint, 2016. [6] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das, Cache Revive: Architecting Volatile Caches for Enhanced Performance in CMPs, in Proceedings of 49th Annual Design Automation Conference (DAC). 2012, pp. 243 252. [7] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, Multi Retention Level Cache Designs with a Dynamic Refresh Scheme, in Proceedings of 44th annual IEEE/ACM International Symposium on Microarchitecture. 2011, pp. 329 338. [8] M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, Technology Comparison for Large Last-level Caches (L 3 Cs): Low-leakage S, Low Write-energy, and Refresh-optimized ed, in Proceedings of 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 143 154. [9] M. R. Jokar, M. Arjomand, and H. Sarbazi-Azad, Sequoia: High- Endurance NVM-Based Cache Architecture, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016. [10] Joo, Yongsoo, and Sangsoo Park. "A hybrid P and cache architecture for extending the lifetime of P caches." IEEE computer architecture letters 12.2 (2013): 55-58. [11] Li, Jianhua, et al. "Low-energy volatile cache design using cache-coherence-enabled adaptive refresh." ACM Transactions on Design Automation of Electronic Systems (TODAES) 19.1 (2013): 5. [12] Zhang, Yaojun, et al. "Read performance: The newest barrier in scaled." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23.6 (2015): 1170-1174. [13] DeMara, Ronald. Module 11, Memory Hierarchy, 2016..

TABLE I. Parameters for Processor the below techniques # of Freq. Capacity cores Level 1 (L1) for Instruction (I) or Data (D) Level 2 (L2) Level 3 (L3) or Last Level Cache (LLC) # of CL Protocol Capacity # of CL Protocol Capacity # of CL Protocol Khoshavi [1] 8 3GHz 32KB 8-way S 512 MESI 512KB 8-way S 8192 MESI 96MB 16-way ed ~100M WB Sun [7] 4 2GHz 32KB 4-way S 512 N/A 256KB 8-way S 4096 N/A 4MB 16-way Jokar[9] 4 3GHz 32KB 8-way D 512 MOESI 2MB 8-way 65536 N/A 32768 MOESI 8MB 8-way Re 131072 MOESI Zhang[12] 16 3.5GHz 32KB 4-way S 512 MESI 256K 8-way S 4096 N/A 16M 16-way S 262144 N/A Chang[8] 8 2GHz 32KB 8-way N/A 512 MESI 256KB 8-way N/A 4096 MESI 32MB 16-way N/A 524288 WB Chen[4] 4 3.3GHz 32KB 8-way S 512 WB 4MB 8-way 65536 WB N/A N/A N/A N/A N/A Khoshavi[5] N/A 3GHz 32KB 8-way S 512 WB N/A 8-way N/A N/A WB 96MB 16-way EDRA M ~100M WB Jog[6] N/A 2GHz 32KB 4-way S 512 WB 1MB 16-way S 16384 N/A N/A N/A N/A N/A N/A Li[11] 16 2GHz 32KB 2-way 512 WB N/A N/A N/A N/A N/A 8MB 16-way 131072 WB Joo[10] 1 2GHz 32KB N/A S 512 WB 8MB 16-way Hybrid 131072 WB N/A N/A N/A N/A N/A CL = Cache line Calculation for # of CL columns: Manually compute the number of cache lines given the capacity value as listed in capacity column, assuming the cache line size is always 64 Bytes Protocol column = {Write Back (WB), Write Through (WT), MESI, MOESI, Not Available (N/A)}