Virtualized ECC: Flexible Reliability in Memory Systems

Size: px
Start display at page:

Download "Virtualized ECC: Flexible Reliability in Memory Systems"

Transcription

1 Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin

2 Motivation Reliability concerns are growing Memory failure rate is increasing Shrinking semiconductor design rules & lower Vdd More and more memory cells Higher capacity & larger number of nodes in a system Industry consistently requires high reliability levels Chipkill and even higher reliability levels Traditional solutions Apply ECC uniformly across memory locations Not all apps/data need same error tolerance levels Cost increases with tolerance levels Hard to meet the required level of reliability at low cost 2

3 Virtualized ECC Uniform ECC Virtualized ECC Virtualized ECC Stronger ECC Data block ECC Store redundant info within memory hierarchy Dynamic mapping between data and ECC Flexible protection level/access granularity Two-tiered protection Finer access granularity 3

4 Outline Motivation Traditional Memory Protection Virtualizing Redundant Information Memory Mapped ECC (MME) Virtualized and Flexible ECC (V-ECC) Adaptive Granularity Memory System (AGMS) Conclusions and Future Work 4

5 Traditional Memory Protection Uniform ECC Simple and transparent to the programmer L2 cache Dedicated and aligned HW resources Waste storage and bandwidth Better reliability at higher cost Memory bus Design-time decision Error tolerance level Access granularity Main Memory Users always pay for the cost of memory protection 5

6 Virtualizing Redundant Information Store all or part of redundant information within the memory hierarchy Minimize dedicated resources to redundant information Two-tiered protection Tier-1 error code: Low-cost, low-complexity for the common case Tier-2 error code: Strong error correcting code, but rarely accessed Decouple data and ECC storage Flexibility in memory protection Adaptive error tolerance levels / access granularities Users pay for the error tolerance they need, rather than overpay for what they might need 6

7 Examples of Virtualizing ECC Last-level cache protection with memory mapped ECC Memory Mapped ECC (MME) [ISCA 09] ECC FIFO [SC 09] Chipkill level flexible main memory protection Virtualized and Flexible ECC (V-ECC) [ASPLOS 10] Adaptive memory access granularity with ECC Adaptive Granularity Memory System (AGMS) [On-going] 7

8 Memory Mapped ECC [ISCA 09]

9 Memory Mapped ECC [ISCA 09] Tag Data T1ECCT2EC 8 ways 64B 1B 8B 8B On-Chip DRAM T2EC is memory mapped and cacheable 9

10 Memory Mapped ECC (Cont d) Last-level cache protection mechanism Two-tiered protection Tier-1 error code (T1EC) Low-overhead on-chip error code Tier-2 error code (T2EC) Strong memory mapped error correcting code LLC is dynamically and transparently partitioned into data and T2EC Fixed, one-on-one mapping between physical cache lines and memory mapped T2EC Area saving: 15%, power saving: 9% 10

11 Virtualized and Flexible ECC [ASPLOS 10]

12 Virtualized and Flexible ECC [ASPLOS 10] Main memory protection mechanism Virtualize redundant information within the memory hierarchy Augment Virtual Memory (VM) Dynamic mapping between data and ECC Flexible memory protection Single hardware can provide different tolerance levels Allow adaptive tuning of reliability levels Enable protection even for Non-ECC DIMMs 12

13 Virtualized and Flexible ECC (Cont d) Virtual Address space Physical Memory Protection Level Low Medium Virtual page i Virtual page j Virtual Page to Physical Frame mapping Page frame i Page frame j High Virtual page k Page frame k T2EC for Chipkill ECC page j Physical Frame to ECC Page mapping ECC page k Data T1EC T2EC for Double Chipkill 13

14 Write-back a T2EC line when evicted Write: update data, T1EC, and T2EC Don t T2ECs Read: Virtualized of consecutive need fetch T2EC data ECC lines operation map most and to T1EC a cases T2EC line ECC Address Translation Unit: fast PA to EA translation B0 A 3 PA: 0x0200 ECC Address Translation Unit LLC EA: 0x Wr: 0x0200 DRAM Rank 0 Rank c Rd: 0x00c0 A 5 Wr: 0x0540 B0 B1 B2 B c c c T2EC for Rank 1 data c c Data T1EC Data T1EC T2EC for Rank 0 data 14

15 Performance Impact of V-ECC Increased data miss rate T2EC lines in LLC reduce effective LLC size Increased traffic due to T2EC write-back One-way write-back traffic Not on the critical path 15

16 Chipkill correct Single Device-error Correct and Double Device-error Detect Can tolerate a DRAM failure Can detect a second DRAM failure Traditional chipkill requires x4 DRAMs V-ECC x8 Two-tiered error code Simpler T1EC relaxes module design constraints Enable more energy efficient x8 configurations T2EC for error correction is virtualized 16

17 Flexible Error Protection Single HW with V-ECC can provide Chipkill-detect, Chipkill-correct, and Double chipkill-correct Use different T2EC for different pages T2EC per 64B Chipkill- Detect Chipkill- Correct 0B 4B 8B Double Chipkill- Correct Maximize performance/power efficiency with Chipkill-Detect Stronger protection at the cost of additional T2EC accesses 17

18 N VECC x8 (normalized to baseline x4 chipkill) Chipkill-Detect Chipkill-Correct Double Chipkill-Correct Normalized Execution Time Normalized EDP bzip2 hmmer mcf libq omnet milc lbm sphinx3 canneal dedup fluid freq avg STREAM GUPS SPEC 2006 PARSEC 18

19 N V-ECC with Non-ECC DIMM (normalized to baseline x4 chipkill) Normalized Execution Time No protection 2 SPEC Chipkill-Detect Chipkill-Correct 4 Double Chipkill-Correct Normalized EDP bzip2 hmmer mcf libquantum omnetpp milc lbm sphinx3 canneal dedup fluidanimatefreqmine Average PARSEC STREAM GUPS bzip2 hmmer mcf libquantum omnetpp milc lbm sphinx3 canneal dedup fluidanimatefreqmine Average STREAM GUPS SPEC 2006 PARSEC

20 Adaptive Granularity Memory System

21 Adaptive Granularity Memory System Bandwidth wall More and more cores/threads per chip Off-chip bandwidth scaling is limited due to pins and power Fixed, coarse access granularity 64B or 128B data block with 125% redundancy overhead Works well with most applications with spatial locality Inefficient for apps with low spatial locality Waste off-chip BW for unnecessary data with a block Can utilize off-chip BW better with fine-grained blocks Cache hierarchy, DRAM and uniform ECC prevents fine-grained accesses 21

22 Coarse- and Fine-Grained Memory Access Coarse-grained access Lower ECC/control overhead Generally works well But, poor throughput without spatial locality Data ECC Fine-grained access Higher ECC/control overhead Better throughput, if apps have poor spatial locality Data ECC Data ECC Data ECC Data ECC Data ECC Data ECC Data ECC Data ECC 22

23 Design examples Wide channel for Coarse-grained access Narrow multi-channel for Fine-grained access AGMS with virtualized ECC Finegrained access Coarsegrained access ECC page ECC page Data ECC Data ECC Data ECC Data 23

24 Register/Demux Demux Demux N AGMS Design Augment Virtual Memory (VM) to identify fine-grained pages -- 1 bit per PTE Core $D $I Core $D $I Core $D $I Sectored caches to manage 8B data blocks L2 L2 L2 Memory controller can handle multiple access granularities Last Level Cache MC SR0 DRAM SR1 Double Data Rate ABUS Reg clk SR2 SR3 SR4 ABUS DBUS x8 x8 x8 x8 x8 x8 x8 x8 x8 Reg SR5 SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8 SR6 clk SR7 Higher Address/Command BW Subranked DRAM to enable an 8B access granularity 24

25 System Throughput System Throughput N Preliminary Results 50 CG AG AG+ECC 4 core system throughput mcf x4 omnetpp x4 libq x4 bzip2 x4 mst x4 em3d x4 SSCA2 x4 linked list x4 gups x4 MIX--1 MIX--2 MIX--3 MIX core system throughput CG AG AG+ECC omnetpp x8 SSCA2 x8 MIX--A MIX--B MIX--C 25

26 Conclusions and Future Work Virtualized ECC Store all or part of redundant information within the memory hierarchy Two-tiered protection Minimize dedicated resources to redundancy Flexibility in error tolerance level/access granularity Future work GPU memory protection Non-Volatile memory protection Adaptive/tunable reliability Generic meta-data storage 26

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon Electrical and Computer Engineering Department The University of Texas at Austin doehyun.yoon@gmail.com Mattan Erez Electrical and Computer Engineering

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems

More information

CLEAN-ECC: High Reliability ECC for Adaptive Granularity Memory System

CLEAN-ECC: High Reliability ECC for Adaptive Granularity Memory System -ECC: High Reliability ECC for Adaptive Granularity Memory System Seong-Lyong Gong ECE, UT Austin sl.gong@utexas.edu Jinsuk Chung ECE, UT Austin chungdna@gmail.com Minsoo Rhu NVIDIA mrhu@nvidia.com Mattan

More information

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Lots of interesting multi-level projects Resilience/

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute

More information

Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin

Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin 1 Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Arch-focused whole-system approach Efficiency requirements

More information

Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies

Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Exploring Latency-Power Tradeoffs in Deep Nonvolatile Memory Hierarchies Doe Hyun Yoon, Tobin Gonzalez, Parthasarathy Ranganathan, and Robert S. Schreiber Intelligent Infrastructure Lab (IIL), Hewlett-Packard

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Leveraging ECC to Mitigate Read Disturbance, False Reads Mitigating Bitline Crosstalk Noise in DRAM Memories and Write Faults in STT-RAM

Leveraging ECC to Mitigate Read Disturbance, False Reads Mitigating Bitline Crosstalk Noise in DRAM Memories and Write Faults in STT-RAM 1 MEMSYS 2017 DSN 2016 Leveraging ECC to Mitigate ead Disturbance, False eads Mitigating Bitline Crosstalk Noise in DAM Memories and Write Faults in STT-AM Mohammad Seyedzadeh, akan. Maddah, Alex. Jones,

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong *, Doe Hyun Yoon, Dam Sunwoo, Michael Sullivan *, Ikhwan Lee *, and Mattan Erez * * Dept. of Electrical and Computer Engineering,

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

arxiv: v1 [cs.ar] 10 Apr 2017

arxiv: v1 [cs.ar] 10 Apr 2017 Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, Srinivas Devadas CSAIL, MIT, Intel Labs, ETH Zurich {yxy, devadas}@mit.edu,

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction 1 Refresh Basics A cell is expected to have a retention time of 64ms; every cell must be refreshed within a 64ms window

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

Lecture 20: Virtual Memory, Protection and Paging. Multi-Level Caches

Lecture 20: Virtual Memory, Protection and Paging. Multi-Level Caches S 09 L20-1 18-447 Lecture 20: Virtual Memory, Protection and Paging James C. Hoe Dept of ECE, CMU April 8, 2009 Announcements: Best class ever, next Monday Handouts: H14 HW#4 (on Blackboard), due 4/22/09

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation Xiangyao Yu 1 Christopher J. Hughes 2 Nadathur Satish 2 Onur Mutlu 3 Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich ABSTRACT

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Resilient Memory Architectures A very short tutorial on ECC and repair Dong Wan Kim Jungrae Kim Mattan Erez

Resilient Memory Architectures A very short tutorial on ECC and repair Dong Wan Kim Jungrae Kim Mattan Erez Resilient Memory Architectures A very short tutorial on ECC and repair Dong Wan Kim Jungrae Kim Mattan Erez The University of Texas at Austin 2 Are DRAM errors rare? Many errors per minute (100k nodes,

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

EE 457 Unit 7b. Main Memory Organization

EE 457 Unit 7b. Main Memory Organization 1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)

More information

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3) Lecture 15: DRAM Main Memory Systems Today: DRAM basics and innovations (Section 2.3) 1 Memory Architecture Processor Memory Controller Address/Cmd Bank Row Buffer DIMM Data DIMM: a PCB with DRAM chips

More information

Practical Data Compression for Modern Memory Hierarchies

Practical Data Compression for Modern Memory Hierarchies Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison

More information

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction

ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction Vinson Young, Chiachen Chou, Aamer Jaleel *, and Moinuddin K. Qureshi Georgia Institute of Technology

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores Appears in the Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA), 218 KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores Nosayba

More information

A Framework for Providing Quality of Service in Chip Multi-Processors

A Framework for Providing Quality of Service in Chip Multi-Processors A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

COP: To Compress and Protect Main Memory

COP: To Compress and Protect Main Memory COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University I/O, Disks, and RAID Yi Shi Fall 2017 Xi an Jiaotong University Goals for Today Disks How does a computer system permanently store data? RAID How to make storage both efficient and reliable? 2 What does

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip Technical Report CSE-11-7 Monday, June 13, 211 Asit K. Mishra Department of Computer Science and Engineering The Pennsylvania

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager 1 T HE D E N A L I N E X T - G E N E R A T I O N H I G H - D E N S I T Y S T O R A G E I N T E R F A C E Laura Caulfield Senior Software Engineer Arie van der Hoeven Principal Program Manager Outline Technology

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Disintegrated Control for Energy-Efficient and Heterogeneous Memory Systems

Disintegrated Control for Energy-Efficient and Heterogeneous Memory Systems Disintegrated Control for Energy-Efficient and Heterogeneous Memory Systems Tae Jun Ham Bharath K.Chelepalli Neng Xue Benjamin C.Lee Duke University {taejun.ham, bc125, nx3, benjamin.c.lee}@duke.edu Abstract

More information

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures Minsoo Rhu Michael Sullivan Jingwen Leng Mattan Erez {minsoo.rhu, mbsullivan, jingwen, mattan.erez}@utexas.edu Department of Electrical

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse] Disks and RAID CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse] Storage Devices Magnetic disks Storage that rarely becomes corrupted Large capacity at low cost Block

More information

ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems

ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems Xun Jian University of Illinois at Urbana-Champaign Email: xunjian1@illinois.edu Rakesh Kumar University of

More information

Efficient RAS support for 3D Die-Stacked DRAM

Efficient RAS support for 3D Die-Stacked DRAM Efficient RAS support for 3D Die-Stacked DRAM Hyeran Jeon University of Southern California hyeranje@usc.edu Gabriel H. Loh AMD Research gabriel.loh@amd.com Murali Annavaram University of Southern California

More information

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

DRAM Main Memory. Dual Inline Memory Module (DIMM)

DRAM Main Memory. Dual Inline Memory Module (DIMM) DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura CSE 451: Operating Systems Winter 2013 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Gary Kimura The challenge Disk transfer rates are improving, but much less fast than CPU performance

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Scalable High Performance Main Memory System Using PCM Technology

Scalable High Performance Main Memory System Using PCM Technology Scalable High Performance Main Memory System Using PCM Technology Moinuddin K. Qureshi Viji Srinivasan and Jude Rivers IBM T. J. Watson Research Center, Yorktown Heights, NY International Symposium on

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information