Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
|
|
- Laura Bruce
- 6 years ago
- Views:
Transcription
1 Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1
2 Please find the power point presentation in: 2
3 3
4 4
5 Communication vs. Computation ~200X Keckler Micro 2011 Improving cache utilization is critical for energy-efficiency!
6 Compressed Cache: Compress and Compact Blocks + Higher effective cache size + Small area overhead + Higher system performance + Lower system energy Previous work limit compression effectiveness: - Limited number of tags - High internal fragmentation - Energy expensive re-compaction
7 Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Non-Contiguous Decoupled Super-Blocks Sub-Blocks Previous work limit compression effectiveness: - Limited number of tags - High Internal Fragmentation - Energy expensive re-compaction 7
8 Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy 8
9 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 9
10 Uncompressed Caching A fixed one-to-one tag/data mapping Tags Data 10
11 Compressed Caching Compact Add more compressed Compress tags to increase cache blocks, effective blocks. to make capacity. room. Tags Data 11
12 Compression (1) Compression: how to compress blocks? There are different compression algorithms. Not the focus of this work. But, which algorithm matters! 64 bytes 20 bytes Compressor 12
13 Compression Potentials We use C-PACK+Z for the rest of the talk! 1.5 Cycles to Decompress Compression Algorithm Compression Ratio = Original Size / Compressed Size High compression ratio potentially large normalized effective cache capacity. 13
14 Compaction (2) Compaction: how to store and find blocks? Critical to achieve the compression potentials. Fixed Sized Compressed Cache (FixedC) [Kim 02, WMPI, Yang Micro 02] This work focuses on compaction. Internal Fragmentation! Tags Data 14
15 Compaction (2) Compaction: how to store and find blocks? Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Tags Data Sub-block 15
16 Previous Compressed Caches Potential: B 10B (Limit 1) Limited Tag/Metadata Normalized Effective Capacity = LLC Number High Area of Overhead Valid Blocks by Adding / MAX 4X/more Number Tags of (Uncompressed) Blocks (Limit 2) Internal Fragmentation Low Cache Capacity Utilization 16
17 (Limit 3) Energy-Expensive Re-Compaction VSC requires energy-expensive re-compaction. B Update needs B 2 sub-blocks Tags Data 3X higher LLC dynamic energy! 17
18 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 18
19 Decoupled Compressed Cache (1) Exploiting Spatial Locality Low Area Overhead (2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation 19
20 (1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 89% 20
21 (1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. Super Tags Tags Data 2X Tags 4X Tags Super-Block Tag Q state A state B state C state D Quad (Q): A, B, C, D Singleton (S): E Up to 4X blocks with low area overheads! 21
22 (2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. Super Tags Flexible Update Allocation B Quad (Q): A, B, C, D Singleton (S): E 22
23 (2) Decoupling tag/data mapping Back pointers identify the owner block of each sub-block. Super Tags Back Pointers Data Tag ID Blk ID Quad (Q): A, B, C, D Singleton (S): E 23
24 (3) Co-compacting super-blocks Co-DCC dynamically co-compacts super-blocks. Reducing internal fragmentation A sub-block Quad (Q): A, B, C, D 24
25 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 25
26 Experimental Methodology Integrated DCC with AMD Bulldozer Cache. We model the timing and allocation constraints of sequential regions at LLC in detail. No need for an alignment network. Verilog implementation and synthesis of the tag match and sub-block selection logic. One additional cycle of latency due to sub-block selection. 26
27 Experimental Methodology Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache sensitivities: Commercial workloads: apache, jbb, oltp, zeus Spec-OMP: ammp, applu, equake, mgrid, wupwise Parsec: blackscholes, canneal, freqmine Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astarbwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactusbzip, omnetpp-lbm Cores Eight OOO cores, 3.2 GHz L1I$/L1D$ Private, 32-KB, 8-way L2$ Private, 256-KB, 8-way L3$ Shared, 8-MB, 16-way, 8 banks Main Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3 27
28 Effective LLC Capacity Normalized LLC Area 2 1 FixedC Baseline 2X Baseline VSC DCC Co-DCC Normalized Effective LLC Capacity Components FixedC/VSC-2X DCC Co-DCC Tag Array 6.3% 2.1% 11.3% Back Pointer Array 0 4.4% 5.4% (De-)Compressors 1.8% 1.8% 1.8% Total Area Overhead 8.1% 8.3% 18.5% 28
29 (Co-)DCC Performance (Co-)DCC boost system performance significantly. 29
30 (Co-)DCC Energy Consumption (Co-)DCC reduce system energy by reducing number of accesses to the main memory. 30
31 Summary Analyze the limits of compressed caching Limited number of tags Internal fragmentation Energy-expensive re-compaction Decoupled Compressed Cache Improving performance and energy of compressed caching Decoupled super-blocks Non-contiguous sub-blocks Co-DCC further reduces internal fragmentation Practical designs [details in the paper] 31
32 Backup (De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy 32
33 (De-)Compression Overhead Parameters Compressor Decompressor Pipeline Depth 6 2 Latency (cycles) 16 9 Area (mm 2 ) Power Consumption (mw)
34 DCC Data Array Organization AMD Bulldozer SR3 A0.3 C3.0 B Phase Flop SR 2 A0.2 B1.1 A Phase Flop SR 1 A0.1 B1.0 B Phase Flop SR0 A0.0 C3.1 A Phase Flop A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks N Set Addr Read Data 4 SR0 Addr 4 SR1 Addr 4 SR2 Addr 4 SR3 Addr 34
35 DCC Timing 35
36 DCC Lookup Super Tags 1. Access Super Tags and Back Pointers in parallel 2. Find the matched Back Pointers 3. Read corresponding sub-blocks and decompress 1 Q 0 S Back Pointers Read C Data Quad (Q): A, B, C, D Singleton (S): E 36
37 Applications Sensitive to Cache Latency Sensitive to Cache Capacity and Latency Cache Insensitive Sensitive to Cache Capacity Spec2006 (m1-m8) bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm 37
38 Co-DCC Design Sub-block 7 A2.1 A: <A2,A1,A0> Sub-block 6 Sub-block 5 Sub-blocks 4-2 A2.0 A1 A0.1 A1-Begin Sub-block 1 Sub-block 0 A0.0 A2.2 A0-Begin A-END Super- Block Tag Cstate3 Comp3 Begin3 Cstate2 Comp2 3b 1b 7b 3b1b 7b Begin2 Cstate1 Comp1 Begin1 Cstate0 Comp0 Begin0 3b 1b 7b 3b1b 7b END 7b Tag ID 1b Sharers 4b 38
39 LLC Effective Cache Capacity 4.0 Norm LLC Effective Capacity
40 LLC Miss Rate 1.00 Norm LLC Miss Rate
41 Memory Dynamic Energy 1.00 Norm Memory Dynamic Energy
42 LLC Dynamic Energy 6 Norm LLC Dynamic Energy
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David
More informationYet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache
Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache SOMAYEH SARDASHTI, University of Wisconsin-Madison ANDRE SEZNEC, IRISA/INRIA DAVID A WOOD, University of Wisconsin-Madison Cache
More informationSkewed Compressed Cache
Skewed Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Skewed Compressed Cache. MICRO - 47th Annual IEEE/ACM International
More informationYet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache
Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Yet Another
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationPractical Data Compression for Modern Memory Hierarchies
Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationVirtualized ECC: Flexible Reliability in Memory Systems
Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationDecoupled Zero-Compressed Memory
Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract
More informationVirtualized and Flexible ECC for Main Memory
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationData Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationMicro-sector Cache: Improving Space Utilization in Sectored DRAM Caches
Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture
More informationRethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationCOP: To Compress and Protect Main Memory
COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,
More informationThesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression
Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationChip-Multithreading Systems Need A New Operating Systems Scheduler
Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationLeveraging ECC to Mitigate Read Disturbance, False Reads Mitigating Bitline Crosstalk Noise in DRAM Memories and Write Faults in STT-RAM
1 MEMSYS 2017 DSN 2016 Leveraging ECC to Mitigate ead Disturbance, False eads Mitigating Bitline Crosstalk Noise in DAM Memories and Write Faults in STT-AM Mohammad Seyedzadeh, akan. Maddah, Alex. Jones,
More informationSpeeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns
March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationHybrid Cache Architecture (HCA) with Disparate Memory Technologies
Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:
More informationAS the processor-memory speed gap continues to widen,
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,
More informationarxiv: v1 [cs.ar] 10 Apr 2017
Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, Srinivas Devadas CSAIL, MIT, Intel Labs, ETH Zurich {yxy, devadas}@mit.edu,
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationDFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory
DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory Yuncheng Guo, Yu Hua, Pengfei Zuo Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology Huazhong
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon
More informationReactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationScalable Dynamic Task Scheduling on Adaptive Many-Cores
Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for
More informationPerformance analysis of Intel Core 2 Duo processor
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 27 Performance analysis of Intel Core 2 Duo processor Tribuvan Kumar Prakash Louisiana State University and Agricultural
More informationMemory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB
Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationEnergy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen Ilya Wagner Zeshan Chishti Wei Wu Chris Wilkerson Shih-Lien Lu Intel Labs Overview Large caches and memories
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationHeap Compression for Memory-Constrained Java
Heap Compression for Memory-Constrained Java CSE Department, PSU G. Chen M. Kandemir N. Vijaykrishnan M. J. Irwin Sun Microsystems B. Mathiske M. Wolczko OOPSLA 03 October 26-30 2003 Overview PROBLEM:
More informationEvaluation of a High Performance Code Compression Method
Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationUsing Transparent Compression to Improve SSD-based I/O Caches
Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationBandwidth Adaptive Snooping
Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationSandbox Based Optimal Offset Estimation [DPC2]
Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset
More informationFrequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches
Technical Report 15, Computer Sciences Dept., UW-Madison, April 24 Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches Alaa R. Alameldeen and David A. Wood Computer Sciences
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationImproving Cache Performance using Victim Tag Stores
Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com
More informationAre we ready for high-mlp?
Are we ready for high-mlp?, James Tuck, Josep Torrellas Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2 Motivation
More informationSEESAW: Set Enhanced Superpage Aware caching
SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationThe Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation
Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:
More informationMyths and Realities: The Performance Impact of Garbage Collection
Myths and Realities: The Performance Impact of Garbage Collection Tapasya Patki February 17, 2011 1 Motivation Automatic memory management has numerous software engineering benefits from the developer
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationARCHITECTURAL SUPPORT FOR A VARIABLE GRANULARITY CACHE MEMORY SYSTEM
ARCHITECTURAL SUPPORT FOR A VARIABLE GRANULARITY CACHE MEMORY SYSTEM by Snehasish Kumar B.Tech, Biju Patnaik University of Technology, 2010 a Thesis submitted in partial fulfillment of the requirements
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationInitial Results on the Performance Implications of Thread Migration on a Chip Multi-Core
3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationSCORPIO: 36-Core Shared Memory Processor
: 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett
More informationEfficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching
Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching Chang Hyun Park, Taekyung Heo, Jaehyuk Huh School of Computing, KAIST {changhyunpark, tkheo}@calab.kaist.ac.kr, and
More informationEvaluating the Reuse Cache for mobile processors. Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison
Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison 1 Executive Summary Problem : Mobile SOCs Area is money! Cache
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationMemory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology
Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationEnabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem Computer Science Department, Electrical and Computer Engineering
More informationEmerging NVM Memory Technologies
Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement
More informationEnabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem Computer Science Department, Electrical and Computer Engineering
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationCombining Local and Global History for High Performance Data Prefetching
Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu
More informationSpeculative Versioning Cache: Unifying Speculation and Coherence
Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationSpring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More information