Practical Data Compression for Modern Memory Hierarchies

Size: px
Start display at page:

Download "Practical Data Compression for Modern Memory Hierarchies"

Transcription

1 Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison Doug Burger, Microsoft Michael Kozuch, Intel Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2 Performance and Energy Efficiency Energy efficiency Applications today are data-intensive Memory Caching Databases Graphics 2

3 Computation vs. Communication Modern memory systems are bandwidth constrained Data movement is very costly Integer operation: ~1 pj Floating operation: ~20 pj Low-power memory access: ~1200 pj Implications ½ bandwidth of modern mobile phone memory exceeds power budget Transfer less or keep data near processing units 3

4 Data Compression across the System Processor Cache Memory Network Disk 4

5 Software vs. Hardware Compression Software vs. Hardware Layer Disk Cache/Memory Latency milliseconds nanoseconds Algorithms Dictionary-based Arithmetic Existing dictionary-based algorithms are too slow for main memory hierarchies 5

6 Key Challenges for Compression in Memory Hierarchy Fast Access Latency Practical Implementation and Low Cost High Compression Ratio 6

7 Thesis Statement It is possible to develop a new set of designs for data compression within modern memory hierarchies that is: Fast Simple Effective in saving storage space and consumed bandwidth so that the resulting improvements in performance, cost, and energy efficiency will make it attractive to implement in future systems 7

8 Contributions of This Dissertation Base-Delta-Immediate (BDI) Compression algorithm with low latency and high compression ratio Compression-Aware Management Policies (CAMP) that incorporate compressed block size into cache management decisions Linearly Compressed Pages (LCP) framework for efficient main memory compression Toggle-Aware Bandwidth compression mechanisms for energy-efficient bandwidth compression 8

9 Practical Data Compression in Memory Processor 1. Cache Compression Cache 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 9

10 PACT Cache Compression 10

11 Background on Cache Compression CPU Hit (3-4 cycles) L1 Cache Decompression Uncompressed Hit (~15 cycles) L2 Cache Data Compressed Key requirement: Low decompression latency 11

12 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x x x x Repeated Values: common initial values, adjacent pixels 0x000000C0 0x000000C0 0x000000C0 0x000000C0 Narrow Values: small values stored in a big data type 0x000000C0 0x000000C8 0x000000D0 0x000000D8 Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 12

13 libquantum lbm mcf tpch17 sjeng omnetpp tpch2 sphinx3 xalancbmk bzip2 tpch6 leslie3d apache gromacs astar gobmk soplex gcc hmmer wrf h264ref zeusmp cactusadm GemsFDTD Arith.Mean Cache Coverage (%) How Common Are These Patterns? SPEC2006, databases, web workloads, 2MB L2 cache Other Patterns include Narrow Values 100% 80% 60% 40% 20% 0% Zero Repeated Values Other Patterns 43% 43% of the cache lines belong to key patterns 13

14 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x x x x Repeated Values: common initial values, adjacent pixels 0x000000C0 0x000000C0 0x000000C0 0x000000C0 Narrow Values: small values stored in a big data type 0x000000C0 0x000000C8 0x000000D0 0x000000D8 Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 14

15 Key Data Patterns in Real Applications Low Dynamic Range: Differences between values are significantly smaller than the values themselves Low Latency Decompressor Low Cost and Complexity Compressor Compressed Cache Organization 15

16 Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039F8 0xC04039C0 Base 0x00 0x08 0x10 0x38 12-byte Compressed Cache Line 1 byte 1 byte 1 byte 20 bytes saved Effective: good compression ratio 16

17 B+Δ Decompressor Design Compressed Cache Line B 0 Δ 0 Δ 1 Δ 2 Δ 3 B 0 B 0 B 0 B V 0 V 1 V 2 V 3 Vector addition V 0 V 1 V 2 V 3 Uncompressed Cache Line Fast Decompression: 1-2 cycles 17

18 Can We Get Higher Compression Ratio? Uncompressible cache line (with a single base): 0x09A x x09A4A838 Key idea - use more bases 0x B More cache lines can be compressed Unclear how to find these bases efficiently Higher overhead (due to additional bases) struct A { int* next; int count;}; 18

19 Compression Ratio B+Δ with Multiple Arbitrary Bases # of bases is fixed bases empirically the best option 19

20 How to Find Two Bases Efficiently? 1. First base - first element in the cache line Base+Delta part 2. Second base - implicit base of 0 Immediate part Base-Delta-Immediate (BΔI) Compression 20

21 Set 0 BΔI Cache Organization Tag Storage: Conventional 2-way cache with 32-byte cache lines Set 0 Data Storage: 32 bytes Set 1 Tag 0 Tag 1 Set 1 Data 0 Data 1 Way 0 Way 1 Way 0 Way 1 BΔI: 4-way cache with 8-byte segmented data Tag Storage: Set 0 Set 1 Tag 0 Tag 1 Tag 2 Tag 3 Set 0 C 8 bytes Set 1 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 C - Compr. encoding bits Way 0 Way 1 Way 2 Way 3 Twice as Tags many tags map to multiple adjacent segments 21

22 Methodology Simulator x86 event-driven simulator (MemSim [Seshadri+, PACT 12]) Workloads SPEC2006 benchmarks, TPC, Apache web server 1 4 core simulations for 1 billion representative instructions System Parameters L1/L2/L3 cache latencies from CACTI BDI (1-cycle decompression) 4GHz, x86 in-order core, cache size (1MB - 16MB) 22

23 Comparison Summary Prior Work vs. BΔI Comp. Ratio Decompression 5-9 cycles 1-2 cycles Compression cycles 1-9 cycles Average performance of a twice larger cache 23

24 Processor Cache 1. Cache Compression 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 24

25 HPCA Compression and Cache Replacement 25

26 Cache Management Background Not only about size Cache management policies are important Insertion, promotion and eviction Belady s OPT Replacement Optimal for fixed-size caches fixed-size blocks Assumes perfect future knowledge Cache: Access stream: X A Y B C X Y Z A 26

27 Block Size Can Indicate Reuse Sometimes there is a relation between the compressed block size and reuse distance data structure compressed block size reuse distance This relation can be detected through the compressed block size Minimal overhead to track this relation (compressed block information is a part of design) 27

28 Code Example to Support Intuition int A[N]; double B[16]; for (int i=0; i<n; i++) { int idx = A[i]; for (int j=0; j<n; j++) { sum += B[(idx+j)%16]; } } // small indices: compressible // FP coefficients: incompressible long reuse, compressible short reuse, incompressible Compressed size can be an indicator of reuse distance 28

29 Block Size Can Indicate Reuse Reuse Distance (# of memory accesses) Block size, bytes Different sizes have different dominant reuse distances 29

30 Compression-Aware Management Policies (CAMP) CAMP SIP: Size-based Insertion Policy MVE: Minimal-Value Eviction compressed block size data Probability of reuse The Value benefits = structure are on-par with cache compression - 2X additional increase Compressed in capacity block reuse size distance 30

31 Processor Cache 1. Cache Compression 2. Compression and Cache Replacement Memory 3. Memory Compression Disk 4. Bandwidth Compression 31

32 MICRO Main Memory Compression 32

33 Challenges in Main Memory Compression 1. Address Computation 2. Mapping and Fragmentation 33

34 Address Computation Uncompressed Page Cache Line (64B) L 0 L 1 L 2... L N-1 Address Offset (N-1)*64 Compressed Page Address Offset L 0 L 1 L 2... L N-1 0??? 34

35 Mapping and Fragmentation Virtual Page (4KB) Virtual Address Physical Page (? KB) Fragmentation Physical Address 35

36 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] 64 cycles 36

37 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] Robust Main Memory Compression [ISCA 05] 5 cycles 37

38 Shortcomings of Prior Work Compression Mechanisms Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. 01] Robust Main Memory Compression [ISCA 05] Linearly Compressed Pages: Our Proposal 38

39 Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4KB: 64*64B) 64B 64B 64B 64B... 64B 4:1 Compression... LCP sacrifices some compression ratio in favor of design simplicity Compressed Data (1KB) LCP effectively solves challenge 1: address computation 39

40 LCP: Key Idea (2) Uncompressed Page (4KB: 64*64B) 64B 64B 64B 64B... 64B 4:1 Compression... M E Exception Storage Compressed Data (1KB) Metadata (64B):? (compressible) 40

41 LCP Framework Overview Page Table entry extension compression type and size OS support for multiple page sizes 4 memory pools (512B, 1KB, 2KB, 4KB) Handling uncompressible data Hardware support memory controller logic metadata (MD) cache PTE 41

42 Physical Memory Layout Page Table PA 0 PA 1 offset 4KB 2KB 2KB 1KB 1KB 1KB 1KB 512B 512B B 4KB PA

43 LCP Optimizations Metadata cache Avoids additional requests to metadata Memory bandwidth reduction: 64B 64B 64B 64B 1 transfer instead of 4 Zero pages and zero cache lines Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) 43

44 Summary of the Results Prior Work vs. LCP Comp. Ratio Performance -4% +14% Energy Consumption 6% 5% 44

45 Processor Cache Memory 1. Cache Compression 2. Compression and Cache Replacement 3. Memory Compression Disk 4. Bandwidth Compression 45

46 HPCA 2016 CAL Energy-Efficient Bandwidth Compression 46

47 Energy Efficiency: Bit Toggles How energy is spent in data transfers: Previous data: 0011 New data: Energy = C*V 2 Bit Toggles Energy: 1 1 Energy of data transfers (e.g., NoC, DRAM) is proportional to the bit toggle count 47

48 Excessive Number of Bit Toggles Uncompressed Cache Line 0x00003A00 0x8001D000 0x00003A01 0x8001D008 XOR = Flit 0 Flit 1 # Toggles = 2 Compressed Cache Line 0x5 0x3A00 0x7 8001D000 0x5 0x3A01 5 3A D D XOR D A02 1 = Flit 0 Flit 1 # Toggles = 31 0x7 8001D008 48

49 FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Normalized Bit Toggle # Effect of Compression on Bit Toggles Discrete Mobile Open-Source Compression significantly increases bit toggle count 49

50 Energy Control Bit toggle count: compressed vs. uncompressed Use a heuristic (Energy X Delay or Energy X Delay 2 metric) to estimate the trade-off Take bandwidth utilization into account Throttle compression when it is not beneficial 50

51 Methodology Simulator: GPGPU-Sim 3.2.x and in-house simulator Workloads: NVIDIA apps (discrete and mobile): 221 apps Open-source (Lonestar, Rodinia, MapReduce): 21 apps System parameters (Fermi): 15 SMs, 32 threads/warp 48 warps/sm, registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler, 2 schedulers/sm Memory: 177.4GB/s BW, GDDR5 Cache: L1-16KB; L2-768KB 51

52 Normalized Bit Toggle Count FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Effect of EC on Bit Toggle Count Without EC With EC Discrete Mobile Open-Source EC significantly reduces the bit toggle count Works for different compression algorithms 52

53 FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack FPC BDI BDI+FPC Fibonacci C-Pack Compression Ratio Effect of EC on Compression Ratio Without EC With EC Discrete Mobile Open-Source EC preserves most of the benefits of compression 53

54 Acknowledgments Todd Mowry and Onur Mutlu Phil Gibbons and Mike Kozuch Kayvon Fatahalian, David Wood, and Doug Burger SAFARI and LBA group members Collaborators at CMU, MSR, NVIDIA and GaTech CALCM and PDL Deb Cavlovich Family and friends 54

55 Conclusion Data stored in memory hierarchies has significant redundancy Inefficient usage of existing limited resources Simple and efficient mechanisms for hardwarebased data compression On-chip caches Main memory On-chip/off-chip interconnects Our mechanisms improve performance, cost and energy efficiency 55

56 Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison Doug Burger, Microsoft Michael Kozuch, Intel Presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Exploi'ng Compressed Block Size as an Indicator of Future Reuse Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

Computer Architecture: (Shared) Cache Management

Computer Architecture: (Shared) Cache Management Computer Architecture: (Shared) Cache Management Prof. Onur Mutlu Carnegie Mellon University (small edits and reorg by Seth Goldstein) Readings Required Qureshi et al., A Case for MLP-Aware Cache Replacement,

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

18-740/640 Computer Architecture Lecture 14: Memory Resource Management I. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015

18-740/640 Computer Architecture Lecture 14: Memory Resource Management I. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015 18-740/640 Computer Architecture Lecture 14: Memory Resource Management I Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015 Required Readings Required Reading Assignment: Qureshi and Patt,

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Scalable Many-Core Memory Systems Optional Topic 4: Cache Management

Scalable Many-Core Memory Systems Optional Topic 4: Cache Management Scalable Many-Core Memory Systems Optional Topic 4: Cache Management Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu onur@cmu.edu HiPEAC ACACES Summer School 2013 July 15-19, 2013 What Will You Learn in

More information

Practical Data Compression for Modern Memory Hierarchies

Practical Data Compression for Modern Memory Hierarchies Practical Data Compression for Modern Memory Hierarchies Gennady G. Pekhimenko CMU-CS-16-116 July 2016 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

COP: To Compress and Protect Main Memory

COP: To Compress and Protect Main Memory COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory

DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory Yuncheng Guo, Yu Hua, Pengfei Zuo Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology Huazhong

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache SOMAYEH SARDASHTI, University of Wisconsin-Madison ANDRE SEZNEC, IRISA/INRIA DAVID A WOOD, University of Wisconsin-Madison Cache

More information

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David

More information

Power Control in Virtualized Data Centers

Power Control in Virtualized Data Centers Power Control in Virtualized Data Centers Jie Liu Microsoft Research liuj@microsoft.com Joint work with Aman Kansal and Suman Nath (MSR) Interns: Arka Bhattacharya, Harold Lim, Sriram Govindan, Alan Raytman

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Data Prefetching by Exploiting Global and Local Access Patterns

Data Prefetching by Exploiting Global and Local Access Patterns Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical

More information

QoS-Aware Memory Systems (Wrap Up) Onur Mutlu July 9, 2013 INRIA

QoS-Aware Memory Systems (Wrap Up) Onur Mutlu July 9, 2013 INRIA QoS-Aware Memory Systems (Wrap Up) Onur Mutlu onur@cmu.edu July 9, 2013 INRIA Slides for These Lectures Architecting and Exploiting Asymmetry in Multi-Core http://users.ece.cmu.edu/~omutlu/pub/onur-inria-lecture1-

More information

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications

More information

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14 98:23 Intro to Computer Organization Lecture 4 Virtual Memory 98:23 Introduction to Computer Organization Lecture 4 Instructor: Nicole Hynes nicole.hynes@rutgers.edu Credits: Several slides courtesy of

More information

Predicting Performance Impact of DVFS for Realistic Memory Systems

Predicting Performance Impact of DVFS for Realistic Memory Systems Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Abhishek Bhowmick Rachata Ausavarungnirun

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Yet Another

More information

Virtual Memory. Physical Addressing. Problem 2: Capacity. Problem 1: Memory Management 11/20/15

Virtual Memory. Physical Addressing. Problem 2: Capacity. Problem 1: Memory Management 11/20/15 Memory Addressing Motivation: why not direct physical memory access? Address translation with pages Optimizing translation: translation lookaside buffer Extra benefits: sharing and protection Memory as

More information

Architecture of Parallel Computer Systems - Performance Benchmarking -

Architecture of Parallel Computer Systems - Performance Benchmarking - Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark

More information

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Saugata Ghose Abhishek Bhowmick Rachata Ausavarungnirun Chita Das Mahmut Kandemir

More information

CS 6354: Memory Hierarchy I. 29 August 2016

CS 6354: Memory Hierarchy I. 29 August 2016 1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time

More information

Computer Architecture. Introduction

Computer Architecture. Introduction to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

CS 537 Lecture 6 Fast Translation - TLBs

CS 537 Lecture 6 Fast Translation - TLBs CS 537 Lecture 6 Fast Translation - TLBs Michael Swift 9/26/7 2004-2007 Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift Faster with TLBS Questions answered in this lecture: Review

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis Multi-level Translation CS 57 Lecture 9 Paging Michael Swift Problem: what if you have a sparse address space e.g. out of GB, you use MB spread out need one PTE per page in virtual address space bit AS

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip Technical Report CSE-11-7 Monday, June 13, 211 Asit K. Mishra Department of Computer Science and Engineering The Pennsylvania

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Multiperspective Reuse Prediction

Multiperspective Reuse Prediction ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Best-Offset Hardware Prefetching

Best-Offset Hardware Prefetching Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement

More information

Virtual Memory. CS 3410 Computer System Organization & Programming

Virtual Memory. CS 3410 Computer System Organization & Programming Virtual Memory CS 3410 Computer System Organization & Programming These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer. Where are we now and

More information

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

CS162 - Operating Systems and Systems Programming. Address Translation => Paging CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L16-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

Skewed Compressed Cache

Skewed Compressed Cache Skewed Compressed Cache Somayeh Sardashti, André Seznec, David A. Wood To cite this version: Somayeh Sardashti, André Seznec, David A. Wood. Skewed Compressed Cache. MICRO - 47th Annual IEEE/ACM International

More information