RAS$Directed,Instruc1on,Prefetching, (RDIP),

Size: px
Start display at page:

Download "RAS$Directed,Instruc1on,Prefetching, (RDIP),"

Transcription

1 RAS$Directed,Instruc1on,Prefetching, (RDIP), Aasheesh,Kolli*,Ali,Saidi,Thomas,F.,Wenisch*,, *,University,of,Michigan,,,,,,,,,,,,,,,ARM,!!! MICRO'46!12/10/2013! 1!

2 Why,instruc1on,prefetching?, 1.6 Relative Performance No Prefetcher (NoP) Better Ideal 50% 16% 0.8 gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog GMEAN Poor!I$!behavior!affects!modern!server!workloads! [Spracklen! 05][Ferdman! 08]![Ferdman! 11]! Cache!size!constraints!!!Prefetching!necessary! 2!

3 Why,another,prefetcher?, Next'2'line!(N2L)! +!Low!overhead! Modest!benefits,!ineffecZve!at!disconZnuiZes! ProacZve!InstrucZon!Fetch!(PIF)![Ferdman! 11]!! +!Best!performing!academic!proposal! Storage!overhead!(!>!200kB!per!core)! Design!complexity! Our Goal: Low overhead, high accuracy prefetcher 3!

4 Contribu1ons, I$!misses!'!program!context!correlaZon! Program!contexts!are!repeZZve,!predictable! RAS!succinctly!captures!program!context! RAS-Directed Instruction Prefetching (RDIP) RDIP achieves 11.5% increase in performance with only 64kB overhead 4!

5 5! Outline, Design,overview, RAS!signature!generaZon! Timely!prefetching! Results! Conclusions!

6 RDIP,design,overview, I$!misses!correlate!to!program!context! Program!contexts!are!predictable! RAS!state!represents!program!contexts! 1. Represent,program,context,using,a,RAS,signature, 2. Map,cache,misses,to,signatures, 3. Prefetch,upon,next,occurrence,of,signature,, 6!

7 7! RDIP,design,challenges, 1. Hash!RAS!contents!to!generate!program!context!signatures! Challenge:!Accurately!represent!program!contexts! 2. Record!cache!misses!associated!with!signature!in!Miss$Table$ Challenge:!Minimize!storage! 3. Prefetch!upon!signature!change!based!on!Miss$Table$ Challenge:!Ensure!Zmely!prefetches!!!

8 8! Outline, Design!overview! RAS,signature,genera1on, Timely!prefetching! Results! Conclusions!

9 Genera1ng,program,context,signatures, Use!contents!of!RAS!to!represent!contexts! Cannot!use!enZre!RAS!!need!compact!signatures! Signatures!must!differenZate!traversals!up!&!down!call!stack!! Call:!XOR!contents!of!RAS!aSer,push!onto!RAS,!append,0, Return:!XOR!contents!of!RAS!before,pop!from!RAS,!append,1,! 9!

10 10! Example,,, Call: XOR contents of RAS after push onto RAS, append 0 Return: XOR contents of RAS before pop from RAS, append 1 Dynamic,Instruc1ons, A:funcX{,,,,,,,,B:funcY{, },,,,,,,,,,,},,,,,,,,.,,,,,,,,C:funcY{,,,,,,,,,,,}, RAS,Signature, (A)0, (A B)0,, (A B)1,, (A C)0,, (A C)1,, RAS C B A

11 11! Outline, Design!overview! RAS!signature!generaZon! Timely,prefetching, Results! Conclusions!

12 Timely,prefetching, Miss$Table$stores!signature'misses!pairs! Prefetches!issued!by!looking!up!Miss$Table$ If!misses!tagged!with!current!signature?!!!Too!late!!! Execution Stream Signatures S 1 Prefetch Requests Time S 2 S 3 Late prefetches! Mapping misses to previous signature! Timely prefetches 12!

13 13! Issuing,prefetch,requests, Miss Table Current Signature S 1 Lookup S 2 S 1 Previous Signature Prefetch Queue To L2$

14 Misses tagged with previous signature!timely prefetching! 14! Upda1ng,miss,table, Miss Table Current Signature S 2 Signature Change! S 1 Update Previous Signature From L2$ Miss Buffer

15 15! Outline, Design!overview! RAS!signature!generaZon! Timely!prefetching! Results, Conclusions!

16 Experimental,setup, gem5! Core:!2GHz!OoO,!8'wide!commit,!16'entry!RAS! I$Cache:,32KB/2'way/64B,!2!cycles! D$Cache:,64KB/2'way/64B,!3!cycles! L2:!2MB/8'way/64B,!24!cycles! Workloads! gem5!!gem5!running!a!spec!benchmark!(twolf)! HD$teraread,,!Hadoop:!Big!data!MapReduce!job! HD$wdcnt,,Hadoop:!Word!count!!!!!!! ssj,,tests!java!performance!in!specpower!!!!!!!! MC$friendfeed!!Memcached:! Facebook 'like!app! MC$microblog!!Memcached:! Twirer 'like!app!!! 16!

17 I$,misses,,signature,correla1on, 1 Miss Coverage gem5 HD-wdcnt HD-teraread ssj Better 0 MC-friendfeed MC-microblog Misses per Signature Strong,correla1on,between,misses,&,signatures, 17!

18 RDIP,prac1cal,design, Summary!of!takeaways!from!sensiZvity!studies:! RAS!size!for!signature!generaZon!!!4!top!entries! Miss!Table!!!4K!entries!(4'way!associaZve)! Entry!size!!!16B!(compacZon!technique![Ferdman! 11])! Max.!27!misses! Total Hardware Overhead = 64kB Please see the paper for a detailed analysis 18!

19 Percentage Coverage,and,erroneous,prefetches, Erroneous Prefetches Misses Prefetch Hits Better Better 50 0 N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog Coverage: PIF ~ RDIP > N2L Erroneous prefetches: PIF > N2L > RDIP 19!

20 RDIP achieves 98% of the performance of PIF, with 3X storage reduction. 20! Relative Performance Performance, N2L PIF RDIP Ideal Better 0.8 gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog GMEAN Performance increase: N2L 5%, PIF 13%, RDIP 11.5%, Ideal 16%

21 Conclusions, I$!misses!!program!contexts!'!RAS!signatures! RDIP!performs!comparably!to!PIF!with!3X! storage!reduczon! 21!

22 22! Thank,You!, Ques1ons?,

RDIP: Return-address-stack Directed Instruction Prefetching

RDIP: Return-address-stack Directed Instruction Prefetching RDIP: Return-address-stack Directed Instruction Prefetching Aasheesh Kolli University of Michigan akolli@umich.edu Ali Saidi ARM ali.saidi@arm.com Thomas F. Wenisch University of Michigan twenisch@umich.edu

More information

Computer Architecture: Optional Homework Set

Computer Architecture: Optional Homework Set Computer Architecture: Optional Homework Set Black Board due date: Hard Copy due date: Monday April 27 th, at Midnight. Tuesday April 28 th, during Class. Exercise 1: (50 Points) Patterson and Hennessy

More information

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher

More information

An Accurate and Detailed Prefetching. Simulation Framework for gem5

An Accurate and Detailed Prefetching. Simulation Framework for gem5 An Accurate and Detailed ing Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina martit@ac.upc.edu Computer Architecture Department UPC BarcelonaTech ing Reduce memory latency

More information

Improving Bandwidth Efficiency of Peer-to-Peer Storage

Improving Bandwidth Efficiency of Peer-to-Peer Storage Improving Bandwidth Efficiency of Peer-to-Peer Storage Patrick Eaton, Emil Ong, John Kubiatowicz University of California, Berkeley http://oceanstore.cs.berkeley.edu/ P2P Storage: Promise vs.. Reality

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

Portland State University ECE 587/687. Caches and Prefetching

Portland State University ECE 587/687. Caches and Prefetching Portland State University ECE 587/687 Caches and Prefetching Copyright by Alaa Alameldeen and Haitham Akkary 2008 Impact of Cache Misses Cache misses are very expensive Blocking cache: severely reduce

More information

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

CS399 New Beginnings. Jonathan Walpole

CS399 New Beginnings. Jonathan Walpole CS399 New Beginnings Jonathan Walpole Virtual Memory (1) Page Tables When and why do we access a page table? - On every instruction to translate virtual to physical addresses? Page Tables When and why

More information

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"

More information

A high-level model of embedded flash energy consumption

A high-level model of embedded flash energy consumption A high-level model of embedded flash energy consumption To appear in CASES, ESWEEK 2014 James Pallister, PhD Student Kerstin Eder, Primary PhD Supervisor Simon Hollis, PhD Supervisor Jeremy Bennett, Embecosm

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Command Scheduling Disk Caching Lecture 23: 1 Announcements Quiz 12 I ll give credit for #4 if you answered (d) Quiz 13 (last one!) on Tuesday Make-up class #2 Thursday,

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

Figure 1: Organisation for 128KB Direct Mapped Cache with 16-word Block Size and Word Addressable

Figure 1: Organisation for 128KB Direct Mapped Cache with 16-word Block Size and Word Addressable Tutorial 12: Cache Problem 1: Direct Mapped Cache Consider a 128KB of data in a direct-mapped cache with 16 word blocks. Determine the size of the tag, index and offset fields if a 32-bit architecture

More information

SEESAW: Set Enhanced Superpage Aware caching

SEESAW: Set Enhanced Superpage Aware caching SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia

More information

5 Solutions. Solution a. no solution provided. b. no solution provided

5 Solutions. Solution a. no solution provided. b. no solution provided 5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache Taking advantage of spatial locality Use block size larger than one word Example: two words Block index tag () () Alternate representations Word index tag block, word block, word block, word block, word

More information

Hash Table Design and Optimization for Software Virtual Switches

Hash Table Design and Optimization for Software Virtual Switches Hash Table Design and Optimization for Software Virtual Switches P R E S E N T E R : R E N WA N G Y I P E N G WA N G, S A M E H G O B R I E L, R E N WA N G, C H A R L I E TA I, C R I S T I A N D U M I

More information

CACHE OPTIMIZATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CACHE OPTIMIZATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah CACHE OPTIMIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 3 will be released on Oct. 31 st This

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation

More information

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

CS 1114: Implementing Search. Last time. ! Graph traversal. ! Two types of todo lists: ! Prof. Graeme Bailey.

CS 1114: Implementing Search. Last time. ! Graph traversal. ! Two types of todo lists: ! Prof. Graeme Bailey. CS 1114: Implementing Search! Prof. Graeme Bailey http://cs1114.cs.cornell.edu (notes modified from Noah Snavely, Spring 2009) Last time! Graph traversal 1 1 2 10 9 2 3 6 3 5 6 8 5 4 7 9 4 7 8 10! Two

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Boomerang: a Metadata-Free Architecture for Control Flow Delivery

Boomerang: a Metadata-Free Architecture for Control Flow Delivery : a Metadata-Free Architecture for Control Flow Delivery Rakesh Kumar Cheng-Chieh Huang Boris Grot Vijay Nagarajan Institute of Computing Systems Architecture University of Edinburgh {rakesh.kumar, cheng-chieh.huang,

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

More information

Data Cache Prefetching with Perceptron Learning

Data Cache Prefetching with Perceptron Learning Data Cache Prefetching with Perceptron Learning Haoyuan Wang School of Automation Huazhong University of Science and Technology Wuhan, China Email: lkwywhy@hust.edu.cn Zhiwei Luo School of Automation Huazhong

More information

Parallel Assertion Processing using Memory Snapshots

Parallel Assertion Processing using Memory Snapshots Parallel Assertion Processing using Memory Snapshots Junaid Haroon Siddiqui Muhammad Faisal Iqbal Derek Chiou UCAS5 26 April 2009 Motivation Importance of Assertions Parallel Assertion Processing Memory

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University Virtual addresses Virtual memory addresses (what the process uses) Page

More information

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers 1 ASPLOS 2016 2-6 th April Amro Awad (NC State University) Pratyusa Manadhata (Hewlett Packard Labs) Yan Solihin (NC

More information

Outline. Spanner Mo/va/on. Tom Anderson

Outline. Spanner Mo/va/on. Tom Anderson Spanner Mo/va/on Tom Anderson Outline Last week: Chubby: coordina/on service BigTable: scalable storage of structured data GFS: large- scale storage for bulk data Today/Friday: Lessons from GFS/BigTable

More information

A best-offset prefetcher

A best-offset prefetcher A best-offset prefetcher Pierre Michaud 2 nd data prefetching championship, june 2015 DPC2 rules core L1 DPC2 simulator L2 MSHR prefetcher L3 DRAM 2 DPC2 rules DPC2 simulator core L1 L2 MSHR physical address

More information

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15 Examples of Physical Query Plan Alternatives Selected Material from Chapters 12, 14 and 15 1 Query Optimization NOTE: SQL provides many ways to express a query. HENCE: System has many options for evaluating

More information

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Best-Offset Hardware Prefetching

Best-Offset Hardware Prefetching Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

CMU : Advanced Computer Architecture Handout 5: Cache Prefetching Competition ** Due 10/11/2005 **

CMU : Advanced Computer Architecture Handout 5: Cache Prefetching Competition ** Due 10/11/2005 ** 1. Introduction CMU 18-741: Advanced Computer Architecture Handout 5: Cache Prefetching Competition ** Due 10/11/2005 ** Architectural innovations along with accelerating processor speeds have led to a

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 9 Pipelining Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Basic Concepts Data Hazards Instruction Hazards Advanced Reliable Systems (ARES) Lab.

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

Parsing Scheme (+ (* 2 3) 1) * 1

Parsing Scheme (+ (* 2 3) 1) * 1 Parsing Scheme + (+ (* 2 3) 1) * 1 2 3 Compiling Scheme frame + frame halt * 1 3 2 3 2 refer 1 apply * refer apply + Compiling Scheme make-return START make-test make-close make-assign make- pair? yes

More information

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK] Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM

More information

A Structure-Independent Approach for Fault Detection Hardware Implementations of the Advanced Encryption Standard

A Structure-Independent Approach for Fault Detection Hardware Implementations of the Advanced Encryption Standard A Structure-Independent Approach for Fault Detection Hardware Implementations of the Advanced Encryption Standard Presented by: Mehran Mozaffari Kermani Department of Electrical and Computer Engineering

More information

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

Warming up Storage-level Caches with Bonfire

Warming up Storage-level Caches with Bonfire Warming up Storage-level Caches with Bonfire Yiying Zhang Gokul Soundararajan Mark W. Storer Lakshmi N. Bairavasundaram Sethuraman Subbiah Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau 2 Does on-demand

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

LabMax-Pro PC data collection

LabMax-Pro PC data collection LabMax-Pro PC data collection Table of Contents 1. The Question 2. Setting up data logging ahead of time 3. Exporting data after it s already been collected 4. Importing and exporting data from the LabMax-Pro

More information

Annotated Memory References: A Mechanism for Informed Cache Management

Annotated Memory References: A Mechanism for Informed Cache Management Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University http://www.cs.duke.edu/ari/ice

More information

Dynamic load-balancing algorithm for a decentralized gpu-cpu cluster

Dynamic load-balancing algorithm for a decentralized gpu-cpu cluster Dynamic load-balancing algorithm for a decentralized gpu-cpu cluster Mert Değirmenci Department of Computer Engineering, Middle East Technical University Related work http://www2.computer.org/portal/web/csdl/doi/10.1109/icpads.2004.13161144

More information

F21 Microprocessor Preliminary specifications 9/98

F21 Microprocessor Preliminary specifications 9/98 F21 contains a CPU, a memory interface processor, two analog I/O coprocessors, an active message serial network coprocessor, and a parallel I/O port on a small custom VLSI CMOS chip. CPU 0 operand stack

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor

More information

This presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components.

This presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components. This presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components. 1 2 Buffer operations are used to get data put or push up to 4 GiB of data

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies. M6 Memory Hierarchy Module Outline CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies. Events on a Cache Miss Events on a Cache Miss Stall the pipeline.

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Loose-Ordering Consistency for Persistent Memory

Loose-Ordering Consistency for Persistent Memory Loose-Ordering Consistency for Persistent Memory Youyou Lu 1, Jiwu Shu 1, Long Sun 1, Onur Mutlu 2 1 Tsinghua University 2 Carnegie Mellon University Summary Problem: Strict write ordering required for

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Removing the I/O Bottleneck in Enterprise Storage

Removing the I/O Bottleneck in Enterprise Storage Removing the I/O Bottleneck in Enterprise Storage WALTER AMSLER, SENIOR DIRECTOR HITACHI DATA SYSTEMS AUGUST 2013 Enterprise Storage Requirements and Characteristics Reengineering for Flash removing I/O

More information

Congestion Control in TCP

Congestion Control in TCP Congestion Control in TCP Antonio Carzaniga Faculty of Informatics University of Lugano May 6, 2005 Outline Intro to congestion control Input rate vs. output throughput Congestion window Congestion avoidance

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

Domino Temporal Data Prefetcher

Domino Temporal Data Prefetcher Read Miss Coverage Appears in Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA) Temporal Data Prefetcher Mohammad Bakhshalipour 1, Pejman Lotfi-Kamran, and

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen Ilya Wagner Zeshan Chishti Wei Wu Chris Wilkerson Shih-Lien Lu Intel Labs Overview Large caches and memories

More information

UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. M.Sc. in Computational Science & Engineering

UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. M.Sc. in Computational Science & Engineering COMP60081 Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE M.Sc. in Computational Science & Engineering Fundamentals of High Performance Execution Wednesday 16 th January 2008 Time: 09:45

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information