Detecting Global Stride Locality in Value Streams
|
|
- Simon Simpson
- 6 years ago
- Views:
Transcription
1 Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University 1
2 Introduction This paper is about predicting values A code example (from the benchmark parser, function is_con_word) lw$v0[2],0($v1[3]) sw$v0[2],0($s8[30]) lw$v0[2],0($s8[30]) //the insn to be predicted bne$v0[2],$zero[0], lw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) a8 sw$v1[3],0($s8[30]) b0 j local value history from value profile: xx528, xx840, 0, xx792, 0, xx720, 0, xx816, xx768, xx744, xx696, xx624, xx672, 2
3 Hard-to to-predict Values 900 Value sequence Value Occurance Hard to predict with (local) value predictors: DFCM [Goeman et.al.]: 2% accuracy; (local) stride [Lipastiet. al.]: 4% accuracy. 3
4 Another look at the code Global Value Locality lw$v0[2],0($v1[3]) //the correlated load sw$v0[2],0($s8[30]) lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0], lw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j
5 Global Stride Locality Control flow graph lw$v0[2],0($v1[3]) // the correlated load sw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0], % prediction accuracy is possible if we can use the results of correlated instructions. 5
6 Global value history: Global Value History If the values produced by the dynamic instruction stream are labeled x 1, x 2,..., x N, then the ordered data sequence (x 1, x 2,..., x N ) is the global value history of order (i.e., length) N. Local value history The value sequence produced by the same instruction to be predicted. 6
7 Global Value Locality Similar to local value predictability [Sazeidesand Smith] Global context locality Previous instruction (PI) predictor [Nakra et.al.] Dynamic dataflow-inherited speculative context (DDISC) predictor [Thomas & Franklin] Global computational locality x N = a N 1x N 1 + a N 2 x N 2 + L + a1x1 + a 0. Similar formulation to Perceptron branch predictor [Jimenez & Lin] 7
8 Global stride locality Global Stride Locality xn = xn k + a 0. k and a 0 are run-time variables The previous example: x N = x N lw$v0[2],0($v1[3]) // the correlated load sw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0],
9 Global Stride Locality In A Program Global stride locality in the code Define (e.g., load ra, rb, rc) // load value is hard to predict Explicit Use (e.g., add rx, ra, #constant) // the destof add can be //predicted using the value of the define instruction results (ra) Explicit Use (e.g., sub rx, ra, rd) // the destof sub can be predicted //if rd has strong repeating patterns Implicit Use (e.g., load rx, ry, rz) //Implicit use through the //memory (eg., spilling and filling; reloading) 9
10 Global Stride Locality In A Program (cont.) Global stride locality embedded in the data structure struct string_list { struct string_list * next; char * string; } Near constant stride between the two addresses (*next, *string) if the two fields (next, string) are allocated and referenced in the same order. [Serrano & Wu] 10
11 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 11
12 The GDiff Value Predictor Two major structures PC Stored differences and distance Global value queue (gvq) completing instructions results Prediction table 12
13 Example A code example a: load r1, r2, #20 b: c: d: add r3, r1, #4 loop Instruction a produces (1, 8, 3, 2, ) Instruction d generates (5, 12, 7, 6, ) 13
14 1 b1 Global value queue 0 0 Stored differences How GDiff Works c1 0 5 Update 4 x Calculated Differences x 8 b2 c x x Global value queue Update Calculated differences 4 x x match Distance = 2 Stored differences 3 b3 Global value queue 4 x c3 x Prediction is 3+4 = 7 Stored differences Instruction a produces (1, 8, 3, 2, ) Instruction d generates (5, 12, 7, 6, ) 14
15 The GDiff Value Predictor prediction update PC detect match =? =? =? =? =? diff1 diff2 diffk diffn distance global value queue mux completing instructions results Prediction table mux + prediction x x N = N k + a 0. 15
16 Results: Exploiting Global Stride Value Locality Value prediction accuracy 90% stride DFCM gdiff(queue size = 8) Prediction accuracy 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average Based on trace simulation (sim-profile) and predicting all value producing insns Stride, gdiff, DFCM L1: 8k entry pc-indexed tagless table; DFCM L2: 64K entry table 16
17 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 17
18 Value Delay When the prediction of an instruction is being made, the correlated instruction has not finished... A. lw$v0[2],0($v1[3]) //the correlated load B. sw$v0[2],0($s8[30]) C. lw$v0[2],0($s8[30]) //the instruction to be predicted D. bne$v0[2],$zero[0], A dispatch A issue A execute A complete C dispatch time A s value needed A s value available 18
19 Value Delay Impact Model value delay as T values: a prediction can not use the previous T insns results. prediction accuracy 90% 80% 70% 60% 50% 40% 30% Prediction accuracy for different value delays (queue size = 8) T = 0 T = 2 T = 4 T = 8 T = 16 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average Trace simulation using sim-profile 19
20 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 20
21 Reducing the Value Delay Fetch Dispatch Issue RegRead Execution Write Back Retire Execution pipeline PC GDiff prediction Table Global value queue Update (out-of-order) GDiff predictor prediction Gdiff with speculative value queue (SGVQ) 21
22 Results of SGVQ 100% 90% Prediction accuracy and coverage of the gdiff predictor with SGVQ and the local stride predictor gdiff accuracy L_stride accuracy gdiff coverage L_stride coverage 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl tw olf vortex vpr average Worse than the simple local stride predictor. Why? 22
23 Problems with Speculative GDiff Though value delay is reduced. The results are out-of-order Completion order a: add r3, r4, r5 b: load r5, r6, 28 c: sub r2, r3, 4 Cache miss b: load r5, r6, 28 a: add r3, r4, r5 c: sub r2, r3, 4 Cache hit 23
24 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 24
25 The Hybrid GDiff Value Predictor Fetch Dispatch Issue RegRead Execution Write Back Retire Execution pipeline Update (out-of-order) PC GDiff prediction Table Global value queue GDiffpredictor prediction PC Local stride predictor local stride prediction Key: construct the global value history in-order (fetch/dispatch order); combine with a different value predictor. 25
26 Results of Hybrid GDiff 100% Prediction accuracy and coverage of gdiff and local predictors gdiff accuracy L_stride accuracy L_context accuracy gdiff coverage L_stride coverage L_context coverage 90% 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average (Local context uses DFCM) 26
27 Predicting load addresses. Potential Uses of GDiff Forming the global value history only with load addresses. Predict next address based on previous addresses. Predicting addresses of missing loads only Forming the global value history only with addresses of missing loads. Predict next address based on previous addresses. Predicting values to break true data dependence chain. And 27
28 Predicting Addresses Using Hybrid GDiff 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Predictability of load address stream ls_accu gs_accu markov_accu ls_cov gs_cov markov_cov bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average 4-way 256k entry Markov address predictor uses tag for confidence 28
29 Predicting Address of Missing Loads 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Predictability of addresses of missing loads ls_m_accu gs_m_accu markov_m_accu ls_m_cov gs_m_cov markov_m_cov bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average 29
30 Using GDiff to Break Data Dependencies 60% 50% Speedups of gdiff and local value prediction local_stride local_context gdiff(hgvq) 40% 30% 20% 10% 0% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr H_mean 4/64 issue out-of-order model 30
31 Summary Global stride locality The Gdiff value predictor Value delay issue Reduce the value delay In-order global value history Potential Uses: What to put into the global value queue (i.e., global value history) How to utilize the prediction 31
32 Thank you and Questions? 32
Detecting Global Stride Locality in Value Streams
Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University {hzhou,jtbodine,conte}@eos.ncsu.edu
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationPMPM: Prediction by Combining Multiple Partial Matches
1 PMPM: Prediction by Combining Multiple Partial Matches Hongliang Gao Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {hgao, zhou}@cs.ucf.edu Abstract
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationPMPM: Prediction by Combining Multiple Partial Matches
Journal of Instruction-Level Parallelism 9 (2007) 1-18 Submitted 04/09/07; published 04/30/07 PMPM: Prediction by Combining Multiple Partial Matches Hongliang Gao Huiyang Zhou School of Electrical Engineering
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationTHE EFFECTIVENESS OF GLOBAL DIFFERENCE VALUE PREDICTION AND MEMORY BUS PRIORITY SCHEMES FOR SPECULATIVE PREFETCH UGUR GUNAL
Abstract GUNAL, UGUR. The Effectiveness of Global Difference Value Prediction And Memory Bus Priority Schemes for Speculative Prefetch. (Under the direction of Dr. Thomas M. Conte.) Processor clock speeds
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationDynamic Code Value Specialization Using the Trace Cache Fill Unit
Dynamic Code Value Specialization Using the Trace Cache Fill Unit Weifeng Zhang Steve Checkoway Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationDual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends
More informationFocalizing Dynamic Value Prediction to CPU s Context
Lucian Vintan, Adrian Florea, Arpad Gellert, Focalising Dynamic Value Prediction to CPUs Context, IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4 (July), p. 457-536, ISSN: 1350-2387,
More informationLocality-Based Information Redundancy for Processor Reliability
Locality-Based Information Redundancy for Processor Reliability Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov, zhou}@cs.ucf.edu
More informationCombining Local and Global History for High Performance Data Prefetching
Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu
More informationMesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors
: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth
More informationA Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su
A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationKoji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch
More informationChip-Multithreading Systems Need A New Operating Systems Scheduler
Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationBumyong Choi Leo Porter Dean Tullsen University of California, San Diego ASPLOS
Bumyong Choi Leo Porter Dean Tullsen University of California, San Diego ASPLOS 2008 1 Multi-core proposals take advantage of additional cores to improve performance, power, and/or reliability Moving towards
More informationFuture Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This
More informationSoftware-Only Value Speculation Scheduling
Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationUnderstanding Scheduling Replay Schemes
Appears in Proceedings of the 0th International Symposium on High Performance Computer Architecture (HPCA-0), February 2004. Understanding Scheduling Replay Schemes Ilhyun Kim and Mikko H. Lipasti Department
More informationExecution-based Prediction Using Speculative Slices
Appearing in the 28th Annual International Symposium on Computer Architecture (ISCA 2001), July, 2001. Execution-based Prediction Using Speculative Slices Craig Zilles and Gurindar Sohi Computer Sciences
More informationFused Two-Level Branch Prediction with Ahead Calculation
Journal of Instruction-Level Parallelism 9 (2007) 1-19 Submitted 4/07; published 5/07 Fused Two-Level Branch Prediction with Ahead Calculation Yasuo Ishii 3rd engineering department, Computers Division,
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationAbstract GUPTA, NIKHIL. Slipstream-Based Steering for Clustered Microarchitectures (Under the direction of Dr. Eric Rotenberg)
Abstract GUPTA, NIKHIL Slipstream-Based Steering for Clustered Microarchitectures (Under the direction of Dr. Eric Rotenberg) To harvest increasing levels of ILP while maintaining a fast clock, clustered
More informationI R I S A P U B L I C A T I O N I N T E R N E N o REVISITING THE PERCEPTRON PREDICTOR ANDRÉ SEZNEC ISSN
I R I P U B L I C A T I O N I N T E R N E 1620 N o S INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES A REVISITING THE PERCEPTRON PREDICTOR ANDRÉ SEZNEC ISSN 1166-8687 I R I S A CAMPUS UNIVERSITAIRE
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationWidth-Partitioned Load Value Predictors
Journal of Instruction-Level Parallelism 5(2003) 1-23 Submitted 9/03; published 11/03 Width-Partitioned Load Value Predictors Gabriel H. Loh College of Computing Georgia Institute of Technology Atlanta,
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationData Access History Cache and Associated Data Prefetching Mechanisms
Data Access History Cache and Associated Data Prefetching Mechanisms Yong Chen 1 chenyon1@iit.edu Surendra Byna 1 sbyna@iit.edu Xian-He Sun 1, 2 sun@iit.edu 1 Department of Computer Science, Illinois Institute
More informationBreaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks 0 / 21 Scope Today s commodity platforms include multiple cores
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationExploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group
More informationDynamic Memory Scheduling
Dynamic : Computer Architecture Dynamic Scheduling Overview Lecture Overview The Problem Conventional Solutions In-order Load/Store Scheduling Blind Dependence Scheduling Conservative Dataflow Scheduling
More information{ssnavada, nkchoudh,
Criticality-driven Superscalar Design Space Exploration Sandeep Navada, Niket K. Choudhary, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University {ssnavada, nkchoudh,
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationA Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST
A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable
More informationDecoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor
Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Computer Systems Laboratory Stanford University Motivation Dynamic analysis help
More informationAutomatic Selection of Compiler Options Using Non-parametric Inferential Statistics
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationMemory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012
Memory Prefetching for the GreenDroid Microprocessor David Curran December 10, 2012 Outline Memory Prefetchers Overview GreenDroid Overview Problem Description Design Placement Prediction Logic Simulation
More informationEfficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness
Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationOutline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation
Speculative Register Promotion Using Advanced Load Address Table (ALAT Jin Lin, Tong Chen, Wei-Chung Hsu, Pen-Chung Yew http://www.cs.umn.edu/agassiz Motivation Outline Scheme of speculative register promotion
More informationHierarchical Scheduling Windows
Hierarchical Scheduling Windows Edward Brekelbaum, Jeff Rupley II, Chris Wilkerson, Bryan Black Microprocessor Research, Intel Labs (formerly known as MRL) {edward.brekelbaum, jeffrey.p.rupley.ii, chris.wilkerson,
More informationThe Predictability of Computations that Produce Unpredictable Outcomes
This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationPerformance Oriented Prefetching Enhancements Using Commit Stalls
Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of
More informationMany Cores, One Thread: Dean Tullsen University of California, San Diego
Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and
More informationCost-Effective, High-Performance Giga-Scale Checkpoint/Restore
Cost-Effective, High-Performance Giga-Scale Checkpoint/Restore Andreas Moshovos Electrical and Computer Engineering Department University of Toronto moshovos@eecg.toronto.edu Alexandros Kostopoulos Physics
More informationRegister Integration: A Simple and Efficient Implementation of Squash Reuse
Appears in the Proceedings of the 33rd Annual International Symposium on Microarchitecture (MICRO-33), Dec. 10-13, 2000. Register Integration: A Simple and Efficient Implementation of Squash Reuse Amir
More informationExploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors
Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com
More informationPunctual Coalescing. Fernando Magno Quintão Pereira
Punctual Coalescing Fernando Magno Quintão Pereira Register Coalescing Register coalescing is an op7miza7on on top of register alloca7on. The objec7ve is to map both variables used in a copy instruc7on
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationAbstract RAMRAKHYANI, PRAKASH SHYAMLAL
Abstract RAMRAKHYANI, PRAKASH SHYAMLAL Dynamic Pipeline Scaling (Under the direction of Dr. Eric Rotenberg) The classic problem of balancing power and performance continues to exist, as technology progresses.
More informationSimulation Of Computer Systems. Prof. S. Shakya
Simulation Of Computer Systems Prof. S. Shakya Purpose & Overview Computer systems are composed from timescales flip (10-11 sec) to time a human interacts (seconds) It is a multi level system Different
More informationDCache Warn: an I-Fetch Policy to Increase SMT Efficiency
DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,
More informationReverse State Reconstruction for Sampled Microarchitectural Simulation
Reverse State Reconstruction for Sampled Microarchitectural Simulation Paul D. Bryan Michael C. Rosier Thomas M. Conte Center for Efficient, Secure and Reliable Computing (CESR) North Carolina State University,
More informationCluster Assignment Strategies for a Clustered Trace Cache Processor
Cluster Assignment Strategies for a Clustered Trace Cache Processor Ravi Bhargava and Lizy K. John Technical Report TR-033103-01 Laboratory for Computer Architecture The University of Texas at Austin Austin,
More informationUsing Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation
Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending
More informationCopyright by Mary Douglass Brown 2005
Copyright by Mary Douglass Brown 2005 The Dissertation Committee for Mary Douglass Brown certifies that this is the approved version of the following dissertation: Reducing Critical Path Execution Time
More informationEECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table
Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.
More informationAnalysis of the TRIPS Prototype Block Predictor
Appears in the 9 nternational Symposium on Performance Analysis of Systems and Software Analysis of the TRPS Prototype Block Predictor Nitya Ranganathan oug Burger Stephen W. Keckler epartment of Computer
More information2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set
2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering University of Texas
More informationABSTRACT. instructions, called the dynamic instruction window, to exploit instruction-level
ABSTRACT GANDHI, JAYNEEL. FabFetch: A Synthesizable RTL Model of a Pipelined Instruction Fetch Unit for Superscalar Processors. (Under the direction of Eric Rotenberg). High-performance superscalar processors
More informationExploiting Partial Operand Knowledge
Exploiting Partial Operand Knowledge Brian R. Mestan IBM Microelectronics IBM Corporation - Austin, TX bmestan@us.ibm.com Mikko H. Lipasti Department of Electrical and Computer Engineering University of
More informationCharacterization of Repeating Data Access Patterns in Integer Benchmarks
Characterization of Repeating Data Access Patterns in Integer Benchmarks Erik M. Nystrom Roy Dz-ching Ju Wen-mei W. Hwu enystrom@uiuc.edu roy.ju@intel.com w-hwu@uiuc.edu Abstract Processor speeds continue
More informationLooking for limits in branch prediction with the GTL predictor
Looking for limits in branch prediction with the GTL predictor André Seznec IRISA/INRIA/HIPEAC Abstract The geometric history length predictors, GEHL [7] and TAGE [8], are among the most storage effective
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan
More informationEric Rotenberg Jim Smith North Carolina State University Department of Electrical and Computer Engineering
Control Independence in Trace Processors Jim Smith North Carolina State University Department of Electrical and Computer Engineering ericro@ece.ncsu.edu Introduction High instruction-level parallelism
More informationFall 2011 Prof. Hyesoon Kim
Fall 2011 Prof. Hyesoon Kim 1 1.. 1 0 2bc 2bc BHR XOR index 2bc 0x809000 PC 2bc McFarling 93 Predictor size: 2^(history length)*2bit predict_func(pc, actual_dir) { index = pc xor BHR taken = 2bit_counters[index]
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationReducing Reorder Buffer Complexity Through Selective Operand Caching
Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev
More informationThe Limits of Speculative Trace Reuse on Deeply Pipelined Processors
The Limits of Speculative Trace Reuse on Deeply Pipelined Processors Maurício L. Pilla, Philippe O. A. Navaux Computer Science Institute UFRGS, Brazil fpilla,navauxg@inf.ufrgs.br Amarildo T. da Costa IME,
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationEric Rotenberg Karthik Sundaramoorthy, Zach Purser
Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely
More informationThe Predictability of Computations that Produce Unpredictable Outcomes
The Predictability of Computations that Produce Unpredictable Outcomes Tor Aamodt Andreas Moshovos Paul Chow Department of Electrical and Computer Engineering University of Toronto {aamodt,moshovos,pc}@eecg.toronto.edu
More informationCache Optimization by Fully-Replacement Policy
American Journal of Embedded Systems and Applications 2016; 4(1): 7-14 http://www.sciencepublishinggroup.com/j/ajesa doi: 10.11648/j.ajesa.20160401.12 ISSN: 2376-6069 (Print); ISSN: 2376-6085 (Online)
More informationFall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory
More information15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results
More informationOptimizations Enabled by a Decoupled Front-End Architecture
Optimizations Enabled by a Decoupled Front-End Architecture Glenn Reinman y Brad Calder y Todd Austin z y Department of Computer Science and Engineering, University of California, San Diego z Electrical
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More information