Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Size: px
Start display at page:

Download "Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)"

Transcription

1 Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu

2 Acknowledgements! Graduate students (who did the real work) o Ying Chen o Resit Sendag o Joshua Yi! Faculty collaborator o Douglas Hawkins (School of Statistics)! Funders o National Science Foundation o IBM o HP/Compaq o Minnesota Supercomputing Institute

3 Problem #! Speculative execution is becoming more popular o Branch prediction o Value prediction o Speculative multithreading! Potentially higher performance! What about impact on the memory system? o Pollute cache/memory hierarchy? o Leads to more misses?

4 Problem #! Computer architecture research relies on simulation! Simulation is slow o Years to simulate SPEC CPU000 benchmarks! Simulation can be wildly inaccurate o Did I really mean to build that system?! Results are difficult to reproduce! Need statistical rigor

5 Outline (Part )! The Superthreaded Architecture! The Wrong Execution Cache (WEC)! Experimental Methodology! Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 003]

6 Hard-to-Parallelize Applications! Early exit loops! Pointers and aliases! Complex branching behaviors! Small basic blocks! Small loops counts Hard to parallelize with conventional techniques

7 Introduce Maybe dependences! Data dependence?! Pointer aliasing? o Yes o No o Maybe! Maybe allows aggressive compiler optimizations o When in doubt, parallelize! Run-time check to correct wrong assumption

8 Thread Pipelining Execution Model CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK Thread i Fork Sync Sync CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK Fork Sync Sync CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed Fork Sync Thread i WRITE-BACK Sync Thread i+

9 The Superthread Architecture Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache

10 Wrong Path Execution Within Superscalar Core Predicted path Speculative execution Prediction result is wrong Ld A Ld B Correct path CP Wrong path execution WP Ld C Ld D Ld E Wrong path Not ready to be executed

11 Wrong Thread Execution Sequential region Parallel region Mark the successor threads as wrong threads BEGIN FORK ABORT TU0 Sequential region between two parallel regions Parallel region FORK ABORT BEGIN FORK ABORT TU TU Kill all the wrong threads from the Previous parallel region FORK ABORT FORK ABORT TU WTH TU FORK ABORT FORK ABORT TU3 WTH TU3 Wrong thread kills itself FORK ABORT TU0

12 How Could Wrong Thread Execution Help Improve Performance? for (i=0; i<0; i++) { for (j=0; j<i; j++) { x=y[j]; } } Parallelized When i=4, j=0,,,3=>y[0], y[], y[], y[3],y[4] When i=5, j=0,,,3,4 =>y[0],y[],y[],y[3],y[4],y[5] i=4 i=5 TU TU TU3 TU4 y[0] y[] y[] y[3] y[4] y[5] TU TU TU3 TU4 y[0] y[] y[] y[3] y[4] y[5] y[6] wrong threads

13 Operation of the WEC Wrong execution YES YES Wrong thread execution? NO Wrong path execution? NO Correct execution YES L data cache miss? YES L data cache miss? NO NO WEC miss? WEC miss? YES NO YES NO Bring the block from the next level memory into the WEC Update LRU info for the WEC Updata LRU info for the L data cache Bring the block from next level memory into L data cache Put the victim block into the WEC Swap the victim block and the WEC block Prefetch the next line into the WEC Update LRU info for the L data cache

14 Processor Configurations for Simulations SIMCA (the SIMulator for the Superthreaded Architecture) features configurations baseline (orig) wrong path (wp) wrong thread (wth) wrong execution cache (wec) prefetch into WEC victim cache (vc) next line prefetch (nlp) orig a vc a a wth-wp-vc a a a a wth-wp-wec a a a a a nlp a a

15 Parameters for Each Thread Unit Issue rate branch target buffer speculative memory buffer round-trip memory latency fork delay unidirectional communication ring Load/store queue Reorder buffer INT ALU, INT multiply/divide units FP adders, FP multiply/divide units WEC L data cache L instruction caches L cache 8 instrs/cycle per thread unit 4-way associative, 04 entries fully associative, 8 entries 00 cycles 4 cycles requests/cycle bandwidth 64 entries 64 entries 8, 4 4, 4 8 entries (same block size as L cache), fully associative distributed, 8KB, -way associative, block size of 64 bytes distributed, 3KB, -way associative, block size of 64 bytes unified, 5KB, 4-way associative, block size of 8 bytes

16 Characteristics of the Parallelized SPEC000 Benchmarks Bench -marks SPEC 000 Type Input set Fraction Parallelized Loop Coalesc -ing Loop Unrolling Statement Reordering to Increase Overlap 75vpr INT SPEC test 86% a 64gzip INT MinneSPEC large 57% a 8mcf INT MinneSPEC large 36% a 97parser INT MinneSPEC medium 7% a 83equake FP MinneSPEC large 3% a a 77mesa FP SPEC test 73% a a

17 Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks TU TU 4TU 8TU 6TU # of TUs Issue rate Reorder buffer size INT ALU INT MULT FP ALU speedup FP MULT 8 4 L data cache size (KB) vpr 64gzip 8mcf 83equake 97parser 77mesa average Baseline configuration

18 Performance of the wth-wp-wec Configuration on Top of the Parallel Execution TU org 4TU org 8TU org 6TU org TU w th-w p-w ec TU w th-w p-w ec 4TU w th-w p-w ec 8TU w th-w p-w ec 6TU w th-w p-w ec relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

19 Performance Improvements Due to the WEC vc wth-wp-vc wth-wp-wec nlp 75vpr 64gzip 8mcf 97parser 83equake 77mesa average relative speedup (%)

20 Sensitivity to L Data Cache Size orig 3k wth-wp-wec 4k wth-wp-wec 3k relative speedup (%) 0 75vpr 64gzip 8mcf 97parser 83equake 77mesa average

21 Sensitivity to WEC Size Compared to a Victim Cache wth-wp-vc 4 wth-wp-vc 6 wth-wp-wec 4 wth-wp-wec 6 relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

22 Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP) nlp 8 nlp 3 wth-wp-wec 8 w th-w p-w ec 3 4 relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

23 Additional Loads and Reduction of Misses Additional loads Reduction of Misses % 40 % vpr 64gzip 8mcf 97parser 83equake 77mesa average

24 Conclusions for the WEC! Allow loads to continue executing even after they are known to be incorrectly issued o Do not let them change state! 455% average reduction in number of misses o 97% average improvement on top of parallel execution o 4% average improvement over victim cache o 56% average improvement over next-line prefetching! Cost o 4% additional loads o Minor hardware complexity

25 Typical Computer Architecture Study Find an interesting problem/performance bottleneck! Eg Memory delays Invent a clever idea for solving it! This is the hard part 3 Implement the idea in a processor/system simulator! This is the part grad students usually like best 4 Run simulations on n standard benchmark programs! This is time-consuming and boring 5 Compare performance with and without your change! Execution time, clocks per instruction (CPI), etc

26 Problem # Simulation in Computer Architecture Research! Simulators are an important tool for computer architecture research and design o Low cost o Faster than building a new system o Very flexible

27 Performance Evaluation Techniques Used in ISCA Papers 00% 80% 60% 40% 0% Other Modeling Measurement Other sim SimpleScalar 0% * Some papers used more than one evaluation technique

28 Simulation is Very Popular, But! Current simulation methodology is not o Formal o Rigorous o Statistically-based! Never enough simulations o Design a new processor based on a few seconds of actual execution time! What are benchmark programs really exercising?

29 An Example -- Sensitivity Analysis! Which parameters should be varied? Fixed?! What range of values should be used for each variable parameter?! What values should be used for the constant parameters?! Are there interactions between variable and fixed parameters?! What is the magnitude of those interactions?

30 Let s Introduce Some Statistical Rigor! Decreases the number of errors o Modeling o Implementation o Set up o Analysis! Helps find errors more quickly! Provides greater insight o Into the processor o Effects of an enhancement! Provides objective confidence in results! Provides statistical support for conclusions

31 Outline (Part )! A statistical technique for o o o o Examining the overall impact of an architectural change Classifying benchmark programs Ranking the importance of processor/simulation parameters Reducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 003]

32 A Technique to Limit the Number of Simulations! Plackett and Burman designs (946) o Multifactorial designs o Originally proposed for mechanical assemblies! Effects of main factors only o Logically minimal number of experiments to estimate effects of m input parameters (factors) o Ignores interactions! Requires O(m) experiments o Instead of O( m ) or O(v m )

33 Plackett and Burman Designs! PB designs exist only in sizes that are multiples of 4! Requires X experiments for m parameters o X = next multiple of 4 m! PB design matrix o Rows = configurations o Columns = parameters values in each config High/low = / o First row = from P&B paper o Subsequent rows = circular right shift of preceding row o Last row = all ()

34 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

35 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

36 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

37 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect 65

38 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect 65-45

39 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

40 PB Design! Only magnitude of effect is important o Sign is meaningless! In example, most least important effects: o [C, D, E] F G A B

41 Case Study #! Determine the most significant parameters in a processor simulator

42 Determine the Most Significant Processor Parameters! Problem o So many parameters in a simulator o How to choose parameter values? o How to decide which parameters are most important?! Approach o Choose reasonable upper/lower bounds o Rank parameters by impact on total execution time

43 Simulation Environment! SimpleScalar simulator o sim-outorder 30! Selected SPEC 000 Benchmarks o gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip, twolf! MinneSPEC Reduced Input Sets! Compiled with gcc (PISA) at O3

44 Functional Unit Values Parameter Int ALUs Int ALU Latency Int ALU Throughput FP ALUs FP ALU Latency FP ALU Throughputs Int Mult/Div Units Int Mult Latency Int Div Latency Int Mult Throughput Int Div Throughput FP Mult/Div Units FP Mult Latency FP Div Latency FP Sqrt Latency FP Mult Throughput FP Div Throughput FP Sqrt Throughput Low Value Cycles 5 Cycles 5 Cycles 80 Cycles 5 Cycles 35 Cycles 35 Cycles Equal to Int Div Latency Equal to FP Mult Latency Equal to FP Div Latency Equal to FP Sqrt Latency High Value 4 Cycle 4 Cycle 4 Cycles 0 Cycles 4 Cycles 0 Cycles 5 Cycles

45 Memory System Values, Part I Parameter L I-Cache Size L I-Cache Assoc L I-Cache Block Size L I-Cache Repl Policy L I-Cache Latency L D-Cache Size L D-Cache Assoc L D-Cache Block Size L D-Cache Repl Policy L D-Cache Latency L Cache Size L Cache Assoc L Cache Block Size Low Value 4 KB -Way 6 Bytes 4 Cycles 4 KB -Way 6 Bytes 4 Cycles 56 KB -Way 64 Bytes Least Recently Used Least Recently Used High Value 8 KB 8-Way 64 Bytes Cycle 8 KB 8-Way 64 Bytes Cycle 89 KB 8-Way 56 Bytes

46 Memory System Values, Part II Parameter L Cache Repl Policy L Cache Latency Mem Latency, First Mem Latency, Next Mem Bandwidth I-TLB Size I-TLB Page Size I-TLB Assoc I-TLB Latency D-TLB Size D-TLB Page Size D-TLB Assoc D-TLB Latency Low Value High Value Least Recently Used 0 Cycles 5 Cycles 00 Cycles 50 Cycles 00 * Mem Latency, First 4 Bytes 3 Bytes 3 Entries 56 Entries 4 KB 4096 KB -Way Fully Assoc 80 Cycles 30 Cycles 3 Entries 56 Entries Same as I-TLB Page Size -Way Fully-Assoc Same as I-TLB Latency

47 Processor Core Values Parameter Fetch Queue Entries Branch Predictor Branch MPred Penalty RAS Entries BTB Entries BTB Assoc Spec Branch Update Decode/Issue Width ROB Entries LSQ Entries Memory Ports Low Value 4 -Level 0 Cycles 4 6 -Way In Commit 8 05 * ROB 4-Way High Value 3 Perfect Cycles 64 5 Fully-Assoc In Decode 64 0 * ROB 4

48 Determining the Most Significant Parameters Run simulations to find response With input parameters at high/low, on/off values Confi g Input Parameters (factors) Respons e A B C D E F G 9 3 Effect

49 Determining the Most Significant Parameters Calculate the effect of each parameter Across configurations Confi g Input Parameters (factors) Respons e A B C D E F G 9 3 Effect 65

50 Determining the Most Significant Parameters 3 For each benchmark Rank the parameters in descending order of effect (=most important, ) Parameter Benchmark Benchmark Benchmark 3 A 3 8 B 9 4 C 6 7

51 Determining the Most Significant Parameters 4 For each parameter Average the ranks Parameter Benchmark Benchmark Benchmark 3 Average A B C 6 7 5

52 Most Significant Parameters Number Parameter ROB Entries gcc 4 gzip art Average 77 L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Speculative Branch Update

53 General Procedure! Determine upper/lower bounds for parameters! Simulate configurations to find response! Compute effects of each parameter for each configuration! Rank the parameters for each benchmark based on effects! Average the ranks across benchmarks! Focus on top-ranked parameters for subsequent analysis

54 Case Study #! Determine the big picture impact of a system enhancement

55 Determining the Overall Effect of an Enhancement! Problem: o Performance analysis is typically limited to single metrics Speedup, power consumption, miss rate, etc o Simple analysis Discards a lot of good information

56 Determining the Overall Effect of an Enhancement! Find most important parameters without enhancement o Using Plackett and Burman! Find most important parameters with enhancement o Again using Plackett and Burman! Compare parameter ranks

57 Example: Instruction Precomputation! Profile to find the most common operations o 0,, etc! Insert the results of common operations in a table when the program is loaded into memory! Query the table when an instruction is issued! Don t execute the instruction if it is already in the table! Reduces contention for function units [Yi, Sendag, Lilja, Europar, 00]

58 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

59 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

60 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

61 Case Study #3! Benchmark program classification

62 Benchmark Classification! By application type o Scientific and engineering applications o Transaction processing applications o Multimedia applications! By use of processor function units o Floating-point code o Integer code o Memory intensive code! Etc, etc

63 Another Point-of-View! Classify by overall impact on processor! Define: o Two benchmark programs are similar if They stress the same components of a system to similar degrees! How to measure this similarity? o Use Plackett and Burman design to find ranks o Then compare ranks

64 Similarity metric! Use rank of each parameter as elements of a vector! For benchmark program X, let o X = (x, x,, x n, x n ) o x = rank of parameter o x = rank of parameter o

65 Vector Defines a Point in n-space Param #3 (x, x, x 3 ) D Param # (y, y, y 3 ) Param #

66 Similarity Metric! Euclidean Distance Between Points / ] ) ( ) ( ) ( ) [( n n n n y x y x y x y x D = K

67 Most Significant Parameters Number Parameter ROB Entries gcc 4 gzip art L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size 6 7 L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries 0 39 Speculative Branch Update 8 8 6

68 Distance Computation! Rank vectors o Gcc = (4,, 5, 8, ) o Gzip = (, 4,, 3, ) o Art = (, 4, 7, 9, )! Euclidean distances o D(gcc - gzip) = [(4) + (-4) + (5-) + ] / o D(gcc - art) = [(4-) + (-4) + (5-7) + ] / o D(gzip - art) = [(-) + (4-4) + (-7) + ] /

69 Euclidean Distances for Selected Benchmarks gcc gzip art mcf gcc gzip art mcf 0

70 Dendogram of Distances Showing (Dis-)Similarity

71 Final Benchmark Groupings Group I II III IV V VI VII VIII Benchmarks Gzip,mesa Vpr-Place,twolf Vpr-Route, parser, bzip Gcc, vortex Art Mcf Equake ammp

72 Conclusion! Multifactorial (Plackett and Burman) design o Requires only O(m) experiments o Determines effects of main factors only o Ignores interactions! Logically minimal number of experiments to estimate effects of m input parameters! Powerful technique for obtaining a big-picture view of a lot of simulation data

73 Conclusion! Demonstrated for o Ranking importance of simulation parameters o Finding overall impact of processor enhancement o Classifying benchmark programs! Current work comparing simulation strategies o Reduced input sets (eg MinneSPEC) o Sampling (eg SimPoints, sampling)

74 Goals! Develop/understand tools for interpreting large quantities of data! Increase insights into processor design! Improve rigor in computer architecture research

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture Resit Sendag, Ying Chen, and David J Lilja Department of Electrical and Computer Engineering Minnesota

More information

1360 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 11, NOVEMBER 2005

1360 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 11, NOVEMBER 2005 1360 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 11, NOVEMBER 2005 Improving Computer Architecture Simulation Methodology by Adding Statistical Rigor Joshua J. Yi, Member, IEEE, David J. Lilja, Senior

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota

More information

Increasing Instruction-Level Parallelism with Instruction Precomputation

Increasing Instruction-Level Parallelism with Instruction Precomputation Increasing Instruction-Level Parallelism with Instruction Precomputation Joshua J. Yi, Resit Sendag, and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Cache Performance Research for Embedded Processors

Cache Performance Research for Embedded Processors Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 1322 1328 2012 International Conference on Solid State Devices and Materials Science Cache Performance Research for Embedded Processors

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Characterizing and Comparing Prevailing Simulation Techniques

Characterizing and Comparing Prevailing Simulation Techniques Characterizing and Comparing Prevailing Simulation Techniques Joshua J. Yi 1, Sreekumar V. Kodakara 2, Resit Sendag 3, David J. Lilja 2, Douglas M. Hawkins 4 1 - Networking and Computing Systems Group

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.]

A Quick SimpleScalar Tutorial. [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] A Quick SimpleScalar Tutorial [Prepared from SimpleScalar Tutorial v2, Todd Austin and Doug Burger.] What is SimpleScalar? SimpleScalar is a tool set for high-performance simulation and research of modern

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 614 COMPUTER ARCHITECTURE II FALL 2005 CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Evaluating Benchmark Subsetting Approaches

Evaluating Benchmark Subsetting Approaches Evaluating Benchmark Subsetting Approaches Joshua J. Yi, Resit Sendag, Lieven Eeckhout, Ajay Joshi, David J. Lilja, and Lizy K. John Networking and omputing Systems Group Freescale Semiconductor, Inc.

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev. Exam Review 2 1 ROB: head/tail PC log. reg prev. phys. store? except? ready? A R3 X3 no none yes old tail B R1 X1 no none yes tail C R1 X6 no none yes D R4 X4 no none yes E --- --- yes none yes F --- ---

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: SOLUTION Notes: CS 152 Computer Architecture and Engineering CS 252 Graduate Computer Architecture Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: I am taking CS152 / CS252 This is a closed

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Characterizing and Comparing Prevailing Simulation Techniques

Characterizing and Comparing Prevailing Simulation Techniques Characterizing and Comparing Prevailing Simulation Techniques Joshua J. Yi 1, Sreekumar V. Kodakara 2, Resit Sendag 3, David J. Lilja 2, Douglas M. Hawkins 4 1 - Networking and Computing Systems Group

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

THE SPEC CPU2000 benchmark suite [12] is a commonly

THE SPEC CPU2000 benchmark suite [12] is a commonly IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007 1549 Speed versus Accuracy Trade-Offs in Microarchitectural Simulations Joshua J. Yi, Member, IEEE, Resit Sendag, Member, IEEE, David J. Lilja,

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information