José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Size: px
Start display at page:

Download "José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2"

Transcription

1 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas

2 MOTIVATION Problem: Limited processor resources Goal: More efficient use by aggressive recycling Opportunity: Resources reserved until retirement Solution: Decouple recycling from retirement : ed Early Resource Recycling in Out-of-order Microprocessors Slide 2/41

3 CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

4 CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

5 CURRENT PROCESSORS Q Entry Add STQ Entry Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

6 CURRENT PROCESSORS Q Entry STQ Entry Add Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

7 CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

8 CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

9 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

10 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

11 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

12 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

13 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

14 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

15 PROPOSAL: EARLY RECYCLING Decouple resource recycling from instruction retirement Recycle when no longer needed Targeted resources: Load queue entries Store queue entries Physical registers Potential when targeted resources are unlimited: 1.12 speedup in SPECint speedup in SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 4/41

16 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

17 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

18 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

19 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

20 CONSTRAINT: PRECISE STATE Events that require precise state: Branch mispredictions Memory replay traps Exceptions Interrupts Early recycling makes this difficult E.g. recycled registers no longer available E.g. memory may be overwriten prematurely : ed Early Resource Recycling in Out-of-order Microprocessors Slide 6/41

21 PROPOSAL: POINT OF NO RETURN Classify events according to frequency: Frequent: branch mispredictions, memory replays Infrequent: exceptions Split ROB into two logical sets of instructions Irreversible set Subject to infrequent events only recovery Reversible set: Subject to frequent and infrequent events Conventional ROB recovery mechanisms Boundary: Point of No Return (PNR) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 7/41

22 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

23 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

24 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

25 OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 9/41

26 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

27 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

28 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

29 TIMELINE Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

30 TIMELINE Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

31 TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

32 TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

33 ENTERING CHERRY MODE Backup architectural registers Can be done in a handful of cycles Allow PNR to race ahead of ROB head Updates to cache hierarchy set Volatile bit Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 11/41

34 PERIODIC CHECKPOINTING Done regularly to bound re-execution overhead Stop early recycling to create checkpoint (collapse) Freeze PNR (stop recycling) Once ROB head catches up with PNR: Clear Volatile bits from cache Backup architectural registers Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 12/41

35 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

36 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

37 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Restore from Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

38 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

39 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

40 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

41 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

42 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

43 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

44 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

45 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

46 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

47 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

48 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

49 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

50 OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 16/41

51 DEFINING THE PNRS Each resource defines its own PNR Based on three ROB pointers: U b : oldest unresolved branch U s : oldest store with unresolved address U l : oldest load with unresolved address B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 17/41

52 EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41

53 EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 ST replay check : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41

54 EARLY Q ENTRY RECYCLING Constraint: Must detect memory replay traps older(u s ): ST- replay trap Not affected by branches PNR Q = U s : ed Early Resource Recycling in Out-of-order Microprocessors Slide 19/41

55 EARLY REGISTER RECYCLING Constraint: Must not contain potentially useful data older(u s ): Free of ST- replay trap older(u b ): Free of branch misprediction PNR Reg = oldest(u S, U B ) Of course register must be dead (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 20/41

56 EARLY RECYCLING SUMMARY Resource PNR Value Q entries U S STQ entries oldest(u L, U S, U B ) Registers oldest(u S, U B ) PNR Q PNR STQ PNR Reg B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 21/41

57 EVALUATION Execution-driven simulation SPEC 2000 applications SGI MIPSPro compiler Compare against: Processor with unlimited resources Three aggressive base processor designs (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 22/41

58 PROCESSOR MODEL Frequency: 3.2GHz Fetch/issue/commit width: 8/8/12 I. window/rob size: 128/384 Int/FP registers : 192/128 Ld/St units: 2/2 Int/FP/branch units: 7/5/3 Ld/St queue entries: 32/32 MSHRs: 24 Cache L1 L2 Bus & Memory Size: 32KB 512KB RT: 2 cycles 10 cycles Assoc: 4-way 8-way Line size: 64B 128B Ports: 4 1 checkpoint frequency: 5µs Up to 1 taken branch/cycle RAS: 32 entries BTB: 4K entries, 4-way assoc. Branch predictor: Hybrid with speculative update Bimodal size: 8K entries Two-level size: 64K entries FSB frequency: 400MHz FSB width: 128bit Memory: 4-channel Rambus DRAM bandwidth: 6.4GB/s Memory RT: 120ns : ed Early Resource Recycling in Out-of-order Microprocessors Slide 23/41

59 OVERALL RESULTS Unlimited Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.06 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

60 OVERALL RESULTS Unlimited Speedup Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average 1.26 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

61 OVERALL RESULTS Unlimited Speedup Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

62 PNR ADVANCE PNR advance as a % of ROB use Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

63 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

64 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

65 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Register 28.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

66 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41 PNR ADVANCE PNR advance as a % of ROB use 78.3% 63.4% 67.7% SPECfp Register 28.7% STQ Entry 19.7% (oldest) Head (newest) Tail Q Entry 77.4% SPECint

67 SUMMARY Decoupling resource recycling from retirement Split ROB into two logical sets Use of ROB+checkpointing simultaneously Early recycling mechanism: Q, STQ, Registers Integration with speculative multithreading (paper) Speedup over baseline 1.06 for SPECint for SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 26/41

68 QUESTIONS? José F. Martínez 1 Jose Renau 2 Michael C. Huang 3 Milos Prvulovic 2 Josep Torrellas : ed Early Resource Recycling in Out-of-order Microprocessors Slide 27/41

69 BACKUP SLIDES : ed Early Resource Recycling in Out-of-order Microprocessors Slide 28/41

70 RECYCLING PNRS Resource PNR Value Q entries (uniprocessor) U S Q entries (multiprocessor) oldest(u L, U S ) STQ entries oldest(u L, U S, U B ) Registers (uniprocessor) oldest(u S, U B ) Registers (multiprocessor) oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 29/41

71 EARLY STQ ENTRY RECYCLING Constraint: Must not overwrite potentially useful data older(u l ): Unresolved may need old data older(u s ): No older is replayable older(u b ): Free of branch misprediction PNR STQ = oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 30/41

72 CHECKPOINT FREQUENCY T o = c k + p e c e (1) p e = T c T e (2) c e = s T c 2 + (s 1) (3) T o T c = c k T c + ( s 1 ) Tc (4) 2 T e T c = c k T e s 1 2 (5) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 31/41

73 CHECKPOINT SELECTION 5 4 Te=200µs 400µs 600µs 800µs 1000µs To/Tc (%) Tc (µs) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 32/41

74 SPECINT2000 PARAMETERS SPECint Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) bzip crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 33/41

75 SPECFP2000 PARAMETERS SPECfp Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 34/41

76 REGISTER ACCESS TIME Access Time (ps) Number of Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 35/41

77 OTHER CONSIDERATIONS (PAPER) Exception on instruction older than youngest PNR: Roll back to checkpoint Each PNR recycles at its own pace Not all PNRs critical, e.g. PNR Q Early register recycling: Must count in-flight consumers (Moudgill et al 1993) Optimal interval between checkpoints : ed Early Resource Recycling in Out-of-order Microprocessors Slide 36/41

78 CHERRY PARAMETERS checkpoint frequency: 5µs overhead: 60 cycles (18.75ns) Avg time between exceptions: 800µs (not modeled) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 37/41

79 INGORING U b Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average Unlimited Pass Branches : ed Early Resource Recycling in Out-of-order Microprocessors Slide 38/41

80 PERFECT ANCH PREDICTION Speedup Unlimited Pass Branches bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 39/41

81 INTERRUPT COLLAPSING Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

82 INTERRUPT COLLAPSING Interrupt Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

83 INTERRUPT COLLAPSING Interrupt Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

84 EQUIVALENT SYSTEM Resource Baseline SPECint2000 SPECfp2000 Q entries STQ entries Int Registers FP Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 41/41

Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ

Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ José F. Martínez Jose Renau y Michael C. Huang z Milos Prvulovic y Josep Torrellas y Computer ystems Laboratory, Cornell

More information

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,

More information

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Are we ready for high-mlp?

Are we ready for high-mlp? Are we ready for high-mlp?, James Tuck, Josep Torrellas Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2 Motivation

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Narrow Width Dynamic Scheduling

Narrow Width Dynamic Scheduling Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415

More information

Are We Ready for High Memory-Level Parallelism?

Are We Ready for High Memory-Level Parallelism? Are We Ready for High Memory-Level Parallelism? Luis Ceze, James Tuck and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign Email: {luisceze,jtuck,torrella}@cs.uiuc.edu

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

Deconstructing Commit

Deconstructing Commit Deconstructing Commit Gordon B. Bell and Mikko H. Lipasti Dept. of Elec. and Comp. Engr. Univ. of Wisconsin-Madison {gbell,mikko}@ece.wisc.edu Abstract Many modern processors execute instructions out of

More information

A Complexity-Effective Out-of-Order Retirement Microarchitecture

A Complexity-Effective Out-of-Order Retirement Microarchitecture 1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,

More information

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Thread-Level Speculation on a CMP Can Be Energy Efficient

Thread-Level Speculation on a CMP Can Be Energy Efficient Thread-Level Speculation on a CMP Can Be Energy Efficient Jose Renau Karin Strauss Luis Ceze Wei Liu Smruti Sarangi James Tuck Josep Torrellas ABSTRACT Dept. of Computer Engineering, University of California

More information

Thread-Level Speculation on a CMP Can Be Energy Efficient

Thread-Level Speculation on a CMP Can Be Energy Efficient Thread-Level Speculation on a CMP Can Be Energy Efficient Jose Renau Karin Strauss Luis Ceze Wei Liu Smruti Sarangi James Tuck Josep Torrellas ABSTRACT Dept. of Computer Engineering, University of California

More information

The Validation Buffer Out-of-Order Retirement Microarchitecture

The Validation Buffer Out-of-Order Retirement Microarchitecture The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract

More information

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding Copyright c 2005 IEEE. Reprinted from 38th Intl Symp Microarchitecture, Nov 2005, pp. 171-182. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted.

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

Implementation Issues of Slackened Memory Dependence Enforcement

Implementation Issues of Slackened Memory Dependence Enforcement Implementation Issues of Slackened Memory Dependence Enforcement Technical Report Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Kilo-instructions Processors

Kilo-instructions Processors Kilo-instructions Processors Speaker: Mateo Valero UPC Team: Adrián Cristal Pepe Martínez Josep Llosa Daniel Ortega and Mateo Valero IBM Seminar on Compilers and Architecture Haifa November 11 th 2003

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY

More information

Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors

Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors Peng Zhou Soner Önder Steve Carr Department of Computer Science Michigan Technological University Houghton, Michigan 4993-295 {pzhou,soner,carr}@mtuedu

More information

Kilo-instruction Processors, Runahead and Prefetching

Kilo-instruction Processors, Runahead and Prefetching Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Fast branch misprediction recovery in out-oforder superscalar processors

Fast branch misprediction recovery in out-oforder superscalar processors See discussions, stats, and author profiles for this publication at: https://wwwresearchgatenet/publication/2223623 Fast branch misprediction recovery in out-oforder superscalar processors CONFERENCE PAPER

More information

Dynamic Scheduling with Narrow Operand Values

Dynamic Scheduling with Narrow Operand Values Dynamic Scheduling with Narrow Operand Values Erika Gunadi University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in partial

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

L1 Data Cache Decomposition for Energy Efficiency

L1 Data Cache Decomposition for Energy Efficiency L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram Objective Reduce

More information

Scalable Cache Miss Handling for High Memory-Level Parallelism

Scalable Cache Miss Handling for High Memory-Level Parallelism Appeared in IEEE/ACM International Symposium on Microarchitecture, December 6 (MICRO 6) Scalable Cache Miss Handling for High Memory-Level Parallelism James Tuck, Luis Ceze, and Josep Torrellas University

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking

Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking Brian Greskamp and Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.uiuc.edu Abstract

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1 I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in

More information

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead International Conference on Parallel Architectures and Compilation Techniques Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer

More information

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as

More information

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn

More information

Analysis of the TRIPS Prototype Block Predictor

Analysis of the TRIPS Prototype Block Predictor Appears in the 9 nternational Symposium on Performance Analysis of Systems and Software Analysis of the TRPS Prototype Block Predictor Nitya Ranganathan oug Burger Stephen W. Keckler epartment of Computer

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance M. Wasiur Rashid, Edwin J. Tan, and Michael C. Huang Department of Electrical & Computer Engineering University of Rochester

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Engin İpek Meyrem Kırman Nevin Kırman José F. Martínez Computer Systems Laboratory Cornell University Ithaca, NY 4853 USA Submitted

More information

Memory Ordering: A Value-Based Approach

Memory Ordering: A Value-Based Approach Memory Ordering: A Value-Based Approach Harold W. Cain Computer Sciences Dept. Univ. of Wisconsin-Madison cain@cs.wisc.edu Mikko H. Lipasti Dept. of Elec. and Comp. Engr. Univ. of Wisconsin-Madison mikko@engr.wisc.edu

More information

BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution

BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Appears in Proceedings of HPCA-16 (2010) BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Andrew Hilton and Amir Roth Department of Computer and Information Science, University of Pennsylvania

More information

An Architectural Characterization Study of Data Mining and Bioinformatics Workloads

An Architectural Characterization Study of Data Mining and Bioinformatics Workloads An Architectural Characterization Study of Data Mining and Bioinformatics Workloads Berkin Ozisikyilmaz Ramanathan Narayanan Gokhan Memik Alok ChoudharyC Department of Electrical Engineering and Computer

More information

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Power-Efficient Approaches to Reliability. Abstract

Power-Efficient Approaches to Reliability. Abstract Power-Efficient Approaches to Reliability Niti Madan, Rajeev Balasubramonian UUCS-05-010 School of Computing University of Utah Salt Lake City, UT 84112 USA December 2, 2005 Abstract Radiation-induced

More information

MEMORY ORDERING: A VALUE-BASED APPROACH

MEMORY ORDERING: A VALUE-BASED APPROACH MEMORY ORDERING: A VALUE-BASED APPROACH VALUE-BASED REPLAY ELIMINATES THE NEED FOR CONTENT-ADDRESSABLE MEMORIES IN THE LOAD QUEUE, REMOVING ONE BARRIER TO SCALABLE OUT- OF-ORDER INSTRUCTION WINDOWS. INSTEAD,

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Cluster Assignment Strategies for a Clustered Trace Cache Processor

Cluster Assignment Strategies for a Clustered Trace Cache Processor Cluster Assignment Strategies for a Clustered Trace Cache Processor Ravi Bhargava and Lizy K. John Technical Report TR-033103-01 Laboratory for Computer Architecture The University of Texas at Austin Austin,

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

An Optimized Front-End Physical Register File with Banking and Writeback Filtering

An Optimized Front-End Physical Register File with Banking and Writeback Filtering An Optimized Front-End Physical Register File with Banking and Writeback Filtering Miquel Pericàs, Ruben Gonzalez, Adrian Cristal, Alex Veidenbaum and Mateo Valero Technical University of Catalonia, University

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information