José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
|
|
- Richard Andrews
- 6 years ago
- Views:
Transcription
1 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas
2 MOTIVATION Problem: Limited processor resources Goal: More efficient use by aggressive recycling Opportunity: Resources reserved until retirement Solution: Decouple recycling from retirement : ed Early Resource Recycling in Out-of-order Microprocessors Slide 2/41
3 CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
4 CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
5 CURRENT PROCESSORS Q Entry Add STQ Entry Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
6 CURRENT PROCESSORS Q Entry STQ Entry Add Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
7 CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
8 CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
9 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
10 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
11 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
12 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
13 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
14 CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
15 PROPOSAL: EARLY RECYCLING Decouple resource recycling from instruction retirement Recycle when no longer needed Targeted resources: Load queue entries Store queue entries Physical registers Potential when targeted resources are unlimited: 1.12 speedup in SPECint speedup in SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 4/41
16 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
17 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
18 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
19 EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
20 CONSTRAINT: PRECISE STATE Events that require precise state: Branch mispredictions Memory replay traps Exceptions Interrupts Early recycling makes this difficult E.g. recycled registers no longer available E.g. memory may be overwriten prematurely : ed Early Resource Recycling in Out-of-order Microprocessors Slide 6/41
21 PROPOSAL: POINT OF NO RETURN Classify events according to frequency: Frequent: branch mispredictions, memory replays Infrequent: exceptions Split ROB into two logical sets of instructions Irreversible set Subject to infrequent events only recovery Reversible set: Subject to frequent and infrequent events Conventional ROB recovery mechanisms Boundary: Point of No Return (PNR) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 7/41
22 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
23 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
24 CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
25 OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 9/41
26 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
27 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
28 TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
29 TIMELINE Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
30 TIMELINE Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
31 TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
32 TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
33 ENTERING CHERRY MODE Backup architectural registers Can be done in a handful of cycles Allow PNR to race ahead of ROB head Updates to cache hierarchy set Volatile bit Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 11/41
34 PERIODIC CHECKPOINTING Done regularly to bound re-execution overhead Stop early recycling to create checkpoint (collapse) Freeze PNR (stop recycling) Once ROB head catches up with PNR: Clear Volatile bits from cache Backup architectural registers Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 12/41
35 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
36 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
37 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Restore from Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
38 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
39 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
40 PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
41 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
42 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
43 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
44 PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
45 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
46 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
47 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
48 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
49 INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
50 OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 16/41
51 DEFINING THE PNRS Each resource defines its own PNR Based on three ROB pointers: U b : oldest unresolved branch U s : oldest store with unresolved address U l : oldest load with unresolved address B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 17/41
52 EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41
53 EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 ST replay check : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41
54 EARLY Q ENTRY RECYCLING Constraint: Must detect memory replay traps older(u s ): ST- replay trap Not affected by branches PNR Q = U s : ed Early Resource Recycling in Out-of-order Microprocessors Slide 19/41
55 EARLY REGISTER RECYCLING Constraint: Must not contain potentially useful data older(u s ): Free of ST- replay trap older(u b ): Free of branch misprediction PNR Reg = oldest(u S, U B ) Of course register must be dead (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 20/41
56 EARLY RECYCLING SUMMARY Resource PNR Value Q entries U S STQ entries oldest(u L, U S, U B ) Registers oldest(u S, U B ) PNR Q PNR STQ PNR Reg B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 21/41
57 EVALUATION Execution-driven simulation SPEC 2000 applications SGI MIPSPro compiler Compare against: Processor with unlimited resources Three aggressive base processor designs (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 22/41
58 PROCESSOR MODEL Frequency: 3.2GHz Fetch/issue/commit width: 8/8/12 I. window/rob size: 128/384 Int/FP registers : 192/128 Ld/St units: 2/2 Int/FP/branch units: 7/5/3 Ld/St queue entries: 32/32 MSHRs: 24 Cache L1 L2 Bus & Memory Size: 32KB 512KB RT: 2 cycles 10 cycles Assoc: 4-way 8-way Line size: 64B 128B Ports: 4 1 checkpoint frequency: 5µs Up to 1 taken branch/cycle RAS: 32 entries BTB: 4K entries, 4-way assoc. Branch predictor: Hybrid with speculative update Bimodal size: 8K entries Two-level size: 64K entries FSB frequency: 400MHz FSB width: 128bit Memory: 4-channel Rambus DRAM bandwidth: 6.4GB/s Memory RT: 120ns : ed Early Resource Recycling in Out-of-order Microprocessors Slide 23/41
59 OVERALL RESULTS Unlimited Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.06 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
60 OVERALL RESULTS Unlimited Speedup Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average 1.26 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
61 OVERALL RESULTS Unlimited Speedup Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
62 PNR ADVANCE PNR advance as a % of ROB use Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
63 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
64 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
65 PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Register 28.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
66 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41 PNR ADVANCE PNR advance as a % of ROB use 78.3% 63.4% 67.7% SPECfp Register 28.7% STQ Entry 19.7% (oldest) Head (newest) Tail Q Entry 77.4% SPECint
67 SUMMARY Decoupling resource recycling from retirement Split ROB into two logical sets Use of ROB+checkpointing simultaneously Early recycling mechanism: Q, STQ, Registers Integration with speculative multithreading (paper) Speedup over baseline 1.06 for SPECint for SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 26/41
68 QUESTIONS? José F. Martínez 1 Jose Renau 2 Michael C. Huang 3 Milos Prvulovic 2 Josep Torrellas : ed Early Resource Recycling in Out-of-order Microprocessors Slide 27/41
69 BACKUP SLIDES : ed Early Resource Recycling in Out-of-order Microprocessors Slide 28/41
70 RECYCLING PNRS Resource PNR Value Q entries (uniprocessor) U S Q entries (multiprocessor) oldest(u L, U S ) STQ entries oldest(u L, U S, U B ) Registers (uniprocessor) oldest(u S, U B ) Registers (multiprocessor) oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 29/41
71 EARLY STQ ENTRY RECYCLING Constraint: Must not overwrite potentially useful data older(u l ): Unresolved may need old data older(u s ): No older is replayable older(u b ): Free of branch misprediction PNR STQ = oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 30/41
72 CHECKPOINT FREQUENCY T o = c k + p e c e (1) p e = T c T e (2) c e = s T c 2 + (s 1) (3) T o T c = c k T c + ( s 1 ) Tc (4) 2 T e T c = c k T e s 1 2 (5) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 31/41
73 CHECKPOINT SELECTION 5 4 Te=200µs 400µs 600µs 800µs 1000µs To/Tc (%) Tc (µs) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 32/41
74 SPECINT2000 PARAMETERS SPECint Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) bzip crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 33/41
75 SPECFP2000 PARAMETERS SPECfp Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 34/41
76 REGISTER ACCESS TIME Access Time (ps) Number of Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 35/41
77 OTHER CONSIDERATIONS (PAPER) Exception on instruction older than youngest PNR: Roll back to checkpoint Each PNR recycles at its own pace Not all PNRs critical, e.g. PNR Q Early register recycling: Must count in-flight consumers (Moudgill et al 1993) Optimal interval between checkpoints : ed Early Resource Recycling in Out-of-order Microprocessors Slide 36/41
78 CHERRY PARAMETERS checkpoint frequency: 5µs overhead: 60 cycles (18.75ns) Avg time between exceptions: 800µs (not modeled) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 37/41
79 INGORING U b Speedup bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average Unlimited Pass Branches : ed Early Resource Recycling in Out-of-order Microprocessors Slide 38/41
80 PERFECT ANCH PREDICTION Speedup Unlimited Pass Branches bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 39/41
81 INTERRUPT COLLAPSING Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
82 INTERRUPT COLLAPSING Interrupt Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
83 INTERRUPT COLLAPSING Interrupt Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
84 EQUIVALENT SYSTEM Resource Baseline SPECint2000 SPECfp2000 Q entries STQ entries Int Registers FP Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 41/41
Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ
Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ José F. Martínez Jose Renau y Michael C. Huang z Milos Prvulovic y Josep Torrellas y Computer ystems Laboratory, Cornell
More informationCAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses
CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,
More informationCAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction
CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationAre we ready for high-mlp?
Are we ready for high-mlp?, James Tuck, Josep Torrellas Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2 Motivation
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationNarrow Width Dynamic Scheduling
Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415
More informationAre We Ready for High Memory-Level Parallelism?
Are We Ready for High Memory-Level Parallelism? Luis Ceze, James Tuck and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign Email: {luisceze,jtuck,torrella}@cs.uiuc.edu
More informationMany Cores, One Thread: Dean Tullsen University of California, San Diego
Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationDynamic Memory Dependence Predication
Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is
More informationDeconstructing Commit
Deconstructing Commit Gordon B. Bell and Mikko H. Lipasti Dept. of Elec. and Comp. Engr. Univ. of Wisconsin-Madison {gbell,mikko}@ece.wisc.edu Abstract Many modern processors execute instructions out of
More informationA Complexity-Effective Out-of-Order Retirement Microarchitecture
1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,
More informationSlackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationThread-Level Speculation on a CMP Can Be Energy Efficient
Thread-Level Speculation on a CMP Can Be Energy Efficient Jose Renau Karin Strauss Luis Ceze Wei Liu Smruti Sarangi James Tuck Josep Torrellas ABSTRACT Dept. of Computer Engineering, University of California
More informationThread-Level Speculation on a CMP Can Be Energy Efficient
Thread-Level Speculation on a CMP Can Be Energy Efficient Jose Renau Karin Strauss Luis Ceze Wei Liu Smruti Sarangi James Tuck Josep Torrellas ABSTRACT Dept. of Computer Engineering, University of California
More informationThe Validation Buffer Out-of-Order Retirement Microarchitecture
The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract
More informationAddress-Indexed Memory Disambiguation and Store-to-Load Forwarding
Copyright c 2005 IEEE. Reprinted from 38th Intl Symp Microarchitecture, Nov 2005, pp. 171-182. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted.
More informationFuture Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This
More informationImplementation Issues of Slackened Memory Dependence Enforcement
Implementation Issues of Slackened Memory Dependence Enforcement Technical Report Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationKilo-instructions Processors
Kilo-instructions Processors Speaker: Mateo Valero UPC Team: Adrián Cristal Pepe Martínez Josep Llosa Daniel Ortega and Mateo Valero IBM Seminar on Compilers and Architecture Haifa November 11 th 2003
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationFINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY
FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY
More informationFast Branch Misprediction Recovery in Out-of-order Superscalar Processors
Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors Peng Zhou Soner Önder Steve Carr Department of Computer Science Michigan Technological University Houghton, Michigan 4993-295 {pzhou,soner,carr}@mtuedu
More informationKilo-instruction Processors, Runahead and Prefetching
Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento
More informationCS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science
CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,
More informationFast branch misprediction recovery in out-oforder superscalar processors
See discussions, stats, and author profiles for this publication at: https://wwwresearchgatenet/publication/2223623 Fast branch misprediction recovery in out-oforder superscalar processors CONFERENCE PAPER
More informationDynamic Scheduling with Narrow Operand Values
Dynamic Scheduling with Narrow Operand Values Erika Gunadi University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in partial
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies
Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu
More informationReducing Reorder Buffer Complexity Through Selective Operand Caching
Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationExploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationSoftware-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain
Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationL1 Data Cache Decomposition for Energy Efficiency
L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram Objective Reduce
More informationScalable Cache Miss Handling for High Memory-Level Parallelism
Appeared in IEEE/ACM International Symposium on Microarchitecture, December 6 (MICRO 6) Scalable Cache Miss Handling for High Memory-Level Parallelism James Tuck, Luis Ceze, and Josep Torrellas University
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following
More informationPerformance Oriented Prefetching Enhancements Using Commit Stalls
Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of
More informationUsing a Serial Cache for. Energy Efficient Instruction Fetching
Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,
More informationPaceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking
Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking Brian Greskamp and Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.uiuc.edu Abstract
More informationSpeculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread
More informationInserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1
I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationSpeculative Parallelization in Decoupled Look-ahead
International Conference on Parallel Architectures and Compilation Techniques Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer
More informationHANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding
HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as
More informationBoost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor
Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn
More informationAnalysis of the TRIPS Prototype Block Predictor
Appears in the 9 nternational Symposium on Performance Analysis of Systems and Software Analysis of the TRPS Prototype Block Predictor Nitya Ranganathan oug Burger Stephen W. Keckler epartment of Computer
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationKoji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationExploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance M. Wasiur Rashid, Edwin J. Tan, and Michael C. Huang Department of Electrical & Computer Engineering University of Rochester
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More information18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012
18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationCore Fusion: Accommodating Software Diversity in Chip Multiprocessors
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Engin İpek Meyrem Kırman Nevin Kırman José F. Martínez Computer Systems Laboratory Cornell University Ithaca, NY 4853 USA Submitted
More informationMemory Ordering: A Value-Based Approach
Memory Ordering: A Value-Based Approach Harold W. Cain Computer Sciences Dept. Univ. of Wisconsin-Madison cain@cs.wisc.edu Mikko H. Lipasti Dept. of Elec. and Comp. Engr. Univ. of Wisconsin-Madison mikko@engr.wisc.edu
More informationBOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution
Appears in Proceedings of HPCA-16 (2010) BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Andrew Hilton and Amir Roth Department of Computer and Information Science, University of Pennsylvania
More informationAn Architectural Characterization Study of Data Mining and Bioinformatics Workloads
An Architectural Characterization Study of Data Mining and Bioinformatics Workloads Berkin Ozisikyilmaz Ramanathan Narayanan Gokhan Memik Alok ChoudharyC Department of Electrical Engineering and Computer
More informationInstruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers
Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationDynamic Speculative Precomputation
In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department
More informationProbabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs
University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationPower-Efficient Approaches to Reliability. Abstract
Power-Efficient Approaches to Reliability Niti Madan, Rajeev Balasubramonian UUCS-05-010 School of Computing University of Utah Salt Lake City, UT 84112 USA December 2, 2005 Abstract Radiation-induced
More informationMEMORY ORDERING: A VALUE-BASED APPROACH
MEMORY ORDERING: A VALUE-BASED APPROACH VALUE-BASED REPLAY ELIMINATES THE NEED FOR CONTENT-ADDRESSABLE MEMORIES IN THE LOAD QUEUE, REMOVING ONE BARRIER TO SCALABLE OUT- OF-ORDER INSTRUCTION WINDOWS. INSTEAD,
More informationA Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn
A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead
More informationFeedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research
More informationCluster Assignment Strategies for a Clustered Trace Cache Processor
Cluster Assignment Strategies for a Clustered Trace Cache Processor Ravi Bhargava and Lizy K. John Technical Report TR-033103-01 Laboratory for Computer Architecture The University of Texas at Austin Austin,
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationAccelerating and Adapting Precomputation Threads for Efficient Prefetching
In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.
More informationMesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors
: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationAn Optimized Front-End Physical Register File with Banking and Writeback Filtering
An Optimized Front-End Physical Register File with Banking and Writeback Filtering Miquel Pericàs, Ruben Gonzalez, Adrian Cristal, Alex Veidenbaum and Mateo Valero Technical University of Catalonia, University
More informationThe Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation
Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:
More information