Pentium Pro Case Study ECE/CS 752 Fall 2017

Size: px
Start display at page:

Download "Pentium Pro Case Study ECE/CS 752 Fall 2017"

Transcription

1 Pentium Pro Case Study ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

2 Pentium Pro Case Study Microarchitecture Order 3 Superscalar Out of Order execution Speculative execution In order completion Design Methodology Performance Analysis Retrospective

3 Goals of P6 Microarchitecture IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power

4 BTB/ICU Fetch 4 Cycles P6 The Big Picture BAC/Rename Decode 2 Cycles Allocation Dispatch 2 Cycles Reservation Station (20) Cycles ROB (40x157) RRF 2 cyc AGU1 AGU0 MOB DCU IEU1 JEU IEU0 Fadd Fmul Imul Div

5 Memory Hierarchy Main Memory PCI 64 bit ICache (8KB) CPU BIU DCache (8Kb) 16 bytes L2 Cache (256Kb) Level 1 instruction and data caches 2 cycle access time Level 2 unified cache 6 cycle access time Separate level 2 cache and memory address/data bus

6 Inst. Rotator Inst. Buf Instruction Fetch L2 Cache (256Kb) Stream Buffer ICache (8Kb) Victim Cache Inst. Inst. Length Decoder Length Marks Prediction Marks Instruction TLB Physical Addr. Fetch Address Other Fetch Requests Next Addr. Logic Branch Target Buffer (512) To Decode Prediction Instruction Data Mux Instruction Data 16 bytes 16 bytes + marks 2 cycle Branch Target

7 Stream Buffer ICache (8 Kb) Instruction Cache Unit Bus Interface Unit Victim Cache Lower 12 bits Upper 20 bits Fetch Address Lower 12 bits Data Mux Instruction Tag Array Instruction Data Hit/Miss ITLB

8 Branch Target Buffer Way 0 Way 1 Way 3 PHT 128 Sets Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. 16 entries/set Tag Compare Fetch Address Return Stack Prediction Control Logic Prediction & Target Addr. Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch

9 Br. History Speculative History Branch Prediction Algorithm 0010 Br. Pred. Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front end and drains the execution core 1 Branch Execution 0101 State Machine 10 Spec. Pred Pattern Table 1 0 Branch mis prediction resets the speculative branch history state to match BHR

10 Instruction Decode 1 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Branch instruction detection Branch address calculation Static prediction and branch always execution One branch decode per cycle (break on branch)

11 Instruction Decode 2 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re filled Macro instructions must shift from decoder 2 to decoder 1 to decoder 0

12 What is a uop? Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx) Uop decomposition: ld guop0, (eax) guop0 <- MEM(eax) ld guop1, (ebx) guop1 <- MEM(ebx) add guop0,guop1 guop0 <- guop0 + guop1 sta eax std guop0 MEM(eax) <- guop0

13 Instruction Dispatch uop Queue (6) Renaming Dispatch Buffer (3) Mux Logic Retirement Info To Reservation Station Allocator 2 cycles Register Renaming Allocation requirements 3 or none Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer probably dispatches all 3 uops before re fill

14 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming 1 Similar to Tomasulo s Algorithm Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB)

15 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming Example Dispatching: add eax, ebx add eax, ecx fxch f0, f1 Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Comp Alloc Completing: sub eax, ecx sub Shen, Lipasti

16 Challenges to Register Renaming Integer RAT Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events 8-bit code mov mov add add AL, #data1 AH, #data2 AL, #data3 AL, #data4 EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Byte addressable registers 39

17 RS bypass Out of Order Execution Engine Reservation Station (20) Cycles AGU1 AGU0 IEU1 IEU0 MOB JEU Fadd Fmul DCU (8Kb) Imul Div In order branch issue and execution In order load/store issue to address generation units Instruction execution and result bus scheduling Is the reservation station truly centralized & what is binding?

18 Reservation Station From Dispatch Queue Port 4 Port 3 Port 2 Port 1 Port 0 Cycle 1 Cycle 2 To Execution Units Cycle 1 Order checking Operand availability Cycle 2 Writeback bus scheduling

19 Memory Ordering Buffer (MOB) AGU0 AGU1 R/S Load Buffer (16) Conflict Logic Store Address Buffer (12) Store Data Buffer (12) 2 cycle Data Cache Unit (8Kb) MOB Bypass Control Logic 2 cycle Load Data Result Load buffer retains loads until completed, for coherency checking Store forwarding out of store buffers 2 cycle latency through MOB Store Coloring Load instructions are tagged by the last store

20 Instruction Completion Handles all exception/interrupt/trap conditions Handles branch recovery OOO core drains out right path instructions, commits to RRF In parallel, front end starts fetching from target/fall through However, no renaming is allowed until OOO core is drained After draining is done, RAT is reset to point to RRF Avoids checkpointing RAT, recovering to intermediate RAT state Commits execution results to the architectural state in order Retirement Register File (RRF) Must handle hazards to RRF (writes/reads in same cycle) Must handle hazards to RAT (writes/reads in same cycle) Atomic IA 32 instruction completion uops are marked as 1st or last in sequence exception/interrupt/trap boundary 2 cycle retirement

21 Pentium Pro Design Methodology 1 Performance Model Start 1990 Finish 4Q 1994 Microarchitecture 1 year Conception 1 year Silicon Debug Behavioral Model 0.75 years Structural Model Logic Design 1.75 years Circuit Design Layout

22 Observability Pentium Pro Performance Analysis On chip event counters Dynamic analysis Benchmark Suite BAPco Sysmark32 32 bit Windows NT applications Winstone97 32 bit Windows NT applications Some SPEC95 benchmarks

23 Performance Run Times User-Mode Processor Cycles Unhalted User-Mode Processor Cycles 1.2E+11 1E+11 Total of 27.5 billion cycles 8E+10 6E+10 4E+10 2E+10 0 avs li go Pwave MSVC Corel excel word paradox WordPro PowerPnt SYSexcel PhotoShop PageMaker SYSword7 SYSflw96 compress SYSCorel6 SYSexcel7 microstation SYSpowerpnt7 SYSParadox7 SYSpagemaker6

24 Performance IPC vs. upc 1.6 Instructions and and UOPs Retired Uops Per Cycle retired per cycle Inst retired UOPS retired 2 uops/instruction # Retired 0.4 Per Cycle WS-AVS SPEC-Li WS-Excel WS-Word SPEC-Go WS-PWave SYS-Excel SYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-WordPro SYS-Paradox WS-PhotoShop WS-PowerPnt WS-Corel Draw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress WS-Microstation

25 Performance IPC vs. upc 100 Inst. Inst. or UOPS Uops retired retired per cycle per cycle AVS PageMaker Paradox PowerPoint WordPro Instructions were Retired 20 Percent 10 of Cycles where n=0-3 UOPs or 0 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.

26 Performance Cache Misses 2.5 Cache Cache Misses Per Cycle Per Cycle IFU Misses / Cycle DCU Misses / Cycle L2 Misses / Cycle IL1-0.51% DL1-1.15% L2-0.25% 1 % Unhalted Cycles WS-AVS SPEC-Li WS-Excel WS-Word WS-PWave SYS-ExcelSYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-PowerPnt WS-WordPro SYS-Paradox WS-PhotoShop WS-CorelDraw WS-Microstation WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress SPEC-Go

27 Performance Branch Prediction 16% Rate Branch Mispredict Rate 14% 12% 6.8% avg 10% 8% 6% 4% % Branches Mispredicted 2% 0% 45% WS-AVS SPEC-Li WS -Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro WS-PowerPnt SYS-Paradox WS-PhotoShop WS-CorelDraw SYS-PowerPnt WS-Microstation WS-PageMa ker SYS- CorelDraw SYS-Pagemaker SPEC-Compress Rate BTB Miss Rate 40% 35% 30% 25% 20% 15% 10% % Branches that Miss in BTB 5% 0% WS-AVS SPEC-Li WS-Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS -MS VC++ WS-Paradox WS- PowerPnt WS-WordPro SYS-Paradox WS-P hotows-coreldraw Shop SYS-PowerPnt WS-Microstation WS-PageMaker SYS-CorelDraw SYS-Pagemaker SPEC-Compress

28 IA 32 Compliant Conclusions Performance (Frequency IPC) ISpec FSpec SPECint SPECfp95 Validation Die Size Fabable Schedule 1 year late Power

29 Retrospective Most commercially successful microarchitecture in history Evolution Pentium II/III, Xeon, etc. Derivatives with on chip L2, ISA extensions, etc. Replaced by Pentium 4 as flagship in 2001 High frequency, deep pipeline, extreme speculation Resurfaced as Pentium M in 2003 Initially a response to Transmeta in laptop market Pentium 4 derivative (90nm Prescott) delayed, slow, hot Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 Shen, Lipasti 29

30 Microarchitectural Updates Pentium M (Banias), Core Duo (Yonah) Micro op fusion (also in AMD K7/K8) Multiple uops in one: (add eax,[mem] => ld/alu), sta/std These uops decode/dispatch/commit once, issue twice Better branch prediction Loop count predictor Indirect branch predictor Slightly deeper pipeline (12 stages) Extra decode stage for micro op fusion Extra stage between issue and execute (for RS/PLRAM read) Data capture reservation station (payload RAM) Clock gated for 32 (int), 64 (fp), and 128 (SSE) operands Shen, Lipasti 30

31 Microarchitectural Updates Core 2 Duo (Merom) 64 bit ISA from AMD K8 Macro op fusion Merge uops from two x86 ops E.g. cmp, jne => cmpjne 4 wide decoder (Complex + 3x Simple) Peak x86 decode throughput is 5 due to macro op fusion Loop buffer Loops that fit in 18 entry instruction queue avoid fetch/decode overhead Even deeper pipeline (14 stages) Larger reservation station (32), instruction window (96) Memory dependence prediction Shen, Lipasti 31

32 Microarchitectural Updates Nehalem (Core i7/i5/i3) RS size 36, ROB 128 Loop cache up to 28 uops L2 branch predictor L2 TLB I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M Simultaneous multithreading RAS now renamed (repaired) 6 issue, 48 load buffers, 32 store buffers New system interface (QPI) finally dropped front side bus Integrated memory controller (up to 3 channels) New STTNI instructions for string/text handling Shen, Lipasti 32

33 Microarchitectural Updates Sandybridge/Ivy Bridge (2 nd 3 rd generation Core i7) On chip integrated graphics (GPU) Decoded uop cache up to 1.5K uops, handles loops, but more general 54 entry RS, 168 entry ROB Physical register file: 144 FP, 160 integer 256 bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128 bit load path from L1 D$ Shen, Lipasti 33

34 Microarchitectural Updates Haswell/Broadwell/Skylake: wider & deeper 8 wide issue (up from 6 wide) 4 th integer ALU, third AGU, second branch unit 60 entry RS, 192 entry ROB 72 entry load queue/42 entry store queue Physical register file: 168 FP, 168 integer Doubled FP throughput (32 SP/16 DP) Load/store bandwidth to L1 doubled (64B/32B) TSX (transactional memory) Integrated voltage regulator Shen, Lipasti 34

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Computer Architecture. Out-of-order Execution

Computer Architecture. Out-of-order Execution Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 The need for speed: Superscalar Remember our goal: minimize

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

ECE/CS 552: Introduction to Superscalar Processors

ECE/CS 552: Introduction to Superscalar Processors ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

The P6 Architecture: Background Information for Developers

The P6 Architecture: Background Information for Developers The P6 Architecture: Background Information for Developers 1995, Intel Corporation P6 CPU Overview CPU Dynamic Execution 133MHz core frequency 8K L1 caches APIC AA AA A AA A A A AA AA AA AA A A AA AA A

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001 Lec

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

BOBCAT: AMD S LOW-POWER X86 PROCESSOR ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison Register Data Flow ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Register Data Flow Techniques Register Data Flow Resolving Anti-dependences Resolving Output Dependences Resolving

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Effectiveness and Limitations of Embedded Counter Based Performance Analysis. Carey K. Kloss

Effectiveness and Limitations of Embedded Counter Based Performance Analysis. Carey K. Kloss CARNEGIE Department of Electrical MELLON and Computer Engineering~ Effectiveness and Limitations of Embedded Counter Based Performance Analysis Carey K. Kloss 1997 Effectiveness and Limitations of Embedded

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

CS 152, Spring 2012 Section 8

CS 152, Spring 2012 Section 8 CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

CISC, RISC and post-risc architectures

CISC, RISC and post-risc architectures Microprocessor architecture and instruction execution CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Superscalar Organization

Superscalar Organization Superscalar Organization ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Stage Phase Function performed CPU, circa 1986 IF φ 1 Translate virtual instr. addr. using TLB φ 2 Access

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information