EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Similar documents
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESI

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Real instruction set architectures. Part 2: a representative sample

Superscalar Processors

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Advanced Instruction-Level Parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

E0-243: Computer Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Intel released new technology call P6P

Advanced Computer Architecture

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Pentium IV-XEON. Computer architectures M

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Dynamic Scheduling. CSE471 Susan Eggers 1

The Pentium II/III Processor Compiler on a Chip

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Handout 2 ILP: Part B

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EC 513 Computer Architecture

5008: Computer Architecture

Super Scalar. Kalyan Basu March 21,

EN164: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Chapter 4 The Processor (Part 4)

Microarchitecture Overview. Performance

Getting CPI under 1: Outline

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

CMSC 611: Advanced Computer Architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

Computer Science 146. Computer Architecture

Chapter 06: Instruction Pipelining and Parallel Processing

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Hardware-based Speculation

CS425 Computer Systems Architecture

RAID 0 (non-redundant) RAID Types 4/25/2011

Exploitation of instruction level parallelism

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS 152, Spring 2011 Section 8

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Hardware-Based Speculation

Microarchitecture Overview. Performance

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lec 25: Parallel Processors. Announcements

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Instruction-Level Parallelism and Its Exploitation

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Advanced processor designs

Advanced issues in pipelining

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Advanced Processor Architecture

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

November 7, 2014 Prediction

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-Based Speculation

Computer Systems Architecture

Superscalar Processors

Computer Architecture. Out-of-order Execution

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Superscalar Machines. Characteristics of superscalar processors

Metodologie di Progettazione Hardware-Software

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Processor Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 152, Spring 2012 Section 8

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

EE 4980 Modern Electronic Systems. Processor Advanced

Transcription:

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2012 [ material from Patterson & Hennessy, 4 th ed & Inside the Machine by J. Stokes] 1

Superscalar and VLIW reminder 2

Superscalar processors CPU decides whether to issue 0, 1, 2, each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU No need to re-compile existing code Static superscalar (in order execution) and dynamic superscalar (out of order execution) 3

Challenges in superscalar processor design Challenging code snippets: DIV.F $f0, $f1, $f2 ADD.F $f0, $f1, $f3 DIV.F $f0, $f1, $f2 ADD.F $f2, $f0, $f1 ADD.F $f1, $f4, $f3 DIV.F $f0, $f1, $f2 ADD.F $f2, $f0, $f1 ADD.F $f2, $f4, $f5 DIV.F $f0, $f1, $f2 ADD.F $f2, $f0, $f1 ADD.F $f4, $f4, $f3 DIV.F $f0, $f1, $f2 BEQ $f0, $0, L L: ADD $t0, $t1, $t2 4

Summary of challenges 1. Structural hazards on RF à increase number of ports à RF size grows squarely with number of port 2. Large stall penalties for RAW hazards because of deep pipelines 3. WAR problems because out-of-order execution 4. WAW name dependency hazards because different instructions take different number of cycles and because of out-of-order execution 5. Speculation problems arising from branch prediction 6. Out of order execution problems (e.g., WAW, exceptions / interrupts) 5

Possible solutions 1. Structural hazards on RF à increase number of ports à RF size grows squarely with number of port à split register file into integer RF and floating point RF 2. Large stall penalties for RAW hazards à out of order execution 3. WAR / WAW problems à register renaming 4. Speculation problems à execute out-of-order but commit in order 5. Exceptions/interrupts à execute out-of-order but commit in order 6

Superscalar in x86 architecture First introduced by Intel in 1976 with the 16-bit processor 8086. Continues with us to this date. Instructions variable length (1-17 bytes) Small number of registers (mainly 8) Instructions kept on expanding with new processor releases 7

x86 assembly language examples # move the 4 bytes in memory at the address contained in EBX into EAX mov eax, [ebx] # move the 4 bytes of data at address ESI+4*EBX into EDX mov edx, [esi+4*ebx] # eax = eax + ebx add eax, ebx # add ebx to memory content at address memory_location add memory_location, ebx # add 10 the data at address EAX add [eax], 10 # push content of eax into top of stack push eax # compare two registers, register memory or memory memory cmp eax, ebx cmp [eax], [ebx] # conditional jumps jl, jeq, jg, jge, jle, jne Instructions vary in length from 1 to 17 bytes with an average of 3 bytes / instruction 8

Pentium Introduced 1993 Technology 0.8 micron Transistors: 3.1 million Speed: 66 Mhz # stages: 5 Issue width: 2-way static First superscalar (static) design from Intel 9

Pentium Fetch and Decode IF ID1 ID2 EX WB Pentium five-stage pipeline Fetch stage: Instructions are fetched in a buffer, boundaries detected and marked. Marked instructions are aligned and sent to decode stage. Decode stage 1 (ID1) does typical decoding and checks if the instruction is a branch (and does static branch prediction for 1 st encounters and branch prediction using BHT and BTB for 2 nd encounters) Decode stage 2 (ID2) does address calculation to support complex memory addressing modes of x86 10

Pentium backend Two single-stage asymmetric integer ALUs One three-stage floating point Control unit examines pairs of instructions and statically issues them simultaneously if there are no hazards and can be supported by both ALUs; otherwise, only one is issued. Results of two instructions are put back in order when completed and written back to register file or memory 11

Pentium processor in perspective CISC ISA bloated support (instruction fetch, instruction decode and complex instruction support) almost occupied 30% of die area RISC competitors (from IBM, MIPS and Sun) required much less overhead for fetch and decode à used the available area for larger on-chip memory (caches) and more execution units) 12

Pentium Pro P6 microarchitecture Introduced 1995 Technology: 0.5 0.35 micron Transistors: 5.5 million Speed: 150-200 Mhz #stages: 11 Issue width: 5-way dynamic Intel Core architecture is a direct descendant of P6 microarchitecture. First Intel x86 superscalar with dynamic scheduler (i.e., out-of-order execution) Instruction memory Data memory load store aka reorder buffer (ROB) 13

Fetch / decode (in-order) IF1 IF2 IF3 ID1 ID2 RAT ROB DIS EX RET1 RET2 Pentium Pro 11-stage pipeline 1. IF1: fetch instruction from memory into a buffer 2. IF2: mark boundaries of instruction and dynamic branch prediction 3. IF3: Align instructions for decoders 4. ID1: Translate complex x86 instructions into internal simpler uop RISC instructions 5. ID2: uop RISC instructions enters into queue simple decoder Instruction memory 16-byte fetch buffer 16-byte fetch buffer simple decoder complex decoder 6-entry uop queue 14

RAT and ROB stages IF1 IF2 IF3 ID1 ID2 RAT ROB DIS EX RET1 RET2 Pentium Pro 11-stage pipeline 6. RAT: registers are renamed into 40 internal registers to eliminate false dependencies like WAW and WAR. A Register Alias Table (RAT) holds the mapping between architectural registers and hidden registers 7. The ROB is a circular buffer with 40 entries and holds the RISC uops in strict program order 6-entry uop queue RAT logical à physical ROB 40-entry circular buffer 15

Dispatch and execute unit (out-of-order) IF1 IF2 IF3 ID1 ID2 RAT ROB DIS EX RET1 RET2 Pentium Pro 11-stage pipeline 8. DIS: Dynamic scheduler checks instructions in ROB and dispatch (or issue) them to the reservation stations of the execution units if (1) the execution unit is available (no structural hazard) and (2) values of its operands are available (no RAW hazard). 9. EX: integer pipelined 1-cycle throughput. Floating pipelined for most instructions. DIV and SQRT take 38-69 cycles to finish. 16

Retire unit (in-order) IF1 IF2 IF3 ID1 ID2 RAT ROB DIS EX RET1 RET2 10. RET1: Results from EX stages are written back to ROB buffer(s) 11. RET2: Results from ROB buffer are committed in order to register file as long as retiring instruction is no longer speculative 17

Pentium II / III P6 microarchitecture Introduced 1997 Technology: 0.35 micron Transistors: 7.5 million Speed: 233-300 Mhz Similar microarchitecture to Pentium Pro Added MMX support for Single-Instruction Multiple Data (SIMD) integer ALU Increased on chip-memory size P6 microarchitecture Introduced: 1999 Technology: 0.25 0.18 micron Transistors: 9.5 million Speed: 600 Mhz 1.3 Ghz Added SSE support for Single- Instruction Multiple Data (SIMD) singleprecision floating point ALU Increased on chip-memory size 18

Limits to superscalar processors Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation helps if done well, but heavy penalties for incorrect branch predictions HW complexity increase power consumption & die area 19