Pentium Pro Case Study ECE/CS 752 Fall 2017

Size: px

Start display at page:

Download "Pentium Pro Case Study ECE/CS 752 Fall 2017"

Jared Reed
6 years ago
Views:

1 Pentium Pro Case Study ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

2 Pentium Pro Case Study Microarchitecture Order 3 Superscalar Out of Order execution Speculative execution In order completion Design Methodology Performance Analysis Retrospective

3 Goals of P6 Microarchitecture IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power

4 BTB/ICU Fetch 4 Cycles P6 The Big Picture BAC/Rename Decode 2 Cycles Allocation Dispatch 2 Cycles Reservation Station (20) Cycles ROB (40x157) RRF 2 cyc AGU1 AGU0 MOB DCU IEU1 JEU IEU0 Fadd Fmul Imul Div

5 Memory Hierarchy Main Memory PCI 64 bit ICache (8KB) CPU BIU DCache (8Kb) 16 bytes L2 Cache (256Kb) Level 1 instruction and data caches 2 cycle access time Level 2 unified cache 6 cycle access time Separate level 2 cache and memory address/data bus

6 Inst. Rotator Inst. Buf Instruction Fetch L2 Cache (256Kb) Stream Buffer ICache (8Kb) Victim Cache Inst. Inst. Length Decoder Length Marks Prediction Marks Instruction TLB Physical Addr. Fetch Address Other Fetch Requests Next Addr. Logic Branch Target Buffer (512) To Decode Prediction Instruction Data Mux Instruction Data 16 bytes 16 bytes + marks 2 cycle Branch Target

7 Stream Buffer ICache (8 Kb) Instruction Cache Unit Bus Interface Unit Victim Cache Lower 12 bits Upper 20 bits Fetch Address Lower 12 bits Data Mux Instruction Tag Array Instruction Data Hit/Miss ITLB

8 Branch Target Buffer Way 0 Way 1 Way 3 PHT 128 Sets Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. 16 entries/set Tag Compare Fetch Address Return Stack Prediction Control Logic Prediction & Target Addr. Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch

9 Br. History Speculative History Branch Prediction Algorithm 0010 Br. Pred. Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front end and drains the execution core 1 Branch Execution 0101 State Machine 10 Spec. Pred Pattern Table 1 0 Branch mis prediction resets the speculative branch history state to match BHR

10 Instruction Decode 1 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Branch instruction detection Branch address calculation Static prediction and branch always execution One branch decode per cycle (break on branch)

11 Instruction Decode 2 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re filled Macro instructions must shift from decoder 2 to decoder 1 to decoder 0

12 What is a uop? Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx) Uop decomposition: ld guop0, (eax) guop0 <- MEM(eax) ld guop1, (ebx) guop1 <- MEM(ebx) add guop0,guop1 guop0 <- guop0 + guop1 sta eax std guop0 MEM(eax) <- guop0

13 Instruction Dispatch uop Queue (6) Renaming Dispatch Buffer (3) Mux Logic Retirement Info To Reservation Station Allocator 2 cycles Register Renaming Allocation requirements 3 or none Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer probably dispatches all 3 uops before re fill

14 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming 1 Similar to Tomasulo s Algorithm Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB)

15 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming Example Dispatching: add eax, ebx add eax, ecx fxch f0, f1 Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Comp Alloc Completing: sub eax, ecx sub Shen, Lipasti

16 Challenges to Register Renaming Integer RAT Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events 8-bit code mov mov add add AL, #data1 AH, #data2 AL, #data3 AL, #data4 EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Byte addressable registers 39

17 RS bypass Out of Order Execution Engine Reservation Station (20) Cycles AGU1 AGU0 IEU1 IEU0 MOB JEU Fadd Fmul DCU (8Kb) Imul Div In order branch issue and execution In order load/store issue to address generation units Instruction execution and result bus scheduling Is the reservation station truly centralized & what is binding?

18 Reservation Station From Dispatch Queue Port 4 Port 3 Port 2 Port 1 Port 0 Cycle 1 Cycle 2 To Execution Units Cycle 1 Order checking Operand availability Cycle 2 Writeback bus scheduling

19 Memory Ordering Buffer (MOB) AGU0 AGU1 R/S Load Buffer (16) Conflict Logic Store Address Buffer (12) Store Data Buffer (12) 2 cycle Data Cache Unit (8Kb) MOB Bypass Control Logic 2 cycle Load Data Result Load buffer retains loads until completed, for coherency checking Store forwarding out of store buffers 2 cycle latency through MOB Store Coloring Load instructions are tagged by the last store

20 Instruction Completion Handles all exception/interrupt/trap conditions Handles branch recovery OOO core drains out right path instructions, commits to RRF In parallel, front end starts fetching from target/fall through However, no renaming is allowed until OOO core is drained After draining is done, RAT is reset to point to RRF Avoids checkpointing RAT, recovering to intermediate RAT state Commits execution results to the architectural state in order Retirement Register File (RRF) Must handle hazards to RRF (writes/reads in same cycle) Must handle hazards to RAT (writes/reads in same cycle) Atomic IA 32 instruction completion uops are marked as 1st or last in sequence exception/interrupt/trap boundary 2 cycle retirement

21 Pentium Pro Design Methodology 1 Performance Model Start 1990 Finish 4Q 1994 Microarchitecture 1 year Conception 1 year Silicon Debug Behavioral Model 0.75 years Structural Model Logic Design 1.75 years Circuit Design Layout

22 Observability Pentium Pro Performance Analysis On chip event counters Dynamic analysis Benchmark Suite BAPco Sysmark32 32 bit Windows NT applications Winstone97 32 bit Windows NT applications Some SPEC95 benchmarks

23 Performance Run Times User-Mode Processor Cycles Unhalted User-Mode Processor Cycles 1.2E+11 1E+11 Total of 27.5 billion cycles 8E+10 6E+10 4E+10 2E+10 0 avs li go Pwave MSVC Corel excel word paradox WordPro PowerPnt SYSexcel PhotoShop PageMaker SYSword7 SYSflw96 compress SYSCorel6 SYSexcel7 microstation SYSpowerpnt7 SYSParadox7 SYSpagemaker6

24 Performance IPC vs. upc 1.6 Instructions and and UOPs Retired Uops Per Cycle retired per cycle Inst retired UOPS retired 2 uops/instruction # Retired 0.4 Per Cycle WS-AVS SPEC-Li WS-Excel WS-Word SPEC-Go WS-PWave SYS-Excel SYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-WordPro SYS-Paradox WS-PhotoShop WS-PowerPnt WS-Corel Draw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress WS-Microstation

25 Performance IPC vs. upc 100 Inst. Inst. or UOPS Uops retired retired per cycle per cycle AVS PageMaker Paradox PowerPoint WordPro Instructions were Retired 20 Percent 10 of Cycles where n=0-3 UOPs or 0 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.

26 Performance Cache Misses 2.5 Cache Cache Misses Per Cycle Per Cycle IFU Misses / Cycle DCU Misses / Cycle L2 Misses / Cycle IL1-0.51% DL1-1.15% L2-0.25% 1 % Unhalted Cycles WS-AVS SPEC-Li WS-Excel WS-Word WS-PWave SYS-ExcelSYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-PowerPnt WS-WordPro SYS-Paradox WS-PhotoShop WS-CorelDraw WS-Microstation WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress SPEC-Go

27 Performance Branch Prediction 16% Rate Branch Mispredict Rate 14% 12% 6.8% avg 10% 8% 6% 4% % Branches Mispredicted 2% 0% 45% WS-AVS SPEC-Li WS -Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro WS-PowerPnt SYS-Paradox WS-PhotoShop WS-CorelDraw SYS-PowerPnt WS-Microstation WS-PageMa ker SYS- CorelDraw SYS-Pagemaker SPEC-Compress Rate BTB Miss Rate 40% 35% 30% 25% 20% 15% 10% % Branches that Miss in BTB 5% 0% WS-AVS SPEC-Li WS-Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS -MS VC++ WS-Paradox WS- PowerPnt WS-WordPro SYS-Paradox WS-P hotows-coreldraw Shop SYS-PowerPnt WS-Microstation WS-PageMaker SYS-CorelDraw SYS-Pagemaker SPEC-Compress

28 IA 32 Compliant Conclusions Performance (Frequency IPC) ISpec FSpec SPECint SPECfp95 Validation Die Size Fabable Schedule 1 year late Power

29 Retrospective Most commercially successful microarchitecture in history Evolution Pentium II/III, Xeon, etc. Derivatives with on chip L2, ISA extensions, etc. Replaced by Pentium 4 as flagship in 2001 High frequency, deep pipeline, extreme speculation Resurfaced as Pentium M in 2003 Initially a response to Transmeta in laptop market Pentium 4 derivative (90nm Prescott) delayed, slow, hot Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 Shen, Lipasti 29

30 Microarchitectural Updates Pentium M (Banias), Core Duo (Yonah) Micro op fusion (also in AMD K7/K8) Multiple uops in one: (add eax,[mem] => ld/alu), sta/std These uops decode/dispatch/commit once, issue twice Better branch prediction Loop count predictor Indirect branch predictor Slightly deeper pipeline (12 stages) Extra decode stage for micro op fusion Extra stage between issue and execute (for RS/PLRAM read) Data capture reservation station (payload RAM) Clock gated for 32 (int), 64 (fp), and 128 (SSE) operands Shen, Lipasti 30

31 Microarchitectural Updates Core 2 Duo (Merom) 64 bit ISA from AMD K8 Macro op fusion Merge uops from two x86 ops E.g. cmp, jne => cmpjne 4 wide decoder (Complex + 3x Simple) Peak x86 decode throughput is 5 due to macro op fusion Loop buffer Loops that fit in 18 entry instruction queue avoid fetch/decode overhead Even deeper pipeline (14 stages) Larger reservation station (32), instruction window (96) Memory dependence prediction Shen, Lipasti 31

32 Microarchitectural Updates Nehalem (Core i7/i5/i3) RS size 36, ROB 128 Loop cache up to 28 uops L2 branch predictor L2 TLB I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M Simultaneous multithreading RAS now renamed (repaired) 6 issue, 48 load buffers, 32 store buffers New system interface (QPI) finally dropped front side bus Integrated memory controller (up to 3 channels) New STTNI instructions for string/text handling Shen, Lipasti 32

33 Microarchitectural Updates Sandybridge/Ivy Bridge (2 nd 3 rd generation Core i7) On chip integrated graphics (GPU) Decoded uop cache up to 1.5K uops, handles loops, but more general 54 entry RS, 168 entry ROB Physical register file: 144 FP, 160 integer 256 bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128 bit load path from L1 D$ Shen, Lipasti 33

Microarchitectural Updates Haswell/Broadwell/Skylake: wider & deeper 8 wide issue (up from 6 wide) 4 th integer ALU, third AGU, second branch unit 60 entry RS, 192 entry ROB 72 entry load queue/42

34 Microarchitectural Updates Haswell/Broadwell/Skylake: wider & deeper 8 wide issue (up from 6 wide) 4 th integer ALU, third AGU, second branch unit 60 entry RS, 192 entry ROB 72 entry load queue/42 entry store queue Physical register file: 168 FP, 168 integer Doubled FP throughput (32 SP/16 DP) Load/store bandwidth to L1 doubled (64B/32B) TSX (transactional memory) Integrated voltage regulator Shen, Lipasti 34

How to write powerful parallel Applications

How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome