Pentium Pro Case Study ECE/CS 752 Fall 2017
|
|
- Jared Reed
- 6 years ago
- Views:
Transcription
1 Pentium Pro Case Study ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti
2 Pentium Pro Case Study Microarchitecture Order 3 Superscalar Out of Order execution Speculative execution In order completion Design Methodology Performance Analysis Retrospective
3 Goals of P6 Microarchitecture IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power
4 BTB/ICU Fetch 4 Cycles P6 The Big Picture BAC/Rename Decode 2 Cycles Allocation Dispatch 2 Cycles Reservation Station (20) Cycles ROB (40x157) RRF 2 cyc AGU1 AGU0 MOB DCU IEU1 JEU IEU0 Fadd Fmul Imul Div
5 Memory Hierarchy Main Memory PCI 64 bit ICache (8KB) CPU BIU DCache (8Kb) 16 bytes L2 Cache (256Kb) Level 1 instruction and data caches 2 cycle access time Level 2 unified cache 6 cycle access time Separate level 2 cache and memory address/data bus
6 Inst. Rotator Inst. Buf Instruction Fetch L2 Cache (256Kb) Stream Buffer ICache (8Kb) Victim Cache Inst. Inst. Length Decoder Length Marks Prediction Marks Instruction TLB Physical Addr. Fetch Address Other Fetch Requests Next Addr. Logic Branch Target Buffer (512) To Decode Prediction Instruction Data Mux Instruction Data 16 bytes 16 bytes + marks 2 cycle Branch Target
7 Stream Buffer ICache (8 Kb) Instruction Cache Unit Bus Interface Unit Victim Cache Lower 12 bits Upper 20 bits Fetch Address Lower 12 bits Data Mux Instruction Tag Array Instruction Data Hit/Miss ITLB
8 Branch Target Buffer Way 0 Way 1 Way 3 PHT 128 Sets Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. Fetch Addr. Tag Br. Offset 4-bit BHR 4-bit BHR spec. Target Addr. 16 entries/set Tag Compare Fetch Address Return Stack Prediction Control Logic Prediction & Target Addr. Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch
9 Br. History Speculative History Branch Prediction Algorithm 0010 Br. Pred. Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front end and drains the execution core 1 Branch Execution 0101 State Machine 10 Spec. Pred Pattern Table 1 0 Branch mis prediction resets the speculative branch history state to match BHR
10 Instruction Decode 1 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Branch instruction detection Branch address calculation Static prediction and branch always execution One branch decode per cycle (break on branch)
11 Instruction Decode 2 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. urom Decoder 0 Decoder 1 Decoder 2 4 uops 1 uop 1 uop Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re filled Macro instructions must shift from decoder 2 to decoder 1 to decoder 0
12 What is a uop? Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx) Uop decomposition: ld guop0, (eax) guop0 <- MEM(eax) ld guop1, (ebx) guop1 <- MEM(ebx) add guop0,guop1 guop0 <- guop0 + guop1 sta eax std guop0 MEM(eax) <- guop0
13 Instruction Dispatch uop Queue (6) Renaming Dispatch Buffer (3) Mux Logic Retirement Info To Reservation Station Allocator 2 cycles Register Renaming Allocation requirements 3 or none Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer probably dispatches all 3 uops before re fill
14 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming 1 Similar to Tomasulo s Algorithm Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB)
15 Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events Register Renaming Example Dispatching: add eax, ebx add eax, ecx fxch f0, f1 Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Comp Alloc Completing: sub eax, ecx sub Shen, Lipasti
16 Challenges to Register Renaming Integer RAT Real Register File (RRF) EAX EBX 8 ECX FST0 FST1 GuoP0 GuoP1 IuoP(0-3) CC/Events 8-bit code mov mov add add AL, #data1 AH, #data2 AL, #data3 AL, #data4 EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 FST7 Reorder Buffer (ROB) Byte addressable registers 39
17 RS bypass Out of Order Execution Engine Reservation Station (20) Cycles AGU1 AGU0 IEU1 IEU0 MOB JEU Fadd Fmul DCU (8Kb) Imul Div In order branch issue and execution In order load/store issue to address generation units Instruction execution and result bus scheduling Is the reservation station truly centralized & what is binding?
18 Reservation Station From Dispatch Queue Port 4 Port 3 Port 2 Port 1 Port 0 Cycle 1 Cycle 2 To Execution Units Cycle 1 Order checking Operand availability Cycle 2 Writeback bus scheduling
19 Memory Ordering Buffer (MOB) AGU0 AGU1 R/S Load Buffer (16) Conflict Logic Store Address Buffer (12) Store Data Buffer (12) 2 cycle Data Cache Unit (8Kb) MOB Bypass Control Logic 2 cycle Load Data Result Load buffer retains loads until completed, for coherency checking Store forwarding out of store buffers 2 cycle latency through MOB Store Coloring Load instructions are tagged by the last store
20 Instruction Completion Handles all exception/interrupt/trap conditions Handles branch recovery OOO core drains out right path instructions, commits to RRF In parallel, front end starts fetching from target/fall through However, no renaming is allowed until OOO core is drained After draining is done, RAT is reset to point to RRF Avoids checkpointing RAT, recovering to intermediate RAT state Commits execution results to the architectural state in order Retirement Register File (RRF) Must handle hazards to RRF (writes/reads in same cycle) Must handle hazards to RAT (writes/reads in same cycle) Atomic IA 32 instruction completion uops are marked as 1st or last in sequence exception/interrupt/trap boundary 2 cycle retirement
21 Pentium Pro Design Methodology 1 Performance Model Start 1990 Finish 4Q 1994 Microarchitecture 1 year Conception 1 year Silicon Debug Behavioral Model 0.75 years Structural Model Logic Design 1.75 years Circuit Design Layout
22 Observability Pentium Pro Performance Analysis On chip event counters Dynamic analysis Benchmark Suite BAPco Sysmark32 32 bit Windows NT applications Winstone97 32 bit Windows NT applications Some SPEC95 benchmarks
23 Performance Run Times User-Mode Processor Cycles Unhalted User-Mode Processor Cycles 1.2E+11 1E+11 Total of 27.5 billion cycles 8E+10 6E+10 4E+10 2E+10 0 avs li go Pwave MSVC Corel excel word paradox WordPro PowerPnt SYSexcel PhotoShop PageMaker SYSword7 SYSflw96 compress SYSCorel6 SYSexcel7 microstation SYSpowerpnt7 SYSParadox7 SYSpagemaker6
24 Performance IPC vs. upc 1.6 Instructions and and UOPs Retired Uops Per Cycle retired per cycle Inst retired UOPS retired 2 uops/instruction # Retired 0.4 Per Cycle WS-AVS SPEC-Li WS-Excel WS-Word SPEC-Go WS-PWave SYS-Excel SYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-WordPro SYS-Paradox WS-PhotoShop WS-PowerPnt WS-Corel Draw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress WS-Microstation
25 Performance IPC vs. upc 100 Inst. Inst. or UOPS Uops retired retired per cycle per cycle AVS PageMaker Paradox PowerPoint WordPro Instructions were Retired 20 Percent 10 of Cycles where n=0-3 UOPs or 0 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.
26 Performance Cache Misses 2.5 Cache Cache Misses Per Cycle Per Cycle IFU Misses / Cycle DCU Misses / Cycle L2 Misses / Cycle IL1-0.51% DL1-1.15% L2-0.25% 1 % Unhalted Cycles WS-AVS SPEC-Li WS-Excel WS-Word WS-PWave SYS-ExcelSYS-Word SYS-FlW96 SYS-Excel WS-MSVC++ WS-Paradox WS-PowerPnt WS-WordPro SYS-Paradox WS-PhotoShop WS-CorelDraw WS-Microstation WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-Pagemaker SPEC-Compress SPEC-Go
27 Performance Branch Prediction 16% Rate Branch Mispredict Rate 14% 12% 6.8% avg 10% 8% 6% 4% % Branches Mispredicted 2% 0% 45% WS-AVS SPEC-Li WS -Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro WS-PowerPnt SYS-Paradox WS-PhotoShop WS-CorelDraw SYS-PowerPnt WS-Microstation WS-PageMa ker SYS- CorelDraw SYS-Pagemaker SPEC-Compress Rate BTB Miss Rate 40% 35% 30% 25% 20% 15% 10% % Branches that Miss in BTB 5% 0% WS-AVS SPEC-Li WS-Excel WS-Word SYS-Excel SYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS -MS VC++ WS-Paradox WS- PowerPnt WS-WordPro SYS-Paradox WS-P hotows-coreldraw Shop SYS-PowerPnt WS-Microstation WS-PageMaker SYS-CorelDraw SYS-Pagemaker SPEC-Compress
28 IA 32 Compliant Conclusions Performance (Frequency IPC) ISpec FSpec SPECint SPECfp95 Validation Die Size Fabable Schedule 1 year late Power
29 Retrospective Most commercially successful microarchitecture in history Evolution Pentium II/III, Xeon, etc. Derivatives with on chip L2, ISA extensions, etc. Replaced by Pentium 4 as flagship in 2001 High frequency, deep pipeline, extreme speculation Resurfaced as Pentium M in 2003 Initially a response to Transmeta in laptop market Pentium 4 derivative (90nm Prescott) delayed, slow, hot Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 Shen, Lipasti 29
30 Microarchitectural Updates Pentium M (Banias), Core Duo (Yonah) Micro op fusion (also in AMD K7/K8) Multiple uops in one: (add eax,[mem] => ld/alu), sta/std These uops decode/dispatch/commit once, issue twice Better branch prediction Loop count predictor Indirect branch predictor Slightly deeper pipeline (12 stages) Extra decode stage for micro op fusion Extra stage between issue and execute (for RS/PLRAM read) Data capture reservation station (payload RAM) Clock gated for 32 (int), 64 (fp), and 128 (SSE) operands Shen, Lipasti 30
31 Microarchitectural Updates Core 2 Duo (Merom) 64 bit ISA from AMD K8 Macro op fusion Merge uops from two x86 ops E.g. cmp, jne => cmpjne 4 wide decoder (Complex + 3x Simple) Peak x86 decode throughput is 5 due to macro op fusion Loop buffer Loops that fit in 18 entry instruction queue avoid fetch/decode overhead Even deeper pipeline (14 stages) Larger reservation station (32), instruction window (96) Memory dependence prediction Shen, Lipasti 31
32 Microarchitectural Updates Nehalem (Core i7/i5/i3) RS size 36, ROB 128 Loop cache up to 28 uops L2 branch predictor L2 TLB I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M Simultaneous multithreading RAS now renamed (repaired) 6 issue, 48 load buffers, 32 store buffers New system interface (QPI) finally dropped front side bus Integrated memory controller (up to 3 channels) New STTNI instructions for string/text handling Shen, Lipasti 32
33 Microarchitectural Updates Sandybridge/Ivy Bridge (2 nd 3 rd generation Core i7) On chip integrated graphics (GPU) Decoded uop cache up to 1.5K uops, handles loops, but more general 54 entry RS, 168 entry ROB Physical register file: 144 FP, 160 integer 256 bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128 bit load path from L1 D$ Shen, Lipasti 33
34 Microarchitectural Updates Haswell/Broadwell/Skylake: wider & deeper 8 wide issue (up from 6 wide) 4 th integer ALU, third AGU, second branch unit 60 entry RS, 192 entry ROB 72 entry load queue/42 entry store queue Physical register file: 168 FP, 168 integer Doubled FP throughput (32 SP/16 DP) Load/store bandwidth to L1 doubled (64B/32B) TSX (transactional memory) Integrated voltage regulator Shen, Lipasti 34
How to write powerful parallel Applications
How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome
More informationThe Pentium II/III Processor Compiler on a Chip
The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda
More informationComputer Architecture. Out-of-order Execution
Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 The need for speed: Superscalar Remember our goal: minimize
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationPentium IV-XEON. Computer architectures M
Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationPentium 4 Processor Block Diagram
FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationThe P6 Architecture: Background Information for Developers
The P6 Architecture: Background Information for Developers 1995, Intel Corporation P6 CPU Overview CPU Dynamic Execution 133MHz core frequency 8K L1 caches APIC AA AA A AA A A A AA AA AA AA A A AA AA A
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationPage 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)
CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001 Lec
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationBOBCAT: AMD S LOW-POWER X86 PROCESSOR
ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II
CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationComputer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013
18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationDatapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale
Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationLecture 12 Branch Prediction and Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer
More informationPortland State University ECE 587/687. The Microarchitecture of Superscalar Processors
Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationRegister Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison
Register Data Flow ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Register Data Flow Techniques Register Data Flow Resolving Anti-dependences Resolving Output Dependences Resolving
More informationComputer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013
18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationPrecise Exceptions and Out-of-Order Execution. Samira Khan
Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationDYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationEffectiveness and Limitations of Embedded Counter Based Performance Analysis. Carey K. Kloss
CARNEGIE Department of Electrical MELLON and Computer Engineering~ Effectiveness and Limitations of Embedded Counter Based Performance Analysis Carey K. Kloss 1997 Effectiveness and Limitations of Embedded
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationCS 152, Spring 2012 Section 8
CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationCISC, RISC and post-risc architectures
Microprocessor architecture and instruction execution CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationSuperscalar Organization
Superscalar Organization ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Stage Phase Function performed CPU, circa 1986 IF φ 1 Translate virtual instr. addr. using TLB φ 2 Access
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationArchitectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1
Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationEN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture
EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More information