Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Similar documents
Itanium 2 Processor Microarchitecture Overview

A 1.5GHz Third Generation Itanium Processor

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Intel released new technology call P6P

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

MICROPROCESSOR. Merced Shows Innovative Design. Static, Dynamic Elements Work in Synergy With Compiler

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Intel Enterprise Processors Technology

Computer Science 146. Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Static Multiple-Issue Processors: VLIW Approach

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

The Processor: Instruction-Level Parallelism

Jim Keller. Digital Equipment Corp. Hudson MA

OPENSPARC T1 OVERVIEW

Advanced processor designs

PowerPC 740 and 750

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECS 322 Computer Architecture Superpipline and the Cache

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Portland State University ECE 588/688. Cray-1 and Cray T3E

EC 513 Computer Architecture

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Chapter 4 The Processor 1. Chapter 4D. The Processor

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

CS425 Computer Systems Architecture

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Getting CPI under 1: Outline

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Hardware-Based Speculation

Advanced Processor Architecture

1. PowerPC 970MP Overview

Microarchitecture Overview. Performance

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

EITF20: Computer Architecture Part4.1.1: Cache - 2

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Next Generation Technology from Intel Intel Pentium 4 Processor

Update for New Implementations. As new implementations of the Itanium architecture

Microarchitecture Overview. Performance

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

COMPUTER ORGANIZATION AND DESI

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Chapter-5 Memory Hierarchy Design

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

ITANIUM PROCESSOR MICROARCHITECTURE

INSTRUCTION LEVEL PARALLELISM

Lecture 9: Multiple Issue (Superscalar and VLIW)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

E0-243: Computer Architecture

Processing Unit CS206T

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

KeyStone II. CorePac Overview

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Control Hazards. Prediction

Intel Itanium Processor

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

CS 152 Computer Architecture and Engineering

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Pentium IV-XEON. Computer architectures M

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 4 The Processor (Part 4)

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

PowerPC 620 Case Study

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Chapter 4. The Processor

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

November 7, 2014 Prediction

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

XT Node Architecture

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Advanced Computer Architecture

Advanced issues in pipelining

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Multiple Issue ILP Processors. Summary of discussions

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Transcription:

Next Generation Itanium Processor Overview Gary Hammond Principal Architect Enterprise Platform Group Corporation August 27-30, 2001 Sam Naffziger Lead Circuit Architect Microprocessor Technology Lab HP Corporation Page 1

Agenda Program Goals Processor Enhancements Processor Structures - Pipelines - Front-End - Functional Units - Caches System Bus Reliability Features Summary Page 2

Program Goals McKinley Program Goals Builds on Itanium processor s EPIC design philosophy 100% Itanium Processor Family binary compatibility World class performance - high performance for commercial servers - high integer and floating-point performance Support for mission critical applications Robust error recovery and containment Page 3

Program Goals Architecture Philosophy Builds on Itanium processor s EPIC features to achieve higher instruction level parallelism - Prefetch/branch/cache hints, speculation, predication, register stacking = same as Itanium processor. - Provide performance scaling + binary compatibility with Itanium-based applications/oses - Full hardware score-board design to resolve register hazards between instruction groups Same programming model as as Itanium processor No No code changes required Page 4

Processor Enhancements McKinley Optimizations Improved dynamic properties - Target production frequency is 1 GHz - Reduced L1, L2, L3 latencies - L3 cache has been incorporated on die - Improved L2 cache capacity - Improved FSB bandwidth - Lower branch prediction penalties McKinley provides significant speed ups ups on on existing Itanium processor binaries Page 5

Processor Enhancements McKinley Optimizations Reduced execution paths - More parallelism/resources More integer, multi-media media units and memory ports - Short latencies Fully bypassed functional units Very Low L1D/L2/L3 Cache Latencies Low latency FP execution - Many more ways to issue/execute 6 insts/clk McKinley provides performance headroom for for re-optimized binaries Page 6

Processor Enhancements Micro-Architectural Enhancements Itanium Processor McKinley System Bus 64 bits wide 133MHz/266 MT/s 2.1 GB/s Width Core 2 bundles per clock 800 MHz 4 integer units 2 load or stores per clock 9 issue ports L3 Cache BSB Caches L1 2X16KB - 2 clock latency L2 96K 12 clock latency L3-4MB external 20 clk 11.7 GB/s bandwidth Addressing 44 bit physical addressing 50 bit virtual addressing Maximum page size of 256MB System Bus System Bus 128 bits wide 200MHz/400 MT/s 6.4 GB/s Width 2 bundles per clock 6 integer units 2 loads and 2 stores per clock 11 issue ports Caches L1 2X16KB - 1 clock latency L2 256K 5 clock latency L3-3MB 12 clk 32 GB/s bandwidth Addressing 50 bit physical addressing 64 bit virtual addressing Maximum page size of 4GB System Bus Core 1 GHz L3 Cache Page 7

Processor Enhancements Architectural Changes Beneficial to compilers Improved data/control speculation support ALAT - fully associative = minimize thrashing processor directly vectors to recovery code for reduced speculation costs 64-bit Long Branch Instruction Beneficial to OS and System designs Full 64-bit virtual addressing Full 2**24 virtual address spaces 4GB virtual pages = reduced TLB pressure 50-bit Physical addressing = very large memory/io spaces Changes provide more flexibility to to compiler, OS OS and and system designs Page 8

McKinley Microarchitecture Full Chip Block Diagram Pipelines Front-end and Branch prediction Functional Units Caches System Bus Page 9

Processor Structure L3 Cache ECC ECC L2 Cache Quad Port ECC ECC McKinley Block Diagram Branch Prediction Scoreboard, Predicate,NaTs, Exceptions 11 Issue Ports Branch & Predicate Registers Branch Units L1 Instruction Cache and Fetch/Pre-fetch Engine Instruction Queue 128 Integer Registers 128 FP Registers Quad-Port L1 Data Cache and DTLB ITLB 8 bundles B B B M M M M I I F F Register Stack Engine / Re-Mapping Integer and MM Units ALAT Floating Point Units IA-32 Decode and Control ECC ECC Bus ECC Controller Page 10

Processor Structure Micro-Architecture Comparison Sun UltraSparc * III 2.4 GB/s 96K L1* System Bus Bandwidth On-die Cache McKinley Processor 6.4 GB/s L3 = 3MB, L2 = 256K; L1 = 16K+16K 14 1 2 3 4 96 Registers Issue Ports On-die Registers 8 1 2 3 4 5 6 7 8 9 1011 328 Registers 2 Integer 1 Branch 2 FP/VIS 1 Load/Store 900 MHz 4 instructions / Cycle Execution Units Core Frequency Instructions / Clk RISC Architecture *8 MB L2 External Cache Source: www.sun.com. The SPARC Architecture Manual (Prentice Hall) Page 11 6 Integer, 3 Branch 2 FP, 1 SIMD 2 Load, 2 Store 1 GHz 6 Instructions / Cycle EPIC Architecture

Processor Pipelines McKinley Pipelines FPU Core L2 IPG ROTEXPREN REGEXEEXE DET WB L2N FP1 FP2 FP3 FP4 L2I WB L2A L2M L2D L2C L2W IPG IP Generate, L1I Cache (6 inst) and TLB access EXE ALU Execute(6), L1D Cache and TLB access + L2 Cache Tag Access(4) ROT Instruction Rotate and Buffer (6 inst) DET Exception Detect, Branch Correction EXP Expand, Port Assignment and Routing WB Writeback, Integer Register update REN Integer and FP Register Rename (6 inst) FP1-WB FP FMAC pipeline (2) + reg write REG Integer and FP Register File read (6) L2N-L2I L2I L2 Queue Nominate/Issue (4) L2A-W L2 Access, Rotate, Correct, Write (4) Short 8-stage 8 in-order main pipeline In-order issue, out-of of-order order completion Reduced branch misprediction penalties Fully interlocked, no way-prediction or flush/replay mechanism Pipelines are are designed for for very very low low latency Page 12

Processor Pipeline McKinley Issue Ports Issue ports 4 Mem/ALU/Multi-Media Media 2 Integer/ALU/Multi-Media Media 2 FMAC 3 branch 44 memory ports L1 data cache two instruction bundles Integer: allow 2 load AND 2 store per clk McKinley Processor L1 instruction cache FP: 2 FP load pairs AND 2 store per clk to feed 2 FMACs ALU/ MEM 1 two load ports two store ports (1 cycle latency) ALU/ MEM 2 ALU/ MEM 3 Substantial performance headroom for for FP FP and and integer kernels ALU MEM 4 ALU/ INT 1 ALU/ INT 2 six arithmetic logic units Page 13

Processor Pipeline McKinley Dispersal Matrix MII MLI MMI MFI MMF MIB* MBB BBB MMB* MFB* MII * hint in first bundle MLI MMI MFI MMF MIB MBB BBB MBB MFM Possible McKinley full issue Possible Itanium processor and McKinley full issue McKinley allows more compiler dispersal options Page 14

Processor Pipeline McKinley Unit Latencies Consuming Class Instruction Producing Class Instruction Integer Multi- Load Store media Address Data Mem/integer ports ALU 1 2 1 1 Integer only ports ALU 1 2 1 1 Multimedia 3 2 3 3 Integer Loads (L1D hit) 1 2 2 1 Short latencies and and full full bypasses, improve performance for for re-optimized code Page 15

Processor Pipeline Floating Point Latencies Operation FP Load (4) (L2 Cache hit) FMAC,FMISC (2) Latency 6 4 FP -> > Int (getf) Int -> > FP (setf) Fcmp to branch Fcmp to qual pred 5 6 2 2 Short latencies = performance upside for for re-optimized FP FP code Page 16

Processor Front-end Branch Prediction Zero clock branch prediction 2 level branch prediction hierarchy - L1IBR Level 1 Branch Cache - Part of the L1 I-cacheI - 1K trigger predictions+0.5k target addresses - L2B - Level 2 Branch Cache (12K histories) - PHT - Pattern History Table (16K counters) Reduced prediction penalties IP-relative branch w/correct prediction - IP-relative branch w/wrong target - Return branch w/correct prediction - Last branch in counted loop prediction - Branch Misprediction Page 17 0 cycle 1 cycle 1 cycle 0 cycle 6 cycle Reduced branch penalties speed up up existing code

Processor Front-End Instruction Prefetching Streaming prefetching Initiated by br.many (hint on branch inst) CPU prefetches ahead the sequential execution stream Streaming prefetch is cancelled by: a predicted-taken taken branch in the front-end a branch misprediction occurs on the back-end Software cancels the prefetch with a brp instruction Branch Prefetching Hints Initiated by brp.few, brp.many or mov_to_br One time prefetch for the target Two hint prefetches can be initiated per cycle Software initiated instruction prefetching improves performance by by lower instruction fetch penalties Page 18

Processor Caches McKinley Caches L1I L1D L2 L3 Size 16K 16K 256K 3M on die Line Size 64B 64B 128B 128B Ways 4 4 8 12 Replacement LRU NRU NRU NRU Latency (load to use) I-Fetch:1 INT:1 FP: NA INT: 5 FP: 6 12 Write Policy - WT (RA) WB (WA +RA) WB (WA) Bandwidth R: 32 GBs R: 16 GBs W: 16 GBs R: 32 GBs W: 32 GBs R: 32 GBs W: 32 GBs All All caches are are physically indexed, pipelined, and and non-blocking: score boarded registers allow continued execution until load load use use Page 19

Processor Caches L1D (1 clock Integer Cache) High Performance 16GBs, 2 ld AND 2 st ports Write Through all stores are pushed to the L2 FP loads force miss, FP stores invalidate True dual-ported read access no load conflicts pseudo-dual dual store port write access 2 store coalescing buffers/port hold data until L1D update Store to load forwarding One One clock data data cache provides a significant performance benefit Page 20

Processor Caches L2 and L3 Cache L2 256KB, 32GBs, 5 clk Data array is pseudo-4 4 ported - 16 banks of 16KB each Non-blocking/out blocking/out-of-orderorder L2 queue (32 entries) - holds all in-flight load/stores out-of of-order order service - smoothes over load/store/bank conflicts, fills Can issue/retire 4 stores/loads per clock Can bypass L2 queue (5,7,9 clk bypass) ) if no address or bank conflicts in same issue group no prior ops in L2 queue want access to L2 data arrays Large L3 3MB, 32GBs, 12 clk cache on die!! Single ported full cache line transfers Large on on die die L2 L2 and and L3 L3 cache provides significant performance potential Page 21

Processor Caches TLBs 2-level TLB hierarchy - DTC/ITC DTC/ITC (32/32 entry, fully associative,.5 clk) - Small fast translation caches tied to L1D/L1I - Key to achieving very fast 1clk L1D, L1I cache accesses - DTLB/ITLB DTLB/ITLB (128/128 entry, fully associative, 1 clk) - All architected page sizes (4K to 4GB) - Supports up to 64/64 ITR/DTRs - TLB miss starts hardware page walker Small fast fast TLBs enable low low latency caches Page 22

System Bus System Bus Enhancements Extension of the Itanium processor bus - Same protocol with minor extensions - Increased to 6.4GBs bandwidth - frequency 200MHz, 400MHz data, 128-bit data bus - Bus is non-blocking and out of order - Most transactions can be deferred for later service - Buffering 18 bus requests/cpu are allowed to be outstanding 16 Read Line + 6 Write Line + two 128 byte WC buffers McKinley significantly extends the the system bus bus performance level Page 23

System Bus New Bus Transactions - L3 cast-outs outs (Normally silent L3 replacement (E->I, S->I)) S - Reduces snoop traffic in Directory based systems - Backward inquiry for L2, L1 coherency - Memory read current - non-destructive (non-coherent) snoop of CPU lines - Used in high bandwidth graphic based systems - Cache Cleanse writes all modified lines to memory - M->E, Used in fault tolerant systems invoked via PAL McKinley provides several new new bus bus transactions to to improve performance/reliability Page 24

Reliability Features Error Features - Error detection on all major arrays - Parity coverage on L1D, L1I, and TLBs - ECC on L2 and L3 - double bit detection single bit correction - Out of path repair - all errors are fully contained - Bus is covered with parity/ecc - double bit detection single bit correction on transmission - Error Isolation (end (end-to-end error detection) - From memory: unique FSB 2xECC syndrome encoding can tolerate additional single bit errors in transmission - Error not reported until referenced by a consuming process McKinley provides extensive error detection/correction/containment Page 25

Reliability Features Thermal Management - Programmable fail-safe thermal trip - McKinley will reduce power consumption - Reduce power consumption to ~60% of peak - Execution rate dropped to 1 inst per clock - Correct Machine Check notification posted to OS - Full speed execution resumes when temperature drops - never invoked in properly designed and operating cooling systems - even on worse case power code McKinley provides a thermal fail-safe mechanism in in the the event of of a cooling failure Page 26

Summary McKinley builds on and extends the Itanium processor family to meet the needs of the most demanding enterprise and technical computing environments - Enhanced McKinley features are a result of extensible Itanium architecture - McKinley is compatible with Itanium processor software Major enhancements include: - Increased frequency - Enhanced micro-architecture more execution units, issue ports - Efficient data handling; higher bandwidth and reduced latencies estimates McKinley based systems will deliver ~1.5X 2X performance improvement over today s Itanium based systems Page 27