IBM's POWER5 Micro Processor Design and Methodology

Similar documents
Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR

Multi-Threading. Last Time: Dynamic Scheduling. Recall: Throughput and multiple threads. This Time: Throughput Computing

Power 7. Dan Christiani Kyle Wieschowski

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

What SMT can do for You. John Hague, IBM Consultant Oct 06

CS 152 Computer Architecture and Engineering

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS 152 Computer Architecture and Engineering

Handout 2 ILP: Part B

CS 152 Computer Architecture and Engineering

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Lecture-13 (ROB and Multi-threading) CS422-Spring

1. PowerPC 970MP Overview

All About the Cell Processor

POWER6 Processor and Systems

Hyperthreading Technology

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Simultaneous Multithreading Architecture

Case Study IBM PowerPC 620

CS425 Computer Systems Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Simultaneous Multithreading on Pentium 4

EECS 452 Lecture 9 TLP Thread-Level Parallelism

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Understanding Simultaneous Multithreading on z Systems (post-announcement)

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Partitioning Computers. Rolf M Dietze,

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

" # " $ % & ' ( ) * + $ " % '* + * ' "

IBM ^ POWER4 System Microarchitecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Kaisen Lin and Michael Conley

Power Technology For a Smarter Future

Introducing Multi-core Computing / Hyperthreading

Multi-core Programming Evolution

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Advanced issues in pipelining

45-year CPU Evolution: 1 Law -2 Equations

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CSE 502 Graduate Computer Architecture


EC 513 Computer Architecture

Lecture 18: Multithreading and Multicores

Advanced Parallel Programming I

Concurrent High Performance Processor design: From Logic to PD in Parallel

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Microarchitecture Overview. Performance

POWER7+ TM IBM IBM Corporation

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

POWER3: Next Generation 64-bit PowerPC Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Microarchitecture Overview. Performance

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Simultaneous Multithreading (SMT)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Simultaneous Multithreading (SMT)

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Advanced Computer Architecture

Processing Unit CS206T

Advanced Processor Architecture

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

AMD Opteron 4200 Series Processor

Each Milliwatt Matters

HW Trends and Architectures

Portland State University ECE 588/688. Cray-1 and Cray T3E

Pipelining and Vector Processing

Outline Marquette University

CS 654 Computer Architecture Summary. Peter Kemper

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

CS425 Computer Systems Architecture

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Computer Architecture Spring 2016

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Advanced processor designs

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Transcription:

IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group

Outline POWER5 Overview Design Process Power

POWER Server Roadmap 2001 POWER4 2002-3 POWER4+ 2004* POWER5 2005* POWER5+ 2006* POWER6 180 nm 1.3 GHz Core 1.3 GHz Core 130 nm 1.7 GHz Core Shared L2 1.7 GHz Core 130 nm > GHz Core > GHz Core Shared L2 90 nm >> GHz Core >> GHz Core Shared L2 Distributed Switch 65 nm Ultra high frequency cores L2 caches Advanced System Features Shared L2 Distributed Switch Distributed Switch Distributed Switch Chip Multi Processing - Distributed Switch - Shared L2 Dynamic LPARs (16) Reduced size Lower power Larger L2 More LPARs (32) Simultaneous multi-threading Sub-processor partitioning Dynamic firmware updates Enhanced scalability, parallelism High throughput performance Enhanced memory subsystem Autonomic Computing Enhancements * *Planned to be offered by IBM. All statements about IBM s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.

POWER5 Technology: 130nm lithography, Cu, SOI 389mm 2 276M Transistors Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor

Multi-threading Evolution FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single Thread Fine Grain Threading Thread 0 Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 1 Executing Coarse Grain Threading Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL No Thread Executing

Thread Priority Instances when unbalanced execution desirable No work for opposite thread Thread waiting on lock Software determined non uniform balance Power management Solution: Control instruction decode rate Software/hardware controls 8 priority levels for each thread IPC 2 2 1 1 1 1 1 0 0 0 Single Thread Mode 0,7-5 -3-1 0 1 3 5 7,0 1,1 Thread 1 Priority - Thread 0 Priority Thread 0 IPC Thread 1 IPC Power Save Mode

Terminology PowerPC Addresses Effective>(SLB)>Virtual>(Page Table)>Real>(LPAR)>Physical Instruction Execution I-fetch Decode Dispatch Issue Finish Complete

Multithreaded Instruction Flow in Processor Pipeline Branch Redirects Out-of-Order Processing Instruction Fetch IF IF IC BP D0 D0 Interrupts & Flushes D1 D2 D3 Xfer GD Group Formation and Instruction Decode MP ISS RF EX BR LD/ST WB Xfer MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX FX WB Xfer MP ISS RF F6 F6 F6F6F6F6 FP WB Xfer CP CP Program Counter Instruction Cache Instruction Translation Alternate Branch History Tables Instruction Buffer 0 Instruction Buffer 1 Branch Prediction Thread Priority Return Stack Target Cache Group Formation, Instruction Decode, Dispatch Shared by two threads Resource used by thread 0 Resource used by thread 1 Shared Register Mappers Shared Issue Queues Dynamic Instruction Selection Read Shared Register Files Shared Execution Units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write Shared Register Files Group Completion Data Translation Store Queue Data Cache L2 Cache

Resource Sizes Analysis done to optimize every micro-architectural resource size GPR/FPR rename pool size I-fetch buffers ST SMT Reservation Station IPC SLB/TLB/ERAT I-cache/D-cache Many Workloads examined Associativity also examined ~ 50 60 70 80 90 100 110 120 130 Number of GPR Renames Results based on simulation of an online transaction processing application Vertical axis does not originate at 0

Single Thread Operation Advantageous for execution unit limited applications Floating or fixed point intensive workloads Execution unit limited applications provide minimal performance leverage for SMT Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread IPC POWER4+ POWER5 ST POWER5 SMT Determined dynamically on a per processor basis Matrix Multiply

Modifications to POWER4 System Structure P P P P L3 L2 L2 L3 L3 L3 Fab Ctl Fab Ctl Mem Ctl Mem Ctl L3 L3 Mem Ctl Mem Ctl Memory Memory

16-way Building Block Memory I/O Memory I/O Memory I/O Memory I/O Book MCM L3 POWER5 L3 POWER5 L3 POWER5 L3 POWER5 L3 POWER5 POWER5 POWER5 POWER5 L3 L3 L3 MCM Memory I/O I/O I/O I/O Memory Memory Memory

POWER5 Multi-chip Module 95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal

64-way SMP Interconnection Interconnection exploits enhanced distributed switch All chip interconnections operate at half processor frequency and scale with processor frequency

Design Process Concept Phase (~10 People/4 months) Competitive assessment Customer requirements Technology assessment Try to match technology introduction with products AREA/POWER estimates Cost Schedule Staffing Lots or executive/technical reviews

Design Process (cont.) High Level Design Phase (~50 People/6 months) Micro-architecture Cycle Time closed Cross section of critical paths Power closed Area/Power budgets established Arrays architected All Unit I/O defined timing contract in place Performance model meets concept phase targets HLD Exit review. Everything is now in place to effectively engage full team

Design Process (cont.) Implementation Phase (~200 People/12-18 months) Write VHDL Schematic entry for circuits Simulation Block>unit>core>chip>system Integration process begins RLM and Custom macros > unit > core >chip Design rule checking, (can t trust human s) Bulk simulation Check recheck and check again RIT (Release Information Tape)

Bring-up 100 s People 12-18 Months Starts at Wafer test. AVPs Random test programs Boot operating system OS based testing Formal testing (new groups of people ) POWER (Will discuss later) Performance testing Additional RITs if necessary Release to mfg GA (Hurray) Field Support

Power Micro-Processor Power Complex equation. DC Power (IDDQ) Voltage Temperature Speed of Chip (PSRO) Processor speed Ring Oscillator. AC Power Function of CV 2 f X Switching Factor. Workload Dependent.

GR DD2.0 Fmax vs. 1.2v, 25C PSRO Unguardbanded, 1.3v, 85C Fmax Frequency (MHz) PSRO (ps/stage) Faster parts Faster parts Parts

Power 5 Leakage Power vs. PSRO 1.3 V and 85C Leakage Power (W) 9s2 Leakage Power PSRO (ps/stage)

Power 5 AC Power vs. Frequency 1.3 V and 85 C AC Power (W) 9s2 AC Power Frequency (MHz)

Chip Power vs Frequency 1.3V 85C Frequency (MHz) Power (W)

Other SMT Considerations Power Management SMT Increases execution unit utilization Dynamic power management does not impact performance Debug tools / Lab bring-up Instruction tracing Hang detection Forward progress monitor Performance Monitoring Serviceability