POWER6 Processor and Systems

Similar documents
Power 7. Dan Christiani Kyle Wieschowski

IBM's POWER5 Micro Processor Design and Methodology

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

IBM POWER6 microarchitecture

POWER7: IBM's Next Generation Server Processor

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

POWER7: IBM's Next Generation Server Processor

1. PowerPC 970MP Overview

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Eric Schwarz. IBM Accelerators. July 11, IBM Corporation

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Power Technology For a Smarter Future

IBM POWER4: a 64-bit Architecture and a new Technology to form Systems

Multi-Threading. Last Time: Dynamic Scheduling. Recall: Throughput and multiple threads. This Time: Throughput Computing

Multiple Instruction Issue. Superscalars

Handout 2 ILP: Part B

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

POWER3: Next Generation 64-bit PowerPC Processor Design

Multithreaded Processors. Department of Electrical Engineering Stanford University

CSC 631: High-Performance Computer Architecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

The Pentium II/III Processor Compiler on a Chip

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE 571 Advanced Microprocessor-Based Design Lecture 4

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

All About the Cell Processor

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS425 Computer Systems Architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Keywords and Review Questions

OPENSPARC T1 OVERVIEW

Introducing Multi-core Computing / Hyperthreading

Pipelining and Vector Processing

Open Innovation with Power8

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

E0-243: Computer Architecture

PowerPC 740 and 750

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

CS 152 Computer Architecture and Engineering

Lecture 14: Multithreading

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Limitations of Scalar Pipelines

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Spring 2011 Prof. Hyesoon Kim

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CS 426 Parallel Computing. Parallel Computing Platforms

Hardware-Based Speculation

IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR

Jim Keller. Digital Equipment Corp. Hudson MA

Advanced issues in pipelining

Itanium 2 Processor Microarchitecture Overview

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

IBM ^ POWER4 System Microarchitecture

CS425 Computer Systems Architecture

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Computer Architecture Spring 2016

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Pentium IV-XEON. Computer architectures M

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EC 513 Computer Architecture

CS 152 Computer Architecture and Engineering

The Nios II Family of Configurable Soft-core Processors

CS 152, Spring 2011 Section 8

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Page 1. Multilevel Memories (Improving performance using a little cash )


Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Creating a Scalable Microprocessor:

CS 152 Computer Architecture and Engineering

Computer Architecture

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Intel Architecture for Software Developers

Case Study IBM PowerPC 620

November 7, 2014 Prediction

CS 152 Computer Architecture and Engineering

Parallelism, Multicore, and Synchronization

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

The Processor: Instruction-Level Parallelism

Next Generation Technology from Intel Intel Pentium 4 Processor

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Processor (IV) - advanced ILP. Hwansoo Han

Transcription:

POWER6 Processor and Systems Jim McInnes jimm@ca.ibm.com Compiler Optimization IBM Canada Toronto Software Lab

Role I am a Technical leader in the Compiler Optimization Team Focal point to the hardware development team Member of the Power ISA Architecture Board For each new microarchitecture I help direct the design toward helpful features Design and deliver specific compiler optimizations to enable hardware exploitation 2

POWER5 Chip Overview High frequency dual-core chip 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR 1.9MB on-chip shared L2 point of coherency, 3 slices On-chip directory and controller On-chip memory controller Technology & Chip Stats 130nm lithography, Cu, SOI 276M transistors, 389 mm 2 I/Os: 2313 signal, 3057 Power/Gnd 3

POWER6 Chip Overview Ultra-high frequency dual-core chip 8 execution units 2LS, 2FP, 2FX, 1BR, 1VMX L2 QUAD IFU SDU LSU FXU RU Core0 FPU VMX L2 QUAD 2 x 4MB on-chip L2 point of coherency, 4 quads On-chip directory and controller L2 CNTL CNTL Two on-chip memory controllers Technology & Chip stats M C FBC GXC M C CMOS 65nm lithography, SOI Cu 790M transistors, 341 mm 2 die L2 CNTL CNTL I/Os: 1953 signal, 5399 Power/Gnd Full error checking and recovery (RU) L2 QUAD IFU LSU FXU L2 QUAD SDU FPU RU Core1 VMX 4

POWER4 / POWER5 / POWER6 POWER4 POWER4 POWER4 Core Core L2 Ctl POWER5 POWER5 Core POWER5 Core L2 Alti Ve c Ctrl POWER6 P6 P6 Core 4 MB L2 Core 4 MB L2 Alti Ve c Ctrl Distributed Switch Enhanced Distributed Switch GX Bus Cntrl Mem Ctl GX Bus Mem Ctl Memory Cntrl Fabric Bus Controller GX Bus Cntrl Memory Cntrl Memory Memory Memor y+ GX+ Bridge Memor y+ 5

POWER6 I/O: Speeds and Feeds DRAM Memory connected by up to 4 channels, 533 800MHz DIMMS DDR2 (channels run at 4X DRAM frequency) Nova Nova Nova Nova Nova Nova Nova Nova Nova Nova Nova Nova Memory port (1 of 2): 8B Read (4x2B), 4B Write (4x1B) Nova Nova Nova Nova Off-Node Fabric Buses (2 pairs): 4B/bc or 8B/bc per unidirectional pair 2 MB L2 Quad 0 (Core0) Memory Cntlr 0 2 MB L2 Quad 20 (Core 1) L2 Ctrl Dir L2 Ctrl Dir Fabric Switch L2 reload: 32B/Pc SMT Core 0 64KB L1D SMT Core 1 64KB L1D Ctrl Dir I/O Ctlr Cntlr Ctrl Dir 2 MB L2 Quad 1 (Core0) Memory Cntlr 1 2 MB L2 Quad 3 (Core 1) buses: 16B/bc Read 16B/bc Write (Split into two, 8 and 8) 32 MB 1 (2x16MB) (same as MC0) I/O Interface: 4B Read, 4B Write at 2-8:1 the pc Buses scale at 2:1 with core frequency pc = processor clock bc = bus clock 2 pc = 1 bc On-Node Fabric Buses (3 pairs): 2B/bc or 8B/bc per unidirectional pair Total L2 on chip: 8MB 1 May be a single 32MB chip with 8B buses 6

POWER6 Core Pipelines 7

POWER5 vs. POWER6 Pipeline Comparison POWER5 Pipe Stages 22 FO4 Pre-Dec Pre-Dec Pre-Dec Pre-Dec Pre-Dec POWER6 Pipe Stages 13 FO4 3-4 cycle redirect I Cache Xmit IBUF Group Form Decode 12 cycle Branch Resolution 5 cycle redirect I Cache I Cache Xmit Rotate IBUF0 14 cycle Branch Resolution Assembly IBUF1 Group Xfer Pre-dispatch Group Disp Group Disp 3 cycle Load-load/fx Rename Issue RF AG/Ex D Cache Fmt Writeback Finish 2 cycle Fx-Fx 6 cycle FP-FP 5 cycle LD-FP 4 cycle Load-load 2 cycle Load-Fx RF AG/RF DCache / Ex DCache Fmt1 Fmt2 Writeback 1 cycle Fx-FX 6 cycle FP to local FP use 8 cycle FP to remote FP use 0 cycle LD to FP use Completion Completion Chkpt 8

Side-by-Side comparisons POWER5+ POWER6 Style General out-of-order execution Mostly in-order with special case out-of-order execution Units 2FX, 2LS, 2FP, 1BR, 1CR 2FX, 2LS, 2FP, 1BR/CR, 1VMX Threading 2 SMT threads Alternate ifetch Alternate dispatch (up to 5 instructions) 2 SMT threads Priority-based dispatch Simultaneous dispatch from two threads (up to 7 instructions) 9

Side-by-side comparisons L1 Cache ICache capacity, associativity DCache capacity, associativity L2 Cache Capacity, line size Associativity, replacement Off-chip Cache Capacity, line size Associativity, replacement Memory Memory bus POWER5+ 64 KB, 2-way 32 KB, 4-way Point of Coherency 1.9 MB, 128 B line 10-way, LRU 36 MB, 256 B line 12-way, LRU 4 TB maximum 2x DRAM frequency POWER6 64 KB, 4-way 64 KB, 8-way Point of Coherency 2 x 4 MB, 128 B line each 8-way, LRU 32 MB, 128 B line 16-way, LRU 8 TB maximum 4x DRAM frequency 10

Latency Profiles 160 140 Local Memory 120 Time (ns 100 80 60 P5@ 1.9GHz P6@4.7GHz 40 20 L2 0 0.1 1 10 100 1000 Size (MB) 11

Processor Core Functional Features PowerPC AS Architecture Simultaneous Multithreading ( 2 threads) Decimal Floating Point (extension to PowerPC ISA) 48 Bit Real Address Support Virtualization Support (1024 partitions) Concurrent support of 4 page sizes (4K, 64K, 16M, 16G) New storage key support Bi-Endian Support Dynamic Power Management Single Bit Error Detection on all Dataflow Robust Error Recovery (R unit) CPU sparing support (dynamic CPU swapping) Common Core for I, P, and Blade 12

Achieving High Frequency: POWER6 13FO4 Challenge Example Circuit Design 1 FO4 = delay of 1 inverter that drives 4 receivers 1 Logical Gate = 2 FO4 1 cycle = Latch + function + wire 1 cycle = 3 FO4 + function + 4 FO4 Function = 6 FO4 = 3 Gates Integration It takes 6 cycles to send a signals across the core Communication between units takes 1 cycle using good wire Control across a 64-bit data flow takes a cycle 13

Sacrifices associated with high frequency In-order design instead of out-of-order Longer pipelines Fp-compare latency FX multiply done in FPU Memory bandwidth reduced Esp. Store queue draining bandwidth Mitigating this: SMT implementation is improved 14

In-Order vs. Out of Order In-Order Dispatch Out-of-Order Execution Fetch Dispatch Load R2, Addi R2,R2,1 Load R3, In-Order Dispatch In-Order Execution Fetch Dispatch Operands must be available Instruction Queue Instruction Queue Operands must be available Functional Unit Functional Unit STQ Functional Unit Functional Unit STQ 15

P5 vs. P6 instruction groups Both machines are superscalar, meaning that instructions are formed into groups and dispatched together. There are two big differences between P5 and P6: On P5 there are few restrictions on what types on insns and how many can be grouped together, while on P6 each group must fit in the available resources (2 fp, 2 fx, 2 ldst,..) On P5, insns are placed in queues to wait for operands to be ready, while on P6 dispatch must wait until all operands are ready for the whole group On p6, cycles are shorter, but there are more stall cycles and more partial instruction groups 16

Strategic NOP insertion ORI 1,1,0 terminates a dispatch group early Suppose at cycle 10 the current group contains FMA, FMA, LFL and all are ready to go. And the only other available instruction is a LFL ready in cycle 15 Including the LFL holds up the first three Could be harmless or catastrophic Sometimes Nops are inserted to fix this 17

FP compares Fp compare has an 11-cycle latency to its use This is a significant delay the best defence is for the programmer to work around it arranging lots of work between a compare and where it is used. Several branch free sequences are available for such code. The compiler will in some cases try to replace fp compares with fx compares. 18

Store Queue optimizations On Power6 stores are placed into a queue to await completion. This queue can drain 1 store every other cycle If the next store in the queue is to consecutive storage with the first, they both can go that cycle. A 2*8 byte fp store is provided to make this more convenient 19

Symmetric Multithreading (SMT) P6 operates in two modes, ST (single thread) and SMT (multithreaded) In SMT mode two independent threads execute simultaneously, possibly from the same parallel program Instructions from both threads can dispatch in the same group, subject to unit availability This is highly profitable on P6 it is a good way to fill otherwise empty machine cycles and achieve better resource usage 20

FPU details Two Binary Floating Point units and one Decimal Floating Point unit 7 stage FPU with bypass in 6 stages locally and 8 stages to other fpu Interleaves execution of independent arithmetic instructions with divide/sqrt Handles fixed point multiply/divide (pipelined) Always run at or behind other units 10 group entry queues (group= 2ldf and 2 fp/stf instructions) Allows pipelined issue of dependent store instructions 10 group entry load target buffer allowing speculative FP load execution Multiple Interrupt modes Ignore, imprecise non-recoverable, imprecise recoverable, precise 21

FP loads FPU: more detail FP arithmetics/stores dispatch xmit xmit RF gprs agen CA fmt xmit Load Target Buffer FP Registers isq dep select steer issue isq dep select steer issue RF RF shift shift align/mul align/mul align/mul align/mul add add add add normalize normalize round round 22 Store Data to LSU writeback writeback

Linpack HPC Performance 250 200 GFLOPS 150 100 50 Power5+ Power6 0 1-way 8-way 16-way 23

Datastream Prefetching in POWER6 Load-Look-Ahead (LLA) mode Triggered on a data cache miss to hide latency Datastream prefetching with no filtering Sixteen (16) Prefetch Request Queues (PRQs) Prefetch 2 lines ahead to L1, up to 24 to L2, demand paced Speculative L2 prefetch for next line on cache miss Depth/aggressiveness control via SPR and dcbt instructions int dscr_ctl() dscrctl change O/S default Store prefetching into L2 (with hardware detection) Full compliment of line and stream touches for both loads and stores Sized for dual-thread performance (8 PRQs/thread) 24

Memory Physical Organization SLOT 0 Channel 0 MC 1B 2B 1B 2B 1B 2B 1B 2B DIMM 0 Rank0 Rank1 Rank2 Rank3 SLOT 1 DIMM 0 Rank0 Rank1 Rank2 Rank3 SLOT 2 DIMM 0 Rank0 Rank1 Rank2 Rank3 SLOT 3 DIMM 0 Rank0 Rank1 Rank2 Rank3 Channel 1 DIMM 1 Rank0 Rank1 Rank2 Rank3 DIMM 1 Rank0 Rank1 Rank2 Rank3 DIMM 1 Rank0 Rank1 Rank2 Rank3 DIMM 1 Rank0 Rank1 Rank2 Rank3 Channel 2 DIMM 2 Rank0 Rank1 Rank2 Rank3 DIMM 2 Rank0 Rank1 Rank2 Rank3 DIMM 2 Rank0 Rank1 Rank2 Rank3 DIMM 2 Rank0 Rank1 Rank2 Rank3 Channel 3 DIMM 3 Rank0 Rank1 Rank2 Rank3 DIMM 3 Rank0 Rank1 Rank2 Rank3 DIMM 3 Rank0 Rank1 Rank2 Rank3 DIMM 3 Rank0 Rank1 Rank2 Rank3 DIMMs populated in quads Single NOVA per DIMM (or pair of DIMMS) DIMMs may be 1/2/4 ranks Actual bandwidth depends on: DIMM speed, type # of ranks/dimms/channel DIMM channel b/w MC read/write bandwidth (from/to buffers) pattern of reads/writes presented to MC 25

POWER5 vs POWER6 bus architecture POWER5 Chip POWER5 Chip POWER6 Chip POWER6 Chip POWER5 Chip POWER5 Chip POWER6 Chip POWER6 Chip Data buses Combined buses Address buses 26

POWER6 Eight-Processor Node Off-node SMP interconnect Node a & b Building block for larger systems 64-way system has 8 nodes Eight off-node SMP links chip chip POWER6 Chip x y POWER6 Chip chip chip Fully connects 8 nodes in a z 64-way Additional link for RAS chip chip POWER6 Chip POWER6 Chip chip chip Can be used to build 128-core SMP Off-node SMP interconnect 27

POWER6 64 way high-end system 28

POWER6 32-way Quad 0 1 3 Quad 2 9 11 A-bus B-bus XYZ-bus 0 2 8 10 5 13 15 6 4 Quad 1 14 Quad 3 12 IOC GX 29 Interconnect GX

30

HV8 X X X X 31

Snoop Filtering: Broadcast Localized Adaptive-Scope Transaction Protocol (BLAST) Most significant microarchitectural enhancement to P6 nest Coherence broadcasts may be speculatively limited to subset (node or possibly chip) of SMP NUMA-directory-like combination of cache states and domain state info in memory provide ability to resolve coherence without global broadcast Provides significant leverage for SW that exhibits node or chip level processor/memory affinity Several system designs significantly cost reduced by cutting bandwidth of chip interconnect 32

Snoop Filtering: Local/System/Double Pump Send request to entire system (System Pump) Longer latency within local node Shorter latency to remote nodes burn address bandwidth on all nodes Send request to local node only (Local Pump) better latency if data is found locally uses only local XYZ address bandwidth re-send to entire system if not found (Double Pump) Adds 100-200 pclks latency (depends on system type) Local predictor uses thread ID, request type, memory address 33

Tracking off-node copies T/Te/Tn/Ten: e =not dirty / n =no off-node copies In/Ig: on- or off-node request took line away castout Ig state to memory directory bit Memory directory extra bit per cache line tells whether off node copies exist MC updates when sourcing data off-node allows node pump to work (MC can provide data) always know from Ig or mem dir if off-node copies exist Double pump No on-node cache has any info about line If memory directory tells us off-node copies exist with data return (data not valid), have to do second pump 34

POWER6 Summary Extends POWER leadership for both Technical and Commercial Computing Approx. twice the frequency of P5+, with similar instruction pipeline length and dependency 6 cycle FP to FP use, same as P5+ Mostly in-order but with OOO-like features (LTB, LLA) 5 instruction dispatch group from a thread, up to 7 for both threads, with one branch at any position Significant improvement in cache-memory-latency profile More cache with higher associativity Lower latency to all levels of cache and to memory Enhanced datastream prefetching with adjustable depth, store-stream prefetching Load look-ahead data prefetching Advanced simultaneous multi-threading gives added dimension to chip level performance Predictive subspace snooping for vastly increased SMP coherence throughput Continued technology exploitation CMOS 65nm, Cu SOI Technology All buses scale with processor frequency Enhanced Power Management Advanced server virtualization Advanced RAS: robust error detection, hardware checkpoint and recovery 35