SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Similar documents
SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

The Future of SPARC64 TM

Fujitsu High Performance CPU for the Post-K Computer

Intel released new technology call P6P

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor

Intel Enterprise Processors Technology

XT Node Architecture

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Itanium 2 Processor Microarchitecture Overview

EC 513 Computer Architecture

Overview of the SPARC Enterprise Servers

Jim Keller. Digital Equipment Corp. Hudson MA

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Power 7. Dan Christiani Kyle Wieschowski

Multiple Instruction Issue. Superscalars

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Power Technology For a Smarter Future

A 1.5GHz Third Generation Itanium Processor

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Tofu Interconnect 2

CS425 Computer Systems Architecture

A 400Gbps Multi-Core Network Processor

Case Study IBM PowerPC 620

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Open Innovation with Power8

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Lecture 4: RISC Computers

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Dynamic Scheduling. CSE471 Susan Eggers 1

Intel Architecture for HPC

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

The Processor: Instruction-Level Parallelism

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

1. PowerPC 970MP Overview

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Vector Engine Processor of SX-Aurora TSUBASA

PowerPC 740 and 750

The Pentium II/III Processor Compiler on a Chip

Processor (IV) - advanced ILP. Hwansoo Han

Superscalar Processor

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Lecture 4: RISC Computers

Limitations of Scalar Pipelines

Goro Watanabe. Bill King. OOW 2013 The Best Platform for Big Data and Oracle Database 12c. EVP Fujitsu R&D Center North America

KeyStone II. CorePac Overview

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Lecture 9: Multiple Issue (Superscalar and VLIW)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Fujitsu s Approach to Application Centric Petascale Computing

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

Each Milliwatt Matters

A 1-GHz Configurable Processor Core MeP-h1

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

All About the Cell Processor

Super Scalar. Kalyan Basu March 21,

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

AMD s Next Generation Microprocessor Architecture

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Lecture 4: Instruction Set Architecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

Putting it all Together: Modern Computer Architecture

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Simultaneous Multithreading (SMT)

Pipelining and Vector Processing

Multiple Choice Type Questions

Kaisen Lin and Michael Conley

Intel Architecture for Software Developers

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

Hardware-Based Speculation

Advanced Parallel Programming I

CS425 Computer Systems Architecture

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

EPYC VIDEO CUG 2018 MAY 2018

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

CS 152 Computer Architecture and Engineering

EECS 322 Computer Architecture Superpipline and the Cache

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Pentium IV-XEON. Computer architectures M

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Puey Wei Tan. Danny Lee. IBM zenterprise 196

Transcription:

X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright FUJITSU LIMITED 2012

X 2 Agenda Fujitsu Processor Development History TM X Design concept SWoC (Software on Chip) Processor chip overview u-architecture Performance Summary

High Performance Technology High Reliability Technology Fujitsu Processor Development Virtual Machine Architecture Software On Chip HPC-ACE System On Chip Hardware Barrier Multi-core Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Single-chip CPU Store Ahead Branch History Prefetch $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=10M CMOS Al 350nm GS8600 II Tr=30M CMOS Al 250nm / 220nm GS8800B GS8800 GP ~1999 Processor Tr=190M0 130nm Tr=46M 180nm Tr=30M 180nm / 150nm 2000~2003 Tr=190M 130nm X 3 GS8900 GP Tr=400M 90nm V GS21 600 Tr=540M 90nm V + Tr=760M 45nm Tr=600M 65nm VI GS21 900 VII Tr=500M 90nm Tr=1B 40nm VIIIfx IXfx GS21 Mainframe :Technology generation 2004~2007 2008~2011 Tr=2.95B 28nm X 2012~

X 4 X Design Concept Combine UNIX and HPC FJ processor features to realize an extremely high throughput UNIX processor. VII/VII+ (UNIX processor) feature High CPU frequency (up-to 3GHz) Multicore/Multithread Scalability : up-to 64sockets VIIIfx (HPC processor) feature HPC-ACE: Innovative ISA extensions to SPARC-V9 High Memory B/W: peak 64GB/s, Embedded Memory Controller Add new features vital to current and future UNIX servers Virtual Machine Architecture Software On Chip Embedded IOC (PCI-GEN3 controller) Direct CPU-CPU interconnect

X 5 Software on Chip 1/2 HW for SW Accelerates specific software function with HW The targets Decimal operation (IEEE754 decimal and NUMBER) Cypher operation (AES/DES) Database acceleration HW implementation The HW engines for SWoC are implemented in FPU To fully utilize 128 FP registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW. Avoid complication due to CISC type instructions Various RISC type instructions are newly defined, instead. 18 insts. for Decimal, and 10 insts. for Cypher operation

Software on Chip 2/2 Decimal Instructions Supported data type IEEE754 DPD(Densely Packed Decimal) 8B fixed length NUMBER Variable length (max 21Byte) Instructions Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers Fd[rs1] To maximize performance with reasonable HW cost When the data length is > 8byte, multiple such instructions will be used. An instruction for special byte-shift on FP registers is newly added to support unaligned NUMBER 0 1...1 0 1...1 0...0 X 6 0 and 0 0 Fd[rd] 0 Fd[rs2] and 0 0

X Chip Overview DDR3 Interface MAC Architecture Features 16 cores x 2 threads SWoC (Software on Chip) Shared 24 MB L2$ Embedded Memory and IO Controller SERDES PCI GEN3 L2 Cache Data L2 Cache Control MAC L2 Cache Data SERDES Inter-CPU 28nm CMOS 23.5mm x 25.0mm 2,950M transistors 1,500 signal pins 3GHz DDR3 Interface Performance (peak) 288GIPS/382GFlops 102GB/s memory throughput X 7

X 8 X spec L1I$ L1$ control Instruction Control L1D$ Execution Unit Register File Instruction Set Architecture Branch Prediction Integer Execution Units FP Execution Units SPARC-V9/JPS HPC-ACE VM SWoC 4K BRHIS 16K PHT 156 GPR x 2 + 64 GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1 128 FPR x 2 + 64 FUB FMA x4, FDIV x2 IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4way

CPU u-architecture enhancements from TM VII+ Deeper pipeline to increase Frequency Better Branch Prediction Scheme Various Queue-size and #Floating point register increase Richer execution Units, including 2EX + 2EAG 2EX + 2EX/EAG 2FMA 4FMA to support 2way-SIMD SWoC engine (Decimal and Cypher) More aggressive O-O-O execution of load and store Multi-banked 2port L1-Cache System On Chip #core and L2$ size (4core/12MB 16core/24MB) Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost. X 9

Fetch (4 stages) TM VII/VII+ Pipeline Issue Dispatch Reg.-Read (2 stages) (4 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 2Way Branch Target Address 8Kentry Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 EAGA EAGB EXA EXB FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L1 D$ 64KB 2Way CSE 64Entry PC x2 Control Registers x2 RSBR 10Entry FUB 48Registers L2$ 6MB/12MB 12Way 4-core System Bus Interface X 10

Fetch (4 stages) TM X Pipeline Issue Dispatch Reg.-Read (4 stages) (5 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 4Way Branch Target Address 4Kentry Pattern History Table 16Kentry Decode & Issue RSA 24Entry RSE 24Entry RSF 20Entry RSBR 16Entry GPR GPR 156Registers 156Registers GUB 64Registers FPR FPR 128Registers 128Registers FUB 64Registers EAGA EXC EAGB EXD EXA EXB FLA Decimal Cypher FLB FLC Cypher FLD Fetch Port 32Entry Store Port 24Entry Write Buffer 10Entry L1 D$ 64KB 4Way CSE 96Entry PC PC Control Control Registers Registers L2$ 24MB 24Way 16-core Router CPU-CPU I/F Memory Controller IO Controller PCI-GEN3 DIMM X 11

Execution units enhancements (Ex.) Integer Execution Unit 2EX + 2EAG 2EX + 2EX/EAG 2 4W GPR 4 integer instructions can be executed per cycle (sustained) EXA EXB EXC EXD Update GUB Commit GPR Load Store Unit Aggressive load/store O-O-O execution: Execute load without waiting for preceding store address calculation. Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel Doubled L1$ bus size Doubled L1$ associativity (2 4way) Increase L1-cache throughput and hit-rate L1$ 16B store L1 cache 2R/1W. (banked) 2R/1W 16B load x2 X 12

TM X interconnects TM VII/VII+ interconnects (SPARC Enterprise M8000) CPU CPU CPU SC SC SC MAC MAC MAC DDR2 DIMMs DDR2 DIMMs DDR2 DIMMs TM VII/VII+ interconnects 4 CPU require 8 additional LSIs to be connected with DIMM DIMM i/f: 4.35GB/s (STREAMtriad) CPU SC MAC DDR2 DIMMs TM X interconnects CPU CPU CPU 14.5GB/s CPU 102GB/s DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs TM X interconnects No additional LSIs to be connected with DIMM DIMM i/f: 65.6GB/s (STREAMtriad) CPU i/f: 14.5GB/s x 5ports (peak) 3 ports: glueless 4way CPU interconnect 2 ports: > 4way CPU X 13

High Speed Transceivers (SerDes) CPU-CPU glue-less communication links 14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports Embedded equalizer circuit enables long distance signal transmission Embedded adaptive control logic optimizes equalizer parameters automatically depending on the various system configurations PCI Express ports 8Gb/s x 8 lanes (Gen 3), 2 ports RX RX PLL TX Tx 14.5Gb/s x 8lanes SerDes Built-in SerDes provides peak 88.5GB/s x2 (up/down) total throughput X 14

Reliability, Availability, Serviceability Units Cache (Tag) Cache (Data) Register ALU Cache dynamic degradation Error detection and correction scheme ECC Duplicate & Parity ECC Parity ECC (INT/FP) Parity(Others) Parity/Residue Yes HW Instruction Retry Yes History Yes TM X RAS diagram Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless New RAS features from VII/VII+ Floating-point registers are ECC protected #checkers increased to ~53,000 to identify a failure point more precisely Guarantees Data Integrity X 15

Hardware Instruction Retry Instruction Retry Fetch Execute Commit Fetch Execute Commit 1. Error 2. Flush 3. Single step execution 4. Update of SW visible resources IBF IWR CSE X PC IBF IWR CSE PC ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources 5. Back to normal execution after the re-executed Instruction gets committed without an error. When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself. X 16

X 17 TM X Performance @3GHz Relative to TM VII+@2.86GHz Hardware measured results 98x SWoC 15x 7x 8x TM X realizes 7x INT/FP/JVM throughput and 15x memory throughput of TM VII+ The INT/FP/JVM result is with un-tuned Compiler/JVM. SWoC of TM X results in max 98x throughput. The NUMBER score is for scalar. Expect to be much better for vector data.

TM X CPI (Cycle Per Instruction) Example Lower Performance TM VII+ v.s. TM X INT (single thread) Hardware measured results Shorter memory latency Large L2$ Improved Branch prediction 2 4way L1$ Increased throughput of L1$ 2EX+2EAG 2EX+2EX/EAG 2 4W GPR Higher Performance 4 integer execution units and write port increase of GPR (integer register) improves overall performance. Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically. X 18

X 19 Summary TM X is Fujitsu s10 th SPARC processor which has been designed to be used for Fujitsu s next generation UNIX server. TM X integrates 16cores + 24MB L2 cache with over 100GB/s(peak) memory B/W. TM X keeps strong RAS features. TM X chip is up and running in the lab. It has shown 7 times throughput of TM VII+ w/o compiler tuning. SWoC is effective to accelerate specific software functions Fujitsu will continue to develop TM series.

TM X Abbreviations IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry X 20