Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Similar documents
Niagara2: A Highly Threaded Server-on-a-Chip. Robert Golla Principal Architect Sun Microsystems

OPENSPARC T1 OVERVIEW

Jim Keller. Digital Equipment Corp. Hudson MA

Multithreaded Processors. Department of Electrical Engineering Stanford University

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Kaisen Lin and Michael Conley

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

CS 152 Computer Architecture and Engineering

Itanium 2 Processor Microarchitecture Overview

1. PowerPC 970MP Overview

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

CS 152 Computer Architecture and Engineering

UltraSPARC User s Manual

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Good luck and have fun!

SPARC ENTERPRISE T5440 SERVER ARCHITECTURE. Unleashing UltraSPARC T2 Plus Processors with Innovative Multi-core Multi-thread Technology

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

An Oracle White Paper June Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture

PowerPC 740 and 750

Hardware-Based Speculation

Power 7. Dan Christiani Kyle Wieschowski

Dynamic Memory Dependence Predication

The ARM10 Family of Advanced Microprocessor Cores

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Handout 2 ILP: Part B

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

An Oracle White Paper April Oracle's Sun SPARC Enterprise T5120/T5220 and Oracle's Sun SPARC Enterprise T5140/T5240 Server Architecture

Pentium IV-XEON. Computer architectures M

Memory Hierarchies 2009 DAT105

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

SPARC ENTERPRISE T5120 AND T5220 SERVER ARCHITECTURE

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Topics in computer architecture

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

EC 513 Computer Architecture

5008: Computer Architecture

Pipelining and Vector Processing

E0-243: Computer Architecture

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Instruction Pipelining Review

s complement 1-bit Booth s 2-bit Booth s

Superscalar Processors

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Multiple Instruction Issue. Superscalars

The Nios II Family of Configurable Soft-core Processors

Case Study IBM PowerPC 620

" # " $ % & ' ( ) * + $ " % '* + * ' "

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Portland State University ECE 588/688. Cray-1 and Cray T3E

Hercules ARM Cortex -R4 System Architecture. Processor Overview

CS152 Computer Architecture and Engineering. Complex Pipelines

Processor: Superscalars Dynamic Scheduling

CS433 Homework 2 (Chapter 3)

Superscalar Machines. Characteristics of superscalar processors

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Each Milliwatt Matters

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Unit 8: Superscalar Pipelines

ARM Cortex-A* Series Processors

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Hardware-based Speculation

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

November 7, 2014 Prediction

Introducing Sandy Bridge

General Purpose Processors

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CS 152, Spring 2011 Section 8

Limitations of Scalar Pipelines

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Concurrent High Performance Processor design: From Logic to PD in Parallel

Write only as much as necessary. Be brief!

HW1 Solutions. Type Old Mix New Mix Cost CPI

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Industry perspective on Chip Multi-Threading: Bridging the gap with academia using OpenSPARC

Advanced Computer Architecture

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

AMD s Next Generation Microprocessor Architecture

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture-13 (ROB and Multi-threading) CS422-Spring

Chapter-5 Memory Hierarchy Design

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Main Points of the Computer Organization and System Software Module

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Transcription:

Niagara-2: A Highly Threaded Server-on-a-Chip Greg Grohoski Distinguished Engineer Sun Microsystems August 22, 2006

Authors Jama Barreh Jeff Brooks Robert Golla Greg Grohoski Rick Hetherington Paul Jordan Mark Luttrell Chris Olson Manish Shah Page 2

Agenda Chip overview Sparc core > Execution Units > Power > RAS Summary Page 3

Niagara-2 Chip Goals Double throughput versus UltraSparc T1 > Maintain Sparc binary compatibility > http://opensparc.sunsource.net/nonav/index.html Improve throughput / watt Improve single-thread performance Integrate important SOC components > Networking > Cryptography Page 4

Niagara-2 Chip Overview 8Sparccores, 8 threads each Shared 4MB L2, 8- banks, 16-way associative Four dual-channel FBDIMM memory controllers Two 10/1 Gb Enet ports w/onboard packet classification and filtering One PCI-E x8 1.0 port 711 signal I/O, 1831 total Page 5

Glossary CCX Crossbar CCU Clock control DMU/PEU PCI Express EFU Efuse (redundancy) ESR Ethernet SERDES FSR FBDIMM SERDES L2B L2 write-back buffers L2D L2 Data L2T L2 tags MCU Memory controller MIO Miscellaneous I/O PSR PCI-Express SERDES RDP/TDS/RTX/MAC Ethernet SII/SIO I/O datapath in/out to memory SPC Sparc core TCU Test control unit Page 6

Sparc Core Goals >2x throughput of UltraSparc T1 Improve integer, floating-point performance Extend cryptographic support > Support relevant ciphers, hashes > Enable free encryption Optimum throughput/area and throughput/watt > Doubling cores vs. increasing threads/core > Utilization of execution units Page 7

Sparc Core Block Diagram SPU TLU EXU0 FGU IFU EXU1 LSU Gasket xbar/l2 MMU/ HWTW IFU Instruction Fetch Unit > 16 KB I$, 32B lines, 8-way SA > 64-entry fully-associative ITLB EXU0/1 Integer Execution Units > 4 threads share each unit > Executes one integer instruction/cycle LSU Load/Store Unit > 8KB D$, 16B lines, 4-way SA > 128-entry fully-associative DTLB FGU Floating/Graphics Unit SPU Stream Processing Unit > Cryptographic acceleration TLU Trap Logic Unit > Updates machine state, handles exceptions and interrupts MMU Memory Management Unit > Hardware tablewalk (HWTW) > 8KB, 64KB, 4MB, 256MB pages Page 8

Core Pipeline 8 stages for integer operations: > Fetch, Cache, Pick, Decode, Execute, Memory, Bypass, Writeback > 3-cycle load-use > Memory (translation, tag/data access) > Bypass (late select, formatting) 12 stages for floating-point: > Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3, FX4, FX5, FB, FW > 6-cycle latency for dependent FP ops > Longer pipeline for divide/sqrt Page 9

Integer/LSU Pipeline F C IFU IB0-3 IB4-7 P P TG0 D E LSU D E TG1 M M M B B B W W W Page 10

IFU Block Diagram ITLB Instruction Buffers (4x8) Decode 0 Thread Group 0 (threads 0-3) Fetch Addr Gen 16 KB 8 way ICache Instruction Buffers (4x8) Pick 0 Pick 1 Decode 1 Pick Unit Decode Unit Fetch Unit Cache Miss Logic EXU0 EXU1 L2 Thread Group 1 (threads 4-7) Instruction cache, fetch/pick/decode logic for 8 threads Fetch up to 4 instructions from I$ > Threads either in ready or wait state > Wait states: TLB miss, cache miss, instruction buffer full > Least-recently fetched among ready threads > One instruction buffer/thread No branch prediction > Predict not-taken, 5-cycle penalty Limited I$ miss prefetching Page 11

IFU Block Diagram ITLB Instruction Buffers (4x8) Decode 0 Thread Group 0 (threads 0-3) Fetch Addr Gen 16 KB 8 way ICache Instruction Buffers (4x8) Pick 0 Pick 1 Decode 1 Pick Unit Decode Unit Fetch Unit Cache Miss Logic EXU0 EXU1 L2 Thread Group 1 (threads 4-7) Fetch decoupled from Pick Threads divided into 2 groups of 4 threads each One instruction from each thread group picked each cycle > Least-recently picked within a thread group among ready threads > Wait states: dependency, D$ miss, DTLB miss, divide/sqrt,... > Gives priority to non-speculative threads (e.g. non-load hit) Decode resolves conflicts > Each thread group picks independently of the other > Both thread groups pick load/store or FGU instructions Page 12

Threaded Execution F2 C6 IFU IB0-3 IB4-7 P0 P5 TG0 D2 E0 LSU D7 E6 TG1 M3 M4 M4 B1 B1 B7 W2 W6 W6 Page 13

EXU RML IRF BYP SHFT ALU LSU LSU FGU FGU Executes integer operations > Some graphics operations Each EXU contains state for 4 threads > Integer register file (IRF) contains 8 register windows per thread > Window management logic (RML) > Adder, shifter Generates addresses for load/store operations Page 14

LSU RAW bypass data store data DTLB compare load addr for RAW STB ACK PA VA store data for D$ update DTAG ldst_miss load data (hit) load miss waysel LMQ to pcx store to pcx DCA fill data Gasket (to xbar/l2) data return bypass to IRF One load or store per cycle > D$ is store-through > D$ fills in parallel with stores Load Miss Queue (LMQ) > One pending load miss per thread > D$ allocates on load misses, updates on store hits Store buffer (STB) contains 8 stores/thread > Stores to same L2 cache line are pipelined to L2 Arbitrates between load misses, stores for crossbar > Fairness algorithm Page 15

MMU Hardware tablewalk of up to 4 page tables Each page table supports one page size Three search modes: > Sequential search page tables in order > Burst search page tables in parallel > Prediction predict page table to search based upon VA > Two-bit predictor for ordering first two page table searches Up to 8 pending misses > ITLB or DTLB miss per thread Page 16

FGU Integer/Crypto Sources Fx1 Fx2 Fx3 FGU Register File 8x32x64b rs1 Add Mul 2R/2W VIS 2.0 Div/ Sqrt rs2 Load Store Fully-pipelined (except divide/sqrt) > Divide/sqrt in parallel with add or multiply operations of other threads FGU performs integer multiply, divide, population count Integer/Crypto Result Fx4 Fx5 Fb Multiplier enhancements for modular arithmetic operations > Built-in accumulator > XOR multiply Page 17

SPU MA Sources to FGU MA Result from FGU MA Scratchpad 160x64b, 2R/1W rs1 MA Execution DMA Engine Hash Engine rs2 Store Data, Address Cipher Engines Address, Data to/ from L2 Cryptographic coprocessor > Runs in parallel w/core at same frequency Two independent sub-units > Modular Arithmetic > RSA, binary and integer polynomial elliptic curve (ECC) > Shares FGU multiplier > Ciphers / Hashes > RC4, DES/3DES, AES- 128/192/256 > MD5, SHA-1, SHA-256 > Designed to achieve wirespeed on both 10Gb Ethernet ports DMA engine shares crossbar port w/core Page 18

Core Power Management Minimal speculation > Next sequential I$ line prefetch > Predict branches not-taken > Predict loads hit in D$ > Hardware tablewalk search control Extensive clock gating > Datapath > Control blocks > Arrays External power throttling > Add wait states at decode stage Page 19

Core Reliability and Serviceability Extensive RAS features > Parity-protection on I$, D$ tags and data, ITLB, DTLB CAM and data, modular arithmetic memory, store buffer address > ECC on integer RF, floating-point RF, store buffer data, trap stack, other internal arrays Combination of hardware and software correction flows > Hardware re-fetch for I$, D$ > Software recovery for other errors > Offline a thread, group of threads, or physical core if error rate too high Page 20

Summary Niagara-2 combines all major server functions on one chip >2x throughput and throughput/watt vs. UltraSparc T1 Greatly improved floating-point performance Significantly improved integer performance Embedded wire-speed cryptographic acceleration Enables new generation of power-efficient, fully-secure datacenters Page 21

Thank you... gregory.grohoski@sun.com Page 22