Niagara2: A Highly Threaded Server-on-a-Chip. Robert Golla Principal Architect Sun Microsystems

Size: px

Start display at page:

Download "Niagara2: A Highly Threaded Server-on-a-Chip. Robert Golla Principal Architect Sun Microsystems"

Sybil Pope
5 years ago
Views:

1 Niagara2: A Highly Threaded Server-on-a-hip Robert Golla Principal Architect Sun icrosystems October 10, 2006

2 ontributors Jama Barreh Jeff Brooks William Bryg Bruce hang Robert Golla Greg Grohoski Rick Hetherington Paul Jordan ark Luttrell ark cpherson Shimon uller hris Olson Bikram Saha anish Shah ichael Wong Page 2

3 Agenda hip Overview Throughput omputing Sparc core rossbar cache Networking PI-Express Power Status Summary Page 3

FSR Niagara2 hip Overview 8 Sparc cores, 8 threads each Data Bank 0 B0 Data Bank 1 B1 U0 U1 B3 Data Bank 3 B2 Data Bank 2 DU PEU PSR NU SPAR ore 0 TAG0 SII TAG2 SPAR ore 2 ESR SPAR ore 1 TAG1 TAG3

4 FSR Niagara2 hip Overview 8 Sparc cores, 8 threads each Data Bank 0 B0 Data Bank 1 B1 U0 U1 B3 Data Bank 3 B2 Data Bank 2 DU PEU PSR NU SPAR ore 0 TAG0 SII TAG2 SPAR ore 2 ESR SPAR ore 1 TAG1 TAG3 SPAR ore 3 X FSR SPAR ore 5 TAG5 TAG7 SPAR ore 7 SIO SPAR ore 4 TAG4 U TAG6 SPAR ore 6 A EFU Data Bank 4 B4 Data Bank 5 B5 U2 U3 B7 Data Bank 7 B6 Data Bank 6 RDP RTX TDS FSR Shared 4B, 8-banks, 16-way associative Four dual-channel FBDI memory controllers Two 10/1 Gb Enet ports One PI-Express x8 1.0A port 342 mm^2 die size in 65 nm 711 signal I/O, 1831 total Page 4

5 Niagara2 hip Overview Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore 2x10/1 GE 8x9 ache rossbar Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 I/O NIU (Ethernet) System Interface Unit PI-EX emory ontrol 0 emory ontrol 1 emory ontrol 2 emory ontrol 3 FBDI FBDI FBDI FBDI Full 8x9 crossbar switch onnects every core to every bank and vice-versa Supports 8 byte writes from a core to a bank Supports 16 byte reads from a bank to core One port for core to read/write IO System interface unit connects networking and IO to memory Page 5

6 Throughput omputing emory Latency ompute Time Threads For a single thread Time Single Thread emory is THE bottleneck to improving performance ommercial server workloads exhibit poor memory locality Only a modest throughput speedup is possible by reducing compute time onventional single-thread processors optimized for ILP have low utilizations With many threads It s possible to find something to execute every cycle Significant throughput speedups are possible Page 6 Processor utilization is much higher

7 Engineering Solutions Design Problem Double UltraSparc T1's throughput and throughput/watt Improve UltraSparc T1's FP single-thread and throughput performance inimize required area for these improvements onsidered doubling number of UltraSparc T1 cores 16 cores of 4 threads each Takes too much die area No area left for improving FP performance Page 7

8 Engineering Solutions Probabilistic odelling Generate synthetic traces for each thread with an Total IP Relative Throughput Performance ultithread_performance_vs._threads (12/2002) UltraSparc T1 Niagara2 instruction/miss profile that matches TP- Schedule ready threads to run on some number of execution units End simulation once simulated distributions are close to actual distributions Works very well for simple scalar cores running lots of threads on transactional workloads Within 10 percent of a detailed cycle accurate simulator Detailed cycle accurate simulator not available at beginning of the project Page 8

9 Engineering Solutions Decided to increase the number of threads per core and increase execution bandwidth 8 threads per core x 8 cores = 64 threads total 2 EXUs per core ore than doubles UltraSparc T1 s throughput Doubling threads is more area efficient than doubling cores Integrate FGU into core pipeline 6 cycle FP latency Threads running FP are non-blocking Enhance Niagara2 s cryptography Added more ciphers Enhanced existing public key support Page 9

10 Throughput hanges Niagara2 throughput changes vs. UltraSparc T1 Add instruction buffers after L1 instruction cache for each thread Add new pipe stage pick hoose 2 threads out of 8 to execute each cycle Increase execution units from 1 to 2 Increase set associativity of L1 instruction cache to 8 Increase size of fully associative DTLB from 64 to 128 entries Increase banks from 4 to 8 15 percent performance loss with only 4 banks and 64 threads Increase threads from 4 to 8 Page 10

11 Sparc ore Block Diagram IFU Instruction Fetch Unit 16 KB I$, 32B lines, 8-way SA SPU TLU EXU0 FGU IFU EXU1 LSU U/ HWTW Gasket rossbar/ 64-entry fully-associative ITLB EXU0/1 Integer Execution Units 4 threads share each unit 8 register windows/thread 160 IRF entries/thread LSU Load/Store Unit 8 threads share LSU 8KB D$, 16B lines, 4-way SA 128-entry fully-associative DTLB FGU Floating-Point/Graphics Unit 8 threads share FGU 32 FRF entries/thread SPU Stream Processing Unit ryptographic coprocessor TLU Trap Logic Unit Updates machine state, handles exceptions and interrupts U emory anagement Unit Hardware tablewalk (HWTW) 8KB, 64KB, 4B, 256B pages Page 11

12 ore Pipeline 8 stage integer pipeline Fetch ache Pick Decode Execute em Bypass W 3-cycle load-use penalty emory (data translation, access tag/data array) Bypass (late way select, data formatting, data forwarding) 12 stage floating-point pipeline Fetch ache Pick Decode Execute Fx1 Fx2 Fx3 Fx4 Fx5 FB FW 6-cycle latency for dependent FP ops Longer pipeline for divide/sqrt Page 12

13 Integer/LSU Pipeline Thread Group 0 P IB0-3 D E B W F LSU B W IFU Thread Group 1 P D E B IB4-7 W Instruction cache is shared by all 8 threads Least-recently-fetched algorithm used to select next thread to fetch Each thread is written into thread-specific instruction buffer Decouples fetch from pick Each thread statically assigned to one of 2 thread groups Pick chooses 1 ready thread each cycle within each thread group Picking within each thread group is independent of the other Least-recently-picked algorithm used to select next thread to execute Decode resolves resource hazards not handled during pick Page 13

14 Integer/LSU Pipeline Threads are Thread Group 0 P0 IB0-3 D2 E0 3 B1 W2 F2 6 LSU 4 B1 W6 IFU Thread Group 1 P5 D7 E6 4 B7 W6 IB4-7 interleaved between pipeline stages with very few restrictions Any thread can be at fetch or cache stage Threads are split into 2 thread groups before pick stage Load/store and floating-point units are shared between all 8 threads Up to 1 thread from either thread group can be scheduled on a shared unit Page 14

15 Stream Processing Unit ryptographic coprocessor One per core A Sources To FGU ultiply Result From FGU A Scratchpad 160x64b, 2R/1W rs1 A Execution Hash Engine rs2 Store Data, Address DA Engine ipher Engines Address, Data to/from Runs in parallel w/core at same frequency Two independent sub-units odular Arithmetic Unit RSA, binary and integer polynomial elliptic curve (E) Shares FGU multiplier ipher/hash Unit R4, DES/3DES, AES- 128/192/256 D5, SHA-1, SHA-256 Designed to achieve wire-speed on both 10Gb Ethernet ports Facilitates wirespeed encryption and decryption DA engine shares core s crossbar port Page 15

16 rossbar onnects 8 cores to 8 Banks and I/O ~90 GB/s write Sparc ore0 Sparc ore1 Sparc ore2 Sparc ore3 Sparc ore4 Sparc ore5 Sparc ore6 Sparc ore7 Non-blocking, pipelined switch 8 load/store requests and 8 data returns can be done at the same time Divided into 2 parts PX ~180 GB/s read Bank0 Bank1 Bank2 B0 ux B7 ux Bank3 Bank4 Bank5 Bank6 Bank7 PX processor to cache PX cache to processor Arbitration for a target is required Priority given to oldest requestor to maintain fairness and order Three cycle arbitration protocol Request, arbitrate and then grant Page 16

17 ache 4 B cache PX Request PX Return 16 way set associative Replayed iss lookup Tag Array Input Queue Arbiter iss Buffer Fill Buffer iss Request to emory miss hit Valid Array iss Request I/O Request 64B emory Read Fill Request Data Array 16B 64B Line Fill Output Queue 16B 16B Write-back Buffer Directory Invalidation Packet Arbiter 64B emory Write 64B Eviction I/O data 64B I/O Write Buffer 8 banks 64 byte line size cache is write-back, write-allocate L1 data cache is writethru Support for partial stores oherency is managed by the cache Directories maintained for all 16 L1 caches Data transfers between the and a core are done in 16 byte packets Page 17

18 Integrated Networking 42 GB/s read 21 GB/s write Pipelined memory accesses tolerate relaxed ordering rossbar NIU (Ethernet) 10 GE Ethernet FBDIs System Interface Unit PI-Ex 2.5 GHz Integrate networking for better overall performance All network data is sourced from and destined to main memory Integration minimizes impact of memory Get networking closer to memory to reduce latency Able to take full advantage of higher memory bandwidth Eliminates inherent inefficiencies of I/O protocol translation Page 18

19 Networking Features Line Rate Packet lassification (~30 pkt/s) Based on Layer 1/2/3/4 of the protocol stack ultiple DA Engines atches DAs to threads Binding flexibility between DAs and ports 16 transmit + 16 receive DA channels Virtualization Support Supports up to 8 partitions Interrupts may be bound to different hardware threads Dual Ethernet ports 2 dual-speed As (10G/1G) with integrated serdes Page 19

20 PI-Express PI-Express operates at 2.5 Gb/s per lane per direction 350 Hz DA/PIO ache Lines TLP Packets Data anagement IOU Interrupt Point-to-point, dual-simplex chip interconnect Transfers are in packets with headers and max data payloads from 128B to 512B IOU supports I/O virtualization and process device isolation by using PIE s BDF# SI Support 250 Hz 2.5 Gb/s XIT SR RV SR 128b 32b 128b 32b PI Express ore Transaction Layer Data Link and Physical Layer X8 Serdes X8 Serdes 16b 16b Event queue accumulates SIs Allows many SIs to be serviced upon an interrupt Total I/O bandwidth is 3-4 GB/s with max payload sizes of 128B to 512B Page 20

21 Power anagement Limit speculation Sequential prefetch of instruction cache lines Predict conditional branches as not-taken Predict loads hit in the data cache Hardware tablewalk search control Extensive clock gating Datapath ontrol blocks Arrays Power throttling 3 external power throttle pins Inject stall cycles into the decode stage based on state of these pins If power_throttle_pins[2:0]==n then n stalls in window of 8, n is 0-7 Affects all threads Page 21

22 Niagara2 System Status First silicon arrived at the end of ay Booted Solaris in 5 days urrent systems are fully operational Expect systems to ship in 2H2007 Page 22

23 Summary Niagara2 combines all major server functions on one chip Integrated networking Integrated PI-Express Embedded wire-speed cryptography Niagara2 has improved performance vs. UltraSparc T1 Better integer throughput and throughput/watt (2x) Improved integer single-thread performance (1.4x) Better floating-point throughput (10x) Better floating-point single-thread performance (5x) Enables new generation of power-efficient, fullysecure datacenters Page 23

24 Thank you... Page 24

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Niagara-2: A Highly Threaded Server-on-a-Chip Greg Grohoski Distinguished Engineer Sun Microsystems August 22, 2006 Authors Jama Barreh Jeff Brooks Robert Golla Greg Grohoski Rick Hetherington Paul Jordan