Snoop-Based Multiprocessor Design III: Case Studies

Size: px

Start display at page:

Download "Snoop-Based Multiprocessor Design III: Case Studies"

Berniece Lisa Boyd
5 years ago
Views:

1 Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions on the design issues discussed above Overview For each system: Bus design Processor and Memory System Input/Output system Microbenchmark memory access results Application performance and scaling (SGI Challenge) SGI Challenge Overview SUN Enterprise Overview R44 CPUs and caches VME-4 SCSI- Graphics HPPI I/O subsystem Interleaved memory: 1 GB maximum P $ P $ $ $ mem ctrl Bus Interf ace / Switch CPU/Mem Cards Bus Interf ace I/O Cards (a) A four-processor board 3 R44 (peak.7 GFLOPS, 4 per board) or 1 R (peak 5.4 GFLOPS, per board) -way interleaved memory (up to 1 GB) 4 I/O ses of 3 MB/s each 1. GB/s 47. MHz, 1 slots, 39 signals Bytes lines (1 + 4 cycles) Split-transaction with up to outstanding reads all transactions take five cycles 3 Powerpath- (5 data, 4 address, 47. MHz) (b) Machine organization Up to 3 UltraSPARC processors (peak 9 GFLOPs) Gigaplane TM has peak bw.7 GB/s; upto 3GB memory 1 slots, for processing or I/O boards CPUs and 1GB memory per board memory distributed, unlike Challenge, but protocol treats as centralized Each I/O board has 4-bit 5Mhz SBUSes 4 Gigaplane TM (5 data, 41 address, 3 MHz) Page 1

2 Bus Design Issues Multiplexed versus non-multiplexed (separate addr and data lines) Wide versus narrow data ses Bus clock rate Affected by signaling technology, length, number of slots... SGI Powerpath- Bus Non-multiplexed, 5-data/4-address, 47. MHz, o/s requests Wide => more interface chips so higher latency, but more bw at slower clock Large block size also calls for wider Uses Illinois MESI protocol (cache-to-cache sharing) More detail in chapter. Resolution At least one requestor Split transaction versus atomic Flow control strategy 1. Arbitration No requestors Acknowledge 4. Decode 5 Bus Timing Processor and Memory Systems. Resolution At least one requestor 1. Arbitration No requestors 3. R44 R44 R44 R44 5. Acknowledge 4. Decode Arb Rslv Addr Decode Ack Arb Rslv Addr Decode Ack A-chip slice 1 slice slice 3 slice 4 Command Urgent arb arb arb Cmd ack ack Urgent arb arb arb D 1 D D 3 D D 1 D D 3 D resource Inhib resource Inhib Inhib Inhib and inhibit resource ID ID 7 Cmd ack ack 4 R44 processors per board share A and D chips A chip has address interface, request table, control logic CC chip per processor has duplicate set of Processor requests go from CC chip to A chip to 4 bit-sliced D chips interface CC chip to Powerpath- Page

3 Memory Access Latency Challenge I/O Subsystem 5ns access time from address on to data on Peripheral SCSI VME HPPI Personality graphics ASICs But overall latency seen by processor is 1ns! 3ns for request to get from processor to down through cache hierarchy, CC chip and A chip 4ns later, data gets to D chips 3 cycles to address phase of request transaction, to access main memory, 5 to deliver data across to D chips 3ns more for data to get to processor chip up through D chips, CC chip, and 4-bit wide interface to processor chip, load data into primary cache, restart pipeline 9 Multiple I/O cards on system, each has 3MB/s Personality ASICs connect these to devices (standard and graphics) Proprietary 4-bit multiplexed address/data, same clock as system Split read transactions, up to 4 per device Pipelined, but centralized arbitration, with several transaction lengths translation via mapping RAM in system interface Why the decouplings? (Why not connect directly to system?) I/O board acts like a processor to memory system 1 (3 MB/s) map System address System data (1. GB/s) path System to interface Challenge Memory System Performance Read microbenchmark with various strides and array sizes Time (ns) 1,5 TLB MEM 1, M 4 M M 1 M 5 K 5 5 K K 4 K 3 K 1 K L K 4 K 1 K 4 K 5 K 1 M 4 M Stride (bytes) Ping-pong flag-spinning microbenchmark: round-trip time. ms. Sun Gigaplane Bus Non-multiplexed, split-transaction, 5-data/41-address, 3.5 MHz Plus 3 ECC lines, 7 tag, 1 arbitration, etc. Total 3. Cards plug in on both sides: per side 1 outstanding transactions, up to 7 from each board Designed for multiple outstanding transactions per processor Emphasis on reducing latency, unlike Challenge Speculative arbitration if address not scheduled from prev. cycle Else regular 1-cycle arbitration, and 7-bit tag assigned in next cycle Snoop result associated with request phase (5 cycles later) Main memory can stake claim to data 3 cycles into this, and start memory access speculatively Two cycles later, asserts tag to inform others of coming transfer MOESI protocol (owned state for cache-to-cache sharing) 11 Page 3

4 Gigaplane Bus Timing Enterprise Processor and Memory System Arbitration Rd A Rd B Share ~Own Own 1 4,5 7 procs per board, external L caches, mem banks with x-bar lines buffered through UDB to drive internal 1.3 GB/s UPA Wide path to memory so full 4-byte line in 1 mem cycle ( cyc) Addr controller adapts proc and protocols, does cache coherence its keep a subset of states needed by (e.g. no M/E distinction) Status A D A D A D A D A D A D A D A D OK Cancel Memory (1 7-bit SIMMS) s s FiberChannel module () SBUS slots 1/1 Ethernet Fast wide SCSI D D 1 UDB UltraSparc UDB UltraSparc SBUS 5 MHz 4 SysIO SysIO D- controller controller (crossbar) controller controller (crossbar) Control Control Gigaplane connector Gigaplane connector Enterprise I/O System I/O board has same interface ASICs as processor boards But internal half as wide, and no memory path Only cache block sized transactions, like processing boards Uniformity simplifies design ASICs implement single-block cache, follows coherence protocol Two independent 4-bit, 5 MHz Ses One for two dedicated FiberChannel modules connected to disk One for Ethernet and fast wide SCSI Can also support three SBUS interface cards for arbitrary peripherals Performance and cost of I/O scale with no. of I/O boards 1 Memory Access Latency 3ns read miss latency 11 cycle min protocol at 3.5 Mhz is 13ns of this time Rest is path through caches and the DRAM access TLB misses add 34 ns Time (ns) M 4 M M 1 M 5 K 5 K K 4 K 3 K 1 K K 4 K 1 K 4 K 5 K 1 M 4 M Stride (bytes) Ping-pong microbenchmark is 1.7 ms round-trip (5 mem accesses) Page 4

5 Application s (Challenge) Application Scaling under Other Models Problem in Ocean with small problem: communication and barrier cost Problem in Radix: contention on due to very high traffic also leads to high imbalances and barrier wait time 17 LU: n = 1,4 Barnes-Hut: 1-K particles LU: n =,4 Barnes-Hut: 5-K particles Raytrace: balls Ocean: n = 13 Raytrace: car Ocean: n = 1,4 Radiosity: room Radix: 1-M keys Radiosity: large room Radix: 4-M keys Work (instructions) Number of bodies 1, 9, Naive TC Naive MC, TC 7, MC, 5, 4, 3,, 1, Naive TC Naive MC TC MC Naive TC Naive MC TC MC PC Number of points per grid Work (instructions) ,, 1,,,, 4,, 1 1 Naive TC Naive MC TC MC Naive TC Naive MC TC MC PC 4 Naive TC Naive MC TC MC Page 5

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived