NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Size: px

Start display at page:

Download "NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability"

Helen Wilson
6 years ago
Views:

1 Recap: Gigaplane Bus Timing Address Rd A Rd B Scalability State Arbitration 1 4,5 2 Share ~Own 6 Own 7 A D A D A D A D A D A D A D A D CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Status Data OK D 0 D 1 Cancel 3/3/99 CS258 S99 2 Enterprise rocessor and emory System 2 procs per board, external L2 caches, 2 mem banks with x-bar Data lines buffered through UDB to drive internal 1.3 GB/s UA bus Wide path to memory so full 64-byte line in 1 mem cycle (2 bus cyc) Addr controller adapts proc and bus protocols, does cache coherence its tags keep a subset of states needed by bus (e.g. no /E distinction) emory (16 72-bit SIS) L 2 UDB s UltraSparc L 2 UDB s UltraSparc FiberChannel module (2) SBUS 25 Hz SysIO SBUS slots SysIO 10/100 Ethernet 64 Fast wide SCSI Enterprise I/O System I/O board has same bus interface ASICs as processor boards But internal bus half as wide, and no memory path Only cache block sized transactions, like processing boards Uniformity simplifies design ASICs implement single-block cache, follows coherence protocol Two independent 64-bit, 25 Hz Sbuses One for two dedicated FiberChannel modules connected to disk One for Ethernet and fast wide SCSI Can also support three SBUS interface cards for arbitrary peripherals erformance and cost of I/O scale with no. of I/O boards D-tags Address controller Address controller Data controller (crossbar) Data controller (crossbar) Data 288 Control Address Data 288 Control Address 3/3/99 CS258 S99 3 Gigaplane connector Gigaplane connector 3/3/99 CS258 S99 4 Limited Scaling of a Bus Workstations in a LAN? Characteristic hysical Length Number of Connections aximum Bandwidth Interface to Comm. medium Global Order rotection Trust OS comm. abstraction Bus ~ 1 ft fixed fixed memory inf arbitration Virt -> physical total single HW Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components 3/3/99 CS258 S99 5 Characteristic Bus LAN hysical Length ~ 1 ft K Number of Connections fixed many aximum Bandwidth fixed??? Interface to Comm. medium memory inf peripheral Global Order arbitration??? rotection Virt -> physical OS Trust total none OS single independent comm. abstraction HW SW No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart 3/3/99 CS258 S99 6 NOW Handout age 1 CS258 S99 1

2 Scalable achines What are the design trade-offs for the spectrum of machines between? specialize or commodity nodes? capability of node-to-network interface supporting programming models? Bandwidth Scalability S S S S Typical switches Bus Crossbar What does scalability mean? avoids inherent design limits on resources bandwidth increases with latency does not cost increases slowly with What fundamentally limits bandwidth? single set of wires ust have many independent wires Connect modules through switches Bus vs Network? ultiplexers 3/3/99 CS258 S99 7 3/3/99 CS258 S99 8 Dancehall Organization Generic Distributed emory Org. Scalable network Scalable network CA Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 3/3/99 CS258 S99 9 Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 3/3/99 CS258 S99 10 Key roperty Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved effects propagated through additional transactions Latency Scaling T(n) = Overhead + Channel + Routing Delay Overhead? Channel (n) = n/b --- BW at bottleneck RoutingDelay(h,n) 3/3/99 CS258 S /3/99 CS258 S99 12 NOW Handout age 2 CS258 S99 2

3 Typical example max distance: log n number of switches: α n log n overhead = 1 us, BW = 64 B/s, 200 ns per hop ipelined T 64 (128) = 1.0 us us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * ( ) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * ( ) us/hop = 23 us Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based S? Ratio of processors : memory : network : I/O? arallel efficiency(p) = Speedup() / Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup 3/3/99 CS258 S /3/99 CS258 S99 14 Cost Effective? hysical Scaling Speedup = /(1+ log) Costup = Chip-level integration Board-level System level rocessors 2048 processors: 475 fold speedup at 206x cost 3/3/99 CS258 S /3/99 CS258 S99 16 ncube/2 achine Organization C-5 achine Organization Basic module Diagnostics network Control network Data network 1024 Nodes rocessing partition rocessing Control I/O partition partition processors interface U I-Fetch Operand & decode DA channels Router Hypercube network configuration SARC FU Data Control networks network Execution unit 64-bit integer IEEE floating point Single-chip node SRA NI BUS Vector unit Vector unit Entire machine synchronous at 40 Hz 3/3/99 CS258 S /3/99 CS258 S99 18 NOW Handout age 3 CS258 S99 3

System Level Integration Realizing rogramming odels CAD Database Scientific modeling arallel applications General interconnection network formed from 8-port switches ower 2 CU IB S-2 node L 2 emory

4 System Level Integration Realizing rogramming odels CAD Database Scientific modeling arallel applications General interconnection network formed from 8-port switches ower 2 CU IB S-2 node L 2 emory bus emory 4-way interleaved controller icrochannel bus NIC I/O DA i860 NI ultiprogramming Shared address essage passing Data parallel rogramming models Compilation or library Communication abstraction User/system boundary Operating systems support Communication hardware hysical communication medium Hardware/software boundary 3/3/99 CS258 S /3/99 CS258 S99 20 Network Transaction rimitive Output buffer node Communication network Serialized data packet Input buffer node one-way transfer of information from a source output buffer to a dest. input buffer causes some action at the destination occurrence is not directly visible at source deposit data, state change, reply 3/3/99 CS258 S99 21 Bus Transactions vs Net Transactions Issues: protection check V->?? format wires flexible output buffering reg, FIFO?? media arbitration global local destination naming and routing input buffering limited many source action completion detection 3/3/99 CS258 S99 22 Shared Address Space Abstraction Consistency is challenging A=1; flag=1; while (flag==0); print A; (1) Initiate memory access (2) Address translation Load r [Global address] (4) Request transaction Read request emory emory emory A:0 flag:0->1 (5) Remote memory access Wait Read request emory access Delay 1: A=1 3: load A (6) Reply transaction Read response Read response (a) 2: flag=1 Interconnection network (7) Complete memory access 2 3 fixed format, request/response, simple action (b) 1 Congested path 3/3/99 CS258 S /3/99 CS258 S99 24 NOW Handout age 4 CS258 S99 4

5 Synchronous essage assing Asynch. essage assing: Optimistic (1) Initiate send (2) Address translation on src Send dest, local VA, len Recv src, local VA, len (1) Initiate send (2) Address translation Send ( dest, local VA, len) (4) Send-ready request Send-rdy req (4) Send data (5) Remote check for posted receive (assume success) Wait check (5) Remote check for posted receive; on fail, allocate data buffer Data-xfer req match Allocate buffer (6) Reply transaction Recv-rdy reply (7) Bulk data transfer VA Dest VA or ID Recv src, local VA, len Data-xfer req Storage??? 3/3/99 CS258 S /3/99 CS258 S99 26 Asynch. sg assing: Conservative Active essages (1) Initiate send (2) Address translation on dest (4) Send-ready request (5) Remote check for posted receive (assume fail); record send-ready Send dest, local VA, len Send-rdy req Return and compute check Request handler handler Reply (6) Receive-ready request Recv src, local VA, len (7) Bulk data reply VA Dest VA or ID Recv-rdy req Data-xfer reply User-level analog of network transaction Action is small user function Request/Reply ay also perform memory-to-memory transfer 3/3/99 CS258 S /3/99 CS258 S99 28 Common Challenges Input buffer overflow N-1 queue over-commitment => must slow sources reserve space per source (credit)» when available for reuse? Ack or Higher level Refuse input when full» backpressure in reliable network» tree saturation» deadlock free» what happens to traffic not bound for congested dest? Reserve ack back channel drop packets 3/3/99 CS258 S99 29 Challenges (cont) Fetch Deadlock For network to remain deadlock free, nodes must continue accepting messages, even when cannot source them what if incoming transaction is a request?» Each may generate a response, which cannot be sent!» What happens when internal buffering is full? logically independent request/reply networks physical networks virtual channels with separate input/output queues bound requests and reserve input buffer space K(-1) requests + K responses per node service discipline to avoid fetch deadlock? NACK on input buffer full NACK delivery? 3/3/99 CS258 S99 30 NOW Handout age 5 CS258 S99 5

6 Summary Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming odels network transactions protocols safety» N-1» fetch deadlock Next: Communication Architecture Design Space how much hardware interpretation of the network transaction? 3/3/99 CS258 S99 31 NOW Handout age 6 CS258 S99 6

Outline. Limited Scaling of a Bus

Outline. Limited Scaling of a Bus Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture