Outline. Limited Scaling of a Bus

Size: px

Start display at page:

Download "Outline. Limited Scaling of a Bus"

Dorothy Paul
5 years ago
Views:

1 Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture Design Space how much hardware interpretation of the network transaction? 2 Limited Scaling of a Bus Characteristic hysical Length Number of Connections Maximum Bandwidth Interface to Comm. medium Global Order rotection Trust OS comm. abstraction Bus ~ 1 ft fixed fixed memory inf arbitration Virt -> physical total single HW Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components 3 1

2 Workstations in a LAN? Characteristic Bus LAN hysical Length ~ 1 ft KM Number of Connections fixed many Maximum Bandwidth fixed??? Interface to Comm. medium memory inf peripheral Global Order arbitration??? rotection Virt -> physical OS Trust total none OS single independent comm. abstraction HW SW No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart 4 Scalable Machines What are the design trade-offs for the spectrum of machines between? specialize or commodity nodes? capability of node-to-network interface supporting programming models? What does scalability mean? avoids inherent design limits on resources bandwidth increases with latency does not cost increases slowly with 5 2

3 Bandwidth Scalability Typical switches Bus S S S S Crossbar M M M M M M M M Multiplexers What fundamentally limits bandwidth? single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch? 6 Dancehall M Organization M M M Scalable network Switch Switch Switch $ $ $ $ Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 7 3

4 Generic Distributed Memory Org. Scalable network Switch Switch Switch M CA $ Network bandwidth? Bandwidth demand? independent processes? communicating processes? Latency? 8 Key roperty Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved effects propagated through additional transactions 9 4

5 Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Overhead? Channel Time(n) = n/b --- BW at bottleneck RoutingDelay(h,n) 10 Typical example max distance: log n number of switches: α n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop ipelined T 64 (128) = 1.0 us us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * ( ) us/hop = 14.2 us T 1024 sf (128) = 1.0 us + 10 hops * ( ) us/hop = 23 us 11 5

6 Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SM? Ratio of processors : memory : network : I/O? arallel efficiency(p) = Speedup() / Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup 12 Cost Effective? Speedup = /(1+ log) Costup = rocessors 2048 processors: 475 fold speedup at 206x cost 13 6

hysical Scaling Chip-level integration Board-level System level 14 ncube/2 Machine Organization Basic module 1024 Nodes DRAM interface MMU I-Fetch & decode DMA

7 hysical Scaling Chip-level integration Board-level System level 14 ncube/2 Machine Organization Basic module 1024 Nodes DRAM interface MMU I-Fetch & decode DMA channels Router Operand $ Execution unit 64-bit integer IEEE floating point Single-chip node Hypercube network configuration Entire machine synchronous at 40 MHz 15 7

CM-5 Machine Organization Diagnostics network Control network Data network M M rocessingı partition rocessingı Controlı partition processors I/O partition SARC FU Dataı networks Controlı network $ı

8 CM-5 Machine Organization Diagnostics network Control network Data network M M rocessingı partition rocessingı Controlı partition processors I/O partition SARC FU Dataı networks Controlı network $ı ctrl $ı SRAM NI MBUS DRAMı ctrl Vectorı unit DRAMı ctrl DRAMı ctrl Vectorı unit DRAMı ctrl DRAM DRAM DRAM DRAM Board-Level Integration Example 16 System Level Integration IBM S-2 ower 2 CU IBM S-2 node L 2 $ Memory bus General inter connection network formed fr om 8-port switches Memory controller 4-way interleaved DRAM MicroChannel bus NIC I/O i860 DMA NI DRAM 17 8

9 Network Transaction rimitive Communication Network serialized msg output buffer input buffer Source Node Destination Node one-way transfer of information from a source output buffer to a dest. input buffer causes some action at the destination occurrence is not directly visible at source deposit data, state change, reply 20 Bus Transactions vs Net Transactions Issues: protection check V->?? format wires flexible output buffering reg, FIFO?? media arbitration global local destination naming and routing simple complex input buffering limited many source action completion detection transaction ordering delivery guarantees 21 9

10 Shared Address Space Abstraction Source Destination (1) Initiate memory access (2) Address translation (3) Local/remote check (4) Request transaction Load r [Global address] Read request Read request (5) Remote memory access Wait Memory access (6) Reply transaction (7) Complete memory access Read response Read response Time Fundamentally a two-way request/response protocol writes have an acknowledgement Issues fixed or variable length (bulk) transfers remote virtual or physical address, where is action performed? deadlock avoidance and input buffer full coherent? consistent? 22 The Fetch Deadlock roblem Even if a node cannot issue a request, it must sink network transactions. Incoming transaction may be a request, which will generate a response. Closed system (finite buffering) 23 10

11 Consistency A=1; flag=1; while (flag==0); print A; Memory Memory Memory A:0 flag:0->1 Delay 1: A=1 3: load A 2: flag=1 Interconnection network (a) 2 3 (b) 1 Congested path write-atomicity violated without caching 24 Key roperties of SAS Abstraction Source and destination data addresses are specified by the source of the request a degree of logical coupling and trust no storage logically outside the application address space(s) may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory logically does not require intervention of the remote processor 25 11

12 Message passing Bulk transfers Complex synchronization semantics more complex protocols More complex action Synchronous Send completes after matching recv and source data sent Receive completes after data transfer complete from matching send Asynchronous Send completes after send buffer may be reused 26 Synchronous Message assing Source Destination (1) Initiate send Recv src, local VA, len (2) Address translation on src (3) Local/remote check Send dest, local VA, len (4) Send-ready request Send-rdy req (5) Remote check for posted receive (assume success) Wait Tag check rocessor Action? (6) Reply transaction Recv-rdy reply (7) Bulk data transfer Source VA Dest VA or ID Data-xfer req Time Constrained programming model. Deterministic! What happens when threads added? Destination contention very limited. User/System boundary? 27 12

13 Asynch. Message assing: Optimistic (1) Initiate send (2) Address translation (3) Local /remote check (4) Send data (5) Remote check for posted receive; on fail, allocate data buffer Source Send ( dest, local VA, len) Data-xfer req Destination Tag match Allocate buffer Time Recv src, local VA, len More powerful programming model Wildcard receive => non-deterministic Storage required within msg layer? 28 Asynch. Msg assing: Conservative Source Destination (1) Initiate send (2) Address translation on dest (3) Local/remote check (4) Send-ready request Send dest, local VA, len Send-rdy req (5) Remote check for posted receive (assume fail); record send-ready Return and compute Tag check (6) Receive-ready request (7) Bulk data reply Source VA Dest VA or ID Recv src, local VA, len Recv-rdy req Where is the buffering? Data-xfer reply Time Contention control? Receiver initiated protocol? Short message optimizations 29 13

14 Key Features of Msg assing Abstraction Source knows send data address, dest. knows receive data address after handshake they both know both Arbitrary storage outside the local address spaces may post many sends before any receives non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors fine print says these are limited too Fundamentally a 3-phase transaction includes a request / response can use optimisitic 1-phase in limited Safe cases credit scheme 30 Active Messages Request handler handler Reply User-level analog of network transaction transfer data packet and invoke handler to extract it from the network and integrate with on-going computation Request/Reply Event notification: interrupts, polling, events? May also perform memory-to-memory transfer 31 14

15 Common Challenges Input buffer overflow N-1 queue over-commitment => must slow sources reserve space per source when available for reuse? Ack or Higher level Refuse input when full (credit) backpressure in reliable network tree saturation deadlock free what happens to traffic not bound for congested dest? Reserve ack back channel drop packets Utilize higher-level semantics of programming model 32 Challenges (cont) Fetch Deadlock For network to remain deadlock free, nodes must continue accepting messages, even when cannot source msgs what if incoming transaction is a request? Each may generate a response, which cannot be sent! What happens when internal buffering is full? logically independent request/reply networks physical networks virtual channels with separate input/output queues bound requests and reserve input buffer space K(-1) requests + K responses per node service discipline to avoid fetch deadlock? NACK on input buffer full NACK delivery? 33 15

16 Challenges in Realizing rog. Models in the Large One-way transfer of information No global knowledge, nor global control barriers, scans, reduce, global-or give fuzzy global state Very large number of concurrent transactions Management of input buffer resources many sources can issue a request and over-commit destination before any see the effect Latency is large enough that you are tempted to take risks optimistic protocols large transfers dynamic allocation Many many more degrees of freedom in design and engineering of these system 34 Summary Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture Design Space how much hardware interpretation of the network transaction? 35 16

17 Network Transaction rocessing Scalable Network Message Output rocessing checks translation formating scheduling M CA Communication Assist Node Architecture M CA Input rocessing checks translation buffering action Key Design Issue: How much interpretation of the message? How much dedicated processing in the Comm. Assist? 36 Increasing HW Support, Specialization, Intrusiveness, erformance (???) Spectrum of Designs None: hysical bit stream blind, physical DMA ncube, isc,... User/System User-level port CM-5, *T User-level handler J-Machine, Monsoon,... Remote virtual address rocessing, translation aragon, Meiko CS-2 Global physical address roc + Memory controller R3, BBN, T3D Cache-to-cache Cache controller Dash, KSR, Flash 37 17

18 Net Transactions: hysical DMA Data Dest DMA channels Addr Length Rdy Cmd Status, interrupt Addr Length Rdy Memory Memory DMA controlled by regs, generates interrupts hysical => OS initiates transfers Send-side sender auth dest addr construct system envelope around user data in kernel area Receive must receive into system buffer, since no interpretation inca 38 ncube Network Interface Input ports Output ports Switch Addr DMA Addr Addr Addr Addr channels Length Length Addr Length Memory bus Memory rocessor independent DMA channel per link direction leave input buffers always open segmented messages routing interprets envelope dimension-order routing on hypercube bit-serial with 36 bit cut-through Os 16 ins 260 cy 13 us Or cy 15 us - includes interrupt 39 18

19 Conventional LAN Network Interface Host Memory NIC trncv NIC Controller Data TX RX addr len DMA Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next mem bus roc IO Bus Addr Len Status Next Addr Len Status Next 40 User Level orts Data User/system Dest Mem Status, interrupt Mem initiate transaction at user level deliver to user without OS intervention network port in user space User/system flag in envelope protection check, translation, routing, media access in src CA user/sys check in dest CA, interrupt on system 41 19

20 User Level Network ports Virtual address space Net output port Net input port Status rocessor Registers rogram counter Appears to user as logical message queues plus status What happens if no user pop? 42 Example: CM-5 Input and output FIFO for each network Diagnostics network Control network Data network 2 data networks tag per message index NI mapping table context switching? SARC $ı ctrl rocessingı partition FU $ı SRAM M M rocessingı Controlı partition processors Dataı Controlı networks network NI I/O partition *T integrated NI on chip iwar also MBUS Vectorı Vectorı DRAMı unit DRAMı DRAMı unit DRAMı ctrl ctrl ctrl ctrl DRAM DRAM DRAM DRAM Os 50 cy 1.5 us Or 53 cy 1.6 us interrupt 10us 43 20

User Level Handlers User/system Data Address Dest Mem Mem Hardware support to vector to address specified in message message ports in registers

21 User Level Handlers User/system Data Address Dest Mem Mem Hardware support to vector to address specified in message message ports in registers 44 iwar Host Interface unit Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory 47 21

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability Recap: Gigaplane Bus Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Address Rd A Rd B Scalability State Arbitration 1 4,5 2 Share ~Own 6 Own 7 A D A D A D A D A D A D A D A D CS 258, Spring 99 David E. Culler