Scalable Multiprocessors

Size: px

Start display at page:

Download "Scalable Multiprocessors"

Tracey Tucker
5 years ago
Views:

1 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Scalable ultiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) essage-based SCAs ( ) Shared-memory based SCAs (7.6) 2/17/2009 slide 1 Scalability Goals ( is number of processors) Bandwidth: scale linearly with Latency: short and independent of Cost: low fixed cost and scale linearly with Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear 2/17/2009 slide 2 1

2 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Organizational Issues Dance-hall memory organization Scalable network Distributed memory organization Scalable network CA Network composed of switches for performance and cost any concurrent transactions allowed Distributed memory can bring down bandwidth demands Bandwidth scaling: no global arbitration and ordering broadcast bandwidth fixed and expensive 2/17/2009 slide 3 Scaling Issues Latency scaling: T(n) = Overhead + Channel Time + Routing Delay Channel Time is a function of bandwidth Routing Delay is a function of number of hops in network Cost scaling: Cost(p,m) = Fixed cost + Incremental Cost (p,m) Design is cost-effective if speedup(p,m) > costup(p,m) 2/17/2009 slide 4 2

3 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 hysical Scaling Chip, board, system-level partitioning has a big impact on scaling However, little consensus Diagnostics network Control network Data network rocessing partition rocessing Control partition processors I/O partition SARC FU Data networks Control network ctrl SRA NI BUS DRA ctrl Vector unit DRA ctrl DRA ctrl Vector unit DRA ctrl DRA DRA DRA DRA 2/17/2009 slide 5 Network Transaction rimitives rimitives to implement the programming model on a scalable machine Communication Network One-way transfer serialized msg between source and output buf fer input buf fer destination Resembles a bus Source Node Destination Node transaction but much richer in variety Examples: A message send transaction A write transaction in a SAS machine 2/17/2009 slide 6 3

arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Bus vs.

Transaction ordering Bus Transactions: V-> address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at

slide 7 SAS Transactions Source Destination (1) Initiate memory access (2) Address translation (3) Local/remote check (4) Request transaction Load r [Global address]

4 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Bus vs. Network Transactions Design Issues: rotection Format Output buffering edia arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering Bus Transactions: V-> address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order 2/17/2009 slide 7 SAS Transactions Source Destination (1) Initiate memory access (2) Address translation (3) Local/remote check (4) Request transaction Load r [Global address] Read request Read request (5) Remote memory access Wait emory access (6) Reply transaction (7) Complete memory access Read response Read response Time Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full 2/17/2009 slide 8 4

5 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Sequential Consistency A=1; flag=1; while (flag==0); print A; emory emory emory A:0 flag:0->1 Delay 1: A=1 3: load A 2: flag=1 Interconnection network (a) 2 3 (b) 1 Congested path Issues: Writes need acks to signal completion SC may cause extreme waiting times 2/17/2009 slide 9 essage assing ultiple flavors of synchronization semantics Blocking versus non-blocking Blocking send/recv returns when operation completes Non-blocking returns immediately (probe function tests completion) Synchronous Send completes after matching receive has executed Receive completes after data transfer from matching send completes Asynchronous (buffered, in I terminology) Send completes as soon as send buffer may be reused 2/17/2009 slide 10 5

6 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Synchronous rotocol Source Destination (1) Initiate send Recv src, local VA, len (2) Address translation on src (3) Local/remote check (4) Send-ready request Send dest, local VA, len Send-rdy req (5) Remote check for posted receive (assume success) Wait Tag check (6) Reply transaction (7) Bulk data transfer Source VA Dest VA or ID Recv-rdy reply Data-xfer req Time Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol 2/17/2009 slide 11 Asynchronous Optimistic rotocol (1) Initiate send (2) Address translation (3) Local/remote check (4) Send data (5) Remote check for posted receive; on fail, allocate data buffer Source Send ( dest, local VA, len) Data-xfer req Destination Tag match Allocate buffer Time Recv src, local VA, len Issues: Copying overhead at receiver from temp buffer to user space Huge buffer space at receiver to cope with worst case 2/17/2009 slide 12 6

7 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Asynchronous Robust rotocol Source Destination (1) Initiate send (2) Address translation on dest (3) Local /remote check (4) Send-ready request Send dest, local VA, len Send-rdy req (5) Remote check for posted receive (assume fail); record send-ready Return and compute Tag check (6) Receive-ready request (7) Bulk data reply Source VA Dest VA or ID Recv src, local VA, len Recv-rdy req Data-xfer reply Time Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead 2/17/2009 slide 13 Active essages Request handler handler Reply User-level analog of network transactions transfer data packet and invoke handler to extract it from network and integrate with ongoing computation 2/17/2009 slide 14 7

8 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Challenges Common to SAS and Input buffer overflow: how to signal buffer space is exhausted Solutions: ACK at protocol level back pressure flow control special ACK path or drop packets (requires time-out) Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: two logically independent request/response networks NACK requests at receiver to free space 2/17/2009 slide 15 Increasing HW Support, Specialization, Intrusiveness, erformance (???) 2/17/2009 slide 16 Spectrum of Designs None, physical bit stream blind, physical DA ncube, isc,... User/System User-level port C-5, *T User-level handler J-achine, onsoon,... Remote virtual address rocessing, translation aragon, eiko CS-2 Global physical address roc + emory controller R3, BBN, T3D Cache-to-cache Cache controller Dash, KSR, Flash 8

9 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Architectures Scalable Network essage Output rocessing checks translation formatting scheduling CA Communication Assist Node Architecture CA Input rocessing checks translation buffering action Design tradeoff: how much processing in CA vs, and how much interpretation of network transaction hysical DA (7.3) User-level access (7.4) Dedicated message processing (7.5) 2/17/2009 slide 17 hysical DA Data Dest Example: ncube/2, IB S1 DA channels Addr Length Rdy Cmd Status, interrupt Addr Length Rdy emory emory Node processor packages messages in user/system mode DA used to copy between network and system buffers roblem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved 2/17/2009 slide 18 9

10 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 User-Level Access Data User/system Dest Example: C-5 em Status, interrupt em Network interface mapped into user space Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts 2/17/2009 slide 19 Dedicated essage rocessing Network does em NI dest em NI Interprets msg Supports msg operations Off-loads with a clean msg abstraction User System User System Issues: / communicate via shared memory: coherence traffic can be a bottleneck due to all concurrent actions 2/17/2009 slide 20 10

11 arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Shared hysical Address Space Scalable Network seudo memory seudo processor seudo memory seudo processor Remote read/write performed by pseudo processors Cache coherence issues treated in Ch. 8 2/17/2009 slide 21 11

Outline. Limited Scaling of a Bus

Outline. Limited Scaling of a Bus Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture