EE382 Processor Design. Illinois

Size: px

Start display at page:

Download "EE382 Processor Design. Illinois"

Anna Holland
5 years ago
Views:

1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1

2 Write-invalidate EE 382 Processor Design Winter 98/99 Michael Flynn 3 Synchronization/coherency Synchronization...means to insure that multiple processors have the same (coherent) view of critical values in memory Coherency...property that ensures that value returned after a read is the same value as the latest write Consistency..degree to which (or part of memory over which) coherency is maintained EE 382 Processor Design Winter 98/99 Michael Flynn 4 2

3 Consistency of memory ops Sequential consistency (strong ordering) all memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency ( buffered writes) LD sequences appear in program order also ST sequences, but LD may proceed ST. different processors may see different op order require explicit synchronization EE 382 Processor Design Winter 98/99 Michael Flynn 5 Weak consistency Other forms possible, e.g. weak ordering all pending mem ops are completed before a synchronization op (forced completion is called a fence op) synch ops are completed before any other memory ops. synch ops are sequentially consistent. EE 382 Processor Design Winter 98/99 Michael Flynn 6 3

4 Outline Partitioning Granularity Overhead and efficiency Multi-threaded MP Shared Bus Coherency Synchronization Consistency Scalable MP Cache directories Interconnection networks Trends and tradeoffs Additional References Hennessy and Patterson, CAQA, Chapter 8 Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach /~culler/book.alpha/index.html EE 382 Processor Design Winter 98/99 Michael Flynn 7 Scalable MP Bandwidth for single bus limits scalability Can use two (or more buses) for even/odd cache lines Extends system size incrementally at substantial cost Use low-degree MP on shared bus as a cluster with scalable interconnect EE 382 Processor Design Winter 98/99 Michael Flynn 8 4

5 Coherency for Scalable MP Maintain single, coherent memory address space There is no longer a shared bus accessed by all processors for synchronization and communication through memory Use a directory to track processors using memory lines central directory: with memory module distributed directory: with individual caches Shared lines can be invalidated or updated on write 4 possible protocols: CD-INV, CD-UP, DD-INV, DD-UP CD-INV and DD-INV (Scalable Coherent Interconnect) are most common EE 382 Processor Design Winter 98/99 Michael Flynn 9 Central Directory EE 382 Processor Design Winter 98/99 Michael Flynn 10 5

6 Central Directory Typically use a bit vector stored with each line in memory Each bit indicates whether the corresponding cluster has cached a copy of the line Various optimizations to reduce storage overhead are possible Used in Stanford DASH/FLASH, MIT Alewife, SGI Origin When a processor needs to write a line it does not own It requests the line from memory CD sends invalidates to all caches that hold the line All relevant caches invalidated the line and acknowledge Requesting processor is allowed to take ownership and modify the line EE 382 Processor Design Winter 98/99 Michael Flynn 11 Distributed Directory (Part I) Linked-list used to keep track of caches holding a line Singly- or doubly-linked (SCI) lists used Pointer to head of list is stored with line in memory Used in IEEE-SCI and Sequent NUMA-Q When a processor (P) needs to write a line it does not own If P holds a shared copy in its cache, P removes itself from the linked list of caches for the line P notifies the memory of its intention to write the line and becomes the head of the list P sends an invalidation signal to the next cache on the list The next cache invalidates the line and returns an acknowledge to P along with a pointer to the next cache on the list When all the caches have been invalidated, P can take ownership and write the line EE 382 Processor Design Winter 98/99 Michael Flynn 12 6

7 Distributed Directory (Part II) EE 382 Processor Design Winter 98/99 Michael Flynn 13 Distributed Directory (Part II) Performance Issues Linked lists generally short for shared data being modified When data is shared, important to minimize synchronization and communication overhead Queue on Lock Bit (QOLB) Hardware maintains queue of caches waiting on lock Software spins on shadow copy of line in local cache Lock and data stored in same cache line Single line transfer required for each processor to synchronize/communicate An Analysis of Synchronization Mechanisms in Shared- Memory Multiprocessors, Woest and Goodman, URL: Efficient algorithms can be quite complex FLASH uses programmable protocol processor EE 382 Processor Design Winter 98/99 Michael Flynn 14 7

8 Interconnect Networks Each network node consists of processor, cache, and part of global memory May also include switch (direct). For indirect networks switches are removed from nodes Networks may be static (fixed links between nodes) or dynamic (switches configure path) Only direct-static and indirect-dynamic commonly used. EE 382 Processor Design Winter 98/99 Michael Flynn 15 Interconnect Networks Direct Indirect EE 382 Processor Design Winter 98/99 Michael Flynn 16 8

9 Static, Direct Networks Includes ring, linear array, star, mesh,... We consider only hypertorus (k,n) topologies n-dimensions, k-elements per dimension k-ary n cubes with end around connection Terms distance smallest no. links/hops between 2 nodes diameter largest distance between 2 nodes number of nodes N = k n for a (k,n) network EE 382 Processor Design Winter 98/99 Michael Flynn 17 Static, Direct Networks Linear Array Grid (2D Mesh) Ring 2D torus EE 382 Processor Design Winter 98/99 Michael Flynn 18 9

10 Links (Channels) and Nodes Link characteristics cycle time: T ch =1/BW of a link wire width of link: w = no. wires in the link directionality:unidirectional or bidirectional links Node buffering (static networks) Store and Forward Wormhole (cut-through) routing EE 382 Processor Design Winter 98/99 Michael Flynn 19 Links (Channels) and Nodes Store and Forward Wormhole EE 382 Processor Design Winter 98/99 Michael Flynn 20 10

11 Communication Latency for Static Network Assume a (k,n) network with dimensional closure and bidirectional links; if message has H header bits and payload bits, number of channel cycles to transmit message over one link is ( + H)/w. If the distance between source and destination nodes is d links and h= H/w, then T store-and-forward = T ch [d ( H)/w] = T ch [d ( /w) +d h] For wormhole routing, once a message header is received at a node the message proceeds to an output channel and is transmitted, so T wormhole = T ch [d h + ( /w)] Note: Both formulas above refer to communication latency in the absence of contention (i.e., no queuing delay). EE 382 Processor Design Winter 98/99 Michael Flynn 21 Dynamic, Indirect Networks Switches are separate from the nodes and centralized as a MIN (Multistage Interconnection Network) A switch is a k x k crossbar with no storage An N-node (1 channel/node) network has (N/k)w switches per stage. Min. no stages to connect N to N is [log k N] EE 382 Processor Design Winter 98/99 Michael Flynn 22 11

12 Dynamic, Indirect Networks Multi-Stage Network Crossbar Switch EE 382 Processor Design Winter 98/99 Michael Flynn 23 Baseline Dynamic Network Destination node address sets switch routing for each stage Simpler baseline network we can have message blocking No storage in the switch Cost for a baseline network is w x (N/k) x [log k N] in kxk switches Assume each switch has a delay of one channel cycle = T ch EE 382 Processor Design Winter 98/99 Michael Flynn 24 12

Baseline Dynamic Network EE 382 Processor Design Winter 98/99 Michael Flynn 25 Other Dynamic Networks Other MIN configurations include additional stages and switches for less blocking (redundant

13 Baseline Dynamic Network EE 382 Processor Design Winter 98/99 Michael Flynn 25 Other Dynamic Networks Other MIN configurations include additional stages and switches for less blocking (redundant paths ) but more cost Dynamic networks generally have Uniform Memory Access (UMA) Equal time to access any part of memory Can optimize for memory local to processor Static networks are generally NUMA EE 382 Processor Design Winter 98/99 Michael Flynn 26 13

Other Dynamic Networks EE 382 Processor Design Winter 98/99 Michael Flynn 27 Network Tradeoffs Direct Networks + Enables placement for communication affinity (NUMA) + Low incremental costs for

14 Other Dynamic Networks EE 382 Processor Design Winter 98/99 Michael Flynn 27 Network Tradeoffs Direct Networks + Enables placement for communication affinity (NUMA) + Low incremental costs for small systems and expansion Requires closely-coupled processor/switch design High-dimensional networks have inefficient mapping to physical wiring EE 382 Processor Design Winter 98/99 Michael Flynn 28 14

15 Network Tradeoffs Indirect Networks + Can be built from standard processors and switches Large fixed cost in switches, even for small systems Trend is Toward Direct Networks With Low Dimensionality EE 382 Processor Design Winter 98/99 Michael Flynn 29 Dynamic Network Analysis Time to transmit message without contention (T c ) n is number of stages T c = n + ( /w) +1 (for h = 1) usually n + ( /w) >>1 so T c = n + ( /w ) network cycles Model contention with M B /D/1 p = δ/k (going to k inputs) δ = ρ (probability that processor is sending a message) ρ = m x ( /w) (service time = /w) m = prob(a particular node makes a request in a cycle) EE 382 Processor Design Winter 98/99 Michael Flynn 30 15

16 Dynamic Network Analysis Queing Delay: T dynamic = T c + T w T w = (ρ /w)(1-1/k)/(2(1 - ρ)) T c = n + /w All expressed in network cycles = T ch EE 382 Processor Design Winter 98/99 Michael Flynn 31 Static Network Analysis For a static (k,n) network let k d be average no of network hops for message to transit a single dimension for bidirectional network with closure k d = k/4, (k even) Time to transmit message without contention (T c ) T c = n x k d + ( /w) in network cycles (for h = 1) EE 382 Processor Design Winter 98/99 Michael Flynn 32 16

17 Static Network Analysis Model contention with M/G/1 for k large (k > 8) and M/D/1 for k smaller λ = mnk d (nk d is the average no. hops for a message) µ = 2nw/ (each node has 2n channels) ρ = mk d ( /2w) For M/G/1 For M/D/1 T w = (ρ/1-ρ)( /w)((k d -1)/k d2 )(1+1/n) T w = (ρ/2(1-ρ))( /w) EE 382 Processor Design Winter 98/99 Michael Flynn 33 Static vs. Dynamic Network Example N = 1024 processing elements = 200 bits Pins per switch = 64 (fan-in + fan-out) With locality Without locality EE 382 Processor Design Winter 98/99 Michael Flynn 34 17

18 Bisection Width Bisection Width is the minimum no. of wires cut when a network is divided into two equal halves If links (rather than nodes) dominate cost then network comparisons should be based on equivalent bisection width, B. For static (k,n) B(k,n) = 2wN/k For dynamic with kxk = 2x2; B = wn So higher-dimensional static networks have shorter virtual latency (no. hops) than lower-dimensional networks, but the planar (or even 3D) realization of physical wiring reduces performance w is reduced for the same no. interconnect layers wires are longer/slower EE 382 Processor Design Winter 98/99 Michael Flynn 35 Hotspots and Combining Network traffic (especially synchronization) may be directed to a single location in memory creating a hotspot Hotspots can be mitigated by adding logic to the switch Fetch and Add instructions directed to a hotspot can be combined and later the fetched result updated and split in the switch t = fraction of references going to hotspot EE 382 Processor Design Winter 98/99 Michael Flynn 36 18

19 Multiprocessing Summary Multi-Threaded Area of research and potential future practical application Driven by diminishing returns of single-threaded performance/cost and emerging programming environments Shared-Bus Established mainstream technology for all but the most cost-sensitive applications Building block for scalable MP Scalable Technology in stages of advanced research and early adoption Static, direct networks with low dimensionality are winning Massively Parallel Remains the holy grail EE 382 Processor Design Winter 98/99 Michael Flynn 37 19

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency