PARALLEL COMPUTER ARCHITECTURES

8 ARALLEL COMUTER ARCHITECTURES 1

CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CU.

M M M M rivate memory CU CU M M M M Messagepassing interconnection network M M Messagepassing interconnection network M M M M M M (a) (b) Figure 8-2. (a) A multicomputer with 16 CUs, each with each own private memory. (b) The bit-map image of Fig. 8-1 split up among the 16 memories.

Machine 1 Machine 2 Machine 1 Machine 2 Machine 1 Machine 2 Application Application Application Application Application Application Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Language run-time system Operating system Operating system Operating system Operating system Operating system Operating system Hardware Hardware Hardware Hardware Hardware Hardware Shared memory (a) Shared memory (b) Shared memory (c) Figure 8-3. Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c) The language runtime system.

(a) (b) (c) (d) (e) (f) (g) (h) Figure 8-4. Various topologies. The heavy dots represent switches. The CUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.

CU 1 A Input port Output port B End of packet C D Four-port switch Middle of packet CU 2 Front of packet Figure 8-5. An interconnection network in the form of a fourswitch square grid. Only two of the CUs are shown.

CU 1 Four-port switch Input port Output port A B A B A B Entire packet C D C D C D CU 2 Entire packet Entire packet (a) (b) (c) Figure 8-6. Store-and-forward packet switching.

CU 1 A B CU 2,CU 4 CU 3 C D Four-port switch Input port Output buffer Figure 8-7. Deadlock in a circuit-switched interconnection network.

60 50 Linear speedup N-body problem 40 Speedup 30 Awari 20 10 0 0 10 20 30 40 Number of CUs 50 60 Skyline matrix inversion Figure 8-8. Real programs achieve less than the perfect speedup indicated by the dotted line.

n CUs active Inherently sequential part otentially parallelizable part 1 CU active f 1 f f 1 f T ft (1 f)t/n (a) (b) Figure 8-9. (a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.

CU Bus (a) (b) (c) (d) Figure 8-10. (a) A 4-CU bus-based system. (b) A 16-CU bus-based system. (c) A 4-CU grid-based system. (d) A 16- CU grid-based system.

1 2 3 1 Work queue 1 2 Synchronization point 3 1 2 3 2 4 5 6 1 2 3 3 Synchronization point 7 8 rocess 9 (a) (b) (c) (d) Figure 8-11. Computational paradigms. (a) ipeline. (b) hased computation. (c) Divide and conquer. (d) Replicated worker.

hysical Logical (hardware) (software) Examples Multiprocessor Shared variables Image processing as in Fig. 8-1 Multiprocessor Message passing Message passing simulated with buffers in memory Multicomputer Shared variables DSM, Linda, Orca, etc. on an S/2 or a C network Multicomputer Message passing VM or MI on an S/2 or a network of Cs Figure 8-12. Combinations of physical and logical sharing.

Instruction streams Data streams Name Examples 1 1 SISD Classical Von Neumann machine 1 Multiple SIMD Vector supercomputer, array processor Multiple 1 MISD Arguably none Multiple Multiple MIMD Multiprocessor, multicomputer Figure 8-13. Flynn s taxonomy of parallel computers.

arallel computer architectures SISD SIMD MISD MIMD (Von Neumann)? Vector processor Array processor Multiprocessors Multicomputers UMA COMA NUMA M COW Bus Switched CC-NUMA NC-NUMA Grid Hypercube Shared memory Message passing Figure 8-14. A taxonomy of parallel computers.

Input vectors Vector ALU Figure 8-15. A vector ALU.

Operation Examples A i = f 1 (B i ) f 1 = cosine, square root Scalar = f 2 (A) f 2 = sum, minimum A i = f 3 (B i, C i ) f 3 = add, subtract A i = f 4 (scalar,b i ) f 4 = multiply B i by a constant Figure 8-16. Various combinations of vector and scalar operations.

Step Name Values 1 Fetch operands 1.082 10 12 9.212 10 11 2 Adjust exponent 1.082 10 12 0.9212 10 12 3 Execute subtraction 0.1608 10 12 4 Normalize result 1.608 10 11 Figure 8-17. Steps in a floating-point subtraction.

Cycle Step 1 2 3 4 5 6 7 Fetch operands B 1,C 1 B 2,C 2 B 3,C 3 B 4,C 4 B 5,C 5 B 6,C 6 B 7,C 7 Adjust exponent B 1,C 1 B 2,C 2 B 3,C 3 B 4,C 4 B 5,C 5 B 6,C 6 Execute operation B 1 + C 1 B 2 + C 2 B 3 + C 3 B 4 + C 4 B 5 + C 5 Normalize result B 1 + C 1 B 2 + C 2 B 3 + C 3 B 4 + C 4 Figure 8-18. A pipelined floating-point adder.

A B S T 8 24-Bit address registers 64 24-Bit holding registers for addresses 8 64-Bit scalar registers 64 64-Bit holding registers for scalars 64 Elements per register 8 64-Bit vector registers ADD ADD ADD ADD MUL Address units BOOLEAN SHIFT MUL RECI. BOOLEAN SHIFT O. COUNT Scalar integer units Scalar/vector floatng-point units Vector integer units Figure 8-19. Registers and functional units of the Cray-1

1 CU Write 100 2 Write 200 Read 2x x 3 W100 W200 R3 = 200 R3 = 200 R4 = 200 W100 R3 = 100 W200 R4 = 200 R3 = 200 W200 R4 = 200 W100 R3 = 100 R4 = 100 Read 2x R4 = 200 R4 = 200 R3 = 100 4 (a) (b) (c) (d) Figure 8-20. (a) Two CUs writing and two CUs reading a common memory word. (b) - (d) Three possible ways the two writes and four reads might be interleaved in time.

Write CU A 1A 1B 1C 1D 1E 1F CU B 2A 2B 2C 2D CU C 3A 3B 3C Synchronization point Time Figure 8-21. Weakly consistent memory uses synchronization operations to divide time into sequential epochs.

Shared memory rivate memory Shared memory CU CU M CU CU M CU CU M Cache Bus (a) (b) (c) Figure 8-22. Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.

Action Local request Remote request Read miss Fetch data from memory Read hit Use data from local cache Write miss Update data in memory Write hit Update cache and memory Invalidate cache entry Figure 8-23. The write through cache coherence protocol. The empty boxes indicate that no action is taken.

(a) CU 1 CU 2 CU 3 A Memory CU 1 reads block A Exclusive Cache Bus (b) CU 1 CU 2 CU 3 A Memory CU 2 reads block A Shared Shared Bus (c) CU 1 CU 2 CU 3 A Memory CU 2 writes block A Modified Bus (d) CU 1 CU 2 CU 3 Memory A A Shared Shared Bus CU 3 reads block A (e) CU 1 CU 2 CU 3 A Memory CU 2 writes block A Modified Bus (f) CU 1 CU 2 CU 3 A Memory CU 1 writes block A Modified Bus Figure 8-24. The MESI cache coherence protocol.

Memories 000 001 010 011 100 101 110 111 Crosspoint switch is open 000 001 010 (b) CUs 011 100 Crosspoint switch is closed 101 110 111 Closed crosspoint switch Open crosspoint switch (c) (a) Figure 8-25. (a) An 8 8 crossbar crosspoint. (c) A closed crosspoint. switch. (b) An open

16 16 Crossbar switch (Gigaplane-XB) Transfer unit is 64-byte cache block Board had 4 GB + 4 CUs Four address buses for snooping UltraSARC CU 0 1 2 1-GB memory 14 15 module Figure 8-26. The Sun Enterprise 10000 symmetric multiprocessor.

A B X Y Module Address Opcode Value (a) (b) Figure 8-27. (a) A 2 2 switch. (b) A message format.

3 Stages CUs Memories 000 001 b 1A 2A b 3A b 000 001 010 011 100 101 1B 1C b 2B 2C 3B 3C 010 011 100 101 110 a a 110 111 1D a 2D a 3D 111 Figure 8-28. An omega switching network.

CU Memory CU Memory CU Memory CU Memory MMU Local bus Local bus Local bus Local bus System bus Figure 8-29. A NUMA machine based on two levels of buses. The Cm* was the first multiprocessor to use this design.

Directory Node 0 Node 1 Node 255 CU Memory Local bus CU Memory Local bus CU Memory Local bus Interconnection network (a) Bits 8 18 6 2 18-1 Node Block Offset (b) 4 3 2 1 0 0 0 1 0 0 82 (c) Figure 8-30. (a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit memory address into fields. (c) The directory at node 36.

CU with cache Intercluster interface Memory Intercluster bus (nonsnooping) 0 D D D D 1 2 3 4 D D D D 5 6 7 8 D D D D 9 10 11 12 D D D D 13 14 15 Local bus (snooping) Cluster Directory (a) Cluster 0123456789 15 Block State This is the directory for cluster 13. This bit tells whether cluster 0 has block 1 of the memory homed here in any of its caches. 3 2 1 0 Uncached, shared, modified (b) Figure 8-31. (a) The DASH architecture. (b) A DASH directory.

Quad board with 4 entium ros and up to 4 GB of RAM Snooping bus interface Directory controller 32-MB cache RAM Directory Data pump IQ board SCI ring RAM CU Figure 8-32. The NUMA-Q multiprocessor.

Local memory table at home node Bits 6 7 13 6 Back State Tag Fwd 2 19-1 Back State Tag Fwd Back State Tag Fwd 0 Node 4 cache directory Node 9 cache directory Node 22 cache directory Figure 8-33. SCI chains all the holders of a given cache line together in a doubly-linked list. In this example, a line is shown cached at three nodes.

CU Memory Node Local interconnect Disk and I/O Local interconnect Disk and I/O Communication processor High-performance interconnection network Figure 8-34. A generic multicomputer.

Network Disk Tape GigaRing Alpha Mem Alpha Mem Alpha Mem Shell Control + E registers Control + E registers Control + E registers Node Commun. processor Commun. processor Commun. processor Full-duplex 3D torus Figure 8-35. The Cray Research T3E.

64-Bit local bus Kestrel board 38 ro ro 64 MB I/O NIC 32 ro ro 64 MB I/O NIC 2 64-Bit local bus (a) (b) Figure 8-36. The Intel/Sandia Option Red system. (a) The kestrel board. (b) The interconnection network.

CU group 0 1 2 3 4 5 6 7 CU group 0 1 2 3 4 5 6 7 CU group 0 1 2 3 4 5 6 7 Time 1 4 3 6 2 5 8 9 1 5 3 9 2 6 8 4 7 9 1 8 3 6 4 2 5 7 7 (a) (b) (c) Figure 8-37. Scheduling a COW. (a) FIFO. (b) Without headof-line blocking. (c) Tiling. The shaded areas indicate idle CUs.

CU CU CU Backplane acket going east (a) acket going west Line card Ethernet (b) Switch Figure 8-38. (a) Three computers on an Ethernet. (b) An Ethernet switch.

CU 1 2 3 4 Cell 5 6 acket 7 8 ort Virtual circuit 9 10 11 12 ATM switch 13 14 15 16 Figure 8-39. Sixteen CUs connected by four ATM switches. Two virtual circuits are shown.

Globally shared virtual memory consisting of 16 pages 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 2 5 1 3 6 4 7 11 13 15 9 8 10 12 14 Memory CU 0 CU 1 CU 2 CU 3 (a) Network 0 2 5 1 3 6 4 7 11 13 15 9 10 8 12 14 CU 0 CU 1 CU 2 CU 3 (b) 0 2 5 1 3 6 4 7 11 13 15 9 10 8 10 12 14 CU 0 CU 1 CU 2 CU 3 (c) Figure 8-40. A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (a) The initial situation. (b) After CU 0 references page 10. (c) After CU 1 references page 10, here assumed to be a read-only page.

( abc, 2,5) ( matrix-1, 1, 6, 3.14) ( family, is sister, Carolyn, Elinor) Figure 8-41. Three Linda tuples.

Object implementation stack; top:integer; # storage for the stack stack: array [integer 0..N-1] of integer; operation push(item: integer); function returning nothing begin stack[top] := item; push item onto the stack top := top + 1; # increment the stack pointer end; operation pop( ): integer; # function returning an integer begin guard top > 0 do # suspend if the stack is empty top := top - 1; # decrement the stack pointer return stack[top]; # return the top item od; end; begin top := 0; end; # initialization Figure 8-42. A simplified ORCA stack object, with internal data and two operations.