Parallel Computer Architecture II

Size: px

Start display at page:

Download "Parallel Computer Architecture II"

Karen Dalton
5 years ago
Views:

1 Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/ WS 5/6 Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 / 37

2 Parallel Computer Architecture II Multiprocessor architectures Message passing Network topologies Example architectures Routing TOP 5 TOP2 Architectures Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 2 / 37

3 Classificaton of MIMD Architectures Physical memory arrangement shared memory distributed memory Address space global local Programming model shared address space message passing Communication structure Memory coupling Message coupling Synchronization semaphores barriers Latency treatment latency hiding latency minimization Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 3 / 37

4 Distributed Memory: MP P C M P C M Connection Network CN Multi processors have a local address space: Each processor can access only its associated memory. Interaction with other processors exclusively by sending of messages. Processors, memory and cache are standard components: Full utilization of the price advantage of high quantity of units. Connection network ranging from Fast Ethernet to Infiniband. Approach with highes scalability: IBM BlueGene > K processors Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 4 / 37

5 Distributed Memory: Message Passing Processes communicate data between distributed address spaces Explicit message passing is necessary Send-/Receive operations Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 5 / 37

6 A Generic Parallel Computer Architecture Compute Node Compute Node CPU MEM MEM CPU I/O NI Connection NI I/O Network R Data R I/O Node R Wiring R Compute Node I/O NI NI MEM I/O I/O I/O CPU Generic approach of a scalable parallel computer with distributed memory Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 6 / 37

7 Scalability Parameters Parameters, that drive the scalability of a computing system: Bandwidth [MB/s] Latency [µs] Costs [$] Physical size [m 2,m 3 ] Power consumption [W ] Fault tolerance / recovery abilities A truely scalable architecture should avoid any hard limits! Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 7 / 37

8 Influence of Parallel Software? from Culler, Singh, Gupta: Parallel Computer Architecture Although the parallel architecture scales scalable software is prerequisite for scalable numerical computation. Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 8 / 37

9 Message Passing Memory block (of variable length) shall be copied from one memory to another More precise: From the address space of one process to the address space of another process (running on a different processor) The connection network is packet oriented. Each message is subdivided in packets of fixed length (e. g. 32 byte to 4 kbyte) Tail Usage Data Head Head: target processor, tail: check sum Communication protocol: acknowledgment whether packet has arrived valid, flow control Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 9 / 37

10 Message Passing Layer model (Hierarchy of protocols): Pi_i Pi_j Operating System Network Driver Operating System Network Driver Network Interface Network Interface Connection Network Model of transmission time: t mess (n) = t s + n t b. t s : setup time (latency), t b : time per byte, /t b : bandwidth, dependent on protocol Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 / 37

11 Network Topology I (a) (b) (c) (d) (e) (a) full connected, (b) star, (c) array (d) hypercube, (e) torus, (f) folded torus (f) Hypercube: of dimension d has 2 d processors. Processor p is connected to q if their binary representation differs in exactly one bit. Network node: Earlier (before 99) this was the processor itself, nowadays it is a dedicated communication processor. Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 / 37

12 Network Topology II Reference parameters: Network topology Nodedegree K Wirecount L Diameter D Bisectionbandwidth B Symmetry Full Connectivity N N(N (N/2) 2 yes )/2 Star N N 2 N/2 no 2D Grid 4 2N 2 2( N N no N ) 3D Torus 6 3N 3 N/2 2 N yes Hypercube log 2 N nn log 2 N n N/2 yes k-ary n-cube (N = k n ) 3N n k/2 nn 2k n yes Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 2 / 37

13 Store & Forward Routing Store-and-forward routing: Message of length n is subdivided into packets of length N. Pipelining on packet level: Packet is stored completely in the network node. M M NN NN NN2 NNd NNd Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 3 / 37

14 Store & Forward Routing Transmission of a packet: Source Target Time Routing Routing Routing Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 4 / 37

15 Store & Forward Routing Run time: ( n ) t SF (n, N, d) = t s + d(t h + Nt b )+ N (t h + Nt b ) ( = t s + t h d + n )+t N b (n+n(d )). t s : time, that is needed on source and target computer until the network is instructed with the message transmission, respectively until the receiving process is informed. This is the software share of the protocol. t h : time that is necessary to transmit the first byte of a message from a network node to another one (node latency, hop time). t b : time for transmission of a byte from network node to network node d: Hops to reach the target node. Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 5 / 37

16 Cut-Through Routing Cut-through routing or wormhole routing: Packets are not buffered, each word (so called flit) is routed immediately to the next network node. Transmission of a packet: Source Target Routing Routing Routing Time 3 2 Run time: t CT (n, N, d) = t s + t h d + t b n Time for short message (n = N): t CT = t s + dt h + Nt b. Because of dt h t s (Hardware!) nearly distance independent Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 6 / 37

17 Deadlock In packet transmiting network the danger of a store-and-forward deadlock exists: D to A C to D to B A to C B Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 7 / 37

18 Deadlock Together with cut-through routing: Deadlock free dimension order routing. Example 2D grid: Partition network into +x, x, +y and y networks each with individual buffers. Message is routed first in the sender row, then in the receiver column. Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 8 / 37

19 Multi-Processor Performance from Culler, Singh, Gupta: Parallel Computer Architecture Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 9 / 37

20 Message Passing Architectures I First machine with parallel Unix Intel Paragon: Process migration, gang scheduling Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 2 / 37

21 Message Passing Architectures II IBM SP2: Compute nodes are RS 6 workstations Switching network Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 2 / 37

22 Message Passing Architectures III high package density a single system wide clock virtual shared memory Cray T3E: Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 22 / 37

23 Top5 Top5 Benchmark: LINPACK benchmark is used for evaluation of the systems Benchmark performance does not reflect the overall performance of the system Benchmark indicates performance during solution of dense linear equation sytem Very regular problem: high acchievable performance (near peak performance) Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 23 / 37

24 Top of Top5 Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 24 / 37

25 Top5 key facts Entry barrier is a performance of 6.8 TeraFlop/s Mean energy consumption of Top is 4.9 MW:.8-2 GFlops/W 4 systems need more than MW Accumulated performance is 23.4 PFlops/s (74.2 PFlop/s) Top minimal performance is 72.6 TFlop/s (5.9 TFlop/s) 2 Petaflops systems Top5 minial performance is 2.97 TFlop/s (9.29 TFlop/s) Processor type: Intel SandyBridge, AMD Opteron, IBM Power % of the systems have processors with 6 or more cores Infiniband (28) and Gigabit Ethernet (27) networks dominate Architecture: 8% Cluster, 2% MPP, % SIMD/SMP Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 25 / 37

26 Top5 Continent + Archtecture Type Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 26 / 37

27 Top5 CoresPerSocket + Interconnect Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 27 / 37

28 IBM Blue Gene/Q Architecture Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 28 / 37

29 IBM Blue Gene/Q Networks 9834 node are connected by three distinct, integrated networks 3 Dimensional Torus Virtual cut-through hardware routing to maximize efficiency 2.8 Gb/s on all 2 node links (total of 4.2 GB/s per node) Communication backbone 34 TB/s total torus interconnect bandwidth Global Tree One-to-all or all-all broadcast functionality Arithmetic operations implemented in tree ~.4 GB/s of bandwidth from any node to all other nodes Latency of tree traversal less than usec Ethernet Incorporated into every node ASIC Disk I/O Host control, booting and diagnostics Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 29 / 37

30 Cray RedStorm Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 3 / 37

31 Cray RedStorm Configuration Normally Unclassified Switchable Nodes Normally Classified Disconnect Cabinets Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 3 / 37

32 Cray RedStorm Building Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 32 / 37

33 Cray RedStorm Compute Node DRAM (or 2) Gbyte or more ASIC = Application Specific Integrated Circuit, or a custom chip CPUs AMD Opteron ASIC NIC + Router Six Links To Other Nodes in X, Y, and Z Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 33 / 37

34 Cray RedStorm Cabinet Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 34 / 37

35 Blue Gene L vs Red Storm BGL 36 TF version, Red Storm TF version Blue Gene L Red Storm Node speed 5.6 GF 5.6 GF (x) Node memory GB 2 (-8 GB) (4x) Network latency 7 us 2 us (2/7x) Network link bw.28 GB/s 6. GB/s (22x) BW Bytes/Flops.5. (22x) Bi-Section B/F.6.38 (24x) #nodes/problem 4,, (/4x) Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 35 / 37

36 Cray XT-5 Jaguar Architecture Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 36 / 37

37 Cray XT-5 Jaguar IO-Configuration Stefan Lang (IWR) Simulation on High-Performance Computers WS 5/6 37 / 37

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours