BlueGene/L. Computer Science, University of Warwick. Source: IBM

Size: px

Start display at page:

Download "BlueGene/L. Computer Science, University of Warwick. Source: IBM"

Victoria Garrison
5 years ago
Views:

1 BlueGene/L Source: IBM 1

2 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours (bidirectional). Routing achieved in hardware. each link with 1.4 Gbit/s. 1.4 x 6 x 2= 16.8 Gbit/s aggregate bandwidth 2

3 BlueGene/L Other three networks: Binary combining tree Used for global operations - reductions, sums, products, barriers etc. Low latency (2μS) Gigabit Ethernet I/O network Support file I/O Diagnostic & control network Booting nodes, monitoring processors. Each chip has the above four network interfaces (torus, tree, i/o, diagnostics) Note specialised networks are used for different purposes - quite different from many other HPC cluster architectures. 3

4 BlueGene/L Message Passing: The BlueGene focussed a good deal of energy developing an efficient MPI implementation to reduce latency in the software stack. Using the MPICH code-base as a start-point: MPI library was enhanced with respect to machine architecture. For example, using the combining tree for reductions & broadcasts. Reading paper: Filtering Failure Logs for a BlueGene/L Prototype 4

5 ASCI Q The Q supercomputing system at Los Alamos National Laboratory (LANL) Product of Advanced Simulation and Computing (ASCI) program Used for simulation and computational modelling Now No.90 (last year No. 40) in Top500 supercomputer list 5

6 ASCI Q Classical cluster architecture SMPs (AlphaServer ES45s from HP) are put in one segment Each with four EV Ghz CPUs with 16-MB cache the whole system has 3 segments The three segments can operate independently or as a single system Aggregate 60 TeraFLOPS capability. 33 Terabytes of memory 664 TB of global storage Interconnection using Quadrics dual-rail switch interconnect (QSNet) High bandwidth (250MB/s/rail) and Low latency (5us) network. Top500 list: 6

7 Earth Simulator Built by NEC, located in the Earth Simulator Centre in Japan Used for running global climate models to evaluate the effects of global warming The fastest supercomputer from Now No.30 in the Top500 supercomputer list 7

8 Earth Simulator 640 nodes, each with 8 vector processors and 16GB memory Two nodes are installed in one cabinet In total: 5120 processors (NEC SX-5) 10 TeraByte memory 700 TeraByte of disk storage and 1.6 PetaByte of Tape storage Computing capacity: 36 TFlop/s Networking: Crossbar interconnection (very expensive) Bandwidth: 16GB/s between any two nodes Latency: 5us Dual level parallelism: OpenMP in-node, MPI out of node Physical installation: Machine resides on 3th floor; Cables on 2nd ; Power generation & cooling on 1st and ground floor. 8

9 UK systems Cambridge PowerEdge 576 Dell PowerEdge 1950 compute servers Computing capability: 28TFlop/s Each server has two Dual- Core Intel Xeon 5160 processors 3GHz and 8GB of memory InfiniBand network Bandwidth: 10GBit Latency: 7us 60 TeraByte of disk storage 9

10 Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance An increase in compute power typically demands proportional increases in lower latency / higher bandwidth communication services. 10

11 Cluster Networks Issues with cluster interconnections are similar to those with normal networks: Latency & Bandwidth Latency= sender overhead + switching overhead + (message size / Bandwidth) + receiver overhead. Topology type (bus, ring, torus, hypercube etc). Routing, switching. Direct connections (point-to-point) or indirect connections. Balance performance and cost NIC (Network Interface Card) capabilities. Physical media (wiring density, reliability). 11

12 Interconnection Topologies In standard LANs we have two general structures: Shared network (bus) As used by classic Ethernet networks. All messages are broadcast each processor listens to every message. Requires complex access control (e.g. CSMA/CD). Collisions can occur: requires back-off policies and retransmissions. Suitable when the offered load is low - inappropriate for high performance applications. Very little reason to use this form of network today. Switched network Permits point-to-point communications between sender & receiver. Fast internal transport provides high aggregate bandwidth. Multiple messages are sent simultaneously. 12

13 Interconnection Topologies For switched networks, first consider node connectivity: Each node usually has one link (connection) to the switch. So if the node is a 2-way SMP, both processors compete for capacity. Can be improved by allowing multiple links per SMP. Useful quantities for switched networks: Scalability : the network s switch scalability with nodes. Degree: number of links to / from a node. Diameter: the shortest path between the furthest nodes. Bisection width: the minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (+/- one node). Essentially a measure of bottleneck bandwidth - if higher, the network will perform better under load. 13

14 Interconnection Topologies Crossbar switch: Low latency and high throughput. Switch scalability is poor - O(N 2 ) Lots of wiring 14

15 Interconnection Topologies Linear Arrays and Rings Consider networks with switch scaling costs better than O(N 2 ). In one dimension, we have simple linear arrays. Direct topology (the number of switches : the number of nodes = 1:1) O(N) switches. These can wrap around to make a ring or 1D torus. High overall bandwidth but latency is high. So 2D/3D Cartesian applications will perform poorly with this network. 15

16 Interconnection Topologies 2D Meshes Can wrap-around as a 2D torus. Switch scaling: O(N) Average degree: 4 (as node count increases) Diameter: O(2n 1/2 ) Bisection width: O(n 1/2 ) 16

17 Interconnection Topologies Hypercubes: Or binary n-cube K dimension, Switches N= 2 K. Diameter: O(log 2 N). Good bisectional width (O(N/2)). 17

18 Interconnection Topologies Binary Tree: Indirect topology (there is not a 1:1 ratio between switch and node counts). Scaling: n = 2 d processor nodes (where d = depth) 2 n -1 switches Degree: 3 Diameter: O(2 log n) Bisection width: O(1) 18

19 Interconnection Topologies Fat trees: Similar in diameter to a binary tree. Bisection width (which equates to bottleneck) is greatly improved due to additional dimensions. For a fat quad-tree (such as used by Quadrics QSNet) bisectional bandwidth scales linearly with size. 19

20 Interconnection Topologies Summary of topologies: Topology Degree Diameter Bisection 1D Array 2 N-1 1 1D Ring 2 N/2 2 2D Mesh 4 2N 1/2 N 1/2 2D Torus 4 N 1/2 2N 1/2 Hypercube n=log(n) n N/2 There are others - we saw a 3D torus in the BlueGene/L section for instance. 20

21 Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a non-dedicated environment (I.e. a LAN). usually, there is a finite buffer size so it is possible that packets will be dropped under heavy load. Also impose a larger in-switch latency. Can detect errors in the packets Worm hole routing (Also called cut-through switching): Packet is divided into small flits (flow units). Switch examines the first flit (header) which contains the destination address, sets up a circuit and forwards the flit immediately. Subsequent flits of the message are forwarded as they arrive (near wirespeed). Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection 21

22 Cluster Network Technologies The performance of (affordable) interconnects in the past 5 years has dramatically increased. Current choices include: Gigabit Ethernet. Myricom s Myrinet. SCI - Dolphin. Quadrics QSNet. InfiniBand. 22

23 Cluster Network Technologies Gigabit Ethernet: The technology has matured and now offers very good performance at a very low cost. Small 1000BaseT switches are cheap and available at below $100/port. Larger switches are still quite expensive (from $1000). Bandwidth achievable is about 90% of peak (compared to about 35% for shared-bus Ethernet). Latency performance is moderate - many Ethernet switches are designed for general LANs (store & forward) where latency reduction is not necessary the primary incentive (the latency is order of ms). Zero-copy OS-bypass message passing can be supported with programmable NIC and direct memory access. 23

Provides zero-copy message passing and can offload packet processing to the NIC.

24 Cluster Network Technologies Myrinet: using fibre optic cable Uses a fat-tree structure that can accommodate large numbers of nodes. Low latency (7-10 µsec) with a peak bandwidth of 1800+Mbps. Provides zero-copy message passing and can offload packet processing to the NIC. Uses cut-through/worm-hole switching to reduce latency. More expensive than Ethernet but current market leader in HPC. (a) Twisted pair cable in Ethernet (b) Fibre optic cable 24

25 Cluster Network Technologies SCI: Scalable Cluster Interconect offers 500MB/s per port Nodes can be arranged in 2D (or 3D) wrapped mesh without switches. Messages not intended for the recipient pass straight through intelligent NICs. Low latency (~4 µsec) and high peak bandwidth (1800 Mbps). Wrapped mesh topology is well suited to applications that decompose data onto a Cartesian grid. 25

26 Cluster Network Technologies Quadrics: product of a strategic partnership between Quadrics & Compaq (used in ASCI/Q). Very low latency of 2-5 µsec due to fast interconnects and highly tuned software stack (MPI libraries); bandwidth is about 2Gbps Uses a quad-tree arrangement. Overall scaling is known to be good as the number of nodes is increased. 26

27 Cluster Network Technologies InfiniBand: by Intel. Initially developed for cluster networks, InfiniBand may eventually be used internally (for inter-processor communication) and I/O subsystems. Basic link speed of 2.5Gb/s. Cut-through/worm-hole switches are used. Current installations are achieving latencies of less than 7 µsec, but this is expected to improve. 27

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance