Cluster Network Products

Size: px

Start display at page:

Download "Cluster Network Products"

Emma Cummings
5 years ago
Views:

1 Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1

2 Interconnects in Top500 list 11/2009 2

3 Interconnects in Top500 list 11/2008 3

4 Cluster Network Technologies Gigabit Ethernet: The technology has matured and now offers very good performance at a very low cost. Latency performance is moderate - many Ethernet switches are designed for general LANs (store & forward) where latency reduction is not necessary the primary incentive (the latency is order of ms). Zero-copy OS-bypass message passing can be supported with programmable NIC and direct memory access. 4

Cluster Network Technologies Myrinet: using fibre optic cable

5 Cluster Network Technologies Myrinet: using fibre optic cable Uses a fat-tree structure Low latency (7-10 µsec) with a peak bandwidth of 4G bps. Provides zero-copy message passing and can offload packet processing to the NIC. Uses cut-through/worm-hole switching to reduce latency. More expensive than Ethernet (a) Twisted pair cable in Ethernet (b) Fibre optic cable 5

6 Zero copy protocol 6

7 Cluster Network Technologies Quadrics: product of a strategic partnership between Quadrics & Compaq (used in ASCI/Q). Uses a fat quad-tree topology Very low latency of 2-5 µsec due to fast interconnects and highly tuned software stack (MPI libraries); bandwidth is about 2Gbps 7

8 Cluster Network Technologies InfiniBand: by Intel. Basic link speed of 2.5Gb/s. Cut-through/worm-hole switches are used. Current installations are achieving latencies of less than 7 µsec, but this is being improved. 8

9 Example Clusters 9

10 BlueGene/L No. 1 in Top500 list from Source: IBM 10

BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around.

11 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours (bidirectional). Routing achieved in hardware. each link with 1.4 Gbit/s. 1.4 x 6 x 2= 16.8 Gbit/s aggregate bandwidth 11

12 BlueGene/L Other three networks: Binary combining tree Used for collective/global operations - reductions, sums, products, barriers etc. Low latency (2μS) Gigabit Ethernet I/O network Support file I/O An I/O node is responsible for performing I/O operations for 128 processors Diagnostic & control network Booting nodes, monitoring processors. Each chip has the above four network interfaces (torus, tree, i/o, diagnostics) Note specialised networks are used for different purposes - quite different from many other HPC cluster architectures. 12

13 BlueGene/L Message Passing: The BlueGene focussed a good deal of energy developing an efficient MPI implementation to reduce latency in the software stack. Using the MPICH code-base as a start-point: MPI library was enhanced with respect to machine architecture. For example, using the combining tree for reductions & broadcasts. Reading paper: Filtering Failure Logs for a BlueGene/L Prototype 13

14 ASCI Q The Q supercomputing system at Los Alamos National Laboratory (LANL) Product of Advanced Simulation and Computing (ASCI) program Used for simulation and computational modelling No. 2 in 2002 in Top500 supercomputer list 14

15 ASCI Q Classical cluster architecture SMPs (AlphaServer ES45s from HP) are put in one segment Each with four EV Ghz CPUs with 16-MB cache the whole system has 3 segments The three segments can operate independently or as a single system Aggregate 60 TeraFLOPS capability. 33 Terabytes of memory 664 TB of global storage Interconnection using Quadrics switch interconnect (QSNet) High bandwidth (250MB/s) and Low latency (5us) network. Top500 list: 15

16 Earth Simulator Built by NEC, located in the Earth Simulator Centre in Japan Used for running global climate models to evaluate the effects of global warming No.1 from

17 Earth Simulator 640 nodes, each with 8 vector processors and 16GB memory Two nodes are installed in one cabinet In total: 5120 processors (NEC SX-5) 10 TeraByte memory 700 TeraByte of disk storage and 1.6 PetaByte of Tape storage Computing capacity: 36 TFlop/s Networking: Crossbar interconnection (very expensive) Bandwidth: 16GB/s between any two nodes Latency: 5us Dual level parallelism: OpenMP in-node, MPI out of node Physical installation: Machine resides on 3th floor; Cables on 2nd ; Power generation & cooling on 1st and ground floor. 17

18 UK systems Cambridge PowerEdge 576 Dell PowerEdge 1950 compute servers Computing capability: 28TFlop/s Each server has two Dual- Core Intel Xeon 5160 processors 3GHz and 8GB of memory InfiniBand network Bandwidth: 10GBit Latency: 7us 60 TeraByte of disk storage 18

19 Cluster Workload Management Goal: maximising the delivery of resources to jobs, given job requirements and local policy restrictions Three parties Users: supplying the job requirements Administrators: describing local use policies Workload management software: monitoring the state of the cluster, scheduling the jobs and tracking the resource usage Some or all the following activities are performed Queuing Scheduling Monitoring Resource management Accounting 19

20 Queuing Job submission usually consists of two primary parts: Resource requirements (e.g. the amount of memory, the number of CPUs needed) Job description (e.g. job name, the location of the required input files) Once submitted, the jobs are held in the queue until the matching resources are available 20

21 Scheduling Determining at what time a job should be put into execution on which resources There are a variety of metrics to measure scheduling performance System-oriented metrics (e.g. throughput, utilisation, average response time of all jobs) user-oriented metrics (e.g. response time of a job submitted by a user) They can contradicts each other and balance needs to be made 21

22 Monitoring providing information to administrators, users and the scheduling system on the status of jobs and resources the method of collection may differ between different workload management systems, but the general purposes are the same 22

23 Resource management Handling the details of Starting a job under the identity of the user Stopping a job Cleaning up the mess left behind after the job either completes or is aborted Removing or adding resources For the batch system, the jobs are put into execution in such a way that the users need not be present during execution For interactive systems, the users have to be present to supply arguments or information during the execution of the jobs. 23

24 Accounting Accounting for which users are using what resources for how long Collecting resource usage data (e.g. job owner, resources requested by the job, total amount of resources consumed by the job) Accounting data can be used for: Producing system usage and user usage reports Tuning the scheduling policy Calculating future resource allocations Anticipating future resource requirements by users Determining the area of improvement within the cluster 24

25 PBS PBS, Portable Batch System, is a flexible workload management and job scheduling system Originally developed at NASA Different versions of PBS OpenPBS PBSpro Torque Three key system demons pbs_server: run in the head node; is the centre of PBS pbs_mom: run in computing nodes; actually place the job into execution pbs_sched: scheduling jobs 25

26 PBS PBS job submission script #!/bin/sh #PBS -l walltime=1:00:00 #PBS -l mem=400mb #PBS -l ncpus=4 cd ${HOME}/PBS/test mpirun -np 4 myprogram Submitting a job % qsub myscriptfile Inquiring the status of a job % qstat Delete a job %qdel

27 Maui By Maui high-performance computing centre and other partners A job scheduler that can interact with a number of different resource managers (e.g. PBS) Maui is an external scheduler, meaning it does not include a resource manager but rather extends the capabilities of the existing resource managers the underlying resource manager continues to maintain responsibility for managing nodes and tracking jobs Maui uses the APIs of other resource managers (e.g. PBS) to obtain system information Maui controls the decisions of when, where, and how jobs will run 27

28 Schedule Policies The simplest policy: First-Come First-Served Jobs are initiated in the same order as they are submitted. Does not require prior knowledge about tasks (e.g. runtime). Problems: jobs can block other jobs from starting, despite there being no performance benefit to either user. 28

29 First-Come First-Served 29

30 Backfilling The problem with FCFS is that idle time (sum of unused processing intervals) can be significant. One improvement is to backfill. Allows a job to start if it does not delay the first job in the queue. 30

31 Backfilling 31

32 Backfilling Advantages: Utilisation is improved. Disadvantages: Information about the job execution time is required. User estimation are usually inaccurate. It is a policy decision to decide what to do if a job overruns; many administrators choose to terminate a job if it exceeds its allocated execution time otherwise some users may deliberately underestimate the job length to get an earlier job start time. 32

33 Backfilling a problem if predicted runtime is wrong: 33

34 Scheduling Policies Reservation: Increasingly user-based quality of service (QoS) is an important scheduling metric. In addition to normal scheduling, reservation services can be used to plan resource allocation. Users are able to set up a reserved block of processing capability that they are able to use at some point in the future. Task management system agrees to the reservation. Users are subsequently able to run jobs within their reservation quotient. 34

35 Coursework seminars Just remind you that the partition of the coursework seminar groups is on my homepage. Start doing your coursework as early as possible Make sure you go to consult your seminar tutors if you have problems with your coursework 35

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a