Commodity Cluster Computing

Size: px

Start display at page:

Download "Commodity Cluster Computing"

Robert Dickerson
5 years ago
Views:

1 Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne

2 Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications 3. Network topologies 4. Swiss-T1 machine: Cluster computing works! 5. Integrated cluster solutions 6. Resource management systems 7. High performance storage and archiving systems 8. Questions

3 1. Introduction 1.1. Evolution of HPC machines 1.2. Evolution of algorithms

4 Hardware History Courtesy: Michel Deville Machine Year Speed Mark I flops IBM IBM Stretch CDC Mflops CDC CRAY CRAY /proc CRAY YMP /proc CRAY C /proc Intel Paragon /proc TM/CM /proc CRAY T3D /proc CRAY T3E /proc NEC SX /proc Swiss-T /proc

5 Hardware History Year Speed Moore s law flops Moore s law: Performance doubled every 18 months in 52 years Speed: Measured with Poisson solver

6 Processor performance evolution

7 Numerical Analysis: Convergence, Errors. Systems solution Space and time discretization FD, FE, FV, spectral methods, Symbolic algebra Functional Analysis Statistics Partial differential equations, Stochastic equations, etc. Algorithms: complexity, accuracy Mesh Generation, CAD Solvers: direct / iterative methods Numerical Math. Applied Math. Scientific Computing Numerical Simulation Mechanical Eng., Biolog.,Processing Nuclear... Computer Science User Physics Visualization Architecture: vector, parallel, scalar, cluster Systems, Compilers. Data management, Parallelisation: MPI, Programming, Performance Interpretation of numerical experiment Geophysics, Astrophysics, Weather forecast, Global Change, Plasma physics Aerodynamics, Hydrodynamics, MHD, Rheology, Physiological fluids, Materials processing, Molten metals Courtesy: Michel Deville

8 Poisson Equation u=f Pressure equation in fluid mechanics Thermal conduction Electrostatic potential Diffusion in chemistry Diffusion in neutronics Darcy equation for porous media Solidification equation MHD equilibrium Mesh generation

9 History of the Algorithms Courtesy: Michel Deville Method Year Complexity Gaussian elimination 1947 N**7 Sub-optimal SOR N**5 Iteration Optimal SOR Iteration N**4 log N Cyclic Reduction N**3 log N Multigrid N**3 Classical // Multigrid N**3( log N )/P Parallel Gaussian 1985 N**7 /( P/4) Improved // Multigrid N**3 (log N) /P

10 History of the Algorithms Method Year Complexity Gaussian elimination 1947 N 7 Improved // Multigrid N 3 (log N) /P Gaussian/Multigrid G/M N 4 P / 30 log N Example: N=100, log N =7, P=1000 G/M =

11 2. Characterisations 2.1. Single compute nodes 2.2. Parallel machines 2.3. Applications 2.4. Tailor clusters to applications

12 In a box: V mac values V mac = R [Mflop/s] / M [Mword/s] Table: V mac values for Alpha and boxes and NEC SX-4 Machine N R M V mac Alpha server DS DS NEC SX

13 Application in a box: V alg and R alg V alg = Operations (Ops) / Memory accesses (LS) Example: y = y + a * x (SAXPY) Ops = 2 LS = 3 (2 loads + 1 store) V alg = 2 / 3 Matrix*matrix multiply and add: V alg = n / 2 R alg = min (R, R * V alg / V mac ) = min (R, M * V alg )

14 Benchmark: MATMUL algorithm V alg = 1 Machine N R R alg R [Mflop/s] T T0(Dual) T0(Dual) T T T0: Alpha workstation (500 MHz) T0(Dual): Alpha server 1200 (533 MHz) T1: Alpha server DS20e (500 MHz) R: Measured performance

15 Benchmark: SAXPY algorithm V alg = 2/3 Machine N R R alg (r ) R T T0(Dual) T0(Dual) T T T0: Alpha workstation (500 MHz) T0(Dual): Alpha server 1200 (533 MHz) T1: Alpha server DS20e (500 MHz) R: Measured performance in Mflop/s

16 Where do the Flops go? Who Cares About the Memory Hierarchy? Performance Processor-DRAM Memory Gap (latency) Moore s Law µproc CPU 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) Time

17 Machine Type Examples of machines Table: «Effective performance» on different machines Nproc Peak Eff perf Swiss-T1 Cluster Baby T1 Cluster Origin2K NUMA NEC SX4 Vector Gravitor Beowulf * Effective performance measured with MATMULT, * estimated.

18 Cluster: g mac value g mac = effective performance / effective bandwidth g mac = N * R [Mflop/s] * <d> / C [Mword/s] Table : The g mac values for Swiss-T0, Swiss-T0(Dual) and Swiss-T1 for MATMUL Machine N R % N * R C <d> γ mac T0 (Bus) * 400 * 4 * T0(Dual) (Bus) 8* * 1000 * 4 * Baby T1 (Switch) 6* * 2400 * 90* 1 27 T1(local) (Switch) 4* * 1600 * 60 * 1 27 T1(global) (Switch) 32* * * 400* T1 (Fast Ethernet) 32* * 12800* 40* * measured (SAXPY and Parkbench)

19 Machine performances: V mac, g mac g mac 400 Clustered SMP/NUMA (MPI) 100 Fast Ethernet or bus based commodity (MPI) HP Switch based commodity cluster (MPI) V mac 1 10 Virtual shared memory NUMA (Threads, Open/MP, MPI) Shared memory SMP (Threads, Open/MP, MPI) Vector machines/single processors

20 g alg of algorithms g alg = Operations (Ops) / Communications (Comm) Material sciences (3D Fourier analysis): g alg ~ 45 Beowulf insufficient, Swiss-T1 just about right Crash analysis (3D non-linear FE): g alg ~ 600 Beowulf sufficient Embarassingly parallel : g alg > 1000 Data Traffic Machine (Web, DB, DM, Sequencing)

21 FFT performance comparison γ alg = 45 fix O3K: 800 Mflop/s SP3: 1500 Mflop/s T3E: 1200 Mflop/s T1: 1000 Mflop/s January 2001

22 FFT speedup (g mac ) Speedup γ alg = 45 fix γ mac = 20 ideal γ mac = 40 γ mac = 80 Processors

23 S Scalability 32 O(1) Administrative curve O(p α ) p

24 Adequacy condition g mac < g alg

25 3. Network topologies 3.1. Network comparison 3.2. Fat Tree 3.3. Circulent graphs

Different Architectures SIMD Vector Processors MIMD Shared Memory NEC SX Vector Computer Cray J90/T90 Vector Computer Distributed Shared Memory SGI Origin Distributed Shared Memory HP-Convex

26 Different Architectures SIMD Vector Processors MIMD Shared Memory NEC SX Vector Computer Cray J90/T90 Vector Computer Distributed Shared Memory SGI Origin Distributed Shared Memory HP-Convex Distributed Shared Memory SUN Enterprise Shared Memory Compaq Wildfire Shared Memory Distributed Memory Cray T3E Distributed Memory IBM SP Distributed Memory Courtesy: Jack Dongarra Cluster of Processors (COW, NOW, Beowulf, Swiss-T1)

27 Comparison of Network architectures D-Torus 3D-Torus 2-Ring Maximum distance Ring Fat-tree Number of PUs

28 Fat-tree/Crossbars 16x16 N=8, P=8, N*P=64 PUs, X=12, BiW=32, L=64

29 Circulant graphs/crossbars 12x12 K=2 (1/3) N=8, P=8, X=8 BiW=8, L=16 K=3 (1/3/5) N=11, P=6, X=11 BiW=18, L=33 K=4 (1/3/5/7) N=16, P=4, X=16 BiW=32, L=64

30 Fat-tree/Circulant graphs Table : Comparison of Fat-tree and circulant graph architectures Parameter Fat-tree Circulant graph K=2 (1/3) Circulant graph K=3 (1/3/5) Circulant graph K=4 (1/3/5/7) Crossbar 16x Crossbar 12x N P N*P D Dm BiW L w T=wP N : Number of computing nodes P : Number of boxes per node N*P : Total number of boxes D : Maximum distance between two nodes Dm : Average distance between two nodes (load for a point-to-point operation) BiW : Bisectional width L : Number of links w : Load factor for an all-to-all communication operation T : Number of steps, or load, to perform an all-to-all operation

31 Beowulf Cluster Linux... PC PC PC PC PC PC PC. Fast Ethernet Switch

32 Swiss-T1 FE Cluster Compaq Tru64 Unix DS20 DS20 DS20 DS DS20 DS20 DS20 Fast Ethernet Switch

33 Swiss-T1 64 Processors Fast Ethernet/Linpack: 39% peak performance Mflop/s Matrix order 1

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by