An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Size: px

Start display at page:

Download "An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks"

Bethany Hampton
5 years ago
Views:

1 An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark

2 Abstract Abstract: The purpose of this presentation is to demonstrate the strengths and weaknesses of Yakkydoodle and Erebus, and provide several preliminary benchmark results. Although results of the well-known hardware benchmarks will be published upon completion in October of 2012 (if the build goes as currently scheduled), some other meteorology-based benchmarks will be also be used. Some benchmarks which replicate similar conditions to those used in atmospheric-science based operations include matrix multiplication, Fast-Fourier Transforms, and the well-known WRF benchmark. Different situations will be taken into account, and the scalability of the benchmarks will be evaluated.

3 Project Goals Examine the Performance of systems similar to Yellowstone in particular compared to other systems Evaluate scalability (performance increase compared to number of cores) Preliminary benchmark results Diagnose Bottlenecks

4 Description of Yakkydoodle 4X Intel i7 Xeon Ghz (same CPUs as Yellowstone) 2 dual server motherboards (nodes) with 2 CPUS per board connected by Ethernet network Each CPU is capable of running 16 threads for a maximum of 64 threads

5 Description of Erebus Intel i7 Xeon E Ghz (same CPUs as Yellowstone) 84 nodes with 2 CPUS per board connected by Infiniband network Each CPU is capable of running 16 threads for a maximum of 64 threads

6 NAS Parallel Benchmark Designed by NASA to evaluate the performance of parallel supercomputers Derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications BT - Block Tri-diagonal solver SP - Scalar Penta-diagonal solver LU - Lower-Upper Gauss-Seidel solver (Linear system of equations)

7 Yakkydoodle NAS-B Parallel 350 Benchmark BT-MZ-B LU-MZ-B SP-MZ-B Number of Nodes

8 HPC Challenge Designed to evaluate 7 key measures that represent frequently used numerical operation patterns in scientific computing FFT Measures the floating point rate of execution of double precision of complex one-dimensional Discrete Fourier Transform (DFT) and is widely utilized in important scientific applications, such as climate modeling and solid-earth geophysics, including seismic analysis, as well as designing of new materials. Random Access Measures the rate of integer random updates of memory (In giga-updates per second)

9 Yakkydoodle vs. Similar Clusters- HPC Challenge PTRANS(gb/s) HPL(GFlops) G-FFTE(gb/s) RandomAccess Xeon W5560 x8 Xeon X2500 Yakkydoodle Cray T3E 4X Athlon X2 4X Q6600

10 Intel MPI Benchmark A set of benchmarks aimed at measuring the most important MPI functions The majority of the time is spent on MPI BCAST (Mellanox July 2012) This is because the processes often ends up waiting for pending blocking and non-blocking send and receives to complete

11 Intel MPI Benchmark BCAST Function Number of Processes vs 1 MB Broadcast Time Time(Mircoseconds) Yakkydoodle Erebus Nodes Chart of 1 MB broadcast time vs. time(milliseconds )using 1 and 2 nodes on Yakkydoodle and 1,2, and 3 nodes on Erebus. (Lower is better).

12 Matrix Multiplication Time nodes,2 processes 1 node,4 processes 2 nodes,2 processes 2 nodes, 4 processes Increased amount of nodes leads to greater ability to handle multitasking

13 Matrix Multiplication Number of Iterations vs Calculation Time, 2 variables OpenMP Time, 1 core MPI Time, 2 cores

14 WRF Benchmark(Version 3.4) WRF Compiled with GNU Fortran and GCC el6 and NETCDF 4.0 Benchmarked on 1 node and 2 nodes on Yakkydoodle in 3 different cases Scaling was better for larger grid sizes and the longer the run

15 Yakkydoodle WRF Benchmark(Version 3.0)-Non-Parallel Version CPU Time(s) Total Run Time em_b_wave em_real em_hill2d

16 Yakkydoodle WRF-3.4 Parallel Benchmark CONUS 12 km(oct. 2001) em_b_wave(ideal Case 41x81) Eastern US Jan. 2000(64X71) node runtime 2 node runtime

17 600 Erebus WRF-3.4 Parallel Benchmark Conus 12 km (Oct. 2001) em_b_wave(ideal Case 41x81 km) Eastern US January 2000(64x71 km) One Node Runtime Two Node Runtime

18 Conclusions Yakkydoodle seemed to perform well in FFT Scalability was best for larger grid sizes and longer runs Problem size tends to be more scalable as number of zones can be increased in longer runs Faster network speeds led to better benchmark results on Erebus in Intel MPI Benchmark and WRF when using two nodes Bottlenecks occur when a high number of processes are executed simultaneously on only one node

19 Special Thanks Jeph Cowan, Shawn Needham, Nate Rini, and Irfan Elahi Rich Loft Kristin Mooney, Linda Yellin SIPARCS 2012 Interns UCAR Help Desk UCAR, NCAR

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,