MM5 Modeling System Performance Research and Profiling. March PDF Free Download

MM5 Modeling System Performance Research and Profiling March 2009

Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com 2

MM5 Fifth-Generation NCAR/Penn State Mesoscale Model Designed to simulate or predict mesoscale and regional-scale atmospheric circulation Mesoscale Meteorology is the study of weather systems smaller than synoptic scale systems but larger than microscale and storm-scale cumulus systems Horizontal dimensions generally range from around 5 kilometers to several hundred kilometers Examples: sea breezes, squall lines, and mesoscale convective complexes. http://www.mmm.ucar.edu/mm5/ Parallel version of MM5 with MPI enabled Support execution of the model on distributed memory (DM) parallel machines 3

Objectives The presented research was done to provide best practices MM5 performance benchmarking Interconnect performance comparisons Ways to increase MM5 productivity Understanding MM5 communication patterns 4

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node OS: RHEL5U2, OFED 1.4 InfiniBand SW stack MPI: Platform MPI 5.6.4, MVAPICH-1.1.0 Application: MM5 Version 3 Benchmark Workload T3A benchmark test case from NCAR 5

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 60Gb/s 20Gb/s 120Gb/s 40Gb/s 240Gb/s (12X) 80Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 6

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache Direct Connect Architecture HyperTransport Technology Up to 24 GB/s peak per processor Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core Integrated Memory Controller Up to 12.8 GB/s DDR2-800 MHz or DDR2-667 MHz Scalability 48-bit Physical Addressing PCI-E Bridge Bridge I/O I/O Hub Hub PCI-E Bridge Bridge Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 8 GB/S 8 GB/S 8 GB/S 8 GB/S 8 GB/S USB USB PCI PCI Dual Channel Reg DDR2 7 November5, 2007 7

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 8

MM5 Benchmark Results - Interconnect Input Data: T3A Resolution 36KM, grid size 112x136, 33 vertical levels 81 second time-step, 3 hour forecast InfiniBand DDR delivers higher performance in any cluster size Up to 46% versus GigE, 30% versus 10GigE MM5 Benchmark Results - T3A 60000 50000 Mflop/s 40000 30000 20000 10000 0 4 6 8 10 12 14 16 18 20 22 Number of Nodes Higher is better GigE 10GigE InfiniBand-DDR Platform MPI 9

MM5 Benchmark Results - Productivity InfiniBand DDR increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for weather simulations Two cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Four jobs per node increases productivity by up to 35% Utilizing Platform MPI dynamic binding of processes to cores Multiple MPI jobs per node works well using Platform MPI due to its dynamic binding of processes to cores MM5 Benchmark Results - T3A InfiniBand DDR Higher is better Mflop/s 80000 70000 60000 50000 40000 30000 20000 10000 0 4 8 12 16 20 Number of Nodes 1 Job per Node 4 Jobs per Node Platform MPI 10

MM5 Performance Results - MPI Platform MPI demonstrates higher performance versus MVAPICH Up to 25% higher performance Platform MPI advantage increases with increased cluster size 60000 50000 MM5 Benchmark Results - T3A Mflop/s 40000 30000 20000 10000 0 4 6 8 10 12 14 16 18 20 22 Number of Nodes Higher is better MVAPICH Platform MPI Single job on each node 11

MM5 Profiling MPI Functions MPI_Send and MPI_Recv are the mostly used MPI functions in MM5 MM5 MPI Profiling Number of Messages 35000 30000 25000 20000 15000 10000 5000 0 MPI_Bcast MPI_Comm_size MPI_Gather MPI_Irecv MPI_Keyval_free MPI_Recv MPI_Send MPI_Wait MPI Functions 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 12

MM5 Profiling Data Transferred Most outgoing MPI messages are smaller than 128B MM5 MPI Profiling (MPI_Send) Number of Messages 25000 20000 15000 10000 5000 0 [0-128B> [128B-1K> [1-8K> [8-256K> Message Size 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 13

MM5 Profiling Data Transferred Most received MPI messages are within 128Bytes to 1K Total number of medium size messages increases with cluster size MM5 MPI Profiling (MPI_Recv) Number of Messages 14000 12000 10000 8000 6000 4000 2000 0 [0-128B> [128B-1K> [1-8K> [8-256K> Message Size 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 14

MM5 Profiling Summary - Interconnect MM5 was profiled to determine networking dependency Majority of data transferred between compute nodes Small to medium size messages Data transferred increases with cluster size Most used message sizes <128B messages MPI_Send 128B-1KB and 8K-256K MPI_Recv Total number of messages increases with cluster size Interconnects effect to MM5 performance Interconnect latency and throughput for <256KB message range Interconnect latency and throughput become critical with cluster size 15

MM5 Performance with Power Management Test Scenario 24 servers, 4 processes per node, 2 processes per CPU (socket) Similar performance with power management enabled or disabled Only 1.4% performance degradation 40 35 MM5 Benchmark Results - T3A Gflop/s 30 25 20 15 10 Higher is better Power Management Enabled Power Management Disabled InfiniBand DDR 16

MM5 Benchmark Power Consumption Power management reduces 5.1% of total system power consumption 0.15 MM5 Benchmark - T3A Power Consumption 0.13 Wh/iteration 0.11 0.09 0.07 0.05 Power Management Enabled Power Management Disabled InfiniBand DDR 17

MM5 Benchmark Power Cost Savings Power management saves 673$/year for the 24-node cluster As cluster size increases, bigger saving are expected 11000 MM5 Benchmark - T3A Power Cost Comparison 10600 7% $/year 10200 9800 9400 9000 Power Management Enabled Power Management Disabled 24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf InfiniBand DDR 18

Conclusions MM5 is widely used weather simulation software MM5 performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughout interconnect technology NUMA aware application for fast access to memory Reasonably job distribution can dramatically improves productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows InfiniBand DDR delivers superior performance and productivity Versus GigE and 10GigE, Due to throughput and latency advantage Power management provide 5.1% saving in power consumption Per 24-node system $673 power savings per year for 24-node cluster Power saving increases with cluster size 19

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 20 20

MM5 Modeling System Performance Research and Profiling. March 2009