Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory

Energy efficiency at Exascale A design goal for future exascale systems Power consumption less than 20 MW Dynamic voltage and frequency scaling (DVFS) is a method to provide variable amount of energy on processors Lowering frequency saves CPU power When CPUs have to do computing in low power states, DVFS may affect performance Several efforts to avoid or reduce performance degradation with DVFS

Large-scale Scientific Simulations ² Large-scale scientific simulations use significant portion of supercomputers VPIC-Hopper ² VPIC ² Flash ² Produce large amounts of data ² 10 trillion particles 8 properties: ~300 TB 3

DVFS during I/O phases Simulations with interleaving computation and I/O phases Parallel I/O is often a collective operation in writing to a single file If processors are idle or not computing during I/O, can they be in a low power state? What is the impact of DVFS during I/O phases?

Parallel I/O Parallel I/O software stack Application High-level I/O libraries and data models I/O middleware Parallel file system Options for performance optimization Complex inter-dependencies among layers Application High-level I/O library (HDF5, NetCDF, etc.) I/O Middleware (MPI-IO, POSIX) I/O optimization layer (redirection, forwarding, etc.) Parallel File System Storage Hardware 5

Experimental Setup NERSC Edison o Cray XC30 o Each node has two 2.4 GHz 12-core Intel Ivy Bridge CPUs o 64 GB DDR3 DRAM o Cray Aries interconnect File system o Sonexion 1600 appliance w/ Lustre o 144 OSTs with 72 GB/s peak I/O bandwidth o 32 MB stripe size

Applications VPIC-IO I/O kernel from a plasma physics simulation o VPIC is developed at LANL and I/O w/ HDF5 at LBNL o H5Part and HDF5 I/O o Each MPI process writes data for 8 million particles o Each particle has 8 properties o Each property is stored as a 1D HDF5 dataset VORPAL-IO I/O kernel from accelerator physics o Developed at TechX o I/O kernel extracted at LBNL o H5Block and HDF5 I/O o Each process writes a 3D block of 100 x 100 x 60

Measurements I/O time o Maximum time of all processes o Includes file open, close, and write times o Ran each experiment 5 times and selected the best performing Energy and Power measurements o Cray Power Management counters PM counters o /sys/cray/pm_counters/cpu_{energy,power} o Developed a small library to obtain the elapsed power/energy and to aggregate from all nodes involved in running a job PMON o To set frequency for running an I/O kernel e.g.: aprun --p-state=1800000 -n 2048 exec args The power and energy measurements are for compute nodes, not for the I/O subsystem

Scaling tests Weak Scaling Number of cores VPIC data size VORPAL data size 2K 512 GB 1.6 TB 4K 1 TB 3.2 TB 8K 2 TB 6.4 TB 16K 4 TB 12.8 TB Strong Scaling Number of cores 2K 4K 8K VPIC data size 1 TB

Weak-scaling Power Consumption Better Power&(in&x1000&Wa/s)& Thousands& 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" VPIC-2K" VPIC-4K" VPIC-8K" VPIC-16K" VORPAL-2K" VORPAL-4K" VORPAL-8K" VORPAL-16K" 0" 1.2" 1.4" 1.6" 1.8" 2" 2.2" 2.4" Default 2.6" 2.8" P6State&(Frequency&in&GHz)&

Better Weak-scaling Energy Consumption

Weak-scaling I/O Rate 50.00# 45.00# Better 40.00# 35.00# I/O$Rate$(GB/s)$ 30.00# VPIC/2K# 25.00# VPIC/4K# 20.00# VPIC/8K# 15.00# VPIC/16K# VORPAL/2K# 10.00# VORPAL/4K# 5.00# VORPAL/8K# 0.00# VORPAL/16K# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P/State$(Frequency$in$GHz)$ Significant I/O performance degradation at low frequencies

Weak-scaling Energy Efficiency 1.80# 1.60# Better I/O$Rate$per$Wa,$(MB/s/W)$ 1.40# 1.20# 1.00# 0.80# 0.60# 0.40# VPIC-2K# VPIC-4K# VPIC-8K# VPIC-16K# VORPAL-2K# VORPAL-4K# VORPAL-8K# 0.20# VORPAL-16K# 0.00# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P3State$(Frequency$in$GHz)$

Weak scaling Energy Savings & Improvement Least energy consumption to default energy Highest energy efficiency to default energy efficiency 5 out of 8 @ 2.2 GHz Others @ 1.8 and 1.4 GHz

Strong Scaling Energy and Energy Efficiency Better Best: 2K à 2.4 GHz 4K à 2.2 GHz 8K à 1.8 GHz Better Default

Strong Scaling Trends Times&increase&from&2K&to&8K& 4.50# 4.00# 3.50# 3.00# 2.50# 2.00# 1.50# 1.00# Energy# Power# I/O#Rate# Energy#Efficiency# 0.50# 0.00# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P2State&&(Frequency&in&GHz)& Power increases by 4X from 2K to 8K Energy efficiency decreases by 50%

Observations & Unknowns Decreased I/O rate with frequency? o CPU and node activity during I/O phase o Using fewer cores per node or pinning fewer cores to perform I/O o MPI-IO in independent mode o I/O performance variation o Fine grain power state settings Some cores at high frequency and some at lower I/O phase energy consumption with new memory and storage hierarchy? o Node level NVM o Burst buffers

Thanks! Advanced Scientific Computing Research (ASCR) for funding the Power-aware Data Management project Program Manager: Lucy Nowell Project PI: Hank Childs 18