Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Similar documents
Taming Parallel I/O Complexity with Auto-Tuning

SDS: A Framework for Scientific Data Services

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures

ArrayUDF Explores Structural Locality for Faster Scientific Analyses

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Extreme I/O Scaling with HDF5

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

libhio: Optimizing IO on Cray XC Systems With DataWarp

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

User Training Cray XC40 IITM, Pune

API and Usage of libhio on XC-40 Systems

Blue Waters I/O Performance

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Illinois Proposal Considerations Greg Bauer

Motivation Goal Idea Proposition for users Study

Improved Solutions for I/O Provisioning and Application Acceleration

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

I/O Performance on Cray XC30

Utilizing Unused Resources To Improve Checkpoint Performance

Analyzing I/O Performance on a NEXTGenIO Class System

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Overview of Tianhe-2

Introduction to High Performance Parallel I/O

Introduction to HPC Parallel I/O

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

NERSC. National Energy Research Scientific Computing Center

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

I/O Profiling Towards the Exascale

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Fast Forward I/O & Storage

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

High Performance Computing. What is it used for and why?

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014

ScalaIOTrace: Scalable I/O Tracing and Analysis

Welcome! Virtual tutorial starts at 15:00 BST

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Cray XC Scalability and the Aries Network Tony Ford

UCX: An Open Source Framework for HPC Network APIs and Beyond

HPC Storage Use Cases & Future Trends

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Building supercomputers from embedded technologies

CHARACTERIZING HPC I/O: FROM APPLICATIONS TO SYSTEMS

Spark and HPC for High Energy Physics Data Analyses

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Medical practice: diagnostics, treatment and surgery in supercomputing centers

Adaptive Power Profiling for Many-Core HPC Architectures

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Burst Buffers Simulation in Dragonfly Network

The Red Storm System: Architecture, System Update and Performance Analysis

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

MDHIM: A Parallel Key/Value Store Framework for HPC

Structuring PLFS for Extensibility

High Performance Computing. What is it used for and why?

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Distributed Memory Parallel Markov Random Fields Using Graph Partitioning

Progress on Efficient Integration of Lustre* and Hadoop/YARN

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Lustre Parallel Filesystem Best Practices

HPC Architectures. Types of resource currently in use

Intel Xeon Phi архитектура, модели программирования, оптимизация.

HPC NETWORKING IN THE REAL WORLD

Realtime Data Analytics at NERSC

Oak Ridge National Laboratory Computing and Computational Sciences

Introduction to Parallel I/O

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GPU-centric communication for improved efficiency

The Cray XC Series Scalability Advantage

Supercomputer Field Data. DRAM, SRAM, and Projections for Future Systems

Toward Automated Application Profiling on Cray Systems

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

High-Performance Lustre with Maximum Data Assurance

Revealing Applications Access Pattern in Collective I/O for Cache Management

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

FastForward I/O and Storage: ACG 8.6 Demonstration

Stream Processing for Remote Collaborative Data Analysis

Mainstream Computer System Components

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Brand-New Vector Supercomputer

Transcription:

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin Lawrence Berkeley National Laboratory

Energy efficiency at Exascale A design goal for future exascale systems Power consumption less than 20 MW Dynamic voltage and frequency scaling (DVFS) is a method to provide variable amount of energy on processors Lowering frequency saves CPU power When CPUs have to do computing in low power states, DVFS may affect performance Several efforts to avoid or reduce performance degradation with DVFS

Large-scale Scientific Simulations ² Large-scale scientific simulations use significant portion of supercomputers VPIC-Hopper ² VPIC ² Flash ² Produce large amounts of data ² 10 trillion particles 8 properties: ~300 TB 3

DVFS during I/O phases Simulations with interleaving computation and I/O phases Parallel I/O is often a collective operation in writing to a single file If processors are idle or not computing during I/O, can they be in a low power state? What is the impact of DVFS during I/O phases?

Parallel I/O Parallel I/O software stack Application High-level I/O libraries and data models I/O middleware Parallel file system Options for performance optimization Complex inter-dependencies among layers Application High-level I/O library (HDF5, NetCDF, etc.) I/O Middleware (MPI-IO, POSIX) I/O optimization layer (redirection, forwarding, etc.) Parallel File System Storage Hardware 5

Experimental Setup NERSC Edison o Cray XC30 o Each node has two 2.4 GHz 12-core Intel Ivy Bridge CPUs o 64 GB DDR3 DRAM o Cray Aries interconnect File system o Sonexion 1600 appliance w/ Lustre o 144 OSTs with 72 GB/s peak I/O bandwidth o 32 MB stripe size

Applications VPIC-IO I/O kernel from a plasma physics simulation o VPIC is developed at LANL and I/O w/ HDF5 at LBNL o H5Part and HDF5 I/O o Each MPI process writes data for 8 million particles o Each particle has 8 properties o Each property is stored as a 1D HDF5 dataset VORPAL-IO I/O kernel from accelerator physics o Developed at TechX o I/O kernel extracted at LBNL o H5Block and HDF5 I/O o Each process writes a 3D block of 100 x 100 x 60

Measurements I/O time o Maximum time of all processes o Includes file open, close, and write times o Ran each experiment 5 times and selected the best performing Energy and Power measurements o Cray Power Management counters PM counters o /sys/cray/pm_counters/cpu_{energy,power} o Developed a small library to obtain the elapsed power/energy and to aggregate from all nodes involved in running a job PMON o To set frequency for running an I/O kernel e.g.: aprun --p-state=1800000 -n 2048 exec args The power and energy measurements are for compute nodes, not for the I/O subsystem

Scaling tests Weak Scaling Number of cores VPIC data size VORPAL data size 2K 512 GB 1.6 TB 4K 1 TB 3.2 TB 8K 2 TB 6.4 TB 16K 4 TB 12.8 TB Strong Scaling Number of cores 2K 4K 8K VPIC data size 1 TB

Weak-scaling Power Consumption Better Power&(in&x1000&Wa/s)& Thousands& 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" VPIC-2K" VPIC-4K" VPIC-8K" VPIC-16K" VORPAL-2K" VORPAL-4K" VORPAL-8K" VORPAL-16K" 0" 1.2" 1.4" 1.6" 1.8" 2" 2.2" 2.4" Default 2.6" 2.8" P6State&(Frequency&in&GHz)&

Better Weak-scaling Energy Consumption

Weak-scaling I/O Rate 50.00# 45.00# Better 40.00# 35.00# I/O$Rate$(GB/s)$ 30.00# VPIC/2K# 25.00# VPIC/4K# 20.00# VPIC/8K# 15.00# VPIC/16K# VORPAL/2K# 10.00# VORPAL/4K# 5.00# VORPAL/8K# 0.00# VORPAL/16K# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P/State$(Frequency$in$GHz)$ Significant I/O performance degradation at low frequencies

Weak-scaling Energy Efficiency 1.80# 1.60# Better I/O$Rate$per$Wa,$(MB/s/W)$ 1.40# 1.20# 1.00# 0.80# 0.60# 0.40# VPIC-2K# VPIC-4K# VPIC-8K# VPIC-16K# VORPAL-2K# VORPAL-4K# VORPAL-8K# 0.20# VORPAL-16K# 0.00# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P3State$(Frequency$in$GHz)$

Weak scaling Energy Savings & Improvement Least energy consumption to default energy Highest energy efficiency to default energy efficiency 5 out of 8 @ 2.2 GHz Others @ 1.8 and 1.4 GHz

Strong Scaling Energy and Energy Efficiency Better Best: 2K à 2.4 GHz 4K à 2.2 GHz 8K à 1.8 GHz Better Default

Strong Scaling Trends Times&increase&from&2K&to&8K& 4.50# 4.00# 3.50# 3.00# 2.50# 2.00# 1.50# 1.00# Energy# Power# I/O#Rate# Energy#Efficiency# 0.50# 0.00# 1.2# 1.4# 1.6# 1.8# 2# 2.2# 2.4# Default 2.6# 2.8# P2State&&(Frequency&in&GHz)& Power increases by 4X from 2K to 8K Energy efficiency decreases by 50%

Observations & Unknowns Decreased I/O rate with frequency? o CPU and node activity during I/O phase o Using fewer cores per node or pinning fewer cores to perform I/O o MPI-IO in independent mode o I/O performance variation o Fine grain power state settings Some cores at high frequency and some at lower I/O phase energy consumption with new memory and storage hierarchy? o Node level NVM o Burst buffers

Thanks! Advanced Scientific Computing Research (ASCR) for funding the Power-aware Data Management project Program Manager: Lucy Nowell Project PI: Hank Childs 18