Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Size: px

Start display at page:

Download "Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov"

Brittney Bradley
5 years ago
Views:

1 Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright lbl.gov

2 NERSC- National Energy Research Scientific Computing Center Mission: Accelerate the pace of scientific discovery by providing high performance computing, information, data, and communications services for all DOE Office of Science (SC) research. The production computing facility for DOE SC. Over 3 users, 4 projects, 5 codes Berkeley Lab Computing Sciences Directorate Computational Research Division (CRD), ESnet NERSC 2

NERSC Systems for Science Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4

Hopper (NERSC-6): Cray XE6 Phase : Cray XT5, 668 nodes, 5344 cores Phase 2: >

PDSF (HEP/NP) ~K core throughput cluster Magellan Cloud testbed IBM idataplex

3 NERSC Systems for Science Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 9,532 compute nodes; 38,28 cores ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 Phase : Cray XT5, 668 nodes, 5344 cores Phase 2: > Pflop/s peak (late 2 delivery) Clusters 4 Tflops total Carver IBM idataplex cluster PDSF (HEP/NP) ~K core throughput cluster Magellan Cloud testbed IBM idataplex cluster GenePool (JGI) ~5K core throughput cluster NERSC Global Filesystem (NGF) Uses IBM s GPFS.5 PB capacity 5.5 GB/s of bandwidth HPSS Archival Storage 4 PB capacity 4 Tape libraries 5 TB disk cache Analytics Euclid (52 GB shared memory) Dirac GPU testbed (48 nodes)

4 Data Trends at NERSC

5 Disruptive Hardware Technologies for Data-Intensive Computing Memory Capacity per Flop decreasing Solid State Storage High bandwidth, low-latency Flash Storage Testbeds ~ TB in NERSC Global Filesystem (NGF) Metadata acceleration 6TB as local SSD in Magellan cloud testbed Data analytics, Local read-only data, Local temp storage

Flash Device Evaluation - Bandwidth 4 Read Bandwidth Write Bandwidth 2 Bandwidth (MB /s ) 8 6 4 2 TMS RamSan

6 Flash Device Evaluation - Bandwidth 4 Read Bandwidth Write Bandwidth 2 Bandwidth (MB /s ) TMS RamSan 2 (45GB) Virident tachion (4GB) Fusion IO iodrive Duo (Single Slot, 6GB) Intel X-25M (6GB) OCZ Colossus (25 GB)

Flash Device Evaluation - IOPS 8 6 4 Peak Read Peak Write Thousands IOPs 2 8 6 4 2 TMS RamSan 2 (45GB)

7 Flash Device Evaluation - IOPS Peak Read Peak Write Thousands IOPs TMS RamSan 2 (45GB) Virident tachion (4GB) Fusion IO iodrive Duo (Single Slot, 6GB) Intel X-25M (6GB) OCZ Colossus (25 GB)

8 Understanding I/O Performance Using IPM

9 What is Integrated Performance Monitoring? IPM provides a performance profile on a batch job input_23 job_23 output_23 IPM profile_23 MPI PAPI I/O Type, duration, size, filename each call OpenMP CUDA

10 IPM IO Tracing at Scale open read write Rank Time

Performance Events to Ensembles 7 6 5 8 R/4 scratch scratch2 52MBwrite MB/s 4 3 count 6 4 R/2 2 2 R 5 5 2 25 seconds 2 3 4 5 sec IOR 24-way Franklin experiment: 5x 52MB writes, w/ barrier: Transition

11 Performance Events to Ensembles R/4 scratch scratch2 52MBwrite MB/s 4 3 count 6 4 R/2 2 2 R seconds sec IOR 24-way Franklin experiment: 5x 52MB writes, w/ barrier: Transition from mechanistic analysis of isolated events to analysis of Task arrives last at barrier defines performance of phase ensembles resembles the strategy of statistical physics - whereby large numbers Observe of high interacting variability systems common can in be I/O described performance by the (identical properties writes) of their However, ensemble histogram distributions of performance such as moments, distribution splitting can be and very line-widths revealing Run with second file systems shows different details but similar statistics 3 modes: ideal behavior, filled local system buffer, intra-node contention Performance modes more constant than random individual operations

Global Cloud System Resolving Climate Modeling Individual cloud physics fairly well understood Parameterization of mesoscale cloud statistics performs poorly.

12 Global Cloud System Resolving Climate Modeling Individual cloud physics fairly well understood Parameterization of mesoscale cloud statistics performs poorly. Direct simulation of cloud systems in global models requires exascale Cloud statistical parameterization is a leading source of errors in climate modeling -Impacts solar and terrestrial radiation, precipitation, etc Currently cloud systems are much smaller than model grid cells (unresolved) Direct cloud system simulation: top priority by the st UN WMO Modeling Summit.

13 High Resolution GCRM Surface Altitude (feet) 2km Typical resolu5on of IPCC AR4 models 25km Upper limit of climate models with cloud parameteriza5ons km Cloud system resolving models are a transforma5onal change

GCRM I/O Optimization MB/sec 6 4... data metadata count Before MB/s 8 2 5 5 2 25. 3.. seconds 6 count MB/s.

14 GCRM I/O Optimization MB/sec data metadata count Before MB/s seconds 6 count MB/s... data metadata 8 4 sec/mb MB/sec seconds sec/mb Collecting buffering aggregating data from,24 to 8 task improves performance by 6% (reduced contention, I/O server queue depth, etc)

GCRM I/O Optimization MB/sec count 4.. data metadata 6. 2. 5 5 2 25.. 3 sec/mb seconds MB/sec count 6 4... data metadata 8 MB/s Before MB/s 8 2 5 5 2 seconds 25 3.

15 GCRM I/O Optimization MB/sec count 4.. data metadata sec/mb seconds MB/sec count data metadata 8 MB/s Before MB/s seconds sec/mb Using HDF5 library calls, padded and aligned writes to MB boundary Worst case per-task rate is now MB/sec, also improves metadata Overall improvement 5% reduction compared with original

GCRM I/O Optimization MB/sec 8 6 count Before MB/s... data metadata 4 2. 5 5 2 25.

. sec/mb seconds Aggregate metadata <3KB writes into single MB write, deferred till file

16 GCRM I/O Optimization MB/sec 8 6 count Before MB/s... data metadata count MB/s sec/mb MB/sec seconds 4... data metadata sec/mb seconds Aggregate metadata <3KB writes into single MB write, deferred till file close Removes large gaps caused by serialized writing on task Total runtime decreased by a total of 4x over baseline

, Lustre) Tuning HDF5 and MPI-I/O 5-x improvement seen Changes part of HDF.8.

17 GCRM and Chombo Benchmarks HDF5 is a popular I/O library 75 NERSC projects use HDF5 High-level object storage model Performance has declined HDF5 not kept up with evolution of parallel filesystems (e.g., Lustre) Tuning HDF5 and MPI-I/O 5-x improvement seen Changes part of HDF.8.6 Tuning HDF5 for Lustre File Systems Mark Howison, Quincey Koziol, David Knaak, John Mainzer, John Shalf. Cluster 2 Tuning HDF5 I/O Library Performance MB/s GCRM Original Chombo Optimized 5 6 Vorpal/ OSIRIS

18 Summary Many NERSC users are solving Data Intensive problems today! IPM enables low overhead, scalable collection of performance information about MPI, OpenMP and I/O to provide a holistic overview of an applications performance By lowering the barrier to performance measurement we hope to enable understanding of the performance of whole workloads

19 Acknowledgments IPM team: David Skinner, Karl Fuerlinger, Andrew Uselton & Katherine Yelick, NERSC & LBNL I/O Work: Andrew Uselton, Mark Howison, Noel Keen, David Skinner, John Shalf, Karen Karavanic & Lenny Oliker Funding from NSF under grant

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves