HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Similar documents
Caching and Buffering in HDF5

What NetCDF users should know about HDF5?

h5perf_serial, a Serial File System Benchmarking Tool

Background. Contiguous Memory Allocation

Introduction to High Performance Parallel I/O

DRAFT. HDF5 Data Flow Pipeline for H5Dread. 1 Introduction. 2 Examples

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory-Management Strategies

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

Chapter 8: Main Memory

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Crossing the Chasm: Sneaking a parallel file system into Hadoop

The Fusion Distributed File System

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

An Evolutionary Path to Object Storage Access

Parallel I/O and Portable Data Formats HDF5

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

What is a file system

Crossing the Chasm: Sneaking a parallel file system into Hadoop

CS420: Operating Systems

The HDF Group. Parallel HDF5. Extreme Scale Computing Argonne.

CS370 Operating Systems

Chapter 9 Memory Management

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University

Chapter 8: Memory Management. Operating System Concepts with Java 8 th Edition

CS307: Operating Systems

File System Implementation. Sunu Wibirama

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

The HDF Group. Parallel HDF5. Quincey Koziol Director of Core Software & HPC The HDF Group.

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Internals. Jo, Heeseung

Parallel I/O on JUQUEEN

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

File Systems: Fundamentals

Parallel I/O and Portable Data Formats

File Systems: Fundamentals

Memory Management. Today. Next Time. Basic memory management Swapping Kernel memory allocation. Virtual memory

HDF5: An Introduction. Adam Carter EPCC, The University of Edinburgh

Extreme I/O Scaling with HDF5

Part Three - Memory Management. Chapter 8: Memory-Management Strategies

Chapter 8: Memory Management. Background Swapping Contiguous Allocation Paging Segmentation Segmentation with Paging

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Lecture 33: More on MPI I/O. William Gropp

Principles of Operating Systems

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

6 - Main Memory EECE 315 (101) ECE UBC 2013 W2

Tutorial on Memory Management, Deadlock and Operating System Types

Practical, transparent operating system support for superpages

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

8.1 Background. Part Four - Memory Management. Chapter 8: Memory-Management Management Strategies. Chapter 8: Memory Management

Parallel I/O CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Parallel I/O Spring / 22

Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1

Table 9. ASCI Data Storage Requirements

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 9: Memory Management. Background

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

Memory Management. Memory Management

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Memory Management. CSCI 315 Operating Systems Design Department of Computer Science

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Chapter 11. Parallel I/O. Rajeev Thakur and William Gropp

I/O in scientific applications

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

COMP 530: Operating Systems File Systems: Fundamentals

Simple idea 1: load-time linking. Our main questions. Some terminology. Simple idea 2: base + bound register. Protection mechanics.

I/O Analysis and Optimization for an AMR Cosmology Application

15 Sharing Main Memory Segmentation and Paging

Parallel I/O Libraries and Techniques

File Systems. CS170 Fall 2018

16 Sharing Main Memory Segmentation and Paging

File System Implementation

I.-C. Lin, Assistant Professor. Textbook: Operating System Principles 7ed CHAPTER 8: MEMORY

CS 4284 Systems Capstone

Operating Systems Design Exam 2 Review: Spring 2011

CS307 Operating Systems Main Memory

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Distributed Filesystem

Chapter 12: File System Implementation

CS 416: Opera-ng Systems Design March 23, 2012

Chapter 8: Main Memory

Chapter 8 Memory Management

I.-C. Lin, Assistant Professor. Textbook: Operating System Concepts 8ed CHAPTER 8: MEMORY

Lustre A Platform for Intelligent Scale-Out Storage

Chapter 8: Main Memory

A FRAMEWORK ARCHITECTURE FOR SHARED FILE POINTER OPERATIONS IN OPEN MPI

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Chapter 12: File System Implementation

Memory Management. Dr. Yingwu Zhu

libhio: Optimizing IO on Cray XC Systems With DataWarp

Transcription:

HDF5 I/O Performance HDF and HDF-EOS Workshop VI December 5, 2002 1

Goal of this talk Give an overview of the HDF5 Library tuning knobs for sequential and parallel performance 2

Challenging task HDF5 Library has to perform well on Variety of UNIX Workstation (SGI, Intel, HP, Sun) Windows Cray DOE supercomputers (IBM SP, Intel Tflops) Linux clusters (Compaq, Intel) Variety of file systems (GPFS, PVFS, Unix FS) Variety of MPI-IO implementations Other tasks Efficient memory and file space management Applications are different (access patterns, many small objects vs. few large objects, parallel vs. sequential, etc.) 3

Outline Sequential performance Tuning knobs File level Data transfer level Memory management File space management: Fill values and storage allocation Chunking Compression Caching Compact storage 4

Outline Parallel performance Tuning knobs Data alignment MPI-IO hints HDF5 Split Driver h5perf benchmark 5

Tuning knobs Sequential Performance 6

Two Sets of Tuning Knobs File level knobs Apply to the entire file Data transfer level knobs Apply to individual dataset read or write 7

File Level Knobs H5Pset_meta_block_size H5Pset_cache 8

H5Pset_meta_block_size Sets the minimum metadata block size allocated for metadata aggregation. Aggregated block is usually written in a single write action Default is 2KB Pro: Larger block size reduces I/O requests Con: Could create holes in the file and make file bigger 9

H5Pset_meta_block_size When to use: File is open for a long time and A lot of objects created A lot of operations on the objects performed As a result metadata is interleaved with raw data A lot of new metadata (attributes) 10

H5Pset_cache Sets: The number of elements (objects) in the meta data cache The number of elements, the total number of bytes, and the preemption policy value (default is 0.75) in the raw data chunk cache 11

Preemption policy: H5Pset_cache (cont.) Chunks are stored in the list with the most recently accessed chunk at the end Least recently accessed chunks are at the beginning of the list X*100% of the list is searched for the fully read/written chunk; X is called preemption value, where X is between 0 and 1 If chunk is found then it is deleted from cache, if not then first chunk in the list is deleted 12

The right values of X H5Pset_cache (cont.) May improve I/O performance by controlling preemption policy 0 value forces to delete the oldest chunk from cache 1 value forces to search all list for the chunk that will be unlikely accessed Depends on application access pattern 13

Data Transfer Level Knobs H5Pset_buffer H5Pset_sieve_buf_size 14

H5Pset_buffer Sets size of the internal buffers used during data transfer Default is 1 MB Pro: Bigger size improves performance Con: Library uses more memory 15

H5Pset_buffer When should be used: Datatype conversion Data gathering-scattering (e.g. checker board dataspace selection) 16

H5Pset_sieve_buf_size Sets the size of the data sieve buffer Default is 64KB Sieve buffer is a buffer in memory that holds part of the dataset raw data During I/0 operations data is replaced in the buffer first, then one big I/0 request occurs 17

H5Pset_sieve_buf_size Pro: Bigger size reduces I/O requests issued for raw data access Con: Library uses more memory When to use: Data scattering-gathering (e.g. checker board) Interleaved hyperslabs 18

HDF5 Application Memory Management H5garbage_collect() Memory used by HDF5 application may grow with the growing number of the objects created and then released Function walks through all the garbage collection routines of the library, freeing any unused memory When to use: Application creates-opens-releases substantial number of objects Number of objects is application and platform dependent 19

HDF5 File Space Management H5Pset_alloc_time Sets the time of data storage allocation for creating a dataset Early when dataset is created Late when dataset is written H5Pset_fill_time Sets the time when fill values are written to a dataset When space allocated Never Avoids unnecessary writes 20

Chunking and Compression Chunking storage Provides better partial access to dataset Space is allocated when data is written Con: Storage overhead May degrade performance if cache is not set up properly Compression (GZIP, SZIP in HDF5 1.5 release) Saves space User may easily turn on their own compression method Con: May take a lot of time Data shuffling (in HDF5 1.5 release) Helps compression algorithms 21

Data shuffling See Kent Yang s poster Not a compression; change of byte order in a stream of data Example 1 23 43 Hexadecimal form 0x01 0x17 0x2B Big-endian machine 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x17 0x00 0x00 0x00 0x2B Shuffling 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x17 0x2B 22

00 00 00 01 00 00 00 17 00 00 00 2B 00 00 00 00 00 00 00 00 00 01 17 2B 23

Chunking and compression benchmark Write one 4-byte integer dataset 256x256x1024 (256MB) Using chunks of 256x16x1024 (16MB) Random integers between 0 and 255 Tests with Compression on/off Chunk cache size 16MB to 256MB Data shuffling 24

Chunking Benchmark Time Definitions Total Time to open file, write dataset, close dataset and close file Write time Time to write the whole dataset Average chunk time Total time/ number of chunks 25

Performance improvement Release version Total time (Open-write-close) in seconds Average time to write a 16MB chunk In seconds 1.4.4 release 8.6950 0.4809 1.4.5-prerelease 4.5711 0.2447 26

Effect of Caching (H5Pset_cache) Compression Cache Total time in seconds Write time in seconds No 16MB 5.607 5.43 File size 268.4MB No 256MB 5.79 3.63 Yes 16MB 672.58 630.89 File size 102.9MB Yes 256MB 674.66 3.48 27

Effect of data shuffling (H5Pset_shuffle + H5Pset_deflate) File size Total time Write Time 102.9MB 671.049 629.45 67.34MB 83.353 78.268 Compression combined with shuffling provides Better compression ratio Better I/O performance 28

Effect of chunk caching and data shuffling H5Pset_cache + H5Pset_shuffle + H5Pset_deflate Cache Total time Write Time 16MB 83.353 78.268 128MB 82.942 43.257 256MB 82.972 3.476 Caching improves chunk write time 29

Compact storage Store small objects (e.g. 4KB dataset) in the file C code example: plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_layout(plist, H5D_COMPACT); H5Pset_alloc_time(plist,H5D_ALLOC_TIME_EARLY); dataset = H5Dcreate(file,, plist); Raw data is stored in the dataset header Metadata and raw data are written/read in one I/0 operation Faster write and read 30

Compact storage benchmark Create a file with N 4KB datasets using regular and compact storage (100 < N < 35000) Measure average time needed to write/read a dataset in a file with N datasets Benchmark run on Linux 2.2.18 i686, 960MB memory timeofday function used to measure time 31

Writing a dataset Average time to write a dataset in seconds 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0 100 4100 8100 12100 16100 20100 24100 28100 32100 Number of datasets (4KB size) Regular storage Compact storage 32

Reading latest written dataset Time in seconds 0.025 0.02 0.015 0.01 0.005 0 Regular storage Compact storage 100 4100 8100 12100 16100 20100 24100 28100 32100 Number of datasets (4KB size) 33

Parallel Performance Tuning knobs h5perf benchmark 34

PHDF5 Implementation Layers Parallel Application Parallel Application Parallel Application Parallel Application User Applications Parallel HDF + MPI HDF library MPI-IO Parallel I/O layer O2K Unix I/O SP GPFS TFLOPS PFS Parallel File systems 35

File Level Knobs (Parallel) H5Pset_alignment H5Pset_fapl_split H5Pset_fapl_mpio 36

H5Pset_alignment Sets two parameters Threshold Minimum size of object for alignment to take effect Default 1 byte Alignment Allocate object at the next multiple of alignment Default 1 byte Example: (threshold, alignment) = (1024, 4K) All objects of 1024 or more bytes starts at the boundary of 4KB 37

H5Pset_alignment Benefits In general, the default (no alignment) is good for single process serial access since the OS already manages buffering. For some parallel file systems such as GPFS, an alignment of the disk block size improves I/O speeds. 38

H5Pset_fapl_split HDF5 splits to two files Metadata file for metadata Raw data file for raw data (array data) Significant I/O improvement if metadata file is stored in Unix file systems (good for many small I/O) raw data file is stored in Parallel file systems (good for large I/O). 39

Write s o Results for ASCI Red machine at Sandia National Laboratory Each process writes 10MB of array data eds Standard vs f MB/sec 20 16 12 8 4 S lit file 2 4 8 16 Number of processes DF5 vs Standard HDF5 write (one file) Split-file HDF5 write MPI I/O write (one file) M 40

I/O Hints via H5Pset_fapl_mpio MPI-IO hints can be passed to the MPI-IO layer via the Info parameter of H5Pset_fapl_mpio Examples Telling Romio to use 2-phase I/O speeds up collective I/O in the ASCI Red machine at Livermore National Laboratory Setting IBM_largeblock_io=true speeds up GPFS write speeds 41

2-Phase I/O p0 p1 p2 p3 p4 p5 p0 p1 disk - Interleaving 42

2-Phase I/O p0 p1 p2 p3 p4 p5 p0 p1 disk Aggregation (available in ROMIO 1.2.4); useful for filling I/O buffers moving data to processors that have better connectivity 43

Effects of I/O Hints IBM_largeblock_io GPFS at Livermore National Laboratory ASCI Blue machine 4 nodes, 16 tasks Total data size 1024MB I/O buffer size 1MB 400 350 300 250 200 150 100 50 0 MPI-IO PHDF5 MPI-IO PHDF5 16 write 16 read IBM_largeblock_io=false IBM_largeblock_io=true 44

Parallel I/O Benchmark Tool h5perf Benchmark test I/O performance Comes with HDF5 binaries Writes datasets into the file by hyperslabs Variables: Number of datasets Number of processes Number of bytes per process per dataset to write/read Threshold for data alignment Size of transfer buffer (memory buffer) and block per process Collective vs. Independent Interleaved blocks vs. contiguous blocks 45

Four kinds of API Parallel HDF5 MPI-IO Parallel I/O Benchmark Tool Native parallel (e.g., gpfs, pvfs) POSIX (open, close, lseek, read, write) Provides standard approach to measure and compare performance results 46

Parallel I/O Tuning Challenging task Performance vary from platform to platform Complex access patterns Many layers can be involved SAF-HDF5-MPIO-GPFS 47

Parallel I/O Tuning Example of transfer buffer effect h5perf run on NERSC IBM SP 4 processes, 4 nodes, 1MB file, 1 dataset, 256KB data per process to write Maximum achieved write speed in MB/sec 48

Parallel I/O Tuning Example of transfer buffer effect on SP2 300 Speed in MB/sec 250 200 150 100 50 0 write write open-writeclose open-writeclose POSIX MPI-IO PHDF5 HDF5 128 256 Transfer buffer size in KB 49

Parallel I/O Tuning Example of transfer buffer effect on SP2 Speed in MB/sec 180 160 140 120 100 80 60 40 20 0 read read open-readclose open-readclose POSIX MPI-IO PHDF5 128 256 Transfer buffer size in KB 50

Parallel I/O Tuning Example of transfer buffer effect on SGI Speed in MB/sec 160 140 120 100 80 60 40 20 0 write write open-writeclose open-writeclose POSIX MPI-IO PHDF5 HDF5 128 256 Transfer buffer size in KB 51

Summery results for Blue h5perf run on ASCI IBM SP Blue 1 to 4 processes per node, 16 nodes, 256KB data per process to write/read, 256 KB transfer size, 256KB block size Varied: Number of tasks per node (1 4) Number of datasets 50, 100, 200 Independent or collective calls 52

HDF5 collective write results Number of datasets Tasks per node (TPN) Tasks per node (TPN) MPI-IO best result 1 2 Speed in MB.sec 3 4 Speed in MB/sec Speed in MB/sec 50 20 30 50 60 68.47 3 TPN 100 40 60 50 60 66.24 3 TPN 200 40 60 25 45 52.66 2 TPN 53

HDF5 independent write results Number of datasets Tasks per node 1 2 Tasks per node 3 4 MPI-IO best result Speed in MB.sec Speed in MB/sec Speed in MB/sec 50 20 35 20 35 32.14 2 TPN 100 20 35 20 35 31.09 4 TPN 200 20 35 35 60 61.06 3 TPN 54

Read performance Mode PHDF5 Speed in MB/sec MPI-IO Speed in MB/sec Collective 75-200 335-1800 Independent 100-650 330-2935 55

Future Parallel HDF5 Features Flexible PHDF5 Reduces the needs of collective calls Set aside a process for independent calls coordination Estimated release date: end of 2002 56

Useful Parallel HDF Links Parallel HDF information site http://hdf.ncsa.uiuc.edu/parallel_hdf/ Parallel HDF mailing list hdfparallel@ncsa.uiuc.edu Parallel HDF5 tutorial available at http://hdf.ncsa.uiuc.edu/hdf5/doc/tutor 57