LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans

Similar documents
Structuring PLFS for Extensibility

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

The Last Bottleneck: How Parallel I/O can improve application performance

Exa Scale FSIO Can we get there? Can we afford to?

Reflections on Failure in Post-Terascale Parallel Computing

Coordinating Parallel HSM in Object-based Cluster Filesystems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

libhio: Optimizing IO on Cray XC Systems With DataWarp

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

MDHIM: A Parallel Key/Value Store Framework for HPC

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Scalable I/O, File Systems, and Storage Networks R&D at Los Alamos LA-UR /2005. Gary Grider CCN-9

HPC I/O and File Systems Issues and Perspectives

Data Movement & Tiering with DMF 7

Introduction to High Performance Parallel I/O

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

Let s Make Parallel File System More Parallel

HPC NETWORKING IN THE REAL WORLD

Techniques to improve the scalability of Checkpoint-Restart

The ASCI/DOD Scalable I/O History and Strategy Run Time Systems and Scalable I/O Team Gary Grider CCN-8 Los Alamos National Laboratory LAUR

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Log-structured files for fast checkpointing

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Parallel I/O Libraries and Techniques

Future File System: An Evaluation

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

TEMPERATURE MANAGEMENT IN DATA CENTERS: WHY SOME (MIGHT) LIKE IT HOT

The Fusion Distributed File System

HPC Storage Use Cases & Future Trends

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Failure in Supercomputers in the Post-Terascale Era

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Introduction to HPC Parallel I/O

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Architecture of a Next- Generation Parallel File System

The State and Needs of IO Performance Tools

Fast Forward I/O & Storage

Cray XC Scalability and the Aries Network Tony Ford

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Efficient On-Demand Operations in Distributed Infrastructures

API and Usage of libhio on XC-40 Systems

Removing the I/O Bottleneck in Enterprise Storage

Non-Blocking Writes to Files

PLFS and Lustre Performance Comparison

Improved Solutions for I/O Provisioning and Application Acceleration

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Challenges in HPC I/O

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Swaroop Kavalanekar, Bruce Worthington, Qi Zhang, Vishal Sharda. Microsoft Corporation

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Availability and Utility of Idle Memory in Workstation Clusters. Anurag Acharya, UC-Santa Barbara Sanjeev Setia, George Mason Univ

The RAMDISK Storage Accelerator

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lustre and PLFS Parallel I/O Performance on a Cray XE6

MSST 08 Tutorial: Data-Intensive Scalable Computing for Science

Turning Object. Storage into Virtual Machine Storage. White Papers

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

JULEA: A Flexible Storage Framework for HPC

DELL EMC DATA PROTECTION FOR VMWARE WINNING IN THE REAL WORLD

Revealing Applications Access Pattern in Collective I/O for Cache Management

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

libhio: Optimizing IO on Cray XC Systems With DataWarp

Enosis: Bridging the Semantic Gap between

FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems

IBM CORAL HPC System Solution

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Lustre TM. Scalability

The Spider Center-Wide File System

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

The Optimal CPU and Interconnect for an HPC Cluster

Lenovo Database Configuration

2. PICTURE: Cut and paste from paper

Lustre A Platform for Intelligent Scale-Out Storage

Lock Ahead: Shared File Performance Improvements

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

Lustre overview and roadmap to Exascale computing

Storage Challenges at Los Alamos National Lab

When, Where & Why to Use NoSQL?

Parallel File Systems Compared

Intel Xeon Phi архитектура, модели программирования, оптимизация.

In Search of Sweet-Spots in Parallel Performance Monitoring

Adaptive Cluster Computing using JavaSpaces

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Cloud Monitoring as a Service. Built On Machine Learning

Erik Riedel Hewlett-Packard Labs

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

The BioHPC Nucleus Cluster & Future Developments

SolidFire and Ceph Architectural Comparison

Massive Scalability With InterSystems IRIS Data Platform

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Toward An Integrated Cluster File System

Ceph: A Scalable, High-Performance Distributed File System

Transcription:

LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans Intelligent Storage Workshop May 14, 2008 Gary Grider HPC-DO James Nunez HPC-5 John Bent HPC-5 LA-UR-08-0252/LA-UR-07-7449

Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

Publicly Available Data from LANL Available Data: Supercomputers Nine years of computer operational failure data Usage records (job size, processors/machines used, duration, time, etc.) Disk failure data for a single Supercomputer Machine Layout Information (Building, room, Rack location in room, node location in rack, hot/cold rows, etc.) I/O Traces of MPI-IO Based Synthetic Application Soon to be Released Data More disk failure, node and scratch file system, data for several Supercomputer Hundreds of workstation File Systems Statistics Information Simulation Science Application I/O Traces

Machine Layout Information Example Machine 8, bldg 1, room 1,RackPosition,RackPosition,Position in Rack,Row facing NODE NUM,East/West,North/South,Vertical Position,Single row cluster N#,1 to 26,28 to 35,"1 to 37, top to bottom", 1,23,28,1,rear to N/Hot 2,23,28,2,rear to N/Hot 3,23,28,3,rear to N/Hot 4,23,28,4,rear to N/Hot 5,23,28,5,rear to N/Hot 6,23,28,6,rear to N/Hot 7,23,28,7,rear to N/Hot 8,23,28,8,rear to N/Hot 160,23,34,20,rear to N/Hot 161,23,34,21,rear to N/Hot 162,23,34,22,rear to N/Hot 163,23,34,23,rear to N/Hot 164,23,34,24,rear to N/Hot

File Systems Statistics Survey Based on CMU/Panasas File System Statistics Survey (fsstats) Purpose Develop better understanding of file system components Gather and Build large db of static file tree statistics Usage Parallel statistics collection with LANL s MPI-File Tree Walk DB query interface LANL Data Production file system statistics from LANL Supercomputers & Testbeds Hundreds of statistics from backups of workstations Lab-wide Output: Histogram and statistics on File Size Capacity Used Directory Size File Name Size

File Systems Statistics Survey - histogram,file size count,2400,items average,195.763750,kb min,0,kb max,18629,kb Example bucket min,bucket max,count,percent,cumulative pct 0,2,14,0.005833,0.005833 2,4,20,0.008333,0.014167 histogram,filename length count,48175,items average,19.143830,chars min,0,chars max,164,chars bucket min,bucket max,count,percent,cumulative pct 0,7,4150,0.086144,0.086144 8,15,24112,0.500509,0.586653 16,23,10512,0.218204,0.804857

Tracing Mechanisms and Trace Data Desired Attributes in a Trace Mechanism Minimum overhead Bandwidth preserving Method used Linux ltrace/strace Information Collected Time Stamps to detect node clock skew and drift Library and system calls and summery information Listing of directories Trace Availability Available Now - Traces of Open Source Benchmark: MPI-IO based synthetic Real Physics Application I/O Traces

MPI-Based Synthetic I/O Trace - Example Timing Information # Barrier before benchmark_call 7: cadillac113.ccstar.lanl.gov (10378) Entered barrier at 1159808385.170918 7: cadillac113.ccstar.lanl.gov (10378) Exited barrier at 1159808385.173167 3: cadillac117.ccstar.lanl.gov (11335) Entered barrier at 1159808385.166396 3: cadillac117.ccstar.lanl.gov (11335) Exited barrier at 1159808385.168893 5: cadillac115.ccstar.lanl.gov (10373) Entered barrier at 1159808385.168842 Traced Application Data 10:59:47.092996 MPI_File_open(92, 0x80675c0, 37, 0x80675a8, 0xbfdfe5e4 <unfinished...> 10:59:47.093718 SYS_statfs64(0x80675c0, 84, 0xbfdfe410, 0xbfdfe410, 0xbd3ff4) = 0 <0.011131> 10:59:47.108352 SYS_open("/panfs/REALM1/scratch/johnbent/O"..., 32832, 0600) = 3 <0.000745> 10:59:47.109189 SYS_close(3) = 0 <0.000063> 10:59:47.109310 SYS_open("/panfs/REALM1/scratch/johnbent/O"..., - 2147450814, 0600) = 3 <0.000564> 10:59:47.110912 <... MPI_File_open resumed> ) = 0 <0.017855> 10:59:47.110955 MPI_Wtime(0x8063830, 0xb7fee8dc, 0xbfdfe568,

MPI-Based Benchmark Trace - Example Summary Information # SUMMARY COUNT OF TRACED CALL(S) # Function Name Number of Calls Total time (s) #======================================================= ===== MPIO_Wait 2 0.000118 MPI_Barrier 29 2.156431 MPI_File_close 2 0.108482 MPI_File_delete 1 0.102532 # SUMMARY COUNT OF CALLS WITHIN 1 MPI_File_delete CALL(S) # Function Name Number of Calls Total time (s) # ====================================================== ======================== SYS_ipc 8 0.000140 SYS_statfs64 1 0.009227 SYS_unlink 1 0.092411

Links to Data & Codes Machine Failure/Usage/Event/Location & Disk Failure Data Sets http://institutes.lanl.gov/data/ Traces of MPI-IO Based Synthetic http://institutes.lanl.gov/data/tdata/ MPI-IO based synthetic & MPI-File Tree Walk http://institutes.lanl.gov/data/software/ File Systems Statistics Survey (fsstats) Code http://www.pdsi-scidac.org/fsstats/ Contact E-mail: jnunez@lanl.gov

Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

What are all the enemies that are conspiring against us? CPU and Disk trends Our machine and application size trends Parallel file system solution and industry trends Applications alignment to parallel file systems scalable capabilities

Disk Trends: Mind the Increasing BW Gap 80000 70000 60000 50000 40000 30000 20000 10000 0 Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate 70 60 50 40 30 20 10 0 Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate Per Unit Capacity 0.12 0.1 0.08 0.06 0.04 0.02 0 Ramac Fuji Eagle 690M-36kSeagate Elite 1.4G-54kSeagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G- 10k Disk Type

Disk Trends: Mind the Increasing Agility Gap 80000 70000 60000 50000 40000 30000 20000 10000 0 Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Seek Rate 300 250 200 150 100 50 0 Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G- 72k Disk Typ Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Technology Seek Rate Per Unit Capacity 0.3 0.25 0.2 0.15 0.1 0.05 0 Ramac Fuji Eagle 690M- 36k Seagate Elite 1.4G-Seagate Barracuda- Seagate Barracuda 54k 4 4.3G-72k 18G-72k Seagate Cheata 73G-10k Disk type

Why Mind the Ever Widening Gaps? Industry trends: Processors are still on a steep processing power curve via multi-core Disk Capacity is on a steep curve as well Disk BW and agility growth is at a slower pace Memory per processor core is going down Sweet spot block sizes for disks/raids are getting bigger We have to add disk units at a faster pace than we add processor units to stay balanced. We have to gather more smaller and less aligned data from more cores and realign/aggregate for more disk devices.

The Hero Calculations vs MTTI Recall that apps compute then checkpoint within MTTI for the machine they are running on. We target < 5% of runtime for checkpointing if possible. For Hero Apps, between 2010 and 2020 50% or more of application run time will be spent in checkpointing In the Exascale era it could even be worse Will we eventually get to just mirroring the application itself? How long can we continue to address the widening gap? Courtesy: DOE SciDAC PDSI (Garth Gibson)

The Parallel File System Industry It is very good to be able to say the parallel file system industry in the mid 90 s we had none Now, with some help, we have multiple solutions These solutions scale well on large block parallel well aligned workloads But is that where our workloads are headed? These solutions naturally gravitate towards the high volume (mid range supercomputing) But the ultra high end is problematic and not profitable High metadata rates are fraught with trade offs But there are workloads that need high metadata rates

N-to-N example T P0 P P0 H P0 Process 0 T P1 P P1 H P1 Process 1 T P2 P P2 H P2 Process 2 T P0 P P0 H P0 file0 T P2 P P2 H P2 file2 T P1 P P1 H P1 file1

N-to-1 strided example Process 0 T P0 P P0 H P0 Process 1 T P1 P P1 H P1 Process 2 T P2 P P2 H P2 T P0 T P1 T P2 P P0 P P1 P P2 H P0 H P1 H P2

Where are our workloads headed? N processes to N files with millions of cores, this is a metadata rate nightmare N processes to 1 file strided With millions of cores with each memory getting smaller, this is a small/unaligned I/O nightmare Our machine will always be pushing the edge on size So our workloads are not well suited for how file systems can easily scale (large aligned parallel) and our machines will not be where the $s are

One Strategy to Help With All The Enemies CPU and Disk trends Our machine and app size trends Parallel file system solution and industry trends Apps alignment to parallel file systems scalable capabilities IOFSL

IOFSL is Similar to Collective Buffering in Middleware Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 21 31 41 12 22 32 42 13 33 43 24 34 23 14 44 Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4

IOFSL is Similar to Collective Buffering in Middleware Except we Reintroduce HDWR Async/Offload Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 21 31 41 HDRW Async I/O Parallel file 12 22 32 42 13 33 43 24 34 23 14 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the aggregation completely async

We can also re-introduce async alignment Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the alignment completely async

We can also re-introduce access methods (delay striding until read (if ever)) Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 Requires access method over file (like log structure etc.)

So what did that buy us? It re-introduces complete async offload Move from compute 95%, dump to disk 5% to Compute 95%, 5% offload, dump to disk 95% This buys us 20X the time for dump to disk Aggregation/realignment happen memory to memory File system doesn t have to see all the compute side parallelism, both for data ops and for metadata ops. It makes our cutting edge machines look more like a mid range machine, mating better with parallel file system offerings It converts all traffic to what scales well on parallel file systems, optimizing access patterns It allows us to re-introduce access methods which could delay striding until read It allows us to architect a very parallel aware protocol between compute memory and async I/O memory (something other than POSIX moves us closer to HECEWG without the Linux Kernel entanglements Allows us to begin to introduce much more parallel aware semantics

IOFSL DOE Office of Science and NNSA joint funded LANL, Sandia, ANL, ORNL, OSC Mission: Design, build, and distribute a scalable, unified high-end computing I/O forwarding software layer that would be adopted and supported by DOE Office of Science and NNSA Goal: Provide function shipping at the file system interface level (without requiring middleware) that enables asynchronous coalescing and I/o without jeopardizing determinism for computation Work being planned This work will leverage existing BlueGene / Red Storm solutions ZeptoOS has IO forwarding already (ANL) LibSysIO has good remote scatter/gather model (SNL) Other work to be leveraged UIUC LBIO (log structured access methods IBM access methods and channel programming of the past Etc. VFS level aggregation, no app code changes needed Open Source, flexible (IO node memory/io offload on compute nodes etc.)

IOFSL Other possibilities Others interested in helping Considering adding current generation Non-Volatile storage (NAND/Flash) DARPA possibly interested in adding next generation Non-Volatile storage which has better write wear characteristics Later after new parallel I/O paradigm/access methods/semantics have settled some, we could consider NRE to parallel file system vendors to pull into their products giving DOE an exit strategy It is important to understand that IOFSL does not fundamentally solve the increasing gap issue, but it does provide a tool for helping to manage it.

Questions?