LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans

LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans Intelligent Storage Workshop May 14, 2008 Gary Grider HPC-DO James Nunez HPC-5 John Bent HPC-5 LA-UR-08-0252/LA-UR-07-7449

Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

Publicly Available Data from LANL Available Data: Supercomputers Nine years of computer operational failure data Usage records (job size, processors/machines used, duration, time, etc.) Disk failure data for a single Supercomputer Machine Layout Information (Building, room, Rack location in room, node location in rack, hot/cold rows, etc.) I/O Traces of MPI-IO Based Synthetic Application Soon to be Released Data More disk failure, node and scratch file system, data for several Supercomputer Hundreds of workstation File Systems Statistics Information Simulation Science Application I/O Traces

Machine Layout Information Example Machine 8, bldg 1, room 1,RackPosition,RackPosition,Position in Rack,Row facing NODE NUM,East/West,North/South,Vertical Position,Single row cluster N#,1 to 26,28 to 35,"1 to 37, top to bottom", 1,23,28,1,rear to N/Hot 2,23,28,2,rear to N/Hot 3,23,28,3,rear to N/Hot 4,23,28,4,rear to N/Hot 5,23,28,5,rear to N/Hot 6,23,28,6,rear to N/Hot 7,23,28,7,rear to N/Hot 8,23,28,8,rear to N/Hot 160,23,34,20,rear to N/Hot 161,23,34,21,rear to N/Hot 162,23,34,22,rear to N/Hot 163,23,34,23,rear to N/Hot 164,23,34,24,rear to N/Hot

File Systems Statistics Survey Based on CMU/Panasas File System Statistics Survey (fsstats) Purpose Develop better understanding of file system components Gather and Build large db of static file tree statistics Usage Parallel statistics collection with LANL s MPI-File Tree Walk DB query interface LANL Data Production file system statistics from LANL Supercomputers & Testbeds Hundreds of statistics from backups of workstations Lab-wide Output: Histogram and statistics on File Size Capacity Used Directory Size File Name Size

File Systems Statistics Survey - histogram,file size count,2400,items average,195.763750,kb min,0,kb max,18629,kb Example bucket min,bucket max,count,percent,cumulative pct 0,2,14,0.005833,0.005833 2,4,20,0.008333,0.014167 histogram,filename length count,48175,items average,19.143830,chars min,0,chars max,164,chars bucket min,bucket max,count,percent,cumulative pct 0,7,4150,0.086144,0.086144 8,15,24112,0.500509,0.586653 16,23,10512,0.218204,0.804857

Tracing Mechanisms and Trace Data Desired Attributes in a Trace Mechanism Minimum overhead Bandwidth preserving Method used Linux ltrace/strace Information Collected Time Stamps to detect node clock skew and drift Library and system calls and summery information Listing of directories Trace Availability Available Now - Traces of Open Source Benchmark: MPI-IO based synthetic Real Physics Application I/O Traces

MPI-Based Synthetic I/O Trace - Example Timing Information # Barrier before benchmark_call 7: cadillac113.ccstar.lanl.gov (10378) Entered barrier at 1159808385.170918 7: cadillac113.ccstar.lanl.gov (10378) Exited barrier at 1159808385.173167 3: cadillac117.ccstar.lanl.gov (11335) Entered barrier at 1159808385.166396 3: cadillac117.ccstar.lanl.gov (11335) Exited barrier at 1159808385.168893 5: cadillac115.ccstar.lanl.gov (10373) Entered barrier at 1159808385.168842 Traced Application Data 10:59:47.092996 MPI_File_open(92, 0x80675c0, 37, 0x80675a8, 0xbfdfe5e4 <unfinished...> 10:59:47.093718 SYS_statfs64(0x80675c0, 84, 0xbfdfe410, 0xbfdfe410, 0xbd3ff4) = 0 <0.011131> 10:59:47.108352 SYS_open("/panfs/REALM1/scratch/johnbent/O"..., 32832, 0600) = 3 <0.000745> 10:59:47.109189 SYS_close(3) = 0 <0.000063> 10:59:47.109310 SYS_open("/panfs/REALM1/scratch/johnbent/O"..., - 2147450814, 0600) = 3 <0.000564> 10:59:47.110912 <... MPI_File_open resumed> ) = 0 <0.017855> 10:59:47.110955 MPI_Wtime(0x8063830, 0xb7fee8dc, 0xbfdfe568,

MPI-Based Benchmark Trace - Example Summary Information # SUMMARY COUNT OF TRACED CALL(S) # Function Name Number of Calls Total time (s) #======================================================= ===== MPIO_Wait 2 0.000118 MPI_Barrier 29 2.156431 MPI_File_close 2 0.108482 MPI_File_delete 1 0.102532 # SUMMARY COUNT OF CALLS WITHIN 1 MPI_File_delete CALL(S) # Function Name Number of Calls Total time (s) # ====================================================== ======================== SYS_ipc 8 0.000140 SYS_statfs64 1 0.009227 SYS_unlink 1 0.092411

Links to Data & Codes Machine Failure/Usage/Event/Location & Disk Failure Data Sets http://institutes.lanl.gov/data/ Traces of MPI-IO Based Synthetic http://institutes.lanl.gov/data/tdata/ MPI-IO based synthetic & MPI-File Tree Walk http://institutes.lanl.gov/data/software/ File Systems Statistics Survey (fsstats) Code http://www.pdsi-scidac.org/fsstats/ Contact E-mail: jnunez@lanl.gov

Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

What are all the enemies that are conspiring against us? CPU and Disk trends Our machine and application size trends Parallel file system solution and industry trends Applications alignment to parallel file systems scalable capabilities

Disk Trends: Mind the Increasing BW Gap 80000 70000 60000 50000 40000 30000 20000 10000 0 Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate 70 60 50 40 30 20 10 0 Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate Per Unit Capacity 0.12 0.1 0.08 0.06 0.04 0.02 0 Ramac Fuji Eagle 690M-36kSeagate Elite 1.4G-54kSeagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G- 10k Disk Type

Disk Trends: Mind the Increasing Agility Gap 80000 70000 60000 50000 40000 30000 20000 10000 0 Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Seek Rate 300 250 200 150 100 50 0 Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G- 72k Disk Typ Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Technology Seek Rate Per Unit Capacity 0.3 0.25 0.2 0.15 0.1 0.05 0 Ramac Fuji Eagle 690M- 36k Seagate Elite 1.4G-Seagate Barracuda- Seagate Barracuda 54k 4 4.3G-72k 18G-72k Seagate Cheata 73G-10k Disk type

Why Mind the Ever Widening Gaps? Industry trends: Processors are still on a steep processing power curve via multi-core Disk Capacity is on a steep curve as well Disk BW and agility growth is at a slower pace Memory per processor core is going down Sweet spot block sizes for disks/raids are getting bigger We have to add disk units at a faster pace than we add processor units to stay balanced. We have to gather more smaller and less aligned data from more cores and realign/aggregate for more disk devices.

The Hero Calculations vs MTTI Recall that apps compute then checkpoint within MTTI for the machine they are running on. We target < 5% of runtime for checkpointing if possible. For Hero Apps, between 2010 and 2020 50% or more of application run time will be spent in checkpointing In the Exascale era it could even be worse Will we eventually get to just mirroring the application itself? How long can we continue to address the widening gap? Courtesy: DOE SciDAC PDSI (Garth Gibson)

The Parallel File System Industry It is very good to be able to say the parallel file system industry in the mid 90 s we had none Now, with some help, we have multiple solutions These solutions scale well on large block parallel well aligned workloads But is that where our workloads are headed? These solutions naturally gravitate towards the high volume (mid range supercomputing) But the ultra high end is problematic and not profitable High metadata rates are fraught with trade offs But there are workloads that need high metadata rates

N-to-N example T P0 P P0 H P0 Process 0 T P1 P P1 H P1 Process 1 T P2 P P2 H P2 Process 2 T P0 P P0 H P0 file0 T P2 P P2 H P2 file2 T P1 P P1 H P1 file1

N-to-1 strided example Process 0 T P0 P P0 H P0 Process 1 T P1 P P1 H P1 Process 2 T P2 P P2 H P2 T P0 T P1 T P2 P P0 P P1 P P2 H P0 H P1 H P2

Where are our workloads headed? N processes to N files with millions of cores, this is a metadata rate nightmare N processes to 1 file strided With millions of cores with each memory getting smaller, this is a small/unaligned I/O nightmare Our machine will always be pushing the edge on size So our workloads are not well suited for how file systems can easily scale (large aligned parallel) and our machines will not be where the $s are

One Strategy to Help With All The Enemies CPU and Disk trends Our machine and app size trends Parallel file system solution and industry trends Apps alignment to parallel file systems scalable capabilities IOFSL

IOFSL is Similar to Collective Buffering in Middleware Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 21 31 41 12 22 32 42 13 33 43 24 34 23 14 44 Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4

IOFSL is Similar to Collective Buffering in Middleware Except we Reintroduce HDWR Async/Offload Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 21 31 41 HDRW Async I/O Parallel file 12 22 32 42 13 33 43 24 34 23 14 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the aggregation completely async

We can also re-introduce async alignment Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the alignment completely async

We can also re-introduce access methods (delay striding until read (if ever)) Compute Compute Compute Compute Process 1 Process 2 Process 3 Process 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 Requires access method over file (like log structure etc.)

So what did that buy us? It re-introduces complete async offload Move from compute 95%, dump to disk 5% to Compute 95%, 5% offload, dump to disk 95% This buys us 20X the time for dump to disk Aggregation/realignment happen memory to memory File system doesn t have to see all the compute side parallelism, both for data ops and for metadata ops. It makes our cutting edge machines look more like a mid range machine, mating better with parallel file system offerings It converts all traffic to what scales well on parallel file systems, optimizing access patterns It allows us to re-introduce access methods which could delay striding until read It allows us to architect a very parallel aware protocol between compute memory and async I/O memory (something other than POSIX moves us closer to HECEWG without the Linux Kernel entanglements Allows us to begin to introduce much more parallel aware semantics

IOFSL DOE Office of Science and NNSA joint funded LANL, Sandia, ANL, ORNL, OSC Mission: Design, build, and distribute a scalable, unified high-end computing I/O forwarding software layer that would be adopted and supported by DOE Office of Science and NNSA Goal: Provide function shipping at the file system interface level (without requiring middleware) that enables asynchronous coalescing and I/o without jeopardizing determinism for computation Work being planned This work will leverage existing BlueGene / Red Storm solutions ZeptoOS has IO forwarding already (ANL) LibSysIO has good remote scatter/gather model (SNL) Other work to be leveraged UIUC LBIO (log structured access methods IBM access methods and channel programming of the past Etc. VFS level aggregation, no app code changes needed Open Source, flexible (IO node memory/io offload on compute nodes etc.)

IOFSL Other possibilities Others interested in helping Considering adding current generation Non-Volatile storage (NAND/Flash) DARPA possibly interested in adding next generation Non-Volatile storage which has better write wear characteristics Later after new parallel I/O paradigm/access methods/semantics have settled some, we could consider NRE to parallel file system vendors to pull into their products giving DOE an exit strategy It is important to understand that IOFSL does not fundamentally solve the increasing gap issue, but it does provide a tool for helping to manage it.

Questions?