LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans

Size: px

Start display at page:

Download "LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans"

Eugenia Page
5 years ago
Views:

1 LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans Intelligent Storage Workshop May 14, 2008 Gary Grider HPC-DO James Nunez HPC-5 John Bent HPC-5 LA-UR /LA-UR

2 Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

3 Publicly Available Data from LANL Available Data: Supercomputers Nine years of computer operational failure data Usage records (job size, processors/machines used, duration, time, etc.) Disk failure data for a single Supercomputer Machine Layout Information (Building, room, Rack location in room, node location in rack, hot/cold rows, etc.) I/O Traces of MPI-IO Based Synthetic Application Soon to be Released Data More disk failure, node and scratch file system, data for several Supercomputer Hundreds of workstation File Systems Statistics Information Simulation Science Application I/O Traces

4 Machine Layout Information Example Machine 8, bldg 1, room 1,RackPosition,RackPosition,Position in Rack,Row facing NODE NUM,East/West,North/South,Vertical Position,Single row cluster N#,1 to 26,28 to 35,"1 to 37, top to bottom", 1,23,28,1,rear to N/Hot 2,23,28,2,rear to N/Hot 3,23,28,3,rear to N/Hot 4,23,28,4,rear to N/Hot 5,23,28,5,rear to N/Hot 6,23,28,6,rear to N/Hot 7,23,28,7,rear to N/Hot 8,23,28,8,rear to N/Hot 160,23,34,20,rear to N/Hot 161,23,34,21,rear to N/Hot 162,23,34,22,rear to N/Hot 163,23,34,23,rear to N/Hot 164,23,34,24,rear to N/Hot

5 File Systems Statistics Survey Based on CMU/Panasas File System Statistics Survey (fsstats) Purpose Develop better understanding of file system components Gather and Build large db of static file tree statistics Usage Parallel statistics collection with LANL s MPI-File Tree Walk DB query interface LANL Data Production file system statistics from LANL Supercomputers & Testbeds Hundreds of statistics from backups of workstations Lab-wide Output: Histogram and statistics on File Size Capacity Used Directory Size File Name Size

6 File Systems Statistics Survey - histogram,file size count,2400,items average, ,kb min,0,kb max,18629,kb Example bucket min,bucket max,count,percent,cumulative pct 0,2,14, , ,4,20, , histogram,filename length count,48175,items average, ,chars min,0,chars max,164,chars bucket min,bucket max,count,percent,cumulative pct 0,7,4150, , ,15,24112, , ,23,10512, ,

7 Tracing Mechanisms and Trace Data Desired Attributes in a Trace Mechanism Minimum overhead Bandwidth preserving Method used Linux ltrace/strace Information Collected Time Stamps to detect node clock skew and drift Library and system calls and summery information Listing of directories Trace Availability Available Now - Traces of Open Source Benchmark: MPI-IO based synthetic Real Physics Application I/O Traces

8 MPI-Based Synthetic I/O Trace - Example Timing Information # Barrier before benchmark_call 7: cadillac113.ccstar.lanl.gov (10378) Entered barrier at : cadillac113.ccstar.lanl.gov (10378) Exited barrier at : cadillac117.ccstar.lanl.gov (11335) Entered barrier at : cadillac117.ccstar.lanl.gov (11335) Exited barrier at : cadillac115.ccstar.lanl.gov (10373) Entered barrier at Traced Application Data 10:59: MPI_File_open(92, 0x80675c0, 37, 0x80675a8, 0xbfdfe5e4 <unfinished...> 10:59: SYS_statfs64(0x80675c0, 84, 0xbfdfe410, 0xbfdfe410, 0xbd3ff4) = 0 < > 10:59: SYS_open("/panfs/REALM1/scratch/johnbent/O"..., 32832, 0600) = 3 < > 10:59: SYS_close(3) = 0 < > 10:59: SYS_open("/panfs/REALM1/scratch/johnbent/O"..., , 0600) = 3 < > 10:59: <... MPI_File_open resumed> ) = 0 < > 10:59: MPI_Wtime(0x , 0xb7fee8dc, 0xbfdfe568,

9 MPI-Based Benchmark Trace - Example Summary Information # SUMMARY COUNT OF TRACED CALL(S) # Function Name Number of Calls Total time (s) #======================================================= ===== MPIO_Wait MPI_Barrier MPI_File_close MPI_File_delete # SUMMARY COUNT OF CALLS WITHIN 1 MPI_File_delete CALL(S) # Function Name Number of Calls Total time (s) # ====================================================== ======================== SYS_ipc SYS_statfs SYS_unlink

10 Links to Data & Codes Machine Failure/Usage/Event/Location & Disk Failure Data Sets Traces of MPI-IO Based Synthetic MPI-IO based synthetic & MPI-File Tree Walk File Systems Statistics Survey (fsstats) Code Contact

11 Outline LANL Data Release What s available and where FASTOS I/O Forwarding Scalable Layer Motivation Where we re at

12 What are all the enemies that are conspiring against us? CPU and Disk trends Our machine and application size trends Parallel file system solution and industry trends Applications alignment to parallel file systems scalable capabilities

13 Disk Trends: Mind the Increasing BW Gap Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Data Rate Per Unit Capacity Ramac Fuji Eagle 690M-36kSeagate Elite 1.4G-54kSeagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G- 10k Disk Type

14 Disk Trends: Mind the Increasing Agility Gap Disk Technology Capacity improvements Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Typ Disk Technology Seek Rate Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G- 72k Disk Typ Seagate Barracuda 18G-72k Seagate Cheata 73G-10k Disk Technology Seek Rate Per Unit Capacity Ramac Fuji Eagle 690M- 36k Seagate Elite 1.4G-Seagate Barracuda- Seagate Barracuda 54k 4 4.3G-72k 18G-72k Seagate Cheata 73G-10k Disk type

15 Why Mind the Ever Widening Gaps? Industry trends: Processors are still on a steep processing power curve via multi-core Disk Capacity is on a steep curve as well Disk BW and agility growth is at a slower pace Memory per processor core is going down Sweet spot block sizes for disks/raids are getting bigger We have to add disk units at a faster pace than we add processor units to stay balanced. We have to gather more smaller and less aligned data from more cores and realign/aggregate for more disk devices.

For Hero Apps, between 2010 and 2020 50% or more of application run time will be spent in checkpointing In the Exascale

16 The Hero Calculations vs MTTI Recall that apps compute then checkpoint within MTTI for the machine they are running on. We target < 5% of runtime for checkpointing if possible. For Hero Apps, between 2010 and % or more of application run time will be spent in checkpointing In the Exascale era it could even be worse Will we eventually get to just mirroring the application itself? How long can we continue to address the widening gap? Courtesy: DOE SciDAC PDSI (Garth Gibson)

17 The Parallel File System Industry It is very good to be able to say the parallel file system industry in the mid 90 s we had none Now, with some help, we have multiple solutions These solutions scale well on large block parallel well aligned workloads But is that where our workloads are headed? These solutions naturally gravitate towards the high volume (mid range supercomputing) But the ultra high end is problematic and not profitable High metadata rates are fraught with trade offs But there are workloads that need high metadata rates

18 N-to-N example T P0 P P0 H P0 Process 0 T P1 P P1 H P1 Process 1 T P2 P P2 H P2 Process 2 T P0 P P0 H P0 file0 T P2 P P2 H P2 file2 T P1 P P1 H P1 file1

19 N-to-1 strided example Process 0 T P0 P P0 H P0 Process 1 T P1 P P1 H P1 Process 2 T P2 P P2 H P2 T P0 T P1 T P2 P P0 P P1 P P2 H P0 H P1 H P2

20 Where are our workloads headed? N processes to N files with millions of cores, this is a metadata rate nightmare N processes to 1 file strided With millions of cores with each memory getting smaller, this is a small/unaligned I/O nightmare Our machine will always be pushing the edge on size So our workloads are not well suited for how file systems can easily scale (large aligned parallel) and our machines will not be where the $s are

21 One Strategy to Help With All The Enemies CPU and Disk trends Our machine and app size trends Parallel file system solution and industry trends Apps alignment to parallel file systems scalable capabilities IOFSL

22 IOFSL is Similar to Collective Buffering in Middleware Process 1 Process 2 Process 3 Process Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4

23 IOFSL is Similar to Collective Buffering in Middleware Except we Reintroduce HDWR Async/Offload Compute Compute Compute Compute Process 1 Process 2 Process 3 Process HDRW Async I/O Parallel file HDRW Async I/O HDRW Async I/O HDRW Async I/O RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the aggregation completely async

24 We can also re-introduce async alignment Compute Compute Compute Compute Process 1 Process 2 Process 3 Process HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 In middleware we used the compute processes for aggregation/alignment, in IOFSL we use separate hardware (IO nodes or IO offload hdwr on compute nodes) to make the alignment completely async

25 We can also re-introduce access methods (delay striding until read (if ever)) Compute Compute Compute Compute Process 1 Process 2 Process 3 Process HDRW Async I/O HDRW Async I/O HDRW Async I/O HDRW Async I/O Parallel file RAID Group 1 RAID Group 2 RAID Group 3 RAID Group 4 Requires access method over file (like log structure etc.)

26 So what did that buy us? It re-introduces complete async offload Move from compute 95%, dump to disk 5% to Compute 95%, 5% offload, dump to disk 95% This buys us 20X the time for dump to disk Aggregation/realignment happen memory to memory File system doesn t have to see all the compute side parallelism, both for data ops and for metadata ops. It makes our cutting edge machines look more like a mid range machine, mating better with parallel file system offerings It converts all traffic to what scales well on parallel file systems, optimizing access patterns It allows us to re-introduce access methods which could delay striding until read It allows us to architect a very parallel aware protocol between compute memory and async I/O memory (something other than POSIX moves us closer to HECEWG without the Linux Kernel entanglements Allows us to begin to introduce much more parallel aware semantics

27 IOFSL DOE Office of Science and NNSA joint funded LANL, Sandia, ANL, ORNL, OSC Mission: Design, build, and distribute a scalable, unified high-end computing I/O forwarding software layer that would be adopted and supported by DOE Office of Science and NNSA Goal: Provide function shipping at the file system interface level (without requiring middleware) that enables asynchronous coalescing and I/o without jeopardizing determinism for computation Work being planned This work will leverage existing BlueGene / Red Storm solutions ZeptoOS has IO forwarding already (ANL) LibSysIO has good remote scatter/gather model (SNL) Other work to be leveraged UIUC LBIO (log structured access methods IBM access methods and channel programming of the past Etc. VFS level aggregation, no app code changes needed Open Source, flexible (IO node memory/io offload on compute nodes etc.)

28 IOFSL Other possibilities Others interested in helping Considering adding current generation Non-Volatile storage (NAND/Flash) DARPA possibly interested in adding next generation Non-Volatile storage which has better write wear characteristics Later after new parallel I/O paradigm/access methods/semantics have settled some, we could consider NRE to parallel file system vendors to pull into their products giving DOE an exit strategy It is important to understand that IOFSL does not fundamentally solve the increasing gap issue, but it does provide a tool for helping to manage it.

29 Questions?

Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w