Parallel File Systems. John White Lawrence Berkeley National Lab

Parallel File Systems John White Lawrence Berkeley National Lab

Topics Defining a File System Our Specific Case for File Systems Parallel File Systems A Survey of Current Parallel File Systems Implementation

What is a File System? Simply, a method for ensuring A Unified Access Method to Data Organization (in a technical sense ) Data Integrity Efficient Use of Hardware

The HPC Application (our application) Large Node Count High IO Code (small file operations) High Throughput Code (large files fast) You Can Never Provide Too Much Capacity

What s the Problem With Tradition? NFS/CIFS/AFP/NAS is slow Single point of contact for both data and metadata Protocol Overhead File based locking We want parallelism from the application to disk We Need a Single Namespace We Need Truly Massive Aggregate Throughput (stop thinking MB/s) Bottlenecks are Inherent to Architecture Most Importantly:

Researchers Just Don t Care They want their data available everywhere They hate transferring data (this bears repeating) Their code wants the data several cycles ago If they have to learn new IO APIs, they commonly won't use it, period An increasing number aren t aware their code is inefficient

Performance in Aggregate: A Specific Case File System capable of Performance of 5GB/s Researcher running an analysis of past stock ticker data 10 independent processes per node, 10+ nodes, sometimes 1000+ processes Was running into performance issues In Reality, code was hitting 90% of peak performance 100s of processes choking each other Efficiency is key

Parallel File Systems A File System That Provides Access to Massive Amounts of Data at Large Client Counts Simultaneous Client Access at Sub-File Levels Striping at Sub-File Levels Massive Scalability A Method to Aggregate Large Numbers of Disks

Popular Parallel File Systems Lustre Purchased by Intel Support offerings from Intel, Whamcloud and numerous vendors Object based Growing feature list Information Lifecycle Management Wide Area mounting support Data replication and Metadata clustering planned Open source Large and growing install base, vibrant community Open compatibility

Popular Parallel File Systems GPFS IBM, born around 1993 as Tiger Shark multimedia file system Support direct from vendor AIX, Linux, some Windows Ethernet and Infiniband support Wide Area Support ILM Distributed metadata and locking Matured storage pool support Replication

Licensing Landscape GPFS (A Story of a Huge Feature Set at a Huge Cost) Binary IBM licensing Per Core Site-Wide Lustre Open Paid Licensing available tied to support offerings

Striping Files

SAN All nodes have access to storage fabric, all LUNs

Direct Connect A separate storage cluster hosts and exports via common fabric

Berkeley Research Computing Current Savio Scratch File System Lustre 2.5 210TB of DDN 9900 ~10GB/s ideal throughput Accessible on all nodes Future Lustre 2.5 or GPFS 4.1 ~1PB+ Capacity ~20GB/s throughput Vendor yet to be determined

Berkeley Research Computing Access Methods Available on every node POSIX MPIIO Data Transfer Globus Online Ideal for large transfers Restartable Tuned for large networks and long distance Easy to use graphical interface online SCP/SFTP Well known Suitable for quick and dirty transfers

Current Technological Landscape Tiered Storage (Storage Pools) When you have multiple storage needs within a single namespace SSD/FC for for jobs, metadata (Tier0) SATA for capacity (Tier1) Tape for long-term/archival (Tier2) ILM Basically, perform actions on data per a rule set Migration to Tape Fast Tier 0 storage use case Purge Policies Replication Dangers of metadata operations Long term storage

Further Information Berkeley Research Computing http://research-it.berkeley.edu/brc HPCS At LBNL http://scs.lbl.gov/ Email: jwhite@lbl.gov