LCI HPC Revolution 2005 26 April 2005 Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments Matthew Woitaszek matthew.woitaszek@colorado.edu
Collaborators Organizations National Center for Atmospheric Research (NCAR) University of Colorado, Boulder (CU) Researchers Jason Cope Michael Oberg Henry Tufo 26 April 2005 2
Outline Motivation Parallel Filesystem Products and Experiences PVFS2, Lustre, GPFS, TerraGrid FS Results Single-Node Performance Parallel Bandwidth and Metadata Performance Future Work 26 April 2005 3
LLNL NCSA SDSC Related Work Directly involved with Cluster File Systems and Lustre Goal: Create a filesystem not restricted to specific hardware Exploring the breadth of available parallel filesystems Integration with Mass Storage and TeraGrid systems Examined GPFS, GFS, Panasas, SamFS, SGI CXFS, an ADIC solution, Lustre, and IBRIX LCI 2004: Examined PVFS, Lustre, and GPFS on IA-64 Focused on a homogeneous architecture with fibre-channel equipped storage servers 26 April 2005 4
Motivation NCAR Storage Systems Supercomputers and Clusters and local working storage Archival Storage Tape silo system Archive Management and disk cache controller Visualization Systems and local working storage Grid Gateway GridFTP Server DataMover Server Shared Storage Cluster with shared filesystem 26 April 2005 5
Motivation Current CU Boulder Systems NFS Servers Home Directories Shared Software Working Space Compute Clusters Xeon Cluster (64) PPC970 Cluster (28) 26 April 2005 6
Motivation Future CU Boulder Systems CU Storage Cluster NCAR Experimental Platforms CU Compute Clusters Xeon Cluster (64) PPC970 Cluster (28) Opteron Cluster (128) 26 April 2005 7
Parallel Filesystem Features Typical desired features Availability (no downtime) Reliability (no data loss) Performance (no waiting) Scalability (no limits) Affordability (no cost) We re starting small, but we want our filesystem to grow with us! Two storage servers at the present time Support expansion and external connectivity 26 April 2005 8
Research Objectives Find a high performance parallel filesystem Minimum of specialized hardware Commodity servers with directly attached disk Filesystem access over Ethernet Support heterogeneous cluster client environment Servers: Xeon, Xeon EMT64 Clients: PPC970, Xeon, and eventually Opteron Examine features and requirements Functionality Performance Administrative overhead 26 April 2005 9
Filesystems Overview and Experience We examined cluster-based filesystems PVFS2 Lustre GPFS TerraFS We did not examine SAN solutions SAN solutions require expensive hardware NCAR uses SANs as collective storage among hosts but not between supercomputers Separate NCAR evaluation team At SC2004, neither GPFS nor Lustre supported Xeon and PPC970 heterogeneous client environments 26 April 2005 10
Experience PVFS2 Installation and configuration Compile kernel module on clients only, no restrictions Two storage servers and one metadata server Very easy to install and configure Worked on our original systems with no kernel changes Stable and reliable parallel filesystem in our environment 26 April 2005 11
Experience in phases Experience Lustre Phase 1: Trying to build our own kernel patches Phase 2: Using the pre-built Lustre kernels Phase 3: Using a custom Lustre kernel Final configuration Required changing Xeon cluster to SLES9 Custom Lustre PPC-enabled kernels using SLES 9 Two object store targets and one metadata server Final phase worked on all systems in our environment Very reliable on Xeon Less reliable with performance variances on PPC970 26 April 2005 12
Experience GPFS Installation and Configuration Compile kernel module on all machines, very restricted Quick and pleasant out-of-box experience Exceptionally well documented and robust Final Configuration Required removing LVM on storage servers Required changing Xeon cluster to SLES9 Two NSD storage servers Worked on all of the clusters in our environment 26 April 2005 13
TerraGrid (TerraFS) iscsi Linux VFS Linux md iscsi Components iscsi initiator on clients Cache coherent iscsi target daemon Linux md (multi-device) software SCSI Kernel level extz derivative filesystem TerraFS Daemon Linux file system Word sense disambiguation Official product name is TerraGrid Frequently abbreviated as TerraFS 26 April 2005 14
Experience TerraFS Installation and configuration Initial install performed by TerraScale engineers Replicated on additional nodes Documentation and software differ slightly (md RAID support) Final configuration Required custom TerraScale built kernel Two storage targets, no metadata server Final phase worked only on the Xeon cluster Generates lots of error messages in failure conditions No current support for PPC970 26 April 2005 15
Table of Administrator Pain and Agony Intel x86-64 Metadata server Intel Xeon Storage server PPC970 Client Intel Xeon Client GPFS 2.3 Lustre 1.4.0 PVFS2 TerraFS Not Used Restricted SLES.111 Restricted SLES.111 Restricted SLES.111 Restricted SLES.141 Restricted SLES.141 Restricted SLES.141 Restricted SLES.141 No Change No Change All (Module Only) All (Module Only) Our Xeon cluster is only 2.5 years old Not Used No Change N/A Custom 2.4.26 Patch GPFS required a commercial OS and a specific kernel version Lustre required a commercial OS and a specific kernel patch TerraFS required a custom kernel Systems already running SuSE required less effort Original goal was to fit filesystem in environment 26 April 2005 16
Performance Experimental Setup Storage Servers Dual Xeon 3.06 GHz 2.5 GB RAM SCSI320 disk array 4 400GB LVM partitions Metadata Server (optional) Dual Xeon EMT64 3.4 GHz 8 GB RAM Xeon Cluster (7 or 14) Core Switch Dual 1Gbps Trunked/Bonded PPC970 Cluster (14 or 27) Other impromptu independent variables Impact of Linux channel bonding servers (PVFS2, Lustre) Impact of Linux Logical Volume Management (Lustre) One disclaimer: GPFS was not run with LVM 26 April 2005 17
Performance Results I Single Node Bandwidth CU workload characteristics Clusters utilized as a compute farm 75% jobs are serial (33% compute time) Used iozone to measure single node performance 26 April 2005 18
Single Node Read Performance NFS PVFS2 Lustre TerraFS GPFS Local 26 April 2005 19
Single Node Write Performance NFS PVFS2 Lustre TerraFS GPFS Local 26 April 2005 20
Performance Results II Aggregate Bandwidth NCAR caggreio benchmark Used by NCAR for previous procurements Writes 30 x 128MB files: separate file per process Does not measure concurrent writer performance Measures average aggregate bandwidth Each process runs independently and is timed Average time is used to produce bandwidth Examined channel bonding variants Lustre No improvement PVFS2 Substantial improvement 26 April 2005 21
Xeon Cluster Aggregate Read Rate 26 April 2005 22
Xeon Cluster Aggregate Write Rate 26 April 2005 23
PPC970 Cluster Aggregate Read Rate 26 April 2005 24
PPC970 Cluster Aggregate Write Rate 26 April 2005 25
Performance Results III - Metadata Testing NCAR metarates benchmark Used by NCAR for previous procurements Writes 10,000 files per task Places file in a single directory or unique directories Measures average file creation rate No GPFS results on the PPC Cluster GPFS was functional and was tested Unable to select balanced nodes for testing 26 April 2005 26
Metadata Creation Rate Same Directory NFS PVFS2 Lustre TerraFS GPFS 26 April 2005 27
Metadata Creation Rate Unique Directories NFS PVFS2 Lustre TerraFS GPFS 26 April 2005 28
GPFS Metarates File creations per second in a unique directory for each task 26 April 2005 29
Linux Logical Volume Management (LVM) There s always something GPFS was the last system we tested GPFS cannot run on top of LVM devices We used LVM with every other filesystem Lustre and GPFS demonstrated close bandwidth results Conclusion Did Linux Logical Volume Management affect Lustre s performance? LVM has no statistically significant impact on Lustre reads (the 95% confidence intervals overlap) Xeon cluster writes are faster without LVM on servers PPC970 cluster writes are inconclusive 26 April 2005 30
Future Work Production filesystem installation Dedicate 2/3 server space to GPFS Reserve 1/3 server space for Lustre Subject filesystem to our user community MPI-IO concurrent write performance testing Wide area network filesystem access Examine higher performance and heterogeneous interconnects Infiniband, 10Gbps Ethernet, Gigabit Ethernet Single network solution not possible 26 April 2005 31
Desired Features in a Production Filesystem Remain responsive even in failure conditions Filesystem failure should not interrupt standard UNIX commands used by administrators ls la /mnt or df should not hang the console Zombies should respond to kill s 9 Support clean normal and abnormal termination Support both service start and shutdown commands Provide an Emergency Stop feature Never hang Linux reboot command Cut losses and let the administrators fix things 26 April 2005 32
Conclusions Heterogenous client support is a recent feature Expect full out-of-box capabilities in the next calendar year with GPFS, PVFS2, and Lustre Specific kernel dependencies and custom kernel patch implementations are a substantial inconvenience Parallel filesystem selection depends on individual site requirements and capabilities. Increased cost (operating system support contracts) Decreased research flexibility Delay when applying security patches Looking forward to Steve Woods Lustre presentation 26 April 2005 33
Acknowledgements Cluster File Systems (Lustre) Jeffrey Denworth, Phil Schwan, and Jacob Berkman IBM (GPFS) Ray Paden, Gautam Shah, Barry Bolding, and Rajiv Bendale NCAR Bill Anderson, Pam Gillman, George Fuentes, and Rich Loft Terrascale Technologies (TerraFS) Tim Wilcox and Dave Jensen University of Colorado, Boulder Theron Voran 26 April 2005 34
Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments Questions? Matthew Woitaszek matthew.woitaszek@colorado.edu