Beyond Petascale Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center
GPFS Research and Development! GPFS product originated at IBM Almaden Research Laboratory! Research continues to be involved in prototyping and developing new GPFS features and related technology
GPFS Parallel File System! GPFS: IBM s Parallel Cluster File System Based on Shared Disk (SAN) Model! Cluster fabric-interconnected nodes (IP, SAN, )! Shared disk - all data and metadata on fabricattached block storage! Parallel - data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager.! Runs on pseries, IA32, IA65; AIX and Linux! Installed on many Top 500 computers, including ASCI White, ASCI Blue, Blue Horizon, others! Applications include HPC, scalable file and Web servers, digital libraries, video streaming, OLAP, financial data management, engineering design! Present customers are approaching file systems of a petabyte! New applications will drive further demand for petascale storage! Technical challenges must be overcome GPFS File System Nodes Switching fabric (System or storage area network) Shared disks (SAN-attached or network block device)
Future trends! Clusters growing larger and cheaper! More nodes (Linux), larger SMP nodes (AIX)! More storage, lower-cost storage, larger files and file systems (multiple petabytes)! More sites with multiple clusters! Storage fabrics growing larger (more ports)! Variety of storage fabric technologies! Fibre channel, cluster switch, 1000bT, 10000bT, Infiniband,! Technology blurs distinction among SAN, LAN, and WAN! Why can t I use whatever fabric I have for storage?! Why do I need to separate fabrics for storage and communication?! Why can t I share files across my entire SAN like I can with Ethernet?! Why can t I access storage over long distances?
Data access beyond the cluster! Problem: In a large data center, multiple clusters and other nodes need to share data over the SAN! Solution: eliminate the notion of a fixed cluster! control nodes for admin, managing locking, recovery,! File access from client nodes! Client nodes authenticate to control nodes to mount a file system! Client nodes are trusted to enforce access control! Clients still directly access disk data and metadata! Issues:! Scalability (10000 nodes)! Can no longer base FT on other clustering software! Administration (parallel?) Cluster 1 Cluster 2 Control Nodes IP SAN Shared Storage Visualization System
File I/O over Fibre Channel SAN Throughput - GPFS at SDSC SDSC Teragrid Cluster! 128 IA64 compute nodes 1600000 1400000 1200000! 48 Sun StorEdge RAID LUNs! 14 TB file system! 4 Brocade Fibre Channel KB/sec 1000000 800000 600000 Write Read switches! Flow control would help further scaling 400000 200000 0 0 20 40 60 80 100 120 Nodes
GPFS and Teragrid! Teragrid /SDSC GPFS File System /SDSC over SAN! NCSA, SDSC, ANL, CalTech, PSC,! Shared computing grid! 40+ GB/s backbone! Goal: sharing data over the backbone! GPFS data center solution, scaled over the WAN! IP for storage access adds 10-60 ms! but under load, storage latency is much higher than this anyway!! Additional issues:! Decentralized administration (UIDs)! Globus security! Single name space! Joint work in progress! Pluggable access control! Name space! Technology demo at SC03 SDSC SAN SDSC NSD Servers SDSC Compute Nodes /SDSC over WAN Scinet Sc2003 Compute Nodes NCSA Compute Nodes Visualization /NCSA over WAN Sc03 NSD Servers Sc03 SAN /NCSA over SAN NCSA SAN NCSA NSD Servers /Sc2003 GPFS File System /NCSA GPFS File System
File I/O over WAN gpfsperf read and write 10G file on N nodes - /sdsc SDSC SC03 Demo Cluster! 40 IA64 compute nodes 1200000 1000000! 60 Sun StorEdge RAID controllers! 75 TB file system! 4 Brocade Fibre Channel 800000 switches KB/sec 600000 400000 Write Read! 16 IA64 NSD servers! 1FC and 1 GE each! 10 Gb SCinet WAN link 200000 0 0 4 8 12 16 20 24 28 32 36 Nodes! Inconsistent write results because link was being shared! GPFS I/O parallelism successfully hides WAN latency! TCP flow control appears to adequately prevent throughput fall-off
Issues for future storage! Tertiary storage! Tape $/MB/sec >> disk $/MB/sec! Petabyte file systems present problems for backup, archive, HSM! Disk cost, longevity not yet quite sufficient to replace tape! Low-cost storage! ATA drives much cheaper than server (SCSI, FC) drives BUT! ATA drives are NOT the same as server drives! MTBF spec at low duty cycle! Vibration sensitivity! ATA hard error rate 10X that of server drives