DVS, GPFS and External Lustre at NERSC How It s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1
NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population Approximately 3000 users, 400 projects, 500 codes Focus on unique resources Expert consulting and other services High end computing & storage systems NERSC is known for: Excellent services & diverse workload Physics Chemistry Fusion Materials 2 2010 allocations Math + CS Climate Lattice Gauge Other Astrophysics Combustion Life Sciences
NERSC Systems Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 9,532 compute nodes; 38,8 cores ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 cores Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak Clusters 140 Tflops total Carver IBM idataplex cluster PDSF (HEP/NP) ~1K core throughput cluster Magellan Cloud testbed IBM idataplex cluster GenePool (JGI) ~5K core throughput cluster NERSC Global Filesystem (NGF) Uses IBM s GPFS 1.5 PB capacity 10 GB/s of bandwidth HPSS Archival Storage 40 PB capacity 4 Tape libraries 150 TB disk cache 3 Analytics Euclid (5 GB shared memory) Dirac GPU testbed (48 nodes)
Lots of users, multiple systems, lots of data At the end of the 90 s it was becoming increasingly clear that data management was a huge issue. Users were generating larger and larger data sets and copying their data to multiple systems for pre- and post-processing. Wasted time and wasted space Needed to help users be more productive 4
Global Unified Parallel Filesystem In 2001 NERSC began the GUPFS project. High performance High reliability Highly scalable Center-wide shared namespace Assess emerging storage, fabric and filesystem technology Deploy across all production systems 5
NERSC Global Filesystem (NGF) First production in 2005 using GPFS Multi-cluster support Shared namespace Separate data and metadata partitions Shared lock manager Filesystems served over Fibre Channel and Ethernet Partitioned server space through private NSDs 6
NERSC Global Filesystem (NGF) Ethernet Network NGF Servers Franklin NGF SAN NGF Disk pnsd PDSF/ Planck Franklin Disk pnsd Hopper External Login IB IB PDSF Dirac Franklin SAN Carver/ Magellan 7 Euclid pnsd IB Hopper Hopper External Filesystem
NGF Configuration NSD servers are commodity 28 core servers 26 private NSD servers 8 for hopper; 14 for carver; 8 for planck (PDSF) Storage is heterogeneous DDN 9900 for data LUNs HDS 2300 for data and metadata LUNs Have also used Engenio and Sun Fabric is heterogeneous FC-8 and 10 GbE for data transport Ethernet for control/metadata traffic 8
NGF Filesystems Collaborative - /project 873 TB, ~ GB/s, served over FC-8 4 DDN 9900 Scratch - /global/scratch 873 TB, ~ GB/s, served over FC-8 4 DDN 9900s User homes /global/u1, /global/u2 40 TB, ~3-5 GB/s, served over Ethernet HDS 2300 Common area - /global/common, syscommon ~5 TB, ~3-5 GB/s, served over Ethernet HDS 2300 9
NGF /project Franklin Hopper Sea* DVS DTNs FC (8x4xFC8) FC (20x2xFC4) FC C4) F x 2 (2x pnsd /project 870TB (~ GB/s) IB Euclid FC (4x2xFC4) pnsd pnsd IB Dirac SGNs PDSF DVS 730TB increase July11 FC (x4xfc8) GPFS Server Gemini Planck 10 Carver Magellan
NGF global scratch Franklin Hopper FC (8x4xFC8) pnsd DTNs FC C4) F x 2 (2x /global/scratch 870TB (~ GB/s) Gemini IB DVS No increase planned FC (x4xfc8) Euclid FC (4x2xFC4) pnsd pnsd IB Dirac SGNs PDSF Planck 11 Carver Magellan
NGF global homes Franklin Hopper /global/homes 40 TB 40TB increase July11 DTNs Euclid FC GPFS Server Ethernet (4x1-10Gb) SGNs PDSF Dirac Carver Planck Magellan
Hopper Configuration DVS NGF DVS/DSL LNET MOM GPFS Storage GPFS Metadata pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server Main System NERSC 10GbE LAN to HPSS External Login Servers LSI 3992 RAID 1+0 RAID 1+0 2 Spare 2 MDS 4 esdm Servers QDR Switch Fabric 52 OSS FC Switch Fabric LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs 13 LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs
DVS on Hopper 16 DVS servers for NGF filesystems IB connected to private NSD servers GPFS remote cluster serving compute and MOM nodes 2 DVS nodes dedicated to MOMs Cluster parallel 32 DVS DSL servers on repurposed compute nodes Loadbalanced for shared root 14
pnsd servers to /global/scratch (idle) 11000 10800 10600 10400 MB/s 10200 10000 9800 Write 9600 Read 9400 9200 9000 1 2 4 8 16 1 MB Block Size 1 2 4 8 16 4 MB Block Size 1 2 4 8 16 8 MB Block Size #I/O processes per block size 15 1 2 4 8 16 16 MB Block Size
pnsd servers to /global/scratch (busy) 9000 8000 7000 MB/s 6000 5000 4000 Write 3000 Read 2000 1000 0 1 2 4 8 1MB Block Size 16 1 2 4 8 16 4MB Block Size 1 2 4 8 16 8MB Block Size # I/O processes per block size 16 1 2 4 8 16MB Block Size 16
DVS servers to /global/scratch (idle) 11400 100 11000 10800 MB/s 10600 10400 10200 Write Read 10000 9800 9600 9400 9200 1 2 4 # I/O processes - block size 4MB 17 8
Hopper compute nodes to /global/scratch (idle) 000 10000 MB/s 8000 6000 Write Read 4000 2000 0 24 48 96 192 384 # I/O processes - packed nodes 18 768 3072
Hopper compute nodes to /global/scratch (busy) 8000 7000 6000 MB/s 5000 4000 Write 3000 Read 2000 1000 0 24 48 96 192 384 768 #I/O processes - packed nodes 19 1536 3072
Hopper Filesystems External Lustre 2 local scratch filesystems 2+ PBs user storage Aggregate 70 GB/s External nodes 26 LSI 7900 52 OSSes with 6 OSTs per OSS 4 MDS with failover 56 LNET routers 20
IOR 2880 MPI Tasks MPI-IO Aggregate 60000 50000 MB/s 40000 30000 Write Read 20000 10000 0 10000 1000000 Block size 21 1048576
IOR 2880 MPI Tasks File Per Processor -- Aggregate 73000 72000 71000 MB/s 70000 69000 68000 Write Read 67000 66000 65000 64000 10000 1000000 Block size 22 1048576
Hopper compute nodes to /scratch (lustre) 40000 35000 30000 MB/s 25000 20000 Write 15000 Read 10000 5000 0 24 48 96 192 384 768 #I/O processes - packed nodes 23 1536 3072
Conclusions The mix of dedicated external Lustre and shared NGF filesystems works well for user workflows with mostly good performance. Shared file I/O is an issue for both Lustre and DVS-served filesystems. Cray and NERSC working together on DVS and shared file I/O issues through Center of Excellence. 24
Acknowledgments This work was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH131. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy. 25
26