DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Similar documents
Storage Supporting DOE Science

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Execution Models for the Exascale Era

NERSC. National Energy Research Scientific Computing Center

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Parallel File Systems Compared

CSCS HPC storage. Hussein N. Harake

Using Resource Utilization Reporting to Collect DVS Usage Statistics

The Hopper System: How the Largest XE6 in the World Went From Requirements to Reality

The GUPFS Project at NERSC Greg Butler, Rei Lee, Michael Welcome. NERSC Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley CA USA

The Spider Center-Wide File System

GPFS for Life Sciences at NERSC

ALICE Grid Activities in US

Parallel File Systems. John White Lawrence Berkeley National Lab

GPFS on a Cray XT. 1 Introduction. 2 GPFS Overview

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Introduction to HPC Parallel I/O

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Parallel File Systems for HPC

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

A GPFS Primer October 2005

Blue Waters System Overview. Greg Bauer

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

InfiniBand Networked Flash Storage

Xyratex ClusterStor6000 & OneStor

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

CISL Update. 29 April Operations and Services Division

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Data Movement & Storage Using the Data Capacitor Filesystem

File Systems for HPC Machines. Parallel I/O

The Spider Center Wide File System

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

An Overview of Fujitsu s Lustre Based File System

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

Ende-zu-Ende Datensicherungs-Architektur bei einem Pharmaunternehmen

Illinois Proposal Considerations Greg Bauer

Voltaire Making Applications Run Faster

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

An Introduction to GPFS

High Performance Storage Solutions

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

NetApp High-Performance Storage Solution for Lustre

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

HPC Storage Use Cases & Future Trends

Efficient Object Storage Journaling in a Distributed Parallel File System

Blue Waters I/O Performance

Improved Solutions for I/O Provisioning and Application Acceleration

Lustre at Scale The LLNL Way

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Lustre overview and roadmap to Exascale computing

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Coordinating Parallel HSM in Object-based Cluster Filesystems

HPC NETWORKING IN THE REAL WORLD

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Lustre TM. Scalability

Lustre usages and experiences

Lessons learned from Lustre file system operation

Architecting a High Performance Storage System

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber

Outline. March 5, 2012 CIRMMT - McGill University 2

Multi-tenancy: a real-life implementation

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Data Movement & Tiering with DMF 7

Design and Evaluation of a 2048 Core Cluster System

EMC Exam E Information Storage and Management Version 3 Exam Version: 6.0 [ Total Questions: 171 ]

Architecting Storage for Semiconductor Design: Manufacturing Preparation

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Current Status of the Ceph Based Storage Systems at the RACF

Datura The new HPC-Plant at Albert Einstein Institute

Isilon Performance. Name

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

EMC VPLEX Geo with Quantum StorNext

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Using file systems at HC3

Cisco Prime Data Center Network Manager 6.2

Emerging Technologies for HPC Storage

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

HPC Technology Update Challenges or Chances?

Toward Understanding Life-Long Performance of a Sonexion File System

CS500 SMARTER CLUSTER SUPERCOMPUTERS

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Using DDN IME for Harmonie

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Transcription:

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1

NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population Approximately 3000 users, 400 projects, 500 codes Focus on unique resources Expert consulting and other services High end computing & storage systems NERSC is known for: Excellent services & diverse workload Physics Chemistry Fusion Materials 2 2010 allocations Math + CS Climate Lattice Gauge Other Astrophysics Combustion Life Sciences

NERSC Systems Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 9,532 compute nodes; 38,8 cores ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 cores Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak Clusters 140 Tflops total Carver IBM idataplex cluster PDSF (HEP/NP) ~1K core throughput cluster Magellan Cloud testbed IBM idataplex cluster GenePool (JGI) ~5K core throughput cluster NERSC Global Filesystem (NGF) Uses IBM s GPFS 1.5 PB capacity 10 GB/s of bandwidth HPSS Archival Storage 40 PB capacity 4 Tape libraries 150 TB disk cache 3 Analytics Euclid (5 GB shared memory) Dirac GPU testbed (48 nodes)

Lots of users, multiple systems, lots of data At the end of the 90 s it was becoming increasingly clear that data management was a huge issue. Users were generating larger and larger data sets and copying their data to multiple systems for pre- and post-processing. Wasted time and wasted space Needed to help users be more productive 4

Global Unified Parallel Filesystem In 2001 NERSC began the GUPFS project. High performance High reliability Highly scalable Center-wide shared namespace Assess emerging storage, fabric and filesystem technology Deploy across all production systems 5

NERSC Global Filesystem (NGF) First production in 2005 using GPFS Multi-cluster support Shared namespace Separate data and metadata partitions Shared lock manager Filesystems served over Fibre Channel and Ethernet Partitioned server space through private NSDs 6

NERSC Global Filesystem (NGF) Ethernet Network NGF Servers Franklin NGF SAN NGF Disk pnsd PDSF/ Planck Franklin Disk pnsd Hopper External Login IB IB PDSF Dirac Franklin SAN Carver/ Magellan 7 Euclid pnsd IB Hopper Hopper External Filesystem

NGF Configuration NSD servers are commodity 28 core servers 26 private NSD servers 8 for hopper; 14 for carver; 8 for planck (PDSF) Storage is heterogeneous DDN 9900 for data LUNs HDS 2300 for data and metadata LUNs Have also used Engenio and Sun Fabric is heterogeneous FC-8 and 10 GbE for data transport Ethernet for control/metadata traffic 8

NGF Filesystems Collaborative - /project 873 TB, ~ GB/s, served over FC-8 4 DDN 9900 Scratch - /global/scratch 873 TB, ~ GB/s, served over FC-8 4 DDN 9900s User homes /global/u1, /global/u2 40 TB, ~3-5 GB/s, served over Ethernet HDS 2300 Common area - /global/common, syscommon ~5 TB, ~3-5 GB/s, served over Ethernet HDS 2300 9

NGF /project Franklin Hopper Sea* DVS DTNs FC (8x4xFC8) FC (20x2xFC4) FC C4) F x 2 (2x pnsd /project 870TB (~ GB/s) IB Euclid FC (4x2xFC4) pnsd pnsd IB Dirac SGNs PDSF DVS 730TB increase July11 FC (x4xfc8) GPFS Server Gemini Planck 10 Carver Magellan

NGF global scratch Franklin Hopper FC (8x4xFC8) pnsd DTNs FC C4) F x 2 (2x /global/scratch 870TB (~ GB/s) Gemini IB DVS No increase planned FC (x4xfc8) Euclid FC (4x2xFC4) pnsd pnsd IB Dirac SGNs PDSF Planck 11 Carver Magellan

NGF global homes Franklin Hopper /global/homes 40 TB 40TB increase July11 DTNs Euclid FC GPFS Server Ethernet (4x1-10Gb) SGNs PDSF Dirac Carver Planck Magellan

Hopper Configuration DVS NGF DVS/DSL LNET MOM GPFS Storage GPFS Metadata pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server pnsd Server Main System NERSC 10GbE LAN to HPSS External Login Servers LSI 3992 RAID 1+0 RAID 1+0 2 Spare 2 MDS 4 esdm Servers QDR Switch Fabric 52 OSS FC Switch Fabric LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs 13 LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs

DVS on Hopper 16 DVS servers for NGF filesystems IB connected to private NSD servers GPFS remote cluster serving compute and MOM nodes 2 DVS nodes dedicated to MOMs Cluster parallel 32 DVS DSL servers on repurposed compute nodes Loadbalanced for shared root 14

pnsd servers to /global/scratch (idle) 11000 10800 10600 10400 MB/s 10200 10000 9800 Write 9600 Read 9400 9200 9000 1 2 4 8 16 1 MB Block Size 1 2 4 8 16 4 MB Block Size 1 2 4 8 16 8 MB Block Size #I/O processes per block size 15 1 2 4 8 16 16 MB Block Size

pnsd servers to /global/scratch (busy) 9000 8000 7000 MB/s 6000 5000 4000 Write 3000 Read 2000 1000 0 1 2 4 8 1MB Block Size 16 1 2 4 8 16 4MB Block Size 1 2 4 8 16 8MB Block Size # I/O processes per block size 16 1 2 4 8 16MB Block Size 16

DVS servers to /global/scratch (idle) 11400 100 11000 10800 MB/s 10600 10400 10200 Write Read 10000 9800 9600 9400 9200 1 2 4 # I/O processes - block size 4MB 17 8

Hopper compute nodes to /global/scratch (idle) 000 10000 MB/s 8000 6000 Write Read 4000 2000 0 24 48 96 192 384 # I/O processes - packed nodes 18 768 3072

Hopper compute nodes to /global/scratch (busy) 8000 7000 6000 MB/s 5000 4000 Write 3000 Read 2000 1000 0 24 48 96 192 384 768 #I/O processes - packed nodes 19 1536 3072

Hopper Filesystems External Lustre 2 local scratch filesystems 2+ PBs user storage Aggregate 70 GB/s External nodes 26 LSI 7900 52 OSSes with 6 OSTs per OSS 4 MDS with failover 56 LNET routers 20

IOR 2880 MPI Tasks MPI-IO Aggregate 60000 50000 MB/s 40000 30000 Write Read 20000 10000 0 10000 1000000 Block size 21 1048576

IOR 2880 MPI Tasks File Per Processor -- Aggregate 73000 72000 71000 MB/s 70000 69000 68000 Write Read 67000 66000 65000 64000 10000 1000000 Block size 22 1048576

Hopper compute nodes to /scratch (lustre) 40000 35000 30000 MB/s 25000 20000 Write 15000 Read 10000 5000 0 24 48 96 192 384 768 #I/O processes - packed nodes 23 1536 3072

Conclusions The mix of dedicated external Lustre and shared NGF filesystems works well for user workflows with mostly good performance. Shared file I/O is an issue for both Lustre and DVS-served filesystems. Cray and NERSC working together on DVS and shared file I/O issues through Center of Excellence. 24

Acknowledgments This work was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH131. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy. 25

26