The Spider Center-Wide File System

Similar documents
Efficient Object Storage Journaling in a Distributed Parallel File System

The Spider Center Wide File System

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Oak Ridge National Laboratory

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Blue Waters I/O Performance

Preparing GPU-Accelerated Applications for the Summit Supercomputer

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

CSCS HPC storage. Hussein N. Harake

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

An Overview of Fujitsu s Lustre Based File System

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Oak Ridge National Laboratory Computing and Computational Sciences

OLCF's next- genera0on Spider file system

Technical Computing Suite supporting the hybrid system

A More Realistic Way of Stressing the End-to-end I/O System

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

High Performance Storage Solutions

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Introduction to HPC Parallel I/O

Lustre at Scale The LLNL Way

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Extraordinary HPC file system solutions at KIT

Lustre overview and roadmap to Exascale computing

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

NetApp High-Performance Storage Solution for Lustre

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Performance and Scalability Evaluation of the Ceph Parallel File System

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Toward Understanding Life-Long Performance of a Sonexion File System

ABySS Performance Benchmark and Profiling. May 2010

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre / ZFS at Indiana University

I/O at the Center for Information Services and High Performance Computing

Update on Topology Aware Scheduling (aka TAS)

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

HPC NETWORKING IN THE REAL WORLD

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Roadmapping of HPC interconnects

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

The Red Storm System: Architecture, System Update and Performance Analysis

Lustre usages and experiences

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

A Semi Preemptive Garbage Collector for Solid State Drives. Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

The Fusion Distributed File System

InfiniBand Networked Flash Storage

Improving I/O Performance in POP (Parallel Ocean Program)

LLNL Lustre Centre of Excellence

SFA12KX and Lustre Update

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Testing methodology for large-scale file systems. Sarp Oral, PhD Technology Integration Group

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

HIGH PERFORMANCE COMPUTING FROM SUN

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

CISL Update. 29 April Operations and Services Division

Design and Evaluation of a 2048 Core Cluster System

The RAMDISK Storage Accelerator

Lustre TM. Scalability

IME Infinite Memory Engine Technical Overview

Cray Operating System and I/O Road Map Charlie Carroll

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Stockholm Brain Institute Blue Gene/L

Maximizing NFS Scalability

Voltaire Making Applications Run Faster

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Embedded Filesystems (Direct Client Access to Vice Partitions)

Challenges in making Lustre systems reliable

Mining Supercomputer Jobs' I/O Behavior from System Logs. Xiaosong Ma

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Parallel File Systems. John White Lawrence Berkeley National Lab

Architecting a High Performance Storage System

The Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Data Transfer Study for HPSS Archiving

HPC Storage Use Cases & Future Trends

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Parallel File Systems for HPC

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Data storage services at KEK/CRC -- status and plan

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Techniques to improve the scalability of Checkpoint-Restart

Transcription:

The Spider Center-Wide File System Presented by Feiyi Wang (Ph.D.) Technology Integration Group National Center of Computational Sciences Galen Shipman (Group Lead) Dave Dillow, Sarp Oral, James Simmons, Ross Miller, Jason Hill HPC Advisory Workshop Changsha, China, October 28, 2009

National Center for Computational Sciences Jaguar XT5 Jaguar XT4 Lens Smoky Eugene HPSS 18,688 nodes 7,832 nodes 149,504 cores 31,328 cores 300+ TB memory 62 TB memory 1.38 PFLOPS 263 TFLOPS Linux cluster for data analysis and visualization Development cluster IBM Blue Gene/P 25 PB mass storage system XT5 Hex-core upgrade: 224,256 compute cores

What is Spider? Spider is a center-wide file system at NCCS Spider is the world s largest Lustre file system the world s fastest Lustre file system Spider history Concept and commission of LNET router development Small scale production Storage system evaluation Initial deployment Production 2005 2006 2007 2008 2009 wolf-0, first incarnation Testing and evaluation

Why Spider? Eliminate per compute-system storage acquisitions Eliminate file system island Mitigate configuration and maintenance overhead Accessibility during system maintenance window

Building Block: Scalable Cluster 280 1TB Disks in 5 Disks disk trays 280 in Disks 5 trays 280 in 5 trays DDN Couplet (2 controllers) DDN Couplet (2 DDN controllers) Couplet (2 controllers) 16 SC Units on the floor 2 racks for each SC 24 IB ports Flextronics 24 IB Switch ports OSS (4 Dell nodes) Flextronics 24 IB Switch ports IB OSS (4 Dell nodes) Flextronics Switch OSS (4 Dell nodes) IB Ports IB Ports Uplink Uplink to to Cisco Cisco Core Core Uplink Switch Switch to Cisco Core Switch SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC A Scalable Cluster (SC)

Host to Controller Redundancy Logical Connections Flextronics Host Group A Flextronics Host Host Group Group B 24 Port IB Switch 24 Port Flextronics IB Switch (A, B, C) 24 Port IB Switch Host Group A Host DDN Group Couplet B (A, B, C) Physical Connections

Hardware Redundancy at Controller

Spider: A System At Scale 13,440 1TB drives Over 10.7 PB of RAID 6 Capacity 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many compute systems through high-speed network: Over 3,000 IB ports Over 5 kilometer of cables Over half of a megawatt power for disks Over 26,000 client mounts for I/O Peak I/O performance: 240 GB/second Current Status in production use on all major NCCS computing platforms

Challenges From Direct-Attached Mode to Routed Mode Lesson from Server Side Client Statistics Hard bounce, client eviction and I/O shadowing Journaling Topology aware I/O and fine grained routing Reliability Modeling

Direct-Attached vs. Routed Configuration Run both direct-attached and routed configuration side-by-side for two file systems Address scalability issues first, without bring in added complexity by routed configurations Requests/responses to DDNs Requests/responses to OSSes

Server Side Client Statistics Problem During our full scale test, OOM occurred Tracing Server side maintains client statistics in 64 KB buffer for each client for each OST/MDT. Solutions Move server side stats to client side Reduce memory required per client stats (4KB)

Survive a Bounce: No single point of failure, transparency is key Everest Powerwall Remote Visualization Cluster End-to-End Cluster Application Development Cluster Data Archive 25 PB SION 192x 48x 192x Login XT5 XT4 Spider

Client Eviction Jaguar XT5 has over 18K clients A hardware event such as link failure will result a reboot With 18K clients evicted! Problem: Each client eviction requires a synchronous write (update to last_rcvd log file), done by every OST for each evicted client. Solution: Make synchronous write to be asynchronous Batching and parallelizing eviction process if necessary

Demonstration of Surviving a Bounce Aggregate backend throughput for half Spider system

Filesystem: Journaling Overhead Filesystem benchmark using IOR: ~ 25% Analysis Result: Large number of 4KB writes in addition to expected 1MB writes so even sequential write exhibits random I/O behavior Problem: ldiskfs journaling

Two Solutions: Hardware-based and Software-based RamSan-400, 3292 MB/sec or 58.7% of the our baseline performance Async Journaling: 5223 MB/sec or 93% of our baseline performance

Application Performance For I/O intensive application such as GTC running at scale, when I/O time dominates, overall runtime reduction with async journaling is up to 50% Histogram of I/O requests observed on DDN couplet. Async-journaling decreases the number of small I/O requests substantially, from 64% to 26.5%

Intelligent LNET Routing Potential Network Congestion Points SeaStar2 Torus network LNET Routers Cisco IB Switch

Strategies to Minimize Contention Pair routers and object storage servers on the same line card (crossbar) So long as routers only talk to OSSes on the same line card contention in the fat-tree is eliminated Required small changes to Open SM (not needed anymore with new wiring scheme) Optimized physical I/O nodes placement within the Torus network Assign clients to nearby routers to minimize contention Allocate objects to nearest OST managed by an OSS that is topologically close to you.

Performance Results (xdd vs. IOR, Placed vs. Non-Placed) Backend throughput - bypassing SeaStar torus - congestion free on IB fabric Torus congestion

Reliability Modeling (1) Spider is a system at scale, what would be quantitative expectation of its reliability? JBOD: Just A Bunch of Disks Assuming each drive operates 24 hours a day, 365 days per year, MTTF is 1.2 Million hours We can calculate APR as: Spider system has 48 couplets, each with 280 1TB disks. Consider a scale factor of 4, then expected disk failures per year should be: RAID 6: Triple Disk Failures

Reliability Modeling (2) Double Disk Failures with Bit Error System Failures with Peripheral Components -Baseboard contributes 50% of failure rate - Over first year, we expect 0.64 failed couplet on 1 st year, 2.42 failed couplets by 2 nd year

Conclusion A system at this scale, we can t just pick up the phone and order one A close collaboration with vendors (Cray, Sun, DDN, and Cisco) and community at large. Lustre Center of Engineering (LCE) proves to be very helpful No problem is small when you scale it up As we are ahead of the curve, we are plowing the road for the community: Driving Lustre scalability through collaborations with key stake holders. Had two Lustre scalability workshops this year at ORNL

Thank you! Contact Information Phone: (865) 574-7209 Email: fwang2@ornl.gov Technology Integration Team