An Overview of Fujitsu s Lustre Based File System

Similar documents
Fujitsu's Lustre Contributions - Policy and Roadmap-

Upstreaming of Features from FEFS

Technical Computing Suite supporting the hybrid system

Fujitsu s Contribution to the Lustre Community

DL-SNAP and Fujitsu's Lustre Contributions

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Parallel File Systems for HPC

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Experiences of the Development of the Supercomputers

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Lustre TM. Scalability

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Xyratex ClusterStor6000 & OneStor

Welcome! Virtual tutorial starts at 15:00 BST

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Fujitsu s Approach to Application Centric Petascale Computing

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Lessons learned from Lustre file system operation

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Toward An Integrated Cluster File System

The way toward peta-flops

Experiences with HP SFS / Lustre in HPC Production

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

NetApp High-Performance Storage Solution for Lustre

Design and Evaluation of a 2048 Core Cluster System

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Lustre at Scale The LLNL Way

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Lustre overview and roadmap to Exascale computing

System Software for Big Data and Post Petascale Computing

Enhancing Lustre Performance and Usability

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

5.4 - DAOS Demonstration and Benchmark Report

Emerging Technologies for HPC Storage

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Efficient Object Storage Journaling in a Distributed Parallel File System

Architecting a High Performance Storage System

Blue Waters I/O Performance

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Using file systems at HC3

The Spider Center-Wide File System

Progress on Efficient Integration of Lustre* and Hadoop/YARN

HPC Storage Use Cases & Future Trends

High Performance Storage Solutions

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

SFA12KX and Lustre Update

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

DELL Terascala HPC Storage Solution (DT-HSS2)

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Challenges in making Lustre systems reliable

Oak Ridge National Laboratory

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

NEMO Performance Benchmark and Profiling. May 2011

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Introduction to HPC Parallel I/O

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

2012 HPC Advisory Council

Dell TM Terascala HPC Storage Solution

LLNL Lustre Centre of Excellence

Multi-Rail LNet for Lustre

Datura The new HPC-Plant at Albert Einstein Institute

2008 International ANSYS Conference

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

ABySS Performance Benchmark and Profiling. May 2010

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Fujitsu s new supercomputer, delivering the next step in Exascale capability

libhio: Optimizing IO on Cray XC Systems With DataWarp

Voltaire Making Applications Run Faster

Introduction to Lustre* Architecture

Optimizing LS-DYNA Productivity in Cluster Environments

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Filesystems on SSCK's HP XC6000

Bridging the peta- to exa-scale I/O gap

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

MAHA. - Supercomputing System for Bioinformatics

CMS experience with the deployment of Lustre

The Leading Parallel Cluster File System

Data storage on Triton: an introduction

Linux Clusters for High- Performance Computing: An Introduction

High-Performance Lustre with Maximum Data Assurance

Transcription:

An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead

Outline Target System Overview Goals of Fujitsu s Cluster File System (FEFS) IO System Architecture Fujitsu s File System Overview 1

Target System: K computer RIKEN and Fujitsu are jointly developing the K computer` To be installed at the RIKEN AICS (Advanced Institute for Computational Science), Kobe, by 2012 Fujitsu is now developing a cluster file system called FEFS for K computer. Miniature System Model The First 8 Racks of K computer 2

K computer: System Configuration from Current Status of Next Generation Supercomputer in Japan by Mitsuo Yokokawa (RIKEN), WPSE2010 3

K computer: Compute Nodes L2$: 6MB 4 from Current Status of Next Generation Supercomputer in Japan by Mitsuo Yokokawa (RIKEN), WPSE2010

K computer: CPU Features 6 (12WAY) from Current Status of Next Generation Supercomputer in Japan by Mitsuo Yokokawa (RIKEN), WPSE2010 5

Goals of Fujitsu s Clustre File System: FEFS FEFS(Fujitsu Exabyte File System) for peta scale and exa-scale supercomputer will achieve: Extremely Large Extra-large volume (100PB~1EB). Massive number of clients (100k~1M) & servers (1k~10k) High Performance Throughput of Single-stream (~GB/s) & Parallel IO (~TB/s). Reducing file open latency (~10k ops). Avoidance of IO interferences among jobs. High Reliability and High Availability Always continuing file service while any part of system are broken down. FEFS is optimized for utilizing maximum hardware performance by minimizing file IO overhead, and based on Lustre file system. 6

Design Challenges of FEFS for K computer How should we realize High Speed and Redundancy together? There is design trade off. How do we realize Ultra High Scalability? Over 1K IO servers Accessed by over 100Ks of Compute nodes. How do we avoid I/O conflicts between Jobs? Storage devices should be occupied for each Job I/O operations by Multiple Users and Multiple Jobs How do we keep Robustness and High Data Integrity? Eliminate single point failure To realize these challenges, we have introduced Integrated Layered File system with Lustre extensions. 7

Integrated Layered Cluster File System Incompatible features is implemented by introducing Layered File System. Local File System (/work): High Speed FS for dedicated use for jobs. Global File System (/data): High Capacity and Redundancy FS for shared use. Compute Nodes Compute Nodes (Fujitsu HPC) Compute (Fujitsu Nodes HPC) Other HPC System Thousands of Users Login Server High Speed (~1TB/s) - For compute node exclusive use. - MPI-I/O /work /work /work Usability Staging Job Scheduler High Capacity & Redundancy - Shared by multiple servers / systems. - Time sharing operations. (ls -l) /data 8

File IO Architecture Optimized for Scalable File IO Operation Achieving Scalable Storage Volume and Performance Eliminating IO Conflicts from Every Components IO Zoning Technology for Local File System File IO is separated among JOBs and processed by IO node located Z=0. Z Link is used for File IO path. Compute Node Comp. Nodes Tofu IO. Node Z X Y Interconnect IO node IB SW Global File Server Global Disk QDR IB PCIe FC RAID Local Disk Local File System Global File System 9

File System: IO Use Cases Local File System (/work) Compute Node Compute Node Compute Node 1TB/s Compute Compute Node Node ~100k 10GB/s Compute Node Compute Node Compute Node 10GB/s Compute Node Compute Node Compute Node Compute Node Shared File ~1PB File File File File File File File File Parallel IO (MPI-IO) Single Stream IO Master Slave IO Global File System (/data) IO Nodes (~4k) Compute Nodes (~100k) Other HPC System Login Servers (1000 user) Input/Output File Transfer Interactive Job Coupled Job Interactive Job TSS Access ~32G files Huge number & variety of file access 10

FEFS Overview Lustre File System Based Currently, 1.8.1.1 based Optimized for IO system architecture: Integrated Layered Cluster File System Drawing out s Hardware Performance and Reliability Main Features Ultra high file IO and MPI-IO performance. (~1TB/s) Low time impact on job execution by file IO. Huge file system capacity. (100PB~) Scalability of performance & capacity. (~4K OSS) High reliability (Continuous service and Data integrity) High usability (Easy system operation and management) Dedicated resource allocation to avoid interferences among jobs 11

Lustre Extension of FEFS Several functions are extended for our requirements. Targets Issues Extension Large Scale FS File Size, Number of Files, Number of OSSs etc. File Size > 1PB to 8EB, Number of Files: 8 Exa Number of OSSs: Thousands of OSSs Performance TSS Response Meta Access Performance Common Local File System Global File System TSS Priority Scheduling Upgrading of Hardware Specification (Communication, CPU, File Cache, Disk) Reducing Software Bottleneck MDS Distribution: Allocating Dedicated File System for each JOB Fairness among Users: QOS Scheduling for Users IO Separation among JOBs for Local File System IO Zoning: Processing IO nodes just below the computing nodes Priority Scheduling Availability Recovering Sequence Recovering Sequences with Hardware Monitoring Support 12

Requirements for FEFS Lustre Extension(1/2) System Limits Features Max file system size Max file size Max #files Max OST size Max stripe count Max ACL entries Node Scalability Max #OSSs Max #OSTs Max #Clients Current Lustre 64PB 320TB 4G 16TB 160 32 1020 8150 128K 2012 Goals 100PB 1PB 32G 100TB 10k 8191 10k 10k 1M Block Size of ldiskfs (Backend File System) 4KB ~512KB Patch-less Server NA Support 13

Requirements for FEFS Lustre Extension(2/2) (Cont d) Features Current Lustre 2012 Goals Big-endian support NA Support Quota OST storage limit <= 4TB No limitation Directory Quota NA Support InfiniBand bonding NA Support Arbitrary OST assignment NA Support QOS NA Support 14

FEFS Technical Issues MDS and OSS Scalability IO Zoning QOS Staging (K computer Specific) Other Issues 15

MDS and OSS Scailability Locality is very important for increasing scalability Locality : Storage, OSS, Interconnect, Server Our strategy: Utilizing these locality as much as possible MDS Scalability: Dividing file systems Clustered MDS (Future Lustre 2.x Based) OSS Scalability over 1000 of servers: Minimizing OSS server response Avoiding Interconnect congestion Minimizing Storage(RAID) response 16

IO Zoning: IO Separation among JOBs Issue: Sharing disk volumes, network links among jobs cause IO performance degradation because of their conflicts. Our Approach: Separating of disk volumes, network links among jobs as much as possible. Z XY Job A Job B Job A Job B IO Node Local Disk Job A File Job B File w/ IO Conflict w/o IO Conflict 17

SP-QoS: Selectable Policy QoS System manager is able to select: Fair Share QoS Best Effort QoS 18

Fair Share QoS Avoiding from some one s occupying file IO resources Without Fair Share QoS Single User IOBandwidth Multi User User A Huge IO User A Not Fair Single User Login Node ド File Server User B With Fair Share QoS Multi User User A User A Limit Maximum IO usage rate User B Fair Share 19

Best Effort QoS Fair Share among users Single node occupying IO bandwidth Sharing IO bandwidth among Multi-Nodes Single Server Multi Servers Single Client IO BW Max BW Single server Mult-Client Multi-server 20

Staging: System Controlled File Transfer Goal: Minimizing IO conflict by controlling copying speed and IO timing. Stage-in: from Global File System to Local File System. Stage-out: from Local File System to Global File System. Staging Directive: Written by user in JOB script Ex.) Transferring a.out file to rank 0-15 nodes. #XXX -I 0-15@./a.out %r :./" Staging Timing: Pre-Staging is controlled by JOB Scheduler Stage-out is processed during JOB execution K computer Specific Stage-in JOB execution Stage-out Timeline Stage-in JOB execution Stage-out Stage-in JOB execution Stage-out 21

High Reliability and High Availability Issue: Keeping System Reliability and Availability against failures Our Approach: Full Duplex Hardware Architecture for 24 hour 365 day system running against single point of failure Duplex paths of IB, FC, IO Server, Data Robustness using RAID Disks( MDS:RAID1+0,OSS(IO node):raid5) Server and communication link are dynamically switched against their failure by software. File Service is not stopped at IO node, FC or IB failure and maintenance. IO Node OSS Rack Failover Running Running Running Running Running Stand-by FC User RAID User RAID User RAID IB Running Running Running Running Running Sys Global File System Sys FC Stand-by Tofu Interconnect 22

Other Issues Kernel Independent Lustre Servers(MDS, OSS) Current Implementation depends on kernel source codes, especially ext file system. LNET, OSS, MDS setting for 100 thousands of Clients and servers Automatic configuration is needed. Checking connectivity among servers Data Integrity Fsck of whole data is impractical for over petabyte file system. 23

Experimental Evaluation Results of FEFS IOR Results ( 864 OSTs, 10000 Clients): POSIX WRITE 96GiB/s, POSIX READ 148GiB/s IOR Result of POSIX IO, 10000 clients, 864 OSTs Command line used: /mnt/client/ior/ior -F -C -i 3 -b 1g -t 16m -o Max Write: 95990.84 MiB/sec (100653.69 MB/sec) Max Read: 147962.87 MiB/sec (155150.31 MB/sec) QOS Results on PC Cluster: Best Effort and Fair Share for two users (User A: 19 node Job, User B: 1 node Job) User B 10000 Files w/o QOS Single User w/o QOS Multi User w/ QOS Multi User Create Files 4.089 10.125 3.932 sec Remove Files 4.235 14.018 5.534 sec Single User Multi User Multi User User B 1 client User A User B 19 clients 1 client User A User B 19 clients 1 client 24

Summary and Future Works We described overview of FEFS (Fujitsu s Lustre Based File System) for the K computer developed by RIKEN and Fujitsu. High-speed file I/O and MPI-IO with low time impact on the job execution. Huge file system capacity, scalable capacity & speed by adding hardware. High-reliability (service continuity, data integrity), Usability (easy operation & management) Future Works Rebase to newer version of Lustre (1.8.5, 2.x) 25

26 Copyright 2011 2010 FUJITSU LIMITED