Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Similar documents
SFA12KX and Lustre Update

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

An Overview of Fujitsu s Lustre Based File System

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

HPC Storage Use Cases & Future Trends

Lustre Metadata Fundamental Benchmark and Performance

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Isilon Performance. Name

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

NetApp High-Performance Storage Solution for Lustre

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

LBRN - HPC systems : CCT, LSU

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Applying DDN to Machine Learning

HP GTC Presentation May 2012

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Parallel File Systems for HPC

University at Buffalo Center for Computational Research

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

Architecting a High Performance Storage System

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

CSCS HPC storage. Hussein N. Harake

HPC Hardware Overview

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Experiences with HP SFS / Lustre in HPC Production

System Software for Big Data and Post Petascale Computing

DELL Terascala HPC Storage Solution (DT-HSS2)

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

MAHA. - Supercomputing System for Bioinformatics

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

A New NSF TeraGrid Resource for Data-Intensive Science

TSUBAME-KFC : Ultra Green Supercomputing Testbed

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Emerging Technologies for HPC Storage

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

n N c CIni.o ewsrg.au

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

DDN About Us Solving Large Enterprise and Web Scale Challenges

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Fujitsu's Lustre Contributions - Policy and Roadmap-

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

InfiniBand Networked Flash Storage

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

2012 HPC Advisory Council

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

IBM Spectrum Scale IO performance

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Dell TM Terascala HPC Storage Solution

Datura The new HPC-Plant at Albert Einstein Institute

Seagate ExaScale HPC Storage

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

FhGFS - Performance at the maximum

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

The Road from Peta to ExaFlop

Demonstration Milestone for Parallel Directory Operations

Advanced Research Compu2ng Informa2on Technology Virginia Tech

Mapping MPI+X Applications to Multi-GPU Architectures

Improved Solutions for I/O Provisioning and Application Acceleration

Lustre overview and roadmap to Exascale computing

TFLOP Performance for ANSYS Mechanical

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

The BioHPC Nucleus Cluster & Future Developments

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Voltaire Making Applications Run Faster

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Comet Virtualization Code & Design Sprint

MILC Performance Benchmark and Profiling. April 2013

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

An ESS implementation in a Tier 1 HPC Centre

CPMD Performance Benchmark and Profiling. February 2014

Philippe Thierry Sr Staff Engineer Intel Corp.

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

Performance Boost for Seismic Processing with right IT infrastructure. Vsevolod Shabad CEO and founder +7 (985)

Refining and redefining HPC storage

Lustre / ZFS at Indiana University

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

A Comprehensive Study on the Performance of Implicit LS-DYNA

Accelerating Spectrum Scale with a Intelligent IO Manager

IME Infinite Memory Engine Technical Overview

Transcription:

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE Hitoshi Sato *1, Shuichi Ihara *2, Satoshi Matsuoka *1 *1 Tokyo Institute of Technology *2 Data Direct Networks Japan 1

Tokyo Tech s TSUBAME2. Nov. 1, 21 The Greenest Production Supercomputer in the World TSUBAME 2. New Development 32nm 4nm >4GB/s Mem BW>1.6TB/s Mem BW >12TB/s Mem BW 8Gbps NW BW 35KW Max ~1KW max 2 >6TB/s Mem BW 22Tbps NW Bisecion BW 1.4MW Max

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 2TB SSD) Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~1TB MEM, ~2TB SSD Thin nodes 148nodes (32nodes x44 Racks) Medium nodes Fat nodes HP Proliant SL39s G7 148nodes CPU: Intel Westmere-EP 2.93GHz 6cores 2 = 12cores/node GPU: NVIDIA Tesla K2X, 3GPUs/node Mem: 54GB (96GB) SSD: 6GB x 2 = 12GB (12GB x 2 = 24GB) Local SSDs Interconnets: Full-bisection Optical QDR Infiniband Network HP Proliant DL58 G7 24nodes CPU: Intel Nehalem-EX 2.GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S17, NextIO vcore Express 27 Mem:128GB SSD: 12GB x 4 = 48GB Core Switch Edge Switch Edge Switch (/w 1GbE ports) HP Proliant DL58 G7 1nodes CPU: Intel Nehalem-EX 2.GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S17 Mem: 256GB (512GB) SSD: 12GB x 4 = 48GB 12switches Voltaire Grid Director 47 12 IB QDR: 324 ports 179switches Voltaire Grid Director 436 179 IB QDR : 36 ports Infiniband QDR Networks 6switches Voltaire Grid Director 436E 6 IB QDR:34ports 1GbE: 2port QDR IB( 4) 2 QDR IB ( 4) 8 1GbE 2 GPFS#1 GPFS#2 GPFS#3 GPFS#4 SFA1k #1 /data SFA1k #2 /data1 SFA1k #3 /work SFA1k #4 /work1 /gscr SFA1k #5 GPFS+Tape Lustre Home HOME System application SFA1k #6 HOME iscsi Global Work Space #1 2.4 PB HDD + Global Work Space #2 Global Work Space #3 3.6 PB Parallel File System Volumes cnfs/clusterd Samba w/ GPFS NFS/CIFS/iSCSI by BlueARC Home Volumes 1.2PB 3

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 2TB SSD) Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~1TB MEM, ~2TB SSD Thin nodes 148nodes (32nodes x44 Racks) Medium nodes Fat nodes HP Proliant SL39s G7 148nodes CPU: Intel Westmere-EP 2.93GHz 6cores 2 = 12cores/node GPU: NVIDIA Tesla K2X, 3GPUs/node Mem: 54GB (96GB) SSD: 6GB x 2 = 12GB (12GB x 2 = 24GB) Local SSDs Interconnets: Full-bisection Optical QDR Infiniband Network HP Proliant DL58 G7 24nodes CPU: Intel Nehalem-EX 2.GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S17, NextIO vcore Express 27 Mem:128GB SSD: 12GB x 4 = 48GB Fine-grained R/W I/O (check point, temporal files) Core Switch Edge Switch Edge Switch (/w 1GbE ports) HP Proliant DL58 G7 1nodes CPU: Intel Nehalem-EX 2.GHz 8cores 2 = 32cores/node GPU: NVIDIA Tesla S17 Mem: 256GB (512GB) SSD: 12GB x 4 = 48GB 12switches Voltaire Grid Director 47 12 IB QDR: 324 ports Mostly Read I/O (data-intensive apps, parallel workflow, parameter survey) 179switches Voltaire Grid Director 436 179 IB QDR : 36 ports Infiniband QDR Networks 6switches Voltaire Grid Director 436E 6 IB QDR:34ports 1GbE: 2port QDR IB( 4) 2 QDR IB ( 4) 8 1GbE 2 GPFS#1 GPFS#2 GPFS#3 GPFS#4 SFA1k #1 /data SFA1k #2 /data1 SFA1k #3 /work SFA1k #4 /work1 /gscr SFA1k #5 GPFS+Tape Lustre Home HOME System application SFA1k #6 HOME iscsi Global Work Space #1 Backup 2.4 PB HDD + Fine-grained R/W I/O (check point, temporal files) Global Work Space #2 Global Work Space #3 3.6 PB Parallel File System Volumes Home storage for computing nodes NFS/CIFS/iSCSI by BlueARC Cloud-based campus storage services cnfs/clusterd Samba w/ GPFS Home Volumes 1.2PB 4

Interim Upgrade TSUBAME2. to 2.5 (Sept.1 th, 213) Upgrade the TSUBAME2.s GPUs NVIDIA Fermi M25 to Kepler K2X Towards TSUBAME 3. TSUBAME-KFC A TSUBAME3. prototype system with advanced cooling for next-gen. supercomputers. 4 compute nodes are oil-submerged. 16 NVIDIA Kepler K2x 8 Ivy Bridge Xeon FDR Infiniband TSUBAME2. Compute Node Fermi GPU 3 x 148 = 4224 GPUs Total 21.61 Flops System 5.21 TFlops SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF Storage design is also a matter!! 5

Existing Storage Problems Small I/O I/O contention PFS Throughput-oriented Design Master-Worker Configuration Performance Bottlenecks on Meta-data references Low IOPS (1k 1k IOPS) IOPS Shared across compute nodes I/O contention from various multiple applications I/O performance degradation Lustre I/O Monitoring PFS operation down Small I/O ops from user s apps Recovered TIME 6

Requirements for Next Generation I/O Sub-System 1 Basic Requirements TB/sec sequential bandwidth for traditional HPC applications Extremely high IOPS for to cope with applications that generate massive small I/O Example: graph, etc. Some application workflows have different I/O requirements for each workflow component Reduction/consolidation of PFS I/O resources is needed while achieving TB/sec performance We cannot simply use a larger number of IO servers and drives for achieving ~TB/s throughput! Many constraints Space, power, budget, etc. Performance per Watt and performance per RU is crucial 7

Requirements for Next Generation I/O Sub-System 2 New Approaches Tsubame 2. has pioneered the use of local flash storage as a high-iops alternative to an external PFS Tired and hybrid storage environments, combining (node) local flash with an external PFS Industry Status High-performance, high-capacity flash (and other new semiconductor devices) are becoming available at reasonable cost New approaches/interface to use high-performance devices (e.g. NVMexpress) 8

Key Requirements of I/O system Electrical Power and Space are still Challenging Reduce of #OSS/OST, but keep higher IO Performance and large storage capacity How maximize Lustre performance Understand New type of Flush storage device Any benefits with Lustre? How use it? MDT on SSD helps? 9

Lustre 2.x Performance Evaluation Maximize OSS/OST Performance for large Sequential IO Single OSS and OST performance 1MB vs 4MB RPC Not only Peak performance, but also sustain performance Small IO to shared file 4K random read access to the Lustre Metadata Performance CPU impacts? SSD helps? 1

Throughput per OSS Server Single OSS's Throughput (Write) Nobonding AA-bonding AA-bonding(CPU bind) Single OSS's Throughput (Read) Nobonding AA-bonding AA-bonding(CPU bind) 14 14 12 Dual IB FDR LNET bandwidth 12. (GB/sec) 12 Throughput(MB/sec) 1 8 6 4 Single FDR s LNET bandwidth 6. (GB/sec) Throughput(MB/s) 1 8 6 4 2 2 1 2 4 8 16 1 2 4 8 16 Number of clients Number of clients 11

Performance comparing (1 MB vs. 4 MB RPC, IOR FPP, asyncio) 4 x OSS, 28 x NL-SAS 14 clients and up to 224 processes IOR(Write, FPP, xfersize=4mb) IOR(Read, FPP, xfersize=4mb) 3 25 25 2 Throughput (MB/sec) 2 15 1 1MB RPC 4MB RPC Throughput (MB/sec) 15 1 1MB RPC 4MB RPC 5 5 28 56 112 224 Number of Process 28 56 112 224 Number of Process 12

Lustre Performance Degradation with Large Number of Threads: OSS standpoint 1 x OSS, 8 x NL-SAS 2 clients and up to 96 processes IOR(FPP, 1MB, Write) IOR(FPP, 1MB, Read) RPC=4MB RPC=1MB RPC=4MB RPC=1MB 12 8 1 7 6 3% THroughput (MB/sec) 8 6 4 4.5x THroughput (MB/sec) 5 4 3 2 55% 2 1 2 4 6 8 1 12 Number of Process 2 4 6 8 1 12 Number of Process 13

Lustre Performance Degradation with Large Number of Threads: SSDs vs. Nearline Disks IOR (FFP, 1MB) to single OST 2 clients and up to 64 processes Write: Lustre Backend Performance Degradation (Maximum for each dataset=1%) Read : Lustre Backend Performance Degradation (Maximum for each dataset=1%) 7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC) 7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC) 12% 12% Performance per OST as % of Peak Performance 1% 8% 6% 4% 2% Performance per OST as % of Peak Performance 1% 8% 6% 4% 2% % 1 2 3 4 5 6 7 Number of Process % 1 2 3 4 5 6 7 Number of Process 14

Lustre 2.4 4k Random Read Performance (With 1 SSDs) 2 x OSS, 1 x SSD(2 x RAID5), 16 clients Create Large random files (FFP, SSF), and run random read access with 4KB 6 Lustre 4K random read(ffp and SSF) 5 4 FFP(POSIX) IOPS 3 2 FFP(MPIIO) SSF(POSIX) SSF(MPIIO) 1 1n32p 2n64p 4n128p 8n256p 12n384p 16n512p 15

Conclusions: Lustre with SSDs SSDs pools in Lustre or SSDs SSDs change the behavior of pools, as seek latency is no longer an issue Application scenarios Very consistent performance, independent of the IO size or number of concurrent IOs Very high random access to a single shared file (millions of IOPS in a large file system with sufficient clients) With SSDs, the bottleneck is no more the device, but the RAID array (RAID stack) and the file system Very high random read IOPS in Lustre is possible, but only if the metadata workload is limited (i.e. random IO to a single shared file) We will investigate more benchmarks of Lustre on SSD 16

Metadata Performance Improvements Very Significant Improvements (since Lustre 2.3) Increased performance for both unique and shared directory metadata Almost linear scaling for most metadata operations with DNE ops/sec 14 12 1 8 6 4 2 Performance Comparing: Lustre-1.8 vs 2.4 (metadata operation to shared dir) lustre-1.8.8 lustre-2.4.2 ops/sec 6 5 4 3 2 1 Performance Comparing (metadata operation to unique dir) Same Storage resource, but just add MDS resource Single MDS Dual MDS(DNE) Directory Directory Directory File File stat File creation stat removal creation removal * 1 x MDS, 16 clients, 768 Lustre exports(multiple mount points) Directory creation Directory stat Directory removal File creation File stat File removal 17

Metadata Performance CPU Impact File Creation (Unique dir) 9 8 7 6 ops/sec 5 4 3 2 1 1 2 3 4 5 6 7 3.6GHz 2.4GHz Metadata Performance is highly dependent on CPU performance ops/sec 7 6 5 4 3 2 Number of Process File Creation (Shared dir) 3.6GHz 2.4GHz Limited variation in CPU frequency or memory speed can have a significant impact on metadata performance 1 1 2 3 4 5 6 7 Number of Process 18

Conclusions: Metadata Performance Significant metadata improvements for shares and unique metadata performance since Lustre 2.3 Improvements across all metadata workloads, especially removals SSDs add to performance, but the impact is limited (and probably mostly due to latency, not the IOPS performance of the device) Important considerations for metadata benchmarks Metadata benchmarks are very sensitive to the underlying hardware Clients limitations are important when considering metadata performance of scaling Only directory/file stats scale with the number of processes per node, other metadata workloads do appear not scale with the number of processes on a single node 19

Summary Lustre Today Significant performance increase in object storage and metadata performance Much better spindle efficiency (which is now reaching the physical drive limit) Client performance is quickly becoming the limitation for small file and random performance, not the server side!! Consider alternative scenarios PFS combined with fast local devices (or, even, local instances of the PFS) PFS combined with a global acceleration/caching layer 2