ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Design Motivation To optimize science utilization of the machine Maximize compute time / wall time ratio Maximize aggregate TB / sec / $ Without compromising R.A.S. Scalable up to large I/O clusters To ensure fault tolerance Application recovery following HW/SW failures Checkpoint I/O In exchange for: Temporarily suspending some POSIX semantics while retaining identical calling sequences (transparency) Delayed availability for reading Pittsburgh Supercomputing Center 2

Design Features Client-side parity calculation Leverages compute node CPU over I/O node CPU In a transparent intercept library Sequential I/O Aggregate many small RPCs into fewer large RPCs Disk block allocator gets next available Both effective latency hiding strategies Asynchronous fragment reconstruction Reassembly MD is carried with data blocks Pittsburgh Supercomputing Center 3

Software Components Compute Node SIO Node Zest Node RAID Vectors (queues) Disk Shelves (by color) 0 1 2 3 LNET LNET Router Router (in memory) 0 1 2 LNET 3 Syncer Threads External LustreFS Zest client (generate parity chunks) Incoming Partial Parity Full Parity Completed Client Library (on compute node) Transparent intercept Parity calculation LNET Router (on SIO node) Route messages from HSN to external IB NW ZEST daemon (on Zest I/O node) Communications Threads LNET to memory IO Thread All I/O to/from each internal disk Syncer Thread Staging data out to external PFS Parity Thread Regenerate lost data buffers Pittsburgh Supercomputing Center 4

Scalable Unit Hardware PCIe 4 GB/s SAS Links 1.2 GB/s 4 4 4 4 4 4 Service and I/O IB Links 1 GB/s (SDR) Zest I/O Node Dual Qual-Core SATA drives 75 MB/s 4 4 Disk Drive Shelves No single point of failure Pittsburgh Supercomputing Center 5

Benchmarking Tests Standard Benchmarking Applications No source modifications required (transparent intercept) IOR used for the following scaling plots SPIOBENCH Bonnie++ FIO PSC-developed, highly parallel and configurable suite Zest Component Tests (sustained) I/O Threads 1.1 GB/sec (91% of 1.2 GB/sec raw spindle) RPC Threads 1.1 GB/sec (27% of 4 GB/sec over IB) Pittsburgh Supercomputing Center 6

Benchmarking Goals 1. Measure client scaling 2. Measure max write bandwidth 3. Compare Zest to Lustre All of the following measurements: IOR client application 1 single server (identical hardware) Pittsburgh Supercomputing Center 7

Zest Performance (Linux / SDP) Note: Good client scaling 750 MB/sec Pittsburgh Supercomputing Center 8

Zest Performance (XT3 / IPoIB) Note: Excellent client scaling to 1K Highly consistent per-write performance IPoIB limited 150 MB/sec Pittsburgh Supercomputing Center 9

Lustre Performance (RAID0 / o2ib) Note: Just for illustration (max BW) 16 OSTs No redundancy 430 MB/sec Pittsburgh Supercomputing Center 10

Lustre Performance (RAID5 / o2ib) Note: What you get w/o HW RAID 1 OST: Linux software RAID 100 MB/sec Pittsburgh Supercomputing Center 11

Storage System Write Efficiency Efficiency = Aggregate Application BW / Aggregate Raw Spindle Speed Because per-drive cost is a substantial portion of a storage subsystem Zest: Highest Server Efficiency Pittsburgh Supercomputing Center 12

Storage System Performance/Cost Aggregate Application BW / Cost of Storage Subsystem Zest: Highest TB/sec/$ (no perserver overhead) Pittsburgh Supercomputing Center 13

The Zest Dashboard Server Efficiency (aggregate) >90% peak 80% sustained All Disk BW Sustained near max Ingest Buffers Remained empty throughout benchmarks Pittsburgh Supercomputing Center 14

zestiondctl A shell tool that provides interactive reporting and control of Zest primitives, including: Controls: logging, syncing, disk online/offline Performance: disk I/O, thread activity, communication rates Resources: memory structures, inodes, multilock lists, requests, parity groups ZESTIONDCTL(8) Pittsburgh Supercomputing Center ZESTIONDCTL(8) NAME zestiondctl - zestion daemon runtime control SYNOPSIS zestiondctl [-H] [-h table] [-L listspec] [-i iostat] [-M mlistspec] [-p paramspec] [-S socket] [-s showspec] DESCRIPTION The zestiondctl utility examines and manipulates zestiond(8) behavior. Pittsburgh Supercomputing Center 15

Zest Configuration (Linux / SDP) SAS SDP Infiniband Switch Zest RAID3 Linux Clients Pittsburgh Supercomputing Center 16

Zest Configuration (XT3 / IBoIB) SAS Routers IPoIB Cray XT3 Infiniband Switch Zest RAID3 Pittsburgh Supercomputing Center 17

Lustre Configuration (RAID0 / o2ib) SAS Routers o2ib Cray XT3 Infiniband Switch RAID0 Lustre Pittsburgh Supercomputing Center 18

Lustre Configuration (RAID5 / o2ib) SAS Routers o2ib Cray XT3 Infiniband Switch Lustre over Software RAID5 Pittsburgh Supercomputing Center 19