SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

Similar documents
Comet Virtualization Code & Design Sprint

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas

Virtualization for High Performance Computing Applications at SDSC Mahidhar Tatineni, Director of User Services, SDSC HPC Advisory Council China

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

Lustre+ZFS at SDSC. Rick Wagner HPC Systems Manager

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

InfiniBand Networked Flash Storage

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Comet Virtual Clusters What s underneath? Philip Papadopoulos San Diego Supercomputer Center

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Feedback on BeeGFS. A Parallel File System for High Performance Computing

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

High Performance Computing and Data Resources at SDSC

Multi-Rail LNet for Lustre

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

New Storage Architectures

Demonstration Milestone for Parallel Directory Operations

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Using SDSC Systems (part 2)

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Introduction to Lustre* Architecture

HPC Systems Overview. SDSC Summer Institute August 6-10, 2012 San Diego, CA. Shawn Strande Gordon Project Manager SAN DIEGO SUPERCOMPUTER CENTER

SFA12KX and Lustre Update

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Parallel File Systems for HPC

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Olaf Weber Senior Software Engineer SGI Storage Software

Xyratex ClusterStor6000 & OneStor

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Solaris Engineered Systems

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Extraordinary HPC file system solutions at KIT

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Andreas Dilger, Intel High Performance Data Division LAD 2017

Lessons learned from Lustre file system operation

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Mahidhar Tatineni, Director of User Services, SDSC. HPC Advisory Council China Conference October 18, 2017, Hefei, China

Lenovo Database Configuration Guide

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre TM. Scalability

Lustre / ZFS at Indiana University

Emerging Technologies for HPC Storage

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

Fujitsu s Contribution to the Lustre Community

HPC NETWORKING IN THE REAL WORLD

An Overview of Fujitsu s Lustre Based File System

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Architecting a High Performance Storage System

Configuring a Single Oracle ZFS Storage Appliance into an InfiniBand Fabric with Multiple Oracle Exadata Machines

HPC Hardware Overview

An ESS implementation in a Tier 1 HPC Centre

Using file systems at HC3

Broadberry. Hyper-Converged Solution. Date: Q Application: Hyper-Converged S2D Storage. Tags: Storage Spaces Direct, DR, Hyper-V

Analytics of Wide-Area Lustre Throughput Using LNet Routers

LLNL Lustre Centre of Excellence

Fujitsu's Lustre Contributions - Policy and Roadmap-

Lustre Reference Architecture

High-Performance Lustre with Maximum Data Assurance

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Lustre usages and experiences

Implementing Storage in Intel Omni-Path Architecture Fabrics

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre overview and roadmap to Exascale computing

SurFS Product Description

NetApp High-Performance Storage Solution for Lustre

Andreas Dilger, Intel High Performance Data Division SC 2017

Pexip Infinity Server Design Guide

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

HPC Architectures. Types of resource currently in use

IBM Storwize V7000 Unified

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Cray XC Scalability and the Aries Network Tony Ford

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

DL-SNAP and Fujitsu's Lustre Contributions

Illinois Proposal Considerations Greg Bauer

Lustre Interface Bonding

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

10GE network tests with UDP. Janusz Szuba European XFEL

Surveillance Dell EMC Storage with Verint Nextiva

Transcription:

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication Rick Wagner HPC Systems Manager San Diego Supercomputer Center

Comet HPC for the long tail of science iphone panorama photograph of 1 of 2 server rows

Comet is in response to NSF s solicitation (13-528) to expand the use of high end resources to a much larger and more diverse community support the entire spectrum of NSF communities... promote a more comprehensive and balanced portfolio include research communities that are not users of traditional HPC systems. The long tail of science needs HPC

99% of jobs run on NSF s HPC resources in 2012 used <2,048 cores And consumed >50% of the total core-hours across NSF resources HPC for the 99% One rack

Comet Will Serve the 99% Island architecture Mix of node types Virtualized HPC clusters

Comet: System Characteristics Total peak flops ~2.1 PF Dell primary integrator Intel Haswell processors w/ AVX2 Mellanox FDR InfiniBand 1,944 standard compute nodes (46,656 cores) Dual CPUs, each 12-core, 2.5 GHz 128 GB DDR4 2133 MHz DRAM 2*160GB GB SSDs (local disk) 36 GPU nodes Same as standard nodes plus Two NVIDIA K80 cards, each with dual Kepler3 GPUs 4 large-memory nodes (June 2015) 1.5 TB DDR4 1866 MHz DRAM Four Haswell processors/node Hybrid fat-tree topology FDR (56 Gbps) InfiniBand Rack-level (72 nodes, 1,728 cores) full bisection bandwidth 4:1 oversubscription cross-rack Performance Storage (Aeon) 7.6 PB, 200 GB/s; Lustre Scratch & Persistent Storage segments Durable Storage (Aeon) 6 PB, 100 GB/s; Lustre Automatic backups of critical data Home directory storage Gateway hosting nodes Virtual image repository 100 Gbps external connectivity to Internet2 & ESNet

Comet Network Architecture InfiniBand compute, Ethernet Storage Node-Local 72 HSWL Storage 320 GB 18 27 racks 320 GB 72 HSWL 36 GPU 4 Large- Memory FDR Management 18 switches FDR 36p FDR 36p Mid-tier InfiniBand 4 4 Gateway Hosts FDR 72 72 Core InfiniBand (2 x 108- port) 2*36 IB-Ethernet Bridges (4 x 18-port each) 4*18 Login Data Mover Arista 40GbE (2x) FDR 40GbE Home File Systems VM Image Repository Research and Education Network Access Data Movers Internet 2 Juniper 100 Gbps Arista 40GbE (2x) Data Mover Nodes 72 HSWL 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks. 18 Performance Storage 7.7 PB, 200 GB/s 32 storage servers 64 40GbE 128 10GbE Durable Storage 6 PB, 100 GB/s 64 storage servers Additional Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE)

File System Breakdown Comet Robinhood Scratch 3.8 PB 2 MDS, MDT 16 OSS Gen II Project Space 3.8 PB 2 MDS, MDT 16 OSS Gen II Replica 6 PB 8 MDS, 4 MDT 64 OSS Gen I Performance Storage Durable Storage

Replication & Migration Durable Storage Reuse current Data Oasis servers as slightly stale replica Replicates allocate project space Think disaster recovery not backup Not accessible to users Goal is full sync within 1 week Exclude changes less than 1-2 days old Using Robinhood Comet Robinhood Scratch Project Space Durable Storage (Replica)

Replication & Migration Migration Building up Durable Storage requires consolidating several production file systems Some need to be migrated to Gen II servers You can t rsync a petabyte Ask me about our replication tool! AMQP-based Replication Tool TSCC Trestles Gordon Project Space Scratch Scratch Old Project Space Consolidate Scratch Durable Storage (Replica)

Lustre & ZFS Performance Goal: 6.25 GB/s per server ZFS (array) performance Lustre performance LNET performance Resolutions: Linux Kernel 3.10 (better IO) ZFS Version including large block (1024k recordsize) support Tuning for prefetch zfs_vdev_cache_size zfs_vdev_cache_max Lustre Patch to utilize zero copy buffers LNET SMP affinity for network tasks Free cores for NIC IRQ handling Hardware Attach both 40GbE NICs to same NUMA domain

Aeon OSS Ground Up Design for ZFS 2 @ LSI 9206-16e HBA (to JBOD) 1 @ LSI 9207-8i HBA (to front of chassis) 2 @ MCX313A-BCBT 40GbE NICs 2 @ Xeon E5-2650v2 @ 2.60GHz (8 core/processor) 128 GB DDR3 1867 MHz DRAM 60 @ ST4000NM0034 4TB 4K native, 512 emulation (ashift=12) 10 chassis 50 JBOD Rated 225 MB/s peak BW 2 servers per chassis Multipath for chassis and JBODs

OSS and Native ZFS Perfomance OSS Socket 0 NIC NIC 9207-8i 9206-16e OST0 OST1 OST2 OST3 Socket 1 9206-16e OST4 OST5

ZFS Performance: Read [root@panda-mds-19-5 tests]# cat readdata.sh #!/bin/bash zpool iostat 60 > /share/apps/bin/wombat/tests/results/zpool-read-$(hostname -a).txt & iostatpid=$! for ost in 0 1 2 3 4 5 do for blob in 0 1 2 3 do dd of=/dev/null if=/mnt/ost${ost}-zfs/randblob${blob} bs=16m count=$((8 * 1024)) & done done for p in $(pidof dd) do wait $p done kill $iostatpid

ZFS Performance: Read 9 GB/s

ZFS Performance: Read 9 GB/s

ZFS Performance: Read 9 GB/s

ZFS Performance: Write [root@panda-mds-19-5 tests]# cat gendata.sh #!/bin/bash # mkrandfile generates 128GB files using drand48 mkrand=/share/apps/bin/wombat/tests/mkrandfile zpool iostat 60 > /share/apps/bin/wombat/tests/results/zpool-write-$(hostname -a).txt & iostatpid=$! for ost in 0 1 2 3 4 5 do for blob in 0 1 2 3 do $mkrand /mnt/ost${ost}-zfs/randblob${blob} & done done for p in $(pidof mkrandfile) do wait $p done kill $iostatpid

ZFS Performance: Write 8 GB/s

ZFS Performance: Write 6.5 GB/s

ZFS Performance: Write 4.2 GB/s

perf Counters Striped zpool http://users.sdsc.edu/~rpwagner/perf-kernel-6ost-dd-stripe.svg

perf Counters RAIDZ2 zpool http://users.sdsc.edu/~rpwagner/perf-kernel-6ost-dd-raidz2.svg

perf Counters RAIDZ2 zpool z_wr_int x 16 z_wr_iss x 12 http://users.sdsc.edu/~rpwagner/perf-kernel-6ost-dd-raidz2.svg

perf Counters RAIDZ2 zpool z_wr_int x 16 txg_sync (0.24%) z_wr_iss x 12 http://users.sdsc.edu/~rpwagner/perf-kernel-6ost-dd-raidz2.svg

Striped Checksum: 3.3% of total (x12) z_wr_int: 0.5% of total (x16) RAIDZ2 Parity: 1.9% of total (x12) z_wr_int: 1.4% of total (x16) Checksum: 1.2% of total (x12) http://users.sdsc.edu/~rpwagner/perf-kernel-6ost-dd-raidz2.svg

ZFS & AVX >>> dt_striped = 100 >>> stuff = dt_striped - (16*0.5 + 12*3.3) >>> dt_raidz2 = stuff + 16*1.4 + 12*3.3*(1 + 1.9/1.2) >>> slowdown = dt_striped/dt_raidz2 >>> print "%0.2f GB/s" % (slowdown*8) 4.52 GB/s Not far off the 4.2 GB/s observed.

ZFS & AVX >>> dt_striped = 100 >>> stuff = dt_striped - (16*0.5 + 12*3.3) >>> dt_raidz2 = stuff + 16*1.4 + 12*3.3*(1 + 1.9/1.2) >>> slowdown = dt_striped/dt_raidz2 >>> print "%0.2f GB/s" % (slowdown*8) 4.52 GB/s Not far off the 4.2 GB/s observed. Help is on the way! Work started on AVX(2) optimizations for checksums Hoping to see this extended to parity https://github.com/zfsonlinux/zfs/pull/2351

Challenge: 64 x 40 Gb/s bidirectional Single LID per gateway Proxy-ARP performance LNET Resolutions: Proxy-ARP 1 VLAN per gateway 4 VLANs map to default IB partition IB-Ethernet Bridges Core Aristas routing between IP subnets 4 IP addresses per client IB Subnet Explicit routing from leaf and spine switches to gateways Max 12 links per rack outbound LNET Single network tcp Hardware Migrated gateway IB connections to midtier switches

IB-Ethernet Bridges VLAN 441 VLAN 442 VLAN 443 VLAN 444 192.168.0.0/21 192.168.8.0/21 192.168.16.0/21 192.168.24.0/21 gw1 gw2 gw3 gw4 Default IB partition 9 servers (1 MDS, 8 OSS) per VLAN Each client has 4 IPoIB aliases (ib0:1, ib0:2, ib0:3, ib0:4) Single LNET NID using tcp(ib0:1) Allows gateways to handle ARP requests appropriately

40GbE NICs & NUMA OSS Socket 0 Socket 1 NIC NIC HBA HBA HBA Critical to have both NICs in same NUMA domain (think bonding & tcp) Good to leave some cores free for IRQ handling irqbalance and setting affinity not helpful See LU-6228 for a discussion [root@wombat-oss-20-1 ~]# for i in lnet libcfs ksocklnd > do > echo $i.conf > cat /etc/modprobe.d/$i.conf > done lnet.conf options lnet networks="tcp(bond0)" libcfs.conf options libcfs cpu_pattern="0[2-7] 1[8-15]" options ost oss_io_cpts="[0,1]" oss_cpts="[1]" ksocklnd.conf options ksocklnd nscheds=6 peer_credits=48 credits=1024 https://jira.hpdd.intel.com/browse/lu-6228

LNET Performance Excellent LNET Self Test 8 clients, 1 OSS concurrency=16 check=simple size=1m Client reads at 10 GB/s Client writes at 8 GB/s

Putting it All Together: Wombat IOR: 12 clients per OSS 108 GB/s total 7.2 GB/s average Individual OSS go to 8+ GB/s Fine print: This should have been 16 servers at 115+ GB/s. I screwed up my striping and didn t discover it until making this plot.

ZFS & Lustre Metadata MDS Same base hardware as OSS No JBOD Dual 10GbE NICs Intel Xeon E5-2670v2 (2.50GHz, 10 c) MDT Configuration 12 SSDs per MDS Intel SSD DC S3500 Series 300GB RAID10 Stripe of over 5 mirrored pairs, 2 global spares DNE Failover MDS0 mdt0 mgs MDS1 mdt1

On MDS0: ZFS & Lustre Metadata zfs set recordsize=4096 wombat-mdt0 zfs create wombat-mdt0/wombat-mdt0 mkfs.lustre --reformat --mdt --mgs --failover=<nid of MDS1> \ --fsname=wombat --backfstype=zfs --index=0 wombat-mdt0/wombat-mdt0 On MDS1: zfs set recordsize=4096 wombat-mdt1 zfs create wombat-mdt1/wombat-mdt1 mkfs.lustre --reformat --mdt --failnode=<nid of MDS0> \ --mgsnode=<nid of MDS0>:<NID of MDS1> --fsname=wombat \ --backfstype=zfs --index=1 wombat-mdt1/wombat-mdt1

1000000 mdtest (Lustre) 1152 tasks, 48 nodes, 1.8M files/directories mdtest -I 25 -z 5 -b 2 100000 62547 55585 26744 10000 7889 8260 5151 2943 2371 Ops/s 1000 285 100 10 1

mdtest (ZFS) 16 tasks, 1 nodes, 250k files/directories mdtest -I 250 -z 5 -b 2 1000000 581005 445394 279796 100000 10000 24509 14375 18055 15049 23507 Ops/s 1000 100 323 10 1

1000000 581005 445394 279796 100000 62547 55585 10000 7889 24509 14375 8260 5151 18055 26744 2943 15049 2371 23507 Lustre Native 1000 100 285323 10 1

Lustre Stack Notes Linux 3.10.65 kernel.org SPL: GitHub master ZFS: GitHub master and pull 2865 https://github.com/behlendorf/zfs/tree/largeblock Lustre: master (~v2.6.92) and the following patches: LU-4820 osd: drop memcpy in zfs osd LU-5278 echo: request pages in batches LU-6038 osd-zfs: Avoid redefining KM_SLEEP LU-6038 osd-zfs: sa_spill_alloc()/sa_spill_free() compat LU-6152 osd-zfs: ZFS large block compat LU-6155 osd-zfs: dbuf_hold_impl() called without the lock

Lustre Stack Notes Linux 3.10.65 kernel.org SPL: GitHub master ZFS: GitHub master and pull 2865 https://github.com/behlendorf/zfs/tree/largeblock Lustre: master (~v2.6.92) and the following patches: LU-4820 osd: drop memcpy in zfs osd You need this! LU-5278 echo: request pages in batches LU-6038 osd-zfs: Avoid redefining KM_SLEEP LU-6038 osd-zfs: sa_spill_alloc()/sa_spill_free() compat LU-6152 osd-zfs: ZFS large block compat LU-6155 osd-zfs: dbuf_hold_impl() called without the lock

This work supported by the National Science Foundation, award ACI-1341698.