Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Similar documents
InfiniBand Networked Flash Storage

Linac Coherent Light Source (LCLS) Data Transfer Requirements

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

An Overview of Fujitsu s Lustre Based File System

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

HPC Storage Use Cases & Future Trends

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Xyratex ClusterStor6000 & OneStor

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

DAQ system at SACLA and future plan for SPring-8-II

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Realtime Data Analytics at NERSC

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Event-Synchronized Data Acquisition System of 5 Giga-bps Data Rate for User Experiment at the XFEL Facility, SACLA

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

Lustre TM. Scalability

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

MAHA. - Supercomputing System for Bioinformatics

Lustre & ZFS Go to Hollywood Lustre User Group 2013

Oracle Exadata: Strategy and Roadmap

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Architecting Storage for Semiconductor Design: Manufacturing Preparation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Was ist dran an einer spezialisierten Data Warehousing platform?

Data Acquisition. Amedeo Perazzo. SLAC, June 9 th 2009 FAC Review. Photon Controls and Data Systems (PCDS) Group. Amedeo Perazzo

LLNL Lustre Centre of Excellence

Enabling a SuperFacility with Software Defined Networking

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Comet Virtualization Code & Design Sprint

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

THESUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Storage Optimization with Oracle Database 11g

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Improved Solutions for I/O Provisioning and Application Acceleration

DDN About Us Solving Large Enterprise and Web Scale Challenges

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Extraordinary HPC file system solutions at KIT

Oracle EXAM - 1Z Oracle Exadata Database Machine Administration, Software Release 11.x Exam. Buy Full Product

REQUEST FOR PROPOSAL FOR PROCUREMENT OF

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

IBM ProtecTIER and Netbackup OpenStorage (OST)

<Insert Picture Here> Oracle Storage

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Lustre / ZFS at Indiana University

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

Cluster Setup and Distributed File System

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Evolution of Rack Scale Architecture Storage

HIGH PERFORMANCE COMPUTING FROM SUN

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Oracle Exadata. Smart Database Platforms - Dramatic Performance and Cost Advantages. Juan Loaiza Senior Vice President Oracle Database Systems

Fujitsu's Lustre Contributions - Policy and Roadmap-

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Data Movement & Tiering with DMF 7

Ronald van der Pol

Lustre at Scale The LLNL Way

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Costefficient Storage with Dataprotection

High Performance Computing. NEC LxFS Storage Appliance

2012 HPC Advisory Council

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Introduction to Psana

Organizational Update: December 2015

Ronald van der Pol

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

SurFS Product Description

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Storage Supporting DOE Science

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

New Approach to Unstructured Data

HPC NETWORKING IN THE REAL WORLD

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Cisco SAN Analytics and SAN Telemetry Streaming

Transcription:

Lustre architecture for LCLS@SLAC Riccardo Veraldi for the LCLS IT Team

2

LCLS Experimental Floor 3

LCLS Parameters 4

LCLS Physics LCLS has already had a significant impact on many areas of science, including: Resolving the structures of macromolecular protein complexes that were previously inaccessible; Capturing bond formation in the elusive transition-state of a chemical reaction; Revealing the behavior of atoms and molecules in the presence of strong fields; Probing extreme states of matter 5

LCLS Data Challenges From the beginning LCLS data systems group faced these challenges: 1. Ability to readout, event build and store multi GB/s data streams 2. Capability for experimenters to analyze data on-the-fly (real-time) 3. Flexibility to accommodate new user supplied equipment 4. Capacity to store and analyze PB scale data sets 5. Changing analysis software/algorithms implemented by non-expert users (weekly!) 6

LCLS Data Flow: Current Online Monitoring Nodes Users Fast Feedback Processing Nodes R 1 R2 R Ethernet 1Gb/s 3 R Infiniband QDR Data Movers 4 RN Data Cache Layer DAQ Readout Nodes Fast Feedback Layer (Gluster) Online System (1 instance per experimental hall, 2 total) DAQ System (1 instance per instrument, 7 total) Users Offline Processing Nodes Ethernet 1Gb/s Infiniband QDR irods Traffic Data Movers Traffic Offline Layer (Lustre) Data Movers Traffic Data Movers Traffic DAQ Traffic Ethernet 1Gb/s DTNs irods Traffic ESNET SLAC Router Automatic Tape On-demand Users driven Offline System (shared across LCLS instruments)

Current LCLS Data Systems Architecture Data Acquisition XTC to HDF5 Translation Calibration irods File Manager AMO DAQ SXR DAQ HPSS (2PB+) High Bandwidth Cluster Storage (5 petabyte) XPP DAQ.. Databases Web Apps & Services ESNET Analysis Farms (6 teraflop) Data Transfer Nodes Automatic On demand 8 User directed 8

TERABYTES Data Collection Statistics (Oct 29 - May 214) 9

LCLS Data Strategy: Drivers LCLS-II Upgrade The high repetition rate (1-MHz) and, above all, the potentially very high data throughput (1GB/s) generated by LCLS-II will require a major upgrade of the data acquisition system and increased data processing capabilities Fast feedback Experience has shown that a capable real-time analysis is critical to the users ability to take informed decisions during an LCLS experiment Powerful fast feedback (~ minute or faster timescales) capabilities reduce the time required to complete the experiment, improve the overall quality of the data, and increase the success rate of experimentals Time to science Sophisticated analysis frameworks can reduce significantly the time between experiment and publication, improving productivity LCLS science community No user left behind Most of the advanced algorithms for analysis of the LCLS science data have been developed by external groups with enough resources to dedicate to a leading edge computing effort Smaller groups with good ideas may be hindered in their ability to conduct science by not having access to these advanced algorithms LCLS support for externally developed algorithms and, possibly, development of in-house algorithms for some specific science domains, would alleviate this problem 1

Infrastructure Challenges (2) Data Processing We expect that LCLS-II will require peta to exascale HPC Deploying and maintaining very large processing capacity at SLAC would require a significant increase in the capabilities of the existing LCLS and/or SLAC IT groups Data Network SLAC recently upgraded its connection to ESNET from 1Gb/s to 1Gb/s Primary reason for upgrading this link is to gain the ability to offload part of the LCLS science data processing to NERSC while keeping up with the DAQ The 1Gb/s link will need to be upgraded to 1 Tb/s for LCLS-II Data Format The translation step from XTC (DAQ format) to HDF5 (users format) will become a bottleneck in the future and LCLS-II should adopt a single data format HDF5 de-facto standard for storing science data at light source facilities In order to effectively replace XTC in LCLS, a couple of critical features are required: ability to read while writing, ability to consolidate multiple writers into a consistent virtual data set 11

LCLS-II Data Throughput, Data Storage and Data Processing Estimates Examples LCLS-II 22: 1 x 16 Mpixel epix @ 36 Hz = 12 GB/s 1K points fast digitizers @ 1kHz = 2 GB/s 2 x 4 Mpixel epix @ 5 khz = 8 GB/s Distributed diagnostics 1-1 GB/s range Example LCLS-II 225 3 beamlines x 2 x 4 Mpixel epix @ 1 khz = 4.8 TB/s Data parameters scaling between LCLS-I and LCLS-II Parameter LCLS-I LCLS-II 22 LCLS-II 225.1-1 GB/s 2-2 GB/s 2 GB/s - 1.2 TB/s Peak throughput 5 GB/s 1 GB/s 4.8 TB/s Peak Processing 5 TFLOPS 1 PFLOPS 6 PFLOPS 5 PB 1 PB 6 EB Average throughput Data Storage

Data Analytics at the Exascale for Free Electron Lasers $1M over 4 years: 4% SLAC (LCLS, CS), 2% LANL, 4% LBL (CAMERA, MBIB, NERSC) High data throughput experiments Single Particle Imaging LCLS data analysis framework Infrastructure Algorithmic improvement with IOTA (Integration Optimization, Triage, and Analysis) and ray tracing - Use example test-case of Serial Femtosecond Crystallography Algorithmic advances with M-TIP (Multi-Tiered Iterative Phasing) Porting psana to supercomputer architecture, change parallelization technology to allow scaling from hundreds of cores (now) to hundred of thousands of cores Data flow from SLAC to NERSC over ESnet Sauter, Brewster - LBNL/MBIB Zwart, Donatelli, Sethian LBNL/CAMERA Aiken - Stanford/SLAC CS, Shipman LANL, O Grady - SLAC/LCLS Perazzo - SLAC/LCLS, Skinner LBNL/NERSC, Guok - LBNL/ESnet

Data Systems Architecture: Evolution Data Acquisition Tape Analysis Farm Calibration Data Transfer Nodes Storage SLAC AMO DAQ Automatic SXR DAQ Flash Based Storage (1 PB) On demand Data Transfer Nodes User directed XPP DAQ.. ESNET Databases Web Apps & Services Fast Feedback Analysis Farm Data Transfer Nodes Analysis Farm NERSC Tape Storage

LCLS Network needs and border links

Core Technologies Infiniband wherever possible (i.e. on short distances) NVRAM devices and NVMe over fabric 1 Gb/s and 4 Gb/s Ethernet between experimental halls and data center(s) Many cores CPUs (KNL) - see NERSC slides for future exascale architectures HDF5 for data format SDN Python as programming language with C/C++ kernels Main open question: file system technology - Lustre? Object storage? Other? 16

Lustre architecture 17

Lustre @LCLS Analysis FS: 5PB/Spindle Lustre/ldiskfs 8 MDS 21 OSS FFB FS: 1TB/SSD (INTEL SSDSC2BB48 48GB) Lustre/ZFS 2 MDS 16 OSS 18

Lustre scalability and performance Theoretical range Client scalability Client performance Known production usage 5+ clients, many in the 1 to 2 range Single Aggregate Single Aggregate 9% of net bandwidth 1TB/s 4.5GB/s (FDR IB) 2.5TB/s 19

Lustre with ZFS Extreme performance at scale Integrated security SW management stack (raidz raidz2 raidz3) Data integrity and recovery Snapshot support (from Lustre 2.1.x) Open source and extensible 2

ZFS unique features Reliability Data is always consistent on disk; silent data corruption is detected and corrected; smart rebuild strategy Compression Snapshot support built into Lustre Consistent snapshot across all the storage targets without stopping the file system (Lustre 2.1.x) Hybrid storage pools Data is tiered automatically across DRAM, SSD/NVMe and HDD accelerating random & small file read performance Manageability Powerful storage pool management makes it easy to assemble and maintain Lustre storage targets from individual devices 21

LCLS Fast Feedback nodes Lustre/ZFS implementation 2 Lustre/ZFS clusters NEH ffb21 5TB 1 MDS ovirt VM 6 OSS 3 OST per OSS 18 total OST 24x 48GB Intel SSD disks 3 RAIDZ zpool per OSS FEH ffb11 5TB 1 MDS ovirt VM 6 OSS 3 OST per OSS 18 total OST 24x 48GB Intel SSD disks 3 RAIDZ zpool per OSS 22

LCLS Fast Feedback nodes Lustre/ZFS cluster architecture ZFS pools OST1 OST2 OST3 OSS1 OST4 OST5 OST6 OSS2 OST7 OST8 OST9 OSS3 OST 1 OST 11 OST 12 OSS4 OST 13 OST 14 OST 15 OSS5 MDS Lustre clients Lustre clients Lustre clients Lustre clients TCP OST 16 OST 17 OST 18 infiniband OSS6 23

OSS ZFS configuration zpool status pool: ffb21-ost1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ffb21-ost1 ONLINE raidz1- ONLINE ata-intel_ssdsc2bb48g6_phwa61744gu48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa617466t48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa61744my48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa61924a48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa61745wb48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa617464c48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa61744le48fgn ONLINE ata-intel_ssdsc2bb48g6_phwa61744ce48fgn ONLINE errors: No known data errors 24

Conclusions For the future we need Very fast Flash based storage By 22 up to 2GB/s average (1GB/s peak) By 225 up to 1TB/s average (5TB/s peak) We are looking forward for Lustre/ZFS over NVMe Lustre alternatives if we cannot achieve our goal BeeGFS/ZFS?? At the moment and up to 22 Lustre can fit our needs Intel is improving Lustre release after release and especially Lustre/ZFS 25