High-Performance Lustre with Maximum Data Assurance

Similar documents
Architecting Storage for Semiconductor Design: Manufacturing Preparation

NetApp High-Performance Storage Solution for Lustre

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

HP SAS benchmark performance tests

EMC SYMMETRIX VMAX 40K STORAGE SYSTEM

Assessing performance in HP LeftHand SANs

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

EMC SYMMETRIX VMAX 40K SYSTEM

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

DELL Terascala HPC Storage Solution (DT-HSS2)

EMC CLARiiON Backup Storage Solutions

An Oracle Technical White Paper October Sizing Guide for Single Click Configurations of Oracle s MySQL on Sun Fire x86 Servers

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

Comparison of Storage Protocol Performance ESX Server 3.5

SGI UV 300RL for Oracle Database In-Memory

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

IBM Power Systems HPC Cluster

Storage Consolidation with the Dell PowerVault MD3000i iscsi Storage

IBM Emulex 16Gb Fibre Channel HBA Evaluation

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Technical White paper Certification of ETERNUS DX in Milestone video surveillance environment

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

JMR ELECTRONICS INC. WHITE PAPER

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Dell TM Terascala HPC Storage Solution

How Smarter Systems Deliver Smarter Economics and Optimized Business Continuity

EMC Backup and Recovery for Microsoft SQL Server

An Introduction to GPFS

Vertical Scaling of Oracle 10g Performance on Red Hat

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Create a Flexible, Scalable High-Performance Storage Cluster with WekaIO Matrix

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Refining and redefining HPC storage

Architecting a High Performance Storage System

Technical Note P/N REV A01 March 29, 2007

SAS workload performance improvements with IBM XIV Storage System Gen3

T10PI End-to-End Data Integrity Protection for Lustre

SFA12KX and Lustre Update

designed. engineered. results. Parallel DMF

IBM System Storage DS8870 Release R7.3 Performance Update

vstart 50 VMware vsphere Solution Specification

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Optimizing Fusion iomemory on Red Hat Enterprise Linux 6 for Database Performance Acceleration. Sanjay Rao, Principal Software Engineer

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

DELL Reference Configuration Microsoft SQL Server 2008 Fast Track Data Warehouse

Storage Optimization with Oracle Database 11g

SGI UV for SAP HANA. Scale-up, Single-node Architecture Enables Real-time Operations at Extreme Scale and Lower TCO

NEMO Performance Benchmark and Profiling. May 2011

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

IBM System Storage DS6800

Lenovo Database Configuration Guide

Entry-level Intel RAID RS3 Controller Family

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

Xyratex ClusterStor6000 & OneStor

RIGHTNOW A C E

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

EMC CLARiiON CX3-40. Reference Architecture. Enterprise Solutions for Microsoft Exchange Enabled by MirrorView/S

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

SNAP Performance Benchmark and Profiling. April 2014

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

ABySS Performance Benchmark and Profiling. May 2010

Dell Compellent Storage Center and Windows Server 2012/R2 ODX

Lenovo Database Configuration

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Video Surveillance Storage and Verint Nextiva NetApp Video Surveillance Storage Solution

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG)

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

An Exploration of New Hardware Features for Lustre. Nathan Rutman

ARISTA: Improving Application Performance While Reducing Complexity

AcuSolve Performance Benchmark and Profiling. October 2011

THESUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Lenovo Database Configuration

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM Power Systems solution for SugarCRM

Dell EMC SCv3020 7,000 Mailbox Exchange 2016 Resiliency Storage Solution using 7.2K drives

InfiniBand Networked Flash Storage

GROMACS Performance Benchmark and Profiling. August 2011

Experiences with HP SFS / Lustre in HPC Production

Transcription:

High-Performance Lustre with Maximum Data Assurance Silicon Graphics International Corp. 900 North McCarthy Blvd. Milpitas, CA 95035 Disclaimer and Copyright Notice The information presented here is meant to be general discussion material only. SGI does not represent or warrant that its products, solutions, or services as set forth in this document will ensure that the reader is in compliance with any laws or regulations. 2015 Silicon Graphics International All rights reserved.

TABLE OF CONTENTS 1.0 Introduction 1 2.0 The Lustre File System 1 2.1 Metadata Management 1 2.2 Scale-Out Object Storage 2 2.3 Data Assurance Through Integrated T10 PI Validation 2 2.4 Simple and Standard Client Access to Data 2 3.0 T10 PI End-to-End Assurance 3 4.0 A Building Block Approach 4 5.0 Benchmark Process and Results 5 5.1 IOR POSIX Buffered Sequential I/O Results 6 5.2 IOR POSIX Buffered Random I/O Results 7 5.3 IOR POSIX Direct I/O Sequential Results 8 5.4 IOR POSIX DIO Random Results 9 6.0 Conclusion 9 SGI Lustre

1.0 Introduction In High-Performance Computing (HPC), there is a strong correlation between the compute power of the solution and the ability of the underlying data storage system to deliver the needed data for processing. As processor power increases, the goal of system architects is to design systems with an appropriate balance of data storage, data movement and data computing power and to do so in a manner that optimizes the overall processing output of the system at a given price point. Lustre storage solutions based on an optimized combination of SGI servers and NetApp storage arrays provide an excellent storage foundation that can be leveraged by HPC researchers, universities, and enterprises that need to deploy a high-throughput, scale-out, commercially-supported and cost-effective parallel file system storage solution. These SGI-delivered storage solutions use Intel Enterprise Edition for Lustre software, a commercially hardened and supported version of Lustre - the leading HPC open source parallel file system. Additionally, by leveraging industry leading data assurance protocols such as T10 PI the SGI-NetApp Lustre storage solutions are able to deliver the highest levels of data assurance and protection throughout the end-to-end data path as storage volumes grow and the potential for undetected bit errors increases. The result is a scale-out HPC storage solution capable of providing reliability and performance and that is based on an architecture that allows for the easy future scaling of both capacity and performance. This white paper provides a brief overview of the Lustre File System and configuration information on a scale-out SGI Lustre solution architecture that leverages NetApp-based block storage. The solution overview is followed by performance analysis and conclusions that were obtained through structured benchmark tests. 2.0 The Lustre File System Lustre is a parallel file system that delivers high performance through a scale-out approach that divides the workload among numerous scale-out processing nodes. While the processing power of numerous data storage servers is available, the system presents a traditional file system namespace that can be leveraged by hundreds or thousands of compute nodes using traditional file-based data access methods. A Lustre installation is made up of three key elements: the metadata management system, the object storage subsystem which takes care of actual file/data storage, and the compute nodes from which the data/file access is performed. 2.1 Metadata Management The metadata management system is made-up of a Metadata Target (MDT) and a corresponding Metadata Server (MDS). The MDT stores the actual metadata for the file system that includes elements like file names, file time stamps, access permissions, and information regarding the actual storage location of data objects for any given file within the object storage system. Within Lustre, the MDS is the server that services requests for file system operations and performs management of the MDT. More recent versions of Lustre include a scalable metadata capability that allow for request loads to be spread across multiple servers and in most deployments, the MDS is configured within a high-availability (HA) environment to ensure ongoing availability of the file system in the event of a server/component failure. SGI Lustre 1

2.2 Scale-Out Object Storage The object storage system for Lustre is where the scale out attribute of the solution occurs. The object storage system is made up of some number of Object Storage Servers (OSS) which manage the storage and retrieval of data and some number of Object Storage Targets (OST) which are the locations on which the actual data is placed/read by the OSS. Lustre deployments typically include numerous OSS nodes and multiple OST storage destinations and this scale-out attribute of Lustre creates an opportunity for the creation of object storage building blocks to be defined such that additional capacity and/or throughput may be added to the system through the addition of incremental building block system elements. In general, administrators will increase the number of OSS nodes in order to increase the data transfer bandwidth on the network that the storage system will support. OST storage configurations will be configured in order to meet both the capacity requirements for the overall system as well as the data throughput/performance requirements of the OSS nodes. Within scale-out file systems (often referred to as parallel file systems ) like Lustre, high-performance is achieved by having the system stripe data across multiple storage location (OSTs) such that file read/write operations are able to benefit from the ability to leverage the throughput of many storage devices in parallel. The result is a system that can deliver throughput at levels that far exceed the capabilities of any single device or node. 2.3 Data Assurance Through Integrated T10 PI Validation The data presented in this white paper looks at the performance of a single pair of highly available OSS nodes within the storage cluster. Additionally, the performance data presented is based on an SGI-and-NetApp Lustre configuration that leverages the T10 PI data assurance protocol in order to deliver extremely high levels of data validation/assurance. Later sections of this document will provide further information on T10 PI and the value that it delivers in highly-scalable storage solutions. 2.4 Simple and Standard Client Access to Data The Lustre storage solution includes client software that enables access to the scale-out Lustre storage solution using a standard file system interface. This standard presentation allows client applications and tools to instantly leverage Lustre-based data storage with no additional work or testing being required. SGI Lustre 2

3.0 T10 PI End-to-End Assurance T10 Protection Information (T10 PI), is an important standard that reflects the storage and data management industry s commitment to end-to-end data integrity validation. By validating data at numerous points within the I/O flow, T10 PI prevents silent data corruption, ensuring that invalid, incomplete or incorrect data will never overwrite good data. Without T10 PI, data corruption events may slip through the cracks and result in numerous negative outcomes that can include system downtime, lost revenue, or lack of compliance with regulatory standards. Protection Information (PI) adds an extra eight bytes of information to the 512-byte sectors typical of enterprise hard drives. Increasing sector size to 520 bytes, these eight bytes of metadata consist of guard (GRD), application (APP) and reference (REF) tags that are used to verify the 512 bytes of data in the sector. Complementing PI, DIX is a technology that specifies how I/O controllers can exchange metadata with a host operating system. The combination of DIX (data integrity between application and I/O controller) and PI (data integrity between I/O controller and disk drive) delivers end-to-end protection against silent corruption of data in flight between a sender and a receiver. SGI Lustre solutions are able to implement end-to-end T10 PI in order to deliver an integrated data protection capability. With the SGI IS5600i using T10 PI End to End, organizations are assured that their data is protected from the time it leaves the server until the time it is next read and accessed. After the 8 byte PI field is set by the HBA during the data write process, that PI field is rechecked by the array twice as it crosses through the Controller before being verified yet another time by the disk drive as it is written to storage media. During a Read Operation the disk drive re-verifies the PI data before returning it to the Controller which implements two additional checks - on the way to final verification SGI understands the importance of data and the integrity of that data within high-performance computing (HPC) environments, and has therefore focused on the implementation, validation and promotion of Protection Information (PI) technology to provide customers with end-to-end data confidence. SGI Lustre 3

4.0 A Building Block Approach While the deployment of Lustre solutions involves a variety of solution components and servers, the achievement of predictable high-performance results can be achieved by leveraging configurations that have been pre-validated, documented and benchmarked. This document presents configuration details and associated performance results based on extensive SGI and NetApp configuration validation work that may be leveraged by customers to deploy solutions with excellent performance and the highest levels of data assurance based on the integrated T10 PI features that are built-in to the solution. For this document, SGI is introducing the concept of a Scalable Storage Unit (SSU) that is comprised of two Lustre OSS nodes connected to an SGI IS5600i storage array (based on technology from NetApp). The purpose of this SSU-based approach is to create a Lustre scale-out building block that can be replicated as needed to scale throughput and capacity. The overall test configuration and dual-oss SSU is shown in the following diagram. SGI Lustre 4

Server Function Hostname Lustre MDS Server MDS01 Lustre OSS Server OSS 1-2 Lustre Clients SGI Platform SGI CH-C1104-GP2 Highland Server SGI CH-C1104-GP2 Highland Server SGI ICE X Cluster Processors Type Intel Xeon E5-2690 v3 2.60GHz 30MB Cache Intel Xeon E5-2690 v3 2.60GHz 30MB Cache Intel Xeon E5-2690 v3 2.60GHz 30MB Cache Number of Nodes 1 2 64 I/O Benchmark Lustre Clients Total Cores per Node 24 24 24 Memory & Memory Speed 128 GB 2133MHz 128 GB 2133MHz 128 GB 2133MHz Local Storage 1 SATA 1TB 7.2 RPM 3Gb/s Drive 1 SATA 1TB 7.2 RPM 3Gb/s Drive Diskless Blades Network Interconnect IB FDR 4x Bandwidth 56Gb/s Latency < 1usec IB FDR 4x Bandwidth 56Gb/s Latency < 1usec IB FDR 4x Bandwidth 56Gb/s Latency < 1usec OS RHEL v6.5 Mellanox OFED v2.3 RHEL v6.5 Mellanox OFED v2.3 SLES11 SP3 Mellanox OFED v2.3 Lustre Software Intel Enterprise Edition for Lustre 2.2 (Lustre 2.5.X) Intel Enterprise Edition for Lustre 2.2 (Lustre 2.5.X) Intel Enterprise Edition for Lustre 2.2 (Lustre 2.5.X) SGI Storage Platform SGI IS5600 (16G FC Interface) SGI IS5600i w/ 6GB SAS T10 PI Data Assurance Enabled Storage Enclosure 24-Bay Enclosure (12 drives only used) 1x60-Bay Storage Ctrl + 1x60-Bay Expansion - - Drive Details Drive Type 4x 200GB 6Gb/s SAS Enterprise SSD 120x 6TB 7.2K RPM 6Gb/s NL-SAS - RAID Protection RAID10 Write Cache Mirror Enabled RAID6 (8+2) 128K Segment Size & WCM Enabled, DA Enabled - 5.0 Benchmark Process & Results This report summarizes the results of the IOR I/O benchmarks. Included in this report are the details of the benchmark environment, commands, and the results achieved while performing the I/O benchmarks on an SGI IS5600i Storage Array with two OSS servers based on Intel Enterprise Edition for Lustre software. IOR is an industry standard I/O benchmark used for benchmarking parallel file systems. The IOR application characteristics are 96% of the runtime in I/O, 1% in CPU & Memory Bandwidth, and 3% in MPI communications. The I/O performance is determined by the performance of the proposed storage and interconnects rather than processor speed or memory bandwidth of the Lustre client. To capture the end to end data protection using T10 PI, the SGI Lustre OSS servers had two Emulex LightPulse 16Gb Fibre Channel (T10 PI) HBAs installed and the IS5600i Storage Array was configured with Data Assurance enabled to prevent silent data corruption. The Emulex BlockGuard TM Data Integrity (offload) feature was enabled in the kernel module lpfc.conf. All testing completed successfully and the results shown reflect the fact that no performance impact was introduced through the enablement of the T10 PI assurance elements. SGI Lustre 5

5.1 IOR POSIX Buffered Sequential I/O Results Figure 1 chart shows the throughput results of a scaling benchmark from a single Lustre client up to 64 Lustre clients with 24 I/O threads per node. The aggregate file size (block size) was 192GB/client, which represents 1.5x the Lustre client physical memory to mitigate the influence of buffer cache. Figure 1: IOR Buffered Sequential I/O Results SGI Lustre 6

5.2 IOR POSIX Buffered Random I/O Results Figure 2 chart shows the throughput results of the Buffered Random I/O throughput. As discussed previously an aggregate file size (block size) was 192GB/client, which represents 1.5x the Lustre client physical memory to mitigate the influence of buffer cache. Figure 2: IOR Buffered Random I/O Results SGI Lustre 7

5.3 IOR POSIX Direct I/O Sequential Results Figure 3 chart shows the throughput results of Direct I/O using Sequential file access. Scaling is from a single Lustre client to 64 Lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/ client since direct I/O system requests bypass the Linux Kernel buffer cache. Figure 3: IOR Direct I/O Sequential Results SGI Lustre 8

5.4 IOR POSIX DIO Random Results Figure 4 chart shows the throughput results of Direct I/O using Random file access. Scaling is from a single lustre client to 64 lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/ client since direct I/O system requests bypass the Linux Kernel buffer cache. Figure 4: IOR Direct I/O Sequential I/O Results 5.0 Conclusion Based on the benchmarks performed characterizing the I/O performance for IOR, SGI concludes that the Lustre parallel file system is workload dependent but is an excellent parallel file system to support light to heavy I/O application workloads. For data protection SGI uses industry standard T10 PI data assurance technology to provide end to end data integrity with the SGI Lustre storage solution based on Intel Enterprise Edition for Lustre software. A Dual OSS configuration combined with an SGI IS5600i storage array with 120 drives supports up to 6GB/ sec; SGI defines this storage building block as a Scalable Storage Unit (SSU). A configuration of two SSU increases throughput to 12GB/sec, and throughput above 100GB sequential and 60/50 GB/sec random write/ read respectively, can be achieved with the straight-forward addition of the Scalable Storage Units as defined in this white paper. Global Sales and Support: sgi.com 2015 Silicon Graphics International Corp. All rights reserved. SGI, SGI ICE, SGI UV, Rackable, NUMAlink, Performance Suite, Accelerate, ProPack, OpenMP and the SGI logo are registered trademarks of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. Intel, the Intel logo, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and/or other countries. Linux is a registered trademark of Linus Torvalds in several countries. All other trademarks mentioned herein are the property of their respective owners. 06112015 4565 06112015 SGI Lustre 9