Deep Learning Performance and Cost Evaluation

Similar documents
Deep Learning Performance and Cost Evaluation

Three Paths to Better Business Decisions

Emerging Technologies for HPC Storage

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

Micron Quad-Level Cell Technology Brings Affordable Solid State Storage to More Applications

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

InfiniBand Networked Flash Storage

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

It s Time to Move Your Critical Data to SSDs Introduction

NVMe: The Protocol for Future SSDs

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Session 201-B: Accelerating Enterprise Applications with Flash Memory

HP Z Turbo Drive G2 PCIe SSD

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Innovations in Non-Volatile Memory 3D NAND and its Implications May 2016 Rob Peglar, VP Advanced Storage,

Solid State Storage is Everywhere Where Does it Work Best?

3D NAND Technology Scaling helps accelerate AI growth

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

SoftNAS Cloud Performance Evaluation on AWS

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Solid State Performance Comparisons: SSD Cache Performance

HP SAS benchmark performance tests

Extremely Fast Distributed Storage for Cloud Service Providers

High speed transfer on PCs

Annual Update on Flash Memory for Non-Technologists

Micron and Hortonworks Power Advanced Big Data Solutions

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

Building the Most Efficient Machine Learning System

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Adaptec Series 7 SAS/SATA RAID Adapters

Samsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

Samsung PM1725a NVMe SSD

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

Opportunities from our Compute, Network, and Storage Inflection Points

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

SoftNAS Cloud Performance Evaluation on Microsoft Azure

IBM Spectrum Scale IO performance

IBM Power AC922 Server

Isilon Performance. Name

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Technical Note. Improving Random Read Performance Using Micron's SNAP READ Operation. Introduction. TN-2993: SNAP READ Operation.

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Demartek Evaluation Accelerate Business Results with Seagate EXOS 15E900 and 10E2400 Hard Drives

Building the Most Efficient Machine Learning System

A New Metric for Analyzing Storage System Performance Under Varied Workloads

IBM DS8880F All-flash Data Systems

PowerVault MD3 SSD Cache Overview

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Building a High IOPS Flash Array: A Software-Defined Approach

FAQs HP Z Turbo Drive Quad Pro

SUPERTALENT ULTRADRIVE LE/ME PERFORMANCE IN APPLE MAC WHITEPAPER

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

NST6000 UNIFIED HYBRID STORAGE. Performance, Availability and Scale for Any SAN and NAS Workload in Your Environment

RIGHTNOW A C E

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Micron NVMe in the Open19 Platform: Low-Latency Storage at the Edge.

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Empowering. Video Surveillance. Highly integrated. Simple to deploy. Vess A-Series & Vess R-Series

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

SNIA Tutorial 1 A CASE FOR FLASH STORAGE HOW TO CHOOSE FLASH STORAGE FOR YOUR APPLICATIONS

Interface Trends for the Enterprise I/O Highway

The QM2 PCIe Expansion Card Gives Your NAS a Performance Boost! QM2-2S QM2-2P QM2-2S10G1T QM2-2P10G1T

Client vs. Enterprise SSDs

Webscale, All Flash, Distributed File Systems. Avraham Meir Elastifile

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Evaluation of the Chelsio T580-CR iscsi Offload adapter

What is QES 2.1? Agenda. Supported Model. Live demo

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

QLC Challenges. QLC SSD s Require Deep FTL Tuning Karl Schuh Micron. Flash Memory Summit 2018 Santa Clara, CA 1

Presented by: Nafiseh Mahmoudi Spring 2017

LENOVO THINKSYSTEM DE6000H

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

SATA RAID For The Enterprise? Presented at the THIC Meeting at the Sony Auditorium, 3300 Zanker Rd, San Jose CA April 19-20,2005

StoreEngine TM OpenVPX

BENEFITS AND BEST PRACTICES FOR DEPLOYING SSDS IN AN OLTP ENVIRONMENT USING DELL EQUALLOGIC PS SERIES

Create a Flexible, Scalable High-Performance Storage Cluster with WekaIO Matrix

Samsung PM863 and SM863 for Data Centers

Specifying Storage Servers for IP security applications

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Transcription:

Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Rene Meyer, Ph.D. AMAX Corporation Publish date: October 25, 2018

Abstract Introduction and Problem Statement Tool Setup Storage Drives Compared Test Environment and System Configuration Figure 2: Server Specifications Results Table 3: Local drive performance comparison 3 4 4 5 6 6 7 8 Table 4: Performance comparison of local and remote attached SSD vs. HDD RAID5 storage array Read Bandwidth Comparison Conclusion 8 9 10

Abstract Near-instant access to training data is critical for most Deep Learning (DL) workloads to ensure that training durations are not negatively affected by data transfer times. Common practice is to use local NVMe SSDs as a data cache, while the storage backbone is realized as a traditional NAS solution based on spinning media. Data is streamed from the back end into the local NVMe for CPU-proximity processing. Growing data sets, increase in numbers of hidden layers in DL models, larger input data as well as the need to share training data sets across multiple users and models have resulted in the demand for a higher performing shared storage solution compared to HDD NAS. In this whitepaper, we present a performance and cost comparison between a 64TB all-flash NAS array utilizing the first in industry quad-level cell (QLC) enterprise SATA SSD drive, the Micron 5210 ION, and a 7200 RPM HDD-based array. Of particular interest was how a Mellanox Infiniband EDR remote attached QLC all-flash NAS array performs under DL specific workloads and how the remote solution compares to local NVMe flash. Compared to widely adapted three-level cell (TLC) SSDs, QLC SSDs offer lower costs per GB at the cost of reduced write endurance. We find an 11x better performance of the QLC flash array over the HDD array for DL specific synthetic FIO tests as well as a significant acceleration of real-world DL applications, in some cases up to 40%. Test results are on par with local NVMe implementations, suggesting that NAS storage solutions built on readintense QLC SSDs deliver substantial cost-effective performance gains for shared centralized training data repositories over HDD based solution. Given the read-intensive nature of deep learning use cases and write endurance is not a concern, QLC SSD solutions are an attractive alternative to TLC SSD-based solutions. 3

Introduction and Problem Statement The availability of GPGPU compute accelerators, open DL frameworks and models, extensive data sets and data bases combined with high industry demand for data intelligence solutions has triggered exponential growth in the Artificial Intelligence (AI) and DL segments of the IT market. Common examples are image/ video stream recognition and analysis, speech to text, speech synthesis, and super-resolution, among many others. Applications include autonomous vehicles, automatic translation, social media, personal assistants, sports and training analytics, broadcasting, security and surveillance, threat detection, stock market prediction algorithms, consumer analysis, and countless others. Deep leaning specific workloads can be classified into two categories, i. model development and training, and ii. deployment of the model and inference. Model development and training requires very high compute and storage resources. Compute is commonly GPGPU accelerated and fast access to storage is realized through local NVMe or SATA SSDs. Inference, on the other hand, can run on less specialized hardware and only relies on GPU accelerators in the case of a high number of concurrent users or if real-time analysis is required. Underpinning most training and development related deep learning workloads is an extremely readintensive IO profile, as collections of data have to be fed into the GPGPUs hundreds of times (=number of training epochs) to complete a model training and continuous iterations on model variants to improve model accuracy are trained using the same data sets. Experts note that the read-to-write ratio is as high as 5000:1, which is an abrupt departure from the traditional (pre-ai) data center read-to-write ratio of 4:1. While GPU compute nodes are typically linked together via high-speed fabric in many cases 100GbE or EDR IB the training data resides on an HDD NAS solution that s typically connected via 1GbE or 10GbE. To overcome the performance limitation of HDD NAS storage, training data sets are temporarily stored on a local SSD (SATA or NVMe). Increasing deployment sizes, data set sizes, moving to larger models, parallel model training, and multi-user environments all favor a setup that enables faster access to all data in the centralized repository. Flash-based enterprise storage appliances provide fast access to data, but this has traditionally come at a significant price premium. New QLC enterprise SSDs seem to offer a compelling means to reduce costs, as QLC NAND stores 33% more bits per cell and delivers similar read-performance as traditional TLC-based SSDs. Because the Micron 5210 ION SATA SSD family is targeted as an HDD replacement option, this test compares the results of a QLC SSD deployment to a HDD deployment, while also putting the results in context with TLC-based NVMe all-flash, which carry a significant cost premium. 4

Tool Setup To study the performance difference between a 7200 RPM SAS HDD versus a QLC enterprise SATA SSD solution in the context of DL workloads, we evaluate 3 different configurations, individual drives, local RAID5 volumes consisting of 16 drives and remote attached NFS RAID5 16x drive volumes connected via EDR Infiniband fabric. For comparison, we also include a local NVMe drive. An FIO test specific to DL workloads and a CNTK training are used as benchmarks. Note that while representative of the platforms and workloads tested in this study, the results presented may not be representative of all DL workloads. Storage Drives Compared Datasheet Specification SSD: Micron 5210 ION HDD: SAS 7200 RPM Form Factor Interface Capacity Sequential Read (128KB transfer) 2.5-inch (7mm height) 6Gb/s SATA 3.84TB 540 MB/s 3.5-inch (25.4mm height) 12Gb/s SAS 8TB 215 MB/s 5

Storage Drives Compared cont... Sequential Write (128K transfer) Random Read (4KB transfer) Random Write (4KB transfer) DWPD (5 years drive lifespan) 350 MB/s 83,000 IOPS 7,000 IOPS < 0.1 215 MB/s 203 IOPS* 273 IOPS* Table 1: Drive specifications *Note: While HDD random IOPS performance specifications are not published in HDD datasheets; these values can be measured using the industry standard SNIA Performance Test Specification v1.1 IOPS test, which was used to derive these values. Test Environment and System Configuration The test environment consists of a GPU compute server and a storage server, both equipped with a Mellanox EDR IB network adapter, a JBOD attached to the storage server and an Infiniband EDR fabric. Flash drives are located in the front of the 24-bay storage server with passive backplane and direct connected though four 4-port HD mini SAS connector to an internal 16-port HBA. HDD drives are hosted by a 60 drive JBOD and attached to the storage node through an external 4-port 12Gb/s SAS HBA. Details of system configurations and software stacks are given below. In this study, a 100G Mellanox EDR Infiniband fabric was chosen to maximize network bandwidth and minimize latency. Component GPU Compute Server NAS Storage Server CPU DRAM GPU SSD 2x Intel 6146 3.2GHz Scalable 24x 16GB DDR4 2666 MHz 8x Nvidia V100 16GB 1x Micron 9200 PRO 3.84TB 2x Intel E5-2680 v3 8x 32GB DDR4 2666 MHz N/A 16x Micron 5210 3.84TB 6

HDD HBA Networking N/A N/A Mellanox EDB 100G IB 16x 7200 RPM SAS 4TB 1x 9300-16i (SSD), 1x 9300-8e (HDD) Mellanox EDB 100G IB Switch Operating system Test FIO version NFS version 1x Mellanox 36-port EDR Infiniband Ubuntu 16.04 CNTK 3.2 NFS 4.2 plus RDMA 1x Mellanox 36-port EDR Infiniband Ubuntu 16.04 N/A N/A Figure 2: Server specifications The FIO script is listed below. A block size of 128k is chosen as a representative size for e.g. the ImageNet training database. FIO script: rw=randread ioengine=libaio iodepth=32 bs=128k numjobs=1 size=1000g direct=1 group _ reporting Results We first evaluate how the individual drives perform under deep learning specific workloads. The Micron 5210 ION SATA SSD drive outperforms the 7200rpm HDD by more than a factor 14 in both read bandwidth 479MB/s vs. 33MB/s and read IOPS 3747 IOPS vs. 258 IOPS. For comparison, we add a locally attached Micron 9200 NVMe drive to the test to get a reference of a best practice implementation. The NVMe drive read bandwidth, 3291MB/s, and read IOPS performance, 25100 IOPS, are about a factor 7 higher compared to the SATA SSD and a factor 100 higher compared to the SAS HDD. We will later compare the performance of the local NVMe to HDD and SSD based remote attached shared storage solutions. 7

DL training results shown in table 1 refer to the duration of the first training epoch. Here, the data is loaded from the storage media prior to the training. Results indicate the impact of the performance of the storage sub-system on the training duration and how low drive performance can negatively affect training times. Training times of subsequent epochs may vary depending if data is partially or fully cached in memory or on a local drive. Micron 9200 NVMe 4TB Micron 5210 SATA 4TB HDD 7200 rpm SAS 8TB, JBOD Single Drive Eval (local) DL training local drive (sec, per epoch) 850 930 5536 BW, FIO, read, 128k random, MB/s 3291 479 33 IOPS, FIO, read, 128k random 25100 3747 258 Ave. latency (ms) 1.3 Figure 3: Local drive performance comparison Next, we evaluate the performance of the local storage subsystems featuring 16x Micron 5210 SSDs and 16x 7200rpm HDD drives in a RAID5 configuration. 4TB of the 8TB total drive capacity is used to form the HDD RAID volume to match the capacity of the SSD volume. Both storage solutions have a raw capacity of around 64TB and a usable capacity of 53 TB. The 16x SSD volume displays a random read bandwidth of 3291MB/s and 24k IOPS at an average latency of 1.3ms. The performance is about a factor 8 better compared to the spinning disc-based solution and on par with the local NVMe drive. Results obtained from the FIO benchmark test are summarized in the table below (see table 4). When accessing the storage solutions through an EDR Infiniband fabric via NFS protocol, the performance degrades slightly to 2934MB/s for the SSD based storage volume and to 359MB/s for the HDD based volume. The observed transfer rates are well within the bandwidth of the IB fabric and are expected to result from the overhead of the NFS software stack. Read IOPS performance is 22400 IOPS for the SSD volume and 2872 IOPS for the HDD volume. The average latency of the SSD volume increases from 1.3ms to 1.4ms. Micron 5210 SATA 4TB HDD 7200 rpm SAS 8TB, JBOD 16x Drive Eval (local, RAID5) BW, FIO, read, 128k random, MB/s IOPS, FIO, read, 128k random Ave. latency (ms) 3075 24600 1.3 370 2959 10.8 8

16x Drive Eval (remote, NFS, 100G, RAID5) DL training (sec, per epoch) BW, FIO, read, 128k random, MB/s IOPS, FIO, read, 128k random Ave. latency (ms) Estimates Costs (System Level) 858 2934 22400 1.4 $302/TB raw* 1248 359 2872 11.1 $53/TB raw* Table 4: Performance comparison of local and remote attached SSD vs. HDD RAID5 storage array *2U Storage Node, 24x 8TB drives, $0.22/GB **1U Head Node, 60x 8TB JBOD In the studied configuration, the SSD based remote storage array exceeds the performance of a local SATA SSD and is on par with a local NVMe solution. Test results indicate commonly used 1GbE, 10GbE and even 25GbE storage network may not be sufficient to fully support the performance of the SSD based storage solution. Given the test results, an integration of the storage solution into a EDR Infiniband or 100GbE compute fabric or attachment via separate EDR Infiniband/100GbE storage fabric is recommended. 3500 Read Bandwidth Comparison (random, 128k) 3000 2500 2000 1500 1000 500 0 Single Drive (local) 16x Drive RAIDS Volume (local) 16x Drive RAIDS Volume (remote, NFS) Micron SSD 9200 NVMe Micron SSD 9200 SATA HDD 7200 rpm SAS Figure 1: Bandwidth performance summary of local and remote attached SSD vs. HDD RAID5 storage array 9

From a cost perspective and performance perspective, the Micron 5210 based NFS shared storage solution is well positioned between a lower performing legacy HDD based storage solution and expensive enterprise-grade all-flash storage appliances. Conclusion Our cost and performance comparison of new QLC SSDs vs. HDDs in a NAS array shows that QLC SSDs deliver the high performance expected of SSDs at a price point between that of an HDD NAS and that of a traditional all-flash array built on more expensive TLC (triple-level cell) technology. While TLC SSDs deliver more endurance and greater random write performance, the read/write ratio of most DL workloads may not justify the price premium, which is why this test was conducted on QLC SSDs. After comparing a 64TB all-flash NAS array built on new QLC SSDs (Micron 5210 ION) to that of a 64TB 7200 RPM HDD based solution, we find: The Micron 5210 ION QLC based storage array is well suited for DL workloads as these workloads are read intensive (very low amount of write IO traffic) and benefit from high read performance. While many SSDs offer similar read performance, the Micron 5210 use of QLC NAND lowers the cost per GB, making it a more attractive option for this use. The QLC SSD NAS solution performs significantly better during normal operation and in single-bit parity RAID mode during the rebuild of a degraded volume over a traditional HDD based NAS solution. Depending on the current setup and the particular workload, the QLC SSD based NAS array can reduce training times, increase GPU utilization, increase development efficiency and streamline the development processes through fast centralized shared storage. We recommend integrating SSD NAS storage solutions into a 100G high-speed compute fabric instead of attaching it via a separate 1GbE/10GbE/25GbE storage network. The bandwidth realized by high-speed compute fabric attachment enables GPU compute servers to fully benefit from the performance of the storage solution. The performance of the evaluated 64TB QLC SSD storage solution is on par with local NVMe storage. Depending on the size of the deployment, we suggest to reduce the capacity of the local NVMe drive or even replace local with remote attached storage. 10