Exploring System Challenges of Ultra-Low Latency Solid State Drives

Similar documents
FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Linux Storage System Bottleneck Exploration

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

The Transition to PCI Express* for Client SSDs

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

OpenMPDK and unvme User Space Device Driver for Server and Data Center

Enhancing SSD Control of NVMe Devices for Hyperscale Applications. Luca Bert - Seagate Chris Petersen - Facebook

Presented by: Nafiseh Mahmoudi Spring 2017

Ziye Yang. NPG, DCG, Intel

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

Dongjun Shin Samsung Electronics

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

A Flash Scheduling Strategy for Current Capping in Multi-Power-Mode SSDs

An NVMe-based FPGA Storage Workload Accelerator

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

I/O Devices & SSD. Dongkun Shin, SKKU

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

The Non-Volatile Memory Verbs Provider (NVP): Using the OFED Framework to access solid state storage

The Long-Term Future of Solid State Storage Jim Handy Objective Analysis

Module 6: INPUT - OUTPUT (I/O)

ReFlex: Remote Flash Local Flash

Architecture Exploration of High-Performance PCs with a Solid-State Disk

LETTER Solid-State Disk with Double Data Rate DRAM Interface for High-Performance PCs

High Performance Solid State Storage Under Linux

High-Speed NAND Flash

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

NVMe: The Protocol for Future SSDs

Using Transparent Compression to Improve SSD-based I/O Caches

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Summarizer: Trading Communication with Computing Near Storage

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Storage. Hwansoo Han

Solving the I/O bottleneck with Flash

A Semi Preemptive Garbage Collector for Solid State Drives. Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim

Samsung PM1725a NVMe SSD

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Deep Learning Performance and Cost Evaluation

Building an All Flash Server What s the big deal? Isn t it all just plug and play?

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Architectural Principles for Networked Solid State Storage Access

Toward SLO Complying SSDs Through OPS Isolation

Operating Systems. V. Input / Output

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Applying Polling Techniques to QEMU

AD910A M.2 (NGFF) to SATA III Converter Card

Extending the NVMHCI Standard to Enterprise

No Tradeoff Low Latency + High Efficiency

Reference Design: NVMe-oF JBOF

HP SSD EX920 M.2. 2TB Sustained sequential read: Up to 3200 MB/s Sustained sequential write: Up to 1600 MB/s

Memory Systems DRAM, etc.

Minerva. Performance & Burn In Test Rev AD903A/AD903D Converter Card. Table of Contents. 1. Overview

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Profiling: Understand Your Application

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

Input/Output Introduction

Introduction. Motivation Performance metrics Processor interface issues Buses

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

ECE 341. Lecture # 19

OSSD: A Case for Object-based Solid State Drives

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Performance Modeling and Analysis of Flash based Storage Devices

Toward a Memory-centric Architecture

Linux Storage System Analysis for e.mmc With Command Queuing

High Performance SSD & Benefit for Server Application

Enabling NVMe I/O Scale

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Samsung Z-SSD and ScyllaDB: Delivering Low Latency and Multi-Terabyte Capacity in a Persistent Database

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

MINERVA. Performance & Burn In Test Rev AD912A Interposer Card. Table of Contents. 1. Overview

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Important new NVMe features for optimizing the data pipeline

Jackson Marusarz Intel Corporation

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Preface. Fig. 1 Solid-State-Drive block diagram

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Deep Learning Performance and Cost Evaluation

HP Z Turbo Drive G2 PCIe SSD

Transcription:

Exploring System Challenges of Ultra-Low Latency Solid State Drives Sungjoon Koh Changrim Lee, Miryeong Kwon, and Myoungsoo Jung Computer Architecture and Memory systems Lab

Executive Summary Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far. Contributions. - Characterizing the performance behaviors of ULL SSD. - Studying several system-level challenges of the current storage stack. Key Observations. - ULL SSD minimizes the I/O interferences (interleaving reads and writes). - NVMe queue mechanisms are required to be optimized for ULL SSDs. - Polling-based I/O completion routine isn t effective for current NVMe SSDs.

Architectural Change of SSD CPU NVMe SSD PCI Express PCI Express MCH (North Bridge) Direct Access High bandwidth ICH (South Bridge) DRAM DRAM SATA SATA SSD

Evolution of SSDs Bandwidth almost reaches the maximum performance. Still, long latency (far from DRAM) SATA SSD Read: 0.5 GB/s Changes NVMe SSD Read: 2.4GB/s New flash memory, called Z-NAND Write: 0.5 GB/s Write: 1.2 GB/s

New Flash Memory Existing 3D NAND Read: 45-120 μs Write: 660-5000 μs Technology Capacity Page Size Z-NAND [1] SLC based 3D NAND 48 stacked word-line layer 64Gb 2kB/Page Z-NAND [1] Read: 3μs (15~20x) Z-NAND based archives Z-SSD Write: 100μs (6~7x)

Characterization Categories Performance Analysis. - Average latency. - Long-tail latency. - Bandwidth. - I/O interference impact. Polling vs. Interrupt - Overall latency comparison. - CPU utilization analysis. - Memory requirement. - Five-nines latency.

Evaluation Settings OS: Linux 4.14.10 CPU: Intel Core i7-4790k (4-core, 4.00GHz) Z-SSD Prototype Memory: DDR4 DRAM (16GB) SSD - ULL SSD: Z-SSD Prototype (800GB) - NVMe SSD: Intel SSD 750 Series (400GB) <Our testbed w/ Z-SSDs> Benchmark: Flexible I/O Tester (FIO v2.99)

Performance Analysis

Overview Request Queue Host Increase queue depth 4KB Rd 4KB Wr 4KB Rd 4KB Wr 4KB Rd 4KB Wr 4KB Rd 4KB Wr 1 Average latency & Long-tail latency NVMe Driver NVMe Controller SSD 2 Bandwidth 3 Read latency under Read & Write intermixed workload

Average Latency of ULL SSD Average Latency (μsec) Average Latency (μsec) Sequential Read Write 150 40 120 35 30 90 25 20 60 15 30 10 5 0 SeqRd SeqWr RndRd RndWr NVMe NVMe ULL ULL 2 4 6 8 10 10 12 12 14 14 16 16 I/O Depth 21 18 15 12 9 5.1x 6 1 2 3 4 1.8x t R 11 μs t DMA 4KB DMA = 8μs ( t R =3μs) Split-DMA & Super-Channel

Split-DMA & Super-Channel Z-SSD Reference: Cheong, Woosung et al., A flash memory controller for 15μs ultra-lowlatency SSD using high-speed 3D NAND flash with 3μs read time, ISSCC, 2018 Channel 0 Split DMA Engine Channel 2 Super 4KB Request 2KB 2KB Split Channel 4 Channel 1 Channel 3 Channel Channel 5 t DMA = 4μs

Long-tail Latency of ULL SSD 99.999th Latency (msec) ULL SeqRd RndRd SeqWr RndWr 7 6 5 4 3 2 1 0 NVMe SeqRd RndRd SeqWr RndWr 2 4 6 8 10 12 14 16 I/O Depth Resource conflict Insufficient internal buffer, Internal tasks Split DMA & Suspend/Resume

Suspend/Resume DMA Technique Reference: Cheong, Woosung et al., A flash memory controller for 15μs ultra-lowlatency SSD using high-speed 3D NAND flash with 3μs read time, ISSCC, 2018 Way 1 DMA (for write request) Way 2 Wait t R Reduce read latency & Increase QoS CMD t R Data Out Suspend/Resume [1] Suspend Resume Way 1 Read DMA (for write request) Way 2 t R CMD Data Out

I/O Interference Great performance bottleneck of conventional SSDs. Read Latency (μsec) 600 500 400 300 200 100 0 Average NVMe SSD ULL SSD ULL SSD Significant be performance applied to real-life storage degradation stack w/o in intermixed performance workloads. degradation. How about ULL SSD? Flush operation / meta data writes 27 32 31 34 37 Remains in file almost system constant are intermixed Suspend/resume, 0 20 40 60 80 with user requests [1] Write fraction (%)

Queue Analysis Normalized Bandwidth 1.0 0.8 0.6 0.4 0.2 0.0 NVMe SSD Only 50% of Max BW SeqRd RndRd SeqWr RndWr 50 100 150 200 250 I/O Depth I/O Requires request Too rescheduling more long write than 100 latency within entries. queue. Normalized Bandwidth 1.0 0.8 0.6 0.4 0.2 0.0 ULL SSD Almost Max BW SeqRd RndRd SeqWr RndWr 4 8 12 16 20 I/O Depth Only Short 6 entries write latency required Light queue mechanisms (ex. NCQ) are not sufficient. Requires rich queue mechanism Well-aligned with light queue mechanisms (ex. NCQ). NVMe needs to be lightened

Polling vs. Interrupt Two different I/O completion methods

Interrupt / Polling Systems with short waiting time adopts polling-based waiting strategy.(even though it incurs lots of overheads) For example, spin lock, network message passing applies polling-based waiting strategy. Polling is currently implemented to NVMe storage stack. Does it really need for current NVMe SSDs?

Interrupt / Polling Interrupt. Submit request CS Sleep CS Complete request CS ISR 3 Wake Low latency SSD Command Execution 2 Raise IRQ 1 Finishes NVMe Controller Polling. Shorter Larger portion Submit request Polling Complete request CS CS SSD Command Execution Done?? Gain

Overall Performance NVMe SSD ULL SSD Average Latency ( sec) 180 36 32 160 Does Interrupt polling-based 28 Interrupt 140 24 120 I/O works 20 on ULL 16 100 PollingSSD? 12 80 8 Polling 4KB 8KB 16KB 32KB Average Latency ( sec) 4KB 8KB 16KB 32KB 30 22 Polling-based 28 26 Interrupt 20 Interrupt I/O 24 18 services 22 are not 16 20 18 Polling 14 16 12 14 10 Polling Read Write Read Write Future lower latency SSD can achieve remarkable performance improvement with Decreases only Read: 0.9% & Write: 8.2% Average Latency ( sec) 4KB 8KB 16KB 32KB Average Latency ( sec) effective for current polling-based I/O completion routine. NVMe SSDs. 4KB 8KB 16KB 32KB Decreases by Read: 7.5% & Write: 13.2%

System Challenges 99.999% Latency (msec) CPU Utilization (%) Memory Bound (%) 5.0 100 Polling Host 4.9 Core 0 4.8 80Polling-based I/O Memory services boundincur InterruptCore always CPU CPU 80 4.7Polling = Fraction of slots where significant Working 60Polling system-level Core overheads 1 Core n pipeline could be stalled SQ CQ 100 4.6 4.5 4.4 4.3 4.2 60 40 20 40 20 ULL Write Interrupt Interrupt release CPU Spin lock for head/tail pointer Synchronization 0 High 0 CPU utilization Time 4KB 4KB 8KB 8KB 16KB 16KB 32KB 32KB <Memory <CPU Uitlization> Bound> Polling does not Needs to due be to addressed load/store. 4KB 8KB 16KB 32KB Head Check CQ update NVMe Controller Memory Space High memory bound Tail CQ SQ Head NVMe Controller = Frequent memory access Tail CQ Head Doorbell SQ Tail Doorbell

Conclusion Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far. Contributions. - Characterizing the performance behaviors of ULL SSD. - Studying several system-level challenges of the current storage stack. Key Insights. - ULL SSDs can be effectively applied to real-life storage stack. (RW mixed) - NVMe queue mechanisms are required to be optimized for ULL SSDs. - Polling-based I/O completion routine isn t effective for current NVMe SSDs.

Thank you Q&A