Exploring System Challenges of Ultra-Low Latency Solid State Drives Sungjoon Koh Changrim Lee, Miryeong Kwon, and Myoungsoo Jung Computer Architecture and Memory systems Lab
Executive Summary Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far. Contributions. - Characterizing the performance behaviors of ULL SSD. - Studying several system-level challenges of the current storage stack. Key Observations. - ULL SSD minimizes the I/O interferences (interleaving reads and writes). - NVMe queue mechanisms are required to be optimized for ULL SSDs. - Polling-based I/O completion routine isn t effective for current NVMe SSDs.
Architectural Change of SSD CPU NVMe SSD PCI Express PCI Express MCH (North Bridge) Direct Access High bandwidth ICH (South Bridge) DRAM DRAM SATA SATA SSD
Evolution of SSDs Bandwidth almost reaches the maximum performance. Still, long latency (far from DRAM) SATA SSD Read: 0.5 GB/s Changes NVMe SSD Read: 2.4GB/s New flash memory, called Z-NAND Write: 0.5 GB/s Write: 1.2 GB/s
New Flash Memory Existing 3D NAND Read: 45-120 μs Write: 660-5000 μs Technology Capacity Page Size Z-NAND [1] SLC based 3D NAND 48 stacked word-line layer 64Gb 2kB/Page Z-NAND [1] Read: 3μs (15~20x) Z-NAND based archives Z-SSD Write: 100μs (6~7x)
Characterization Categories Performance Analysis. - Average latency. - Long-tail latency. - Bandwidth. - I/O interference impact. Polling vs. Interrupt - Overall latency comparison. - CPU utilization analysis. - Memory requirement. - Five-nines latency.
Evaluation Settings OS: Linux 4.14.10 CPU: Intel Core i7-4790k (4-core, 4.00GHz) Z-SSD Prototype Memory: DDR4 DRAM (16GB) SSD - ULL SSD: Z-SSD Prototype (800GB) - NVMe SSD: Intel SSD 750 Series (400GB) <Our testbed w/ Z-SSDs> Benchmark: Flexible I/O Tester (FIO v2.99)
Performance Analysis
Overview Request Queue Host Increase queue depth 4KB Rd 4KB Wr 4KB Rd 4KB Wr 4KB Rd 4KB Wr 4KB Rd 4KB Wr 1 Average latency & Long-tail latency NVMe Driver NVMe Controller SSD 2 Bandwidth 3 Read latency under Read & Write intermixed workload
Average Latency of ULL SSD Average Latency (μsec) Average Latency (μsec) Sequential Read Write 150 40 120 35 30 90 25 20 60 15 30 10 5 0 SeqRd SeqWr RndRd RndWr NVMe NVMe ULL ULL 2 4 6 8 10 10 12 12 14 14 16 16 I/O Depth 21 18 15 12 9 5.1x 6 1 2 3 4 1.8x t R 11 μs t DMA 4KB DMA = 8μs ( t R =3μs) Split-DMA & Super-Channel
Split-DMA & Super-Channel Z-SSD Reference: Cheong, Woosung et al., A flash memory controller for 15μs ultra-lowlatency SSD using high-speed 3D NAND flash with 3μs read time, ISSCC, 2018 Channel 0 Split DMA Engine Channel 2 Super 4KB Request 2KB 2KB Split Channel 4 Channel 1 Channel 3 Channel Channel 5 t DMA = 4μs
Long-tail Latency of ULL SSD 99.999th Latency (msec) ULL SeqRd RndRd SeqWr RndWr 7 6 5 4 3 2 1 0 NVMe SeqRd RndRd SeqWr RndWr 2 4 6 8 10 12 14 16 I/O Depth Resource conflict Insufficient internal buffer, Internal tasks Split DMA & Suspend/Resume
Suspend/Resume DMA Technique Reference: Cheong, Woosung et al., A flash memory controller for 15μs ultra-lowlatency SSD using high-speed 3D NAND flash with 3μs read time, ISSCC, 2018 Way 1 DMA (for write request) Way 2 Wait t R Reduce read latency & Increase QoS CMD t R Data Out Suspend/Resume [1] Suspend Resume Way 1 Read DMA (for write request) Way 2 t R CMD Data Out
I/O Interference Great performance bottleneck of conventional SSDs. Read Latency (μsec) 600 500 400 300 200 100 0 Average NVMe SSD ULL SSD ULL SSD Significant be performance applied to real-life storage degradation stack w/o in intermixed performance workloads. degradation. How about ULL SSD? Flush operation / meta data writes 27 32 31 34 37 Remains in file almost system constant are intermixed Suspend/resume, 0 20 40 60 80 with user requests [1] Write fraction (%)
Queue Analysis Normalized Bandwidth 1.0 0.8 0.6 0.4 0.2 0.0 NVMe SSD Only 50% of Max BW SeqRd RndRd SeqWr RndWr 50 100 150 200 250 I/O Depth I/O Requires request Too rescheduling more long write than 100 latency within entries. queue. Normalized Bandwidth 1.0 0.8 0.6 0.4 0.2 0.0 ULL SSD Almost Max BW SeqRd RndRd SeqWr RndWr 4 8 12 16 20 I/O Depth Only Short 6 entries write latency required Light queue mechanisms (ex. NCQ) are not sufficient. Requires rich queue mechanism Well-aligned with light queue mechanisms (ex. NCQ). NVMe needs to be lightened
Polling vs. Interrupt Two different I/O completion methods
Interrupt / Polling Systems with short waiting time adopts polling-based waiting strategy.(even though it incurs lots of overheads) For example, spin lock, network message passing applies polling-based waiting strategy. Polling is currently implemented to NVMe storage stack. Does it really need for current NVMe SSDs?
Interrupt / Polling Interrupt. Submit request CS Sleep CS Complete request CS ISR 3 Wake Low latency SSD Command Execution 2 Raise IRQ 1 Finishes NVMe Controller Polling. Shorter Larger portion Submit request Polling Complete request CS CS SSD Command Execution Done?? Gain
Overall Performance NVMe SSD ULL SSD Average Latency ( sec) 180 36 32 160 Does Interrupt polling-based 28 Interrupt 140 24 120 I/O works 20 on ULL 16 100 PollingSSD? 12 80 8 Polling 4KB 8KB 16KB 32KB Average Latency ( sec) 4KB 8KB 16KB 32KB 30 22 Polling-based 28 26 Interrupt 20 Interrupt I/O 24 18 services 22 are not 16 20 18 Polling 14 16 12 14 10 Polling Read Write Read Write Future lower latency SSD can achieve remarkable performance improvement with Decreases only Read: 0.9% & Write: 8.2% Average Latency ( sec) 4KB 8KB 16KB 32KB Average Latency ( sec) effective for current polling-based I/O completion routine. NVMe SSDs. 4KB 8KB 16KB 32KB Decreases by Read: 7.5% & Write: 13.2%
System Challenges 99.999% Latency (msec) CPU Utilization (%) Memory Bound (%) 5.0 100 Polling Host 4.9 Core 0 4.8 80Polling-based I/O Memory services boundincur InterruptCore always CPU CPU 80 4.7Polling = Fraction of slots where significant Working 60Polling system-level Core overheads 1 Core n pipeline could be stalled SQ CQ 100 4.6 4.5 4.4 4.3 4.2 60 40 20 40 20 ULL Write Interrupt Interrupt release CPU Spin lock for head/tail pointer Synchronization 0 High 0 CPU utilization Time 4KB 4KB 8KB 8KB 16KB 16KB 32KB 32KB <Memory <CPU Uitlization> Bound> Polling does not Needs to due be to addressed load/store. 4KB 8KB 16KB 32KB Head Check CQ update NVMe Controller Memory Space High memory bound Tail CQ SQ Head NVMe Controller = Frequent memory access Tail CQ Head Doorbell SQ Tail Doorbell
Conclusion Motivation. Ultra-low latency (ULL) is emerging, but not characterized by far. Contributions. - Characterizing the performance behaviors of ULL SSD. - Studying several system-level challenges of the current storage stack. Key Insights. - ULL SSDs can be effectively applied to real-life storage stack. (RW mixed) - NVMe queue mechanisms are required to be optimized for ULL SSDs. - Polling-based I/O completion routine isn t effective for current NVMe SSDs.
Thank you Q&A