Developing Low Latency NVMe Systems for HyperscaleData Centers Prepared by Engling Yeo Santa Clara, CA 95054 Date: 08/04/2017
Quality of Service IOPS, Throughput, Latency Short predictable read latencies Tape Storage. IBM engineers achieve 201 GB/in 2 200PB fits easily into a truck 5ft x 15ft stacked 100 thick Drive 3 hours to San Francisco Throughput : 18TB/s, 4.5G-IOPS/s APPLICATION (VIRTUAL) FILE SYSTEM Block IO NVMe Driver User Space Kernel Device Driver Limit the maximum latency NVMe SSD PCIe Root Complex NVMe SSD NAND Media NAND Media 8/8/17 2
Hyperscale Storage Directions Worldwide data generated 2010 : 70% storage on Mobile/PC 2025 : 50% storage on HyperScale 40% data mining, machine learning, IOT Factors Affecting Growth Cost / Capacity Mean time between failures Power concerns Security Configurability. Key management Firmware update Santization and Life Cycles Control over stack. Build vs buy infrastructure Performance 180 160 140 120 100 80 60 40 20 0 163 3 16 2010 2016 2025 ZetaBytes Generated 8/8/17 3
Latency Benchmarks of Several Enterprise PCIe SSDs Courtesy Anandtech June 2014 8/8/17 4
Typical Read Latency of a NVMe System Typical read latencies for a 4kB Read Access Controller PCIe and NVMe frontend HW 1!s Firmware interpretation of NVMe command 2!s FTL-cache miss; DDR Access 3!s T read (TLC) 100!s Transfer 4kB @ 800MBps 6!s ECC Decoding 6!s Gen3x4 PCIe transfer and NVMe completion 4!s Total ~122!s Compare this latency with DDR4, e.g. 200ns 8/8/17 5
Hardware Challenges Percentage latency attributable to media : SLC : 50% MLC : 70% TLC : 80% Amdahl s Law What can the controller design do? 8/8/17 6
Low Latency NVMe Controller Firmware Instead of optimizing best case latencies Focus on reducing maximum latency Garbage collection Data Cache User Data Configurable FTLs to adapt dynamically to work loads Hybrid HW-SW implementation of FTLs Trade off dramatic changes in latency with more frequent context switches 8/8/17 7
Low Latency NVMe Controller Hardware Configurable memories to support the hybrid FTL Rapid context switching Speculative processing Flexibility to issue and maintain control over massively parallel Channels/CEs/LUNs/Planes PCIe Host Interface NVMe CPU Security T C M T C M SRAM FTL Accelerator ECC DDR Controller NAND I/F NAND I/F NAND I/F DDR NAND CH0 NAND CH1 NAND CHn 8/8/17 8
Error Correction LDPC has higher decoding latencies (not exactly) T read time 100us Xfer 6us Decode 3 ~ 6us 1 st Retry Read T read time 100us Xfer 6us Decode 3 ~ 6us 2 nd Retry Read T read time 100us Xfer 6us Decode 3 ~ 6us FAIL! FAIL! PASS Read retries are typically >100!s penalty Soft-LDPC decoding also requires read retries Take advantage of orthogonal Channels / CEs / LUNs/ Planes Parallel Reads can recover the error frame in significantly reduced latencies 8/8/17 Engling Yeo. Low Latency NVMe Systems 9
Flash Interface Controller Respect the well documented T read, T prog, T ber times Poll less, transfer more Know when to suspend/abort more time consuming tasks Out of order execution Stop asking. The data is NOT Ready!! Courtesy Wu, Virginia Commonwealth University 8/8/17 10
Latency is Key to QOS Always respect Amdahl s Law Context Switching Control your maximum latency Identify you latency bottleneck, and go WIDE 11 8/8/17
THANK YOU GOKE US RESEARCH LABORATORY 4655 Old Ironsides Dr, #350 Santa Clara, CA 95054 WWW.GOKEUSLAB.COM
Abstract Hyperscale data centers need extremely low latency storage systems to provide predictable high performance over a wide variety of applications at reasonable cost. To be commercially viable, they need a multi-tiered memory system consisting of DRAM for high speed, low-latency non-volatile memory (such as 3D XPoint) for larger amounts of key data, and the more traditional non-volatile NAND flash for mass storage. The realization of such systems involve hardware, software, and driver challenges. The result must be fully scalable, low-power, and capable of handling the most challenging big data applications.