Enhancing SSD Control of NVMe Devices for Hyperscale Applications. Luca Bert - Seagate Chris Petersen - Facebook

Enhancing SSD Control of NVMe Devices for Hyperscale Applications Luca Bert - Seagate Chris Petersen - Facebook

Agenda Introduction & overview (Luca) Problem statement & proposed solution (Chris) SSD implication and proto work (Luca) Test results and analysis (All) Conclusion (All) 2

Problem Statement & Proposed Solution 3

NAND Flash Trends M.2 SSD Capacity (GB) 16384 14336 12288 10240 8192 6144 4096 2048 M.2 SSD capacity (GB) 0 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 4

SSD Capacity = Shared Resource A B C D SSD 5

Noisy Neighbors Request latency Application A Application B Application C Time 6

Introducing I/O Determinism New NVMe feature called: I/O Determinism SSD is configured as multiple NVM Sets (e.g. A, B, C) NVM Sets are QoS isolated regions A write to Set A does not impact a read to Set B or C One or more namespaces are allocated from an NVM Set 7

SSD Implications and Prototype Work SSD Data Layout Impact 8

Key Hurdles to Address Physical resource segregation Die have to be allocated to VM Sets Localization of the Globals Data: Maps, Journals, Logs, Metadata are global and need to be redesigned to be local to VM Sets Processes: Same for GC and other background processes Resources: OP, cache, memory... can be local or global 9

Baseline: Conventional SSD with 4 NS ONFi Channels NS1 NS2 NS3 16 die NS4 Global Maps OP 8 channels Pros Simplest configuration No logical physical affiliation Most balanced model Cons IO pattern conflicts 10

16 die IOD: Vertical Sets (VS) Dedicated Channels NS1 NS2 NS3 NS4 NS1 Maps NS2 Maps NS3 Maps NS4 Maps OP Pros Few dedicated channels for each Set No conflict in channel or Die Every Set runs in parallel resources 8 channels Cons 1/4 BW possible per Set (limited by number of dedicated channels) 11

IOD: Horizontal Sets (HS) Shared Channels NS1 NS2 NS3 NS4 NS1 Maps NS2 Maps NS3 Maps NS4 Maps OP Pros All channels are utilized by all Sets Full BW possible with single active Set (or shared BW with all Sets) No Conflict in Die Cons No Dedicated channels or queues for Set 12

IOD: Mixed Sets (MS) Shared Channels Shared Channels NS1 NS2 NS3 NS4 NS1 Maps NS2 Maps NS3 Maps NS4 Maps OP Pros Few dedicated channels for group of Sets No conflict in Die Groups of Sets runs in parallel resources Cons 1/2 of the BW possible per Set (limited by number of dedicated/shared channels) 13

Areas Not Covered in Proto Work R and W path are HW assisted No complete separation between them Conflicts do happen Cache memory is shared across VS Conflicts happen but limited by test nature R and W have same priority No priority to any specific IO No Erase Suspend 14

Test Results and Analysis 15

Configuration Under Test Facebook OCP Leopard System 2-socket Intel Xeon v4 CPU 32GB DRAM STX Nytro XP7102 (2TB emlc), 128 dies on 8 channels, ONFi @400 MT/s Tested with Baseline (No IOD), VSets and Hsets Writes = 32KB SW @QD4 Reads = 4KB RR @QD8 16

PICS 17

The Noisy Neighbor One Set under heavy writes, the other under heavy reads Write is 32K Seq Write with QD=4 Read is 4K Random Read with QD=8 Expected Outcome: minimal to no impact on Reads from the noisy Writes 18

WR on NS1 + RD on NS2 20 15 10 5 Zoom-in Area Avg Read Latency over 30 min (1800 seconds) BS-AVG-LAT HS-AVG-LAT VS-AVG-LAT 0 1 41 81 121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1001 1041 1081 1121 1161 1201 1241 1281 1321 1361 1401 1441 1481 1521 1561 1601 1641 1681 1721 1761 4000 3500 3000 2500 2000 1500 1000 500 0 Max Read Latency over 30 min (1800 seconds) 1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821 862 903 944 985 1026 1067 1108 1149 1190 1231 1272 1313 1354 1395 1436 1477 1518 1559 1600 1641 1682 1723 1764 BS-MAX-LAT HS-MAX-LAT VS-MAX-LAT 19

WR on NS1 + RD on NS2 1.8 1.6 Avg Read Latency 1.4 1.2 AVG=1.33 ms 1 0.8 0.6 9.5x improvement 7x improvement 0.4 0.2 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 280 289 298 307 316 325 334 343 352 361 370 379 388 397 AVG=0.19 ms AVG=0.14 ms BS-AVG-LAT HS-AVG-LAT VS-AVG-LAT 20

WR on NS1 + RD on NS2 140 Max Read Latency 120 100 80 60 40 AVG=55.7 ms 20 0 11x improvement 16x improvement 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 280 289 298 307 316 325 334 343 352 361 370 379 388 397 AVG=4.79 ms AVG=3.41 ms BS-MAX-LAT HS-MAX-LAT VS-MAX-LAT 21

WR on NS1 + RD on NS2 100.0000% Random 4k Read Latency (us) 0 1000 2000 3000 4000 5000 6000 7000 10.0000% Nines Probability 1.0000% 0.1000% 0.0100% 0.0010% BS-QD8 HS-QD8 VS-QD8 0.0001% Combined - 1 Sets 32K SW (QD=4), 1 Set 4k RR (QD=8) 22

WR on NS1 + RD on NS2 8 7 6 5 4 3 2 BS_AVG_LAT BS_P99 HS_AVG_LAT HS_P99 VS_AVG_LAT VS_P99 1 0 1 2 4 8 16 32 64 128 23

Conclusions 24

Conclusions Average values: IOD is behaving largely as expected Max values: there is more under the hood to address but even proto show huge advantages IOD is all about die separation but Controller, DRAM, ONFi, and other resources play key roles under specific conditions Most of all improvements are so high that there may be other usage models outside Hyperscalers 25

Visit Seagate Booth Thank You! Questions? #505 Learn about Seagate s portfolio of SSDs, flash solutions and system level products for every segment. www.seagate.com/nytro 26