MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Size: px

Start display at page:

Download "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices"

Charlene Eaton
6 years ago
Views:

org/conference/fast/presentaton/tavakkol Ths paper s ncluded n the Proceedngs of the th USENIX Conference on Fle and Storage Technologes.

1 MQSm: A Framework for Enablng Realstc Studes of Modern Mult-Queue SSD Devces Arash Tavakkol, Juan Gómez-Luna, and Mohammad Sadrosadat, ETH Zürch; Saugata Ghose, Carnege Mellon Unversty; Onur Mutlu, ETH Zürch and Carnege Mellon Unversty Ths paper s ncluded n the Proceedngs of the th USENIX Conference on Fle and Storage Technologes. February 5, Oakland, CA, USA ISBN Open access to the Proceedngs of the th USENIX Conference on Fle and Storage Technologes s sponsored by USENIX.

2 MQSm: A Framework for Enablng Realstc Studes of Modern Mult-Queue SSD Devces Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadat, Saugata Ghose, Onur Mutlu ETH Zürch Carnege Mellon Unversty Abstract Sold-state drves (SSDs) are used n a wde array of computer systems today, ncludng n datacenters and enterprse servers. As the I/O demands of these systems contnue to ncrease, manufacturers are evolvng SSD archtectures to keep up wth ths demand. For example, manufacturers have ntroduced new hgh-bandwdth nterfaces to replace the conventonal SATA host nterface protocol. These new nterfaces, such as the NVMe protocol, are desgned specfcally to enable the hgh amounts of concurrent I/O bandwdth that SSDs are capable of delverng. Whle modern SSDs wth sophstcated features such as the NVMe protocol are already on the market, exstng SSD smulaton tools have fallen behnd, as they do not capture these new features. We fnd that state-of-theart SSD smulators have three shortcomngs that prevent them from accurately modelng the performance of real off-the-shelf SSDs. Frst, these smulators do not model crtcal features of new protocols (e.g., NVMe), such as ther use of multple applcaton-level queues for requests and the elmnaton of OS nterventon for I/O request processng. Second, these smulators often do not accurately capture the mpact of advanced SSD mantenance algorthms (e.g., garbage collecton), as they do not properly or quckly emulate steady-state condtons that can sgnfcantly change the behavor of these algorthms n real SSDs. Thrd, these smulators do not capture the full end-to-end latency of I/O requests, whch can ncorrectly skew the results reported for SSDs that make use of emergng non-volatle memory technologes. By not accurately modelng these three features, exstng smulators report results that devate sgnfcantly from real SSD performance. In ths work, we ntroduce a new smulator, called MQSm, that accurately models the performance of both modern SSDs and conventonal SATA-based SSDs. MQSm fathfully models new hgh-bandwdth protocol mplementatons, steady-state SSD condtons, and the full end-to-end latency of requests n modern SSDs. We valdate MQSm, showng that t reports performance results that are only 6%-% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modelng crtcal features of modern SSDs, MQSm uncovers several real and mportant ssues that were not captured by exstng smulators, such as the performance mpact of nter-flow nterference. We have released MQSm as an open-source tool, and we hope that t can enable researchers to explore drectons n new and dfferent areas. Introducton Sold-state drves (SSDs) are wdely used n today s computer systems. Due to ther hgh throughput, low response tme, and decreasng cost, SSDs have replaced tradtonal magnetc hard dsk drves (HDDs) n many datacenters and enterprse servers, as well as n consumer devces. As the I/O demand of both enterprse and consumer applcatons contnues to grow, SSD archtectures are rapdly evolvng to delver mproved performance. For example, a major nnovaton has been the ntroducton of new host nterfaces to the SSD. In the past, many SSDs made use of the Seral Advanced Technology Attachment (SATA) protocol [67], whch was orgnally desgned for HDDs. Over tme, SATA has proven to be neffcent for SSDs, as t cannot enable the fast I/O accesses and mllons of I/O operatons per second (IOPS) that contemporary SSDs are capable of delverng. New protocols such as NVMe [63] overcome these barrers as they are desgned specfcally for the hgh throughput avalable n SSDs. NVMe enables hgh throughput and low latency for I/O requests through ts use of the mult-queue SSD (MQ-SSD) concept. Whle SATA exposes only a sngle request port to the OS, MQ-SSD protocols provde multple request queues to drectly expose applcatons to the SSD devce controller. Ths allows () an applcaton to bypass OS nterventon for I/O request processng, and () the SSD controller to schedule I/O requests based on how busy the SSD s resources are. As a result, the SSD can make hgher-performance I/O request schedulng decsons. As SSDs and ther assocated protocols evolve to keep pace wth changng system demands, the research communty needs smulaton tools that relably model these new features. Unfortunately, state-of-the-art SSD smulators do not model a number of key propertes of modern SSDs that are already on the market. We evaluate several real modern SSDs, and fnd that state-of-the-art smulators do not capture three features that are crtcal to accurately model modern SSD behavor. Frst, these smulators do not correctly model the mult-queue approach used n modern SSD protocols. Instead, they mplement only the sngle-queue approach used n HDD-based protocols such as SATA. As a result, exstng smulators do not capture () the hgh amount of request-level parallelsm and () the lack of OS nterventon n modern SSDs. Second, many smulators do not adequately model steady-state behavor wthn a reasonable amount of smulaton tme. A number of fundamental SSD mantenance algorthms, such as garbage collecton [ 3, 3], are not executed when an SSD s new (.e., no data has been wrtten to the drve). As a result, manufacturers desgn these mantenance algorthms to work best when an SSD reaches the steady-state operatng pont (.e., after all of the pages wthn the SSD have been wrtten to at least once) [7]. However, smulators that cannot capture steady-state behavor (wthn a reasonable USENIX Assocaton th USENIX Conference on Fle and Storage Technologes 9

3 smulaton tme) perform these mantenance algorthms on a new SSD. As such, many exstng smulators do not adequately capture algorthm behavor under realstc condtons, and often report unrealstc SSD performance results (as we dscuss n Secton 3.). Thrd, these smulators do not capture the full end-toend latency of performng I/O requests. Exstng smulators capture only the part of the request latency that takes place durng ntra-ssd operatons. However, many emergng hgh-speed non-volatle memores greatly reduce the latency of ntra-ssd operatons, and, thus, the uncaptured parts of the latency now make up a sgnfcant porton of the overall request latency. For example, n Intel Optane SSDs, whch make use of 3D XPont memory [9, 5], the overhead of processng a request and transferrng data over the system I/O bus (e.g., PCIe) s much hgher than the memory access latency []. By not capturng the full end-to-end latency, exstng smulators do not report the true performance of SSDs wth new and emergng memory technologes. Based on our evaluaton of real modern SSDs, we fnd that these three features are essental for a smulator to capture. Because exstng smulators do not model these features adequately, ther results devate sgnfcantly from the performance of real SSDs. Our goal n ths work s to develop a new SSD smulator that can fathfully model the features and performance of both modern mult-queue SSDs and conventonal SATA-based SSDs. To ths end, we ntroduce MQSm, a new smulator that provdes an accurate and flexble framework for evaluatng SSDs. MQSm addresses the three shortcomngs we found n exstng smulators, by () provdng detaled models of both conventonal (e.g., SATA) and modern (e.g., NVMe) host nterfaces; () accurately and quckly modelng steady-state SSD behavor; and (3) measurng the full end-to-end latency of a request, from the tme an applcaton enqueues a request to the tme the request response arrves at the host. To allow MQSm to adapt easly to future SSD developments, we employ a modular desgn for the smulator. Our modular approach allows users to easly modfy the mplementaton of a sngle component (e.g., I/O scheduler, address mappng) wthout the need to change other parts of the smulator. We provde two executon modes for MQSm: () standalone executon, and () ntegrated executon wth the gem5 full-system smulator []. We valdate the performance reported by MQSm usng several real SSDs. We fnd that the response tme results reported by MQSm are very close to the response tmes of the real SSDs, wth an average (maxmum) error of only % (%) for real storage workload traces. By fathfully modelng the major features found n modern SSDs, MQSm can uncover several ssues that exstng smulators are unable to demonstrate. One such ssue s the performance mpact of nter-flow nterference n modern MQ-SSDs. For two or more concurrent flows (.e., streams of I/O requests from multple applcatons), there are three major sources of nterference: () the wrte cache, () the mappng table, and (3) the I/O scheduler. Usng MQSm, we fnd that nter-flow nterference leads to sgnfcant unfarness (.e., the nterference slows down each flow unequally) n modern SSDs. Ths s a major concern, as farness s a frst-class desgn goal n modern computng platforms [, 7, 9, 3, 37, 56 6, 66, 73 76,,, ]. Unfarness reduces the predctablty of the I/O latency and throughput for each flow, and can allow a malcous flow to deny or delay I/O servce to other, bengn flows. We have made MQSm avalable as an open source tool to the research communty []. We hope that MQSm enables researchers to explore drectons n several new and dfferent areas. We make the followng key contrbutons n ths work: We use real off-the-shelf SSDs to show that stateof-the-art SSD smulators do not adequately capture three mportant propertes of modern SSDs: () the mult-queue model used by modern host nterface protocols such as NVMe, () steady-state SSD behavor, and (3) the end-to-end I/O request latency. We ntroduce MQSm, a smulator that accurately models both modern NVMe-based and conventonal SATA-based SSDs. To our knowledge, MQSm s the frst publcly-avalable SSD smulator to fathfully model the NVMe protocol. We valdate the results reported by MQSm aganst several real state-of-the-art mult-queue SSDs. We demonstrate how MQSm can uncover mportant ssues n modern SSDs that exstng smulators cannot capture, such as the mpact of nter-flow nterference on farness and system performance. Background In ths secton, we provde a bref background on multqueue SSD (MQ-SSD) devces. Frst, we dscuss the nternal organzaton of an MQ-SSD (Secton.). Next, we dscuss host nterface protocols commonly used by SSDs (Secton.). Fnally, we dscuss how the SSD flash translaton layer (FTL) handles requests and performs mantenance tasks (Secton.3).. SSD Internals Modern MQ-SSDs are typcally bult usng NAND flash memory chps. NAND flash memory [, ] supports read and wrte operatons at the granularty of a flash page (typcally kb). Insde the NAND flash chps, multple pages are grouped together nto a flash block, whch s the granularty at whch erase operatons take place. Flash wrtes can take place only to pages that are erased (.e., free). To mnmze the wrte latency, MQ-SSDs perform out-of-place updates (.e., when a logcal page s updated, ts data s wrtten to a dfferent, free physcal page, and the logcal-to-physcal mappng s updated). Ths avods the need to erase the old physcal page durng a wrte operaton. Instead, the old page s marked as nvald, and a garbage collecton procedure [ 3, 3] reclams nvald physcal pages n the background. Fgure shows the nternal organzaton of an MQ- SSD. The components nsde the MQ-SSD are dvded nto two groups: () the back end, whch ncludes the memory devces; and () the front end, whch ncludes the control and management unts. The memory devces (e.g., NAND flash memory [, ], phase-change 5 th USENIX Conference on Fle and Storage Technologes USENIX Assocaton

4 MQ-SSD Cached Host DRAM SQ CQ SQ CQ Root Complex PCIe Bus PCIe Swtch Detaled host-to-devce SQ N data transmsson model CQ N n MQSm 3 SQ: I/O Submsson Queue CQ: I/O Completon Queue Front end HIL FTL Request, LPA Cache LPA Address PPA Transacton Page Management Translaton Schedulng Request, Page M Devce-level I/O Request Queue Wrte Cache DRAM Chp Queue Chp Queue Chp Queue Chp 3 Queue FCC WRQ RDQ GC-WRQ GC-RDQ Detaled request processng delay model, and Mult-queue request Support for mult-queue-aware cache and processng n MQSm address mappng n MQSm 3 Back end Channel Chp Chp Channel FCC Chp Chp 3 Multplexed Interface Bus Interface Plane Plane Plane Plane De De Fast and effcent precondtonng n MQSm Fgure : Organzaton of an MQ-SSD. As hghlghted n the fgure (,, 3 ), our MQSm smulator captures several aspects of MQ-SSDs not modeled by exstng smulators. memory [], STT-MRAM [], 3D XPont [9]) n the back end are organzed n a hghly-herarchcal manner to maxmze I/O concurrency. The back end contans multple ndependent bus channels, whch connect the memory devces to the front end. Each channel connects to one or more memory chps. For a NAND flash memory based SSD, each NAND flash chp s typcally dvded nto multple des, where each de can ndependently execute memory commands. All of the des wthn a chp share a common communcaton nterface. Each de s made up of one or more planes, whch are arrays of flash cells. Each plane contans multple blocks. Multple planes wthn a sngle de can execute memory operatons n parallel only f each plane s executng the same command on the same address offset wthn the plane. In an MQ-SSD, the front end ncludes three major components [7]. () The host nterface logc (HIL) mplements the protocol used to communcate wth the host (Secton.). () The flash translaton layer (FTL) manages flash resources and processes I/O requests (Secton.3). (3) The flash chp controllers (FCCs) send commands to and transfer data to/from the memory chps n the back end. The front end contans on-board DRAM, whch s used by the three components to cache applcaton data and store data structures for flash management.. Host Interface Logc The HIL plays a crtcal role n leveragng the nternal parallelsm of the NAND flash memory to provde hgher I/O performance to the host. The SATA protocol [67] s commonly used for conventonal SSDs, due to wdespread support for SATA on enterprse and clent systems. SATA employs Natve Command Queung (NCQ), whch allows the SSD to concurrently execute I/O requests. NCQ allows the SSD to schedule multple I/O requests based on whch back end resources are currently dle [9, 5]. The NVM Express (NVMe) protocol [63] was desgned to allevate the bottlenecks of SATA [9], and to enable scalable, hgh-bandwdth, and low-latency communcaton over the PCIe bus. When an applcaton ssues an I/O request n NVMe, t bypasses the I/O stack n the OS and the block layer queue, and nstead drectly nserts the request nto a submsson queue (SQ n Fgure ) dedcated to the applcaton. The SSD then selects a request from the SQ, performs the request, and nserts the request s job completon nformaton (e.g., ack, read data) nto the request completon queue (CQ) for the correspondng applcaton. NVMe has already been wdely adopted n modern SSD products [3, 6, 79, 5, 6]..3 Flash Translaton Layer The FTL executes on a mcroprocessor wthn the SSD, performng I/O requests and flash management procedures [, ]. Handlng an I/O request n the FTL requres four steps for an SSD usng NVMe. Frst, when the HIL selects a request from the SQ, t nserts the request nto a devce-level queue. Second, the HIL breaks the request down nto multple flash transactons, where each transacton s at the granularty of a sngle page. Next, the FTL checks f the request s a wrte. If t s, and the MQ-SSD supports wrte cachng, the wrte cache management unt stores the data for each transacton n the wrte cache space wthn DRAM, and asks the HIL to prepare a response. Otherwse, the FTL translates the logcal page address (LPA) of the transacton nto a physcal page address (PPA), and enqueues the transacton nto the correspondng chp-level queue. There are separate queues for reads (RDQ) and for wrtes (WRQ). The transacton schedulng unt (TSU) resolves resource contenton among the pendng transactons n the chp-level queue, and sends transactons that can be performed to ts correspondng FCC [, 7]. Fnally, when all transactons for a request fnsh, the FTL asks the HIL to prepare a response, whch s then delvered to the host. The address translaton module of the FTL plays a key role n mplementng out-of-place updates. When a transacton wrtes to an LPA, a page allocaton scheme assgns the LPA to a free PPA. The LPA-to-PPA mappng s recorded n a mappng table, whch s stored wthn the non-volatle memory and cached n DRAM (to reduce the latency of mappng lookups) []. When a transacton reads from an LPA, the module searches for the LPA s mappng and retreves the PPA. The FTL s also responsble for memory wearout management (.e., wear-levelng) and garbage collecton (GC) [ 3, 3]. GC s trggered when the number of free pages drops below a threshold. The GC procedure reclams nvaldated pages, by selectng a canddate block wth a hgh number of nvald pages, movng any vald pages n the block nto a free block, and then erasng the canddate block. Any read and wrte transactons USENIX Assocaton th USENIX Conference on Fle and Storage Technologes 5

5 generated durng GC are nserted nto dedcated read (GC-RDQ) and wrte (GC-WRQ) queues. Ths allows the transacton schedulng unt to schedule GC-related requests durng dle perods. 3 Smulaton Challenges for Modern MQ-SSDs In ths secton, we compare the capabltes of state-ofthe-art SSD smulators to the common features of the modern SSD devces. As shown n Fgure, we dentfy three sgnfcant features of modern SSDs that are not supported by current smulaton tools: mult-queue support, fast modelng of steady-state behavor, and 3 proper modelng of the end-to-end request latency. Whle some of these features are also present n some conventonal SSDs, ther lack of support n exstng smulators s more crtcal when we evaluate modern and emergng MQ-SSDs, resultng n large devatons between smulaton results and measured performance. 3. Mult-Queue Support A fundamental dfference of a modern MQ-SSD from a conventonal SSD s ts use of multple queues that drectly expose the devce controller to applcatons [9]. For conventonal SSDs, the OS I/O scheduler coordnates concurrent accesses to the storage devces and ensures farness for co-runnng applcatons [66, 6]. MQ- SSDs elmnate the OS I/O scheduler, and are themselves responsble for farly servcng I/O requests from concurrently-runnng applcatons and guaranteeng hgh responsveness. Exposng applcaton-level queues to the storage devce enables the use of many optmzed management technques n the MQ-SSD controller, whch can provde hgh performance and a hgh level of both farness and responsveness. Ths s manly due to the fact that the devce controller can make better schedulng decsons than the OS, as the devce controller knows the current status of the SSD s nternal resources. We nvestgate how the performance of a flow changes when the flow s concurrently executed wth other flows on real MQ-SSDs. We conduct a set of experments where we control the ntensty of synthetc workloads that run on four new off-the-shelf MQ-SSDs released between and 7 (see Table and Appendx A). In each experment, there are two flows, Flow- and Flow-, where each flow always keeps ts I/O queue full wth only sequental read accesses of kb average request sze. We control the ntensty of a flow by adjustng ts I/O queue depth. A deeper I/O queue results n a more ntensve flow. We hold the I/O queue depth of Flow- constant n all experments, settng t to requests. We sweep eght dfferent values for the I/O queue depth of Flow-, rangng from to requests. To quantfy the I/O servce farness of each devce, we measure the average slowdown of each executed flow, and then use the slowdown to calculate farness usng Equaton. We defne the slowdown of a flow f as S f = RTf shared /RTf alone, where RTf shared s the response tme of f when t s run concurrently wth other flows, and We assume that each I/O flow uses a separate I/O queue RTf alone s the response tme of f when t runs alone. Farness (F) s calculated as [, 56, 5]: MIN{S f } F = () MAX{S f } Accordng to the above defnton: < F. Lower F values ndcate hgher dfferences between the mnmum and maxmum slowdowns of all concurrently-runnng flows, whch we say s more unfar to the flow that s slowed down the most. Hgher F values are desrable. Fgure depcts the slowdown, normalzed throughput (IOPS), and farness results when we execute Flow- and Flow- concurrently on our four target MQ-SSDs (whch we call SSD-A, SSD-B, SSD-C, and SSD-D). The x-axes n all of the plots n Fgure represent the queue depth (.e., the flow ntensty) of Flow- n the experments. For each SSD, we show three plots from left to rght: () the slowdown and normalzed throughput of Flow-, () the slowdown and normalzed throughput of Flow-, and (3) farness SSD-A Flow SSD-B 35 Flow SSD-C SSD-D.6... Flow- Flow Throughput Farness Flow Farness..6.. Throughput Farness Flow Farness..6.. Throughput Farness Flow Farness..6.. Throughput Farness.5.5 Flow Farness Fgure : Performance of Flow- (left) and Flow- (center), and farness (rght), when flows are concurrently executed wth dfferent ntenstes on four real MQ-SSDs. We make four major observatons from Fgure. Frst, n SSD-A, SSD-B, and SSD-C, the throughput of Flow- substantally ncreases proportonately wth the queue depth. Asde from the maxmum bandwdth avalable from the SSD, there s no lmt on the throughput of each I/O flow. Second, Flow- s slowed down sgnfcantly due to nterference from Flow- when the I/O queue depth of Flow- s much greater than that of Flow-. Thrd, for SSD-A, SSD-B, and SSD-C, the slowdown of Flow- becomes almost neglgble (.e., ts th USENIX Conference on Fle and Storage Technologes USENIX Assocaton

6 value approaches ) as the ntensty of Flow- ncreases. Fourth, SSD-D lmts the maxmum throughput of each flow, and thus the negatve mpact of Flow- on the performance of Flow- s well controlled. Further experments wth a hgher number of flows reveal that one flow cannot explot more than a quarter of the full I/O bandwdth of SSD-D, ndcatng that SSD-D has some level of nternal farness control. In contrast, one flow can unfarly explot the full I/O capabltes of the other three SSDs. We conclude that () the relatve ntensty of each flow sgnfcantly mpacts the throughput delvered to each flow; and () MQ-SSDs wth farness controls, such as SSD-D, perform dfferently from MQ-SSDs wthout farness controls when the relatve ntenstes of concurrently-runnng flows dffer. Thus, to accurately model the performance of MQ-SSDs, an SSD smulator needs to model multple queues and enable multple concurrently-runnng flows. 3. Steady-State Behavor SSD performance evaluaton standards explctly clarfy that the SSD performance should be reported n the steady state [7]. As a consequence, pre-condtonng (.e., quckly reachng steady state) s an essental requrement for SSD devce performance evaluaton, n order to ensure that the results are collected n the steady state. Ths polcy s mportant for three reasons. Frst, the garbage collecton (GC) actvtes are nvoked only when the devce has performed a certan number of wrtes, whch causes the number of free pages n the SSD to drop below the GC threshold. GC actvtes nterfere wth user I/O actvty and can sgnfcantly affect the sustaned devce performance. However, a fresh out-ofthe-box (FOB) devce s unlkely to execute GC. Hence, performance results on an FOB devce are unrealstc as they would not account for GC [7]. Second, the steadystate benefts of the wrte cache may be lower than the short-term benefts, partcularly for wrte-heavy workloads. More precsely, n the steady state, the wrte cache s flled wth applcaton data and warmed up, and t s hghly lkely that no free slot can be allocated to new wrte requests. Ths leads to cache evctons and ncreased flash wrte traffc n the back end [33]. Thrd, the physcal data placement of currently-runnng applcatons s hghly dependent on the devce usage hstory and the data placement of prevous processes. For example, whch physcal pages are currently free n the SSD depends on how prevous I/O requests wrote to and n- Based on the SNIA defnton [7], a devce s n the steady state f ts performance varaton s lmted to a determnstc range. Total Wrte Volume (GB) proj- proj-3 proj- proj- proj- prn- prn- mds- mds- tpcc tpce hm- hm- rad-ps rad-be msncfs dev exchange msnfs wsrch-3 wsrch- fn fnwsrch- valdated physcal pages. As a result, channel- and chplevel parallelsm n SSDs s lmted n the steady state. Although a number of works do successfully precondton and smulate steady-state behavor, many prevous studes have not explored the effect of steady-state behavor on ther proposals. Instead, ther smulatons start wth an FOB SSD, and never reach steady state (e.g., when each physcal page of the SSD has been wrtten to at least once). Most well-known storage traces are not large enough to fll the entre storage space of a modern SSD. Fgure 3 shows the total wrte volume of popular storage workloads [6, 53 55, 6]. We observe that most of the workloads have a total wrte volume that s much smaller than the storage capacty of most SSDs, wth an average wrte volume of 6 GB. Even for the few workloads that are large enough to fll the SSD, t s tme consumng for many exstng smulators to smulate each I/O request and reach steady state (see Secton 5). Therefore, t s crucal to have a smulator that enables effcent and hgh-performance steady-state smulaton of SSDs. 3.3 Real End-to-End Latency Request latency s a crtcal factor of MQ-SSD performance, snce t affects how long an applcaton stalls on an I/O request. The end-to-end latency of an I/O request, from the tme t s nserted nto the host submsson queue to the tme the response s sent back from the MQ-SSD devce to the completon queue, ncludes seven dfferent parts, as we show n Fgure. Exstng smulaton tools model only some parts of the end-to-end latency, whch are usually consdered to be the domnant parts of the end-to-end latency [3, 6, 7, 35, 3]. Fgure a llustrates the end-to-end latency dagram for a small kb read request n a typcal NAND flashbased MQ-SSD. It ncludes I/O job enqueung n the submsson queue (SQ), host-to-devce I/O job transfer over the PCIe bus, address translaton and transacton schedulng n the FTL 3, read command and address transfer to the flash chp, flash chp read 5, read data transfer over the Open NAND Flash Interface (ONFI) [65] bus 6, and devce-to-host read data transfer over the PCIe bus 7. Steps 5 and 6 are assumed to be the most tme-consumng parts n the end-to-end request processng. Consderng typcal latency values for an kb page read operaton, the I/O job nserton (< µs, as measured on our real SSDs), the FTL request processng on a multcore processor ( µs) [7] (assumng a mappng table cache ht), and the I/O job and data transfer over the PCIe bus ( µs) [, 6] make neglgble contrbutons compared to the flash read (5- µs) [9, 5, 5, 69] and the ONFI NV-DDR [65] flash transfer ( µs). However, the above assumpton s unrealstc due to two major reasons. Frst, for some I/O requests, FTL re- 7 9 src- src- src- src- src- rsrsch- rsrch- rsrch- prxy- prxy- Mean webdev-3 webdev- webdev- webdev- web-3 web- web- web- usr- ts- usr- usr- stg- stg- src- Fgure 3: Total amount of data wrtten by commonly-used storage workloads [6, 53 55, 6]. USENIX Assocaton th USENIX Conference on Fle and Storage Technologes 53

7 I/O job Xfer over PCIe Request 3 processng Read request Xfer to chp 5 Flash read (TFlash Read) Enqueue I/O job n the SQ NAND flash Chp Hghest contrbuton to end-to-end latency Response data Xfer over PCIe 6 ONFI data Xfer (TONFI Xfer) tme 7 In summary, a detaled, realstc model of end-to-end latency s key for accurate smulaton of modern SSD devces wth () multple I/O flows that can potentally lead to a sgnfcant ncrease n CMT (cached mappng table) msses, and () very-fast NVM technologes such as 3D XPont that greatly reduce raw memory read/wrte latences. Exstng smulaton tools do not provde accurate performance results for such devces. Modelng a Modern MQ-SSD wth MQSm (a) NAND flash memory User Applcaton Host Memory MQ-SSD HIL MQ-SSD Frmware 3D Xpont Chp I/O job Xfer over PCIe Hghest contrbuton to end-to-end latency Request Read request processng 3 Xfer to chp 5 6 tme Fast data Xfer (TFast Xfer) 7 Response data Xfer over PCIe 3D Xpont read (T3DXpont Read) Enqueue I/O job n the SQ (b) 3D XPont memory Fgure : Tmng dagram for a kb read request n (a) NAND-flash and (b) 3D XPont MQ-SSDs. quest processng may not always be neglgble, and can even become comparable to the flash read access tme. For example, pror work [6] shows that f the FTL uses page-level address mappng, then a workload wthout localty ncurs a large number of msses n the cached mappng table (CMT). In case of a mss n the CMT, the user read operaton stalls untl the mappng data s read from the SSD back end and transferred to the front end []. Ths can lead to a substantal ncrease n the latency of Step 3 n Fgure a, whch can become even longer than the combned latency of Steps 5 and 6. In an MQ-SSD, as a greater number of I/O flows execute concurrently, there s more contenton for the CMT, leadng to a larger number of CMT msses. Second, as shown n Fgure b, cuttng-edge nonvolatle memory technologes, such as 3D XPont [7, 9,, ], dramatcally reduce the access and data transfer tmes of the MQ-SSD back end, by as much as three orders of magntude compared to that of NAND flash [5,,, 3]. The total latency of the 3D XPont read and transfer (< µs) contrbutes less than % to the end-to-end I/O request processng latency (< µs) [7, ]. In ths case, a conventonal smulaton tool would be naccurate, as t does not model the major steps contrbutng to the end-to-end latency. To our knowledge, there s no SSD modelng tool that supports mult-queue I/O executon, fast and effcent modelng of the SSD s steady-state behavor, and a full end-to-end request latency estmaton. In ths work, we present MQSm, a new smulaton framework that s developed from scratch to support all of these three mportant features that are requred for accurate performance modelng and desgn space exploraton of modern MQ-SSDs. Although manly desgned for MQ-SSD smulaton, MQSm also supports smulaton of the conventonal SATA-based SSDs that mplement natve command queung (NCQ). Our new smulator models all of the components shown n Fgure, whch exst n modern SSDs. Table provdes a quck comparson between MQSm and prevous SSD smulators. MQSm s a dscrete-event smulator wrtten n C++ and s released under the permssve MIT Lcense []. Fgure 5 depcts a hgh-level vew of MQSm s man components and ther nteracton. In ths secton, we brefly descrbe these components and explan ther novel features wth respect to the prevous smulators. Front end Back end Host Interface Request Fetch Unt Input Stream Manager Data Cache Manager FTL Address Mappng Unt Cached Mappng Table Flash Block Manager GC and WL Unt NVM Chp MQ-SSD Frmware NVM Channel MQ-SSD HIL NVM PHY Host Memory Transacton Schedulng Unt (TSU) User Applcaton Fgure 5: Hgh-level vew of MQSm components.. SSD Back End Model MQSm provdes a smple yet detaled model of the flash memory chps. It consders three major latency components of the SSD back end: () address and command transfer to the memory chp; () flash memory read/ Table : A quck comparson between MQSm and exstng SSD modelng tools. Tool Mult-Queue Support Precondtonng End-to-end Latency MQSm Mult-queue schedulng and prortzaton Exstng Tools Not supported Fast and automatc (enabled by default) Manual, optonal, and long executon tme Detaled model of the end-to-end latency Mssng some constant- or varable-latency components 5 th USENIX Conference on Fle and Storage Technologes Bult-n Implementaton of SSD Components All major components that exst n modern SSDs Implementaton s mssng for some major components USENIX Assocaton

8 wrte executon for dfferent technologes that store,, or 3 bts per cell []; and (3) data transfer to/from memory chps. MQSm s flash model consders the constrants of de- and plane-level parallelsm, and advanced command executon [65]. One mportant new feature of MQSm s that t can be confgured or easly modfed to smulate new NVM chps (e.g., those that do not need erase-before-wrte). Due to decouplng of the NVM chp communcaton nterface from the chp s nternal mplementaton of the memory operatons, one can modfy the NVM chp of MQSm wthout the need to change the mplementaton of the other MQSm components. Another new feature of MQSm s that t decouples the szes of read and wrte operatons. Ths feature helps to explot large page szes of modern flash memory chps n that can enable better wrte performance, whle preventng the negatve effects of large page szes on read performance. For flash chp wrtes, the operaton s always page-szed [, ]. MQSm s data cache controller can delay wrtes to elmnate wrte-back of partally-updated logcal pages (where the update sze s smaller than the physcal page sze). When a partally-updated logcal page should be wrtten back to the flash storage, the unchanged sub-pages (sectors) of the logcal page are frst read from the physcal page that stores page data. Then, unchanged and updated peces of the page are merged. In the last step, the entre page data s wrtten to a new free physcal page. For flash chp reads, the operaton could be smaller than the physcal page sze. When a read operaton fnshes, only the data peces that are requested n the I/O request are transferred from flash chps to the SSD controller, avodng the data transfer overhead of large physcal pages.. SSD Front End Model The front end model of MQSm ncludes all of the basc components of a modern SSD controller and provdes many new features that do not exst n prevous SSD modelng tools... Host Interface Model The host nterface component of MQSm provdes both NVMe mult-queue (MQ) and SATA natve command queue models for a modern SSD. To our knowledge, MQSm s the frst modelng tool that supports MQ I/O request processng. There s a request fetch unt wthn the host nterface of MQSm that fetches and schedules applcaton I/O requests from dfferent nput queues. The NVMe host nterface provdes users wth a parameter, called QueueFetchSze, that can be used to tune the behavor of the request fetch unt, n order to accurately model the behavor of real MQ-SSDs. Ths parameter defnes the maxmum number of I/O requests from each SQ that can be concurrently servced n the MQ-SSD. More precsely, at any gven tme, the number of I/O requests that are fetched from a host SQ to the devce-level queue s always less than or equal to QueueFetchSze. Ths parameter has a large mpact on the MQ-SSD multflow request processng characterstcs dscussed n Secton 3. (.e., on maxmum achevable throughput per I/O flow and probablty of nter-flow nterference). Appendx A.3 analyzes the effect of ths parameter on performance. MQSm also models dfferent prorty classes for hostsde request queues, whch are part of the NVMe standard specfcaton [63]... Data Cache Manager MQSm has a data cache manager component that mplements a DRAM-based cache wth the least-recentlyused (LRU) replacement polcy. The DRAM cache can be confgured to cache () recently-wrtten data (default mode), () recently-read data, or (3) both recentlywrtten and recently-read data. A new feature of MQSm s cache manager, compared to prevous SSD modelng tools, s that t mplements a DRAM access model n whch the contenton among the concurrent accesses to DRAM chps and the latency of DRAM commands are consdered. The DRAM cache models n MQSm can be extended to make use of detaled and fast DRAM smulators, such as Ramulator [, 39], to perform detaled studes of the effect of DRAM cache performance on the overall MQ-SSD performance. We leave ths to future work...3 FTL Components MQSm mplements all the man FTL components, ncludng () the address translaton unt, () the garbage collecton (GC) and wear-levelng (WL) unt, and (3) the transacton schedulng unt. MQSm provdes dfferent optons for each of these components, ncludng state-ofthe-art address translaton strateges [, 7], GC canddate block selecton algorthms [,, 3, 5,, 9], and transacton schedulng schemes [3, 7]. MQSm also mplements several state-of-the-art GC and flash management mechansms, ncludng preemptble GC I/O schedulng [], ntra-plane data movement from one physcal page to another physcal page usng copyback read and wrte command pars [7], and program/erase suspenson [7] to reduce the nterference of GC operatons wth applcaton I/O requests. One novel feature of MQSm s that all of ts FTL components support mult-flow (.e., mult-nput queue) request processng. For example, the address mappng unt can partton the cached mappng table space among the concurrently runnng flows. Ths nherent support of mult-queueaware request processng facltates the desgn space exploraton of performance solaton and QoS schemes for MQ-SSDs..3 Modelng End-to-End Latency In addton to the flash operaton and nternal data transfer latency (steps 3,, 5, and 6 n Fgure ), there s a mx of varable and constant latences that MQSm models to determne the end-to-end request latency. Varable Latences. These are the varable request processng tmes n FTL that result from contenton n the cached mappng table and the DRAM wrte cache. Dependng on the request type (ether read or wrte) and the request s logcal address, the request processng tme n FTL ncludes some of the followng tems: () the tme requred to read/wrte from/to the data cache, and () the USENIX Assocaton th USENIX Conference on Fle and Storage Technologes 55

9 tme to fetch mappng data from flash storage n case of a mss n the cached address mappng table. Constant Latences. These nclude the tmes requred to transmt the I/O job nformaton, the entre user data, and the I/O completon nformaton over the PCIe bus, and the frmware (FTL) executon tme on the controller s mcroprocessor. The PCIe transmsson latences are calculated based on a smple packet latency model provded by Xlnx [] that consders: () the PCIe communcaton bandwdth, () the payload and header szes of the PCIe Transacton Layer Packets (TLP), (3) the sze of the NVMe management data structures, and d) the sze of the applcaton data. The frmware executon tme s estmated usng both a CPU and cache latency model [].. Modelng Steady-State Behavor The basc assumpton of MQSm s that all smulatons should be executed when the modeled devce s n steady state. To model the steady-state behavor, MQSm, by default, automatcally executes a precondtonng functon before startng the actual smulaton process. Ths functon performs precondtonng n a short tme (e.g., less than mn when runnng tpcc [53] on an GB MQ- SSD) wthout the need to execute addtonal I/O requests. Durng precondtonng, all avalable physcal pages of the modeled SSD are transtoned to ether a vald or nvald state, based on the steady-state vald/nvald page dstrbuton model provded n [] (only very few flash blocks are assumed to reman free and are added to the free block pool). MQSm pre-processes the nput trace to extract the LPA (logcal page address) access characterstcs of the applcaton I/O requests n the trace, and then uses the extracted nformaton as nputs to the vald/nvald page dstrbuton model. In addton, nput trace characterstcs, such as the average wrte arrval rate and the dstrbuton of wrte addresses, are used to warm up the wrte cache..5 Executon Modes MQSm provdes two modes of operaton: () standalone mode, where t s fed a real dsk trace or a synthetc workload, and () ntegrated mode, where t s fed dsk requests from an executon-drven engne (e.g., gem5 []). 5 Comparson wth Prevous Smulators The ncreasng usage of SSDs n modern computng systems has boosted nterest n SSD desgn space exploraton. To ths end, several smulators have been developed n recent years. Table summarzes the features of MQSm and popular exstng SSD modelng tools. The table also shows the average error rates for the performance of real storage workloads reported by each smulator, compared to the performance measured on four real MQ-SSDs (see Appendx A. for our methodology). Exstng tools ether do not model some major components of modern SSDs or provde very smplstc component models that lead to unrealstc I/O request latency estmaton. In contrast, MQSm provdes detaled mplementatons for all of the major components of modern SSDs. MQSm s wrtten n C++ and has 3K lnes of code (LOC). Next, we dscuss the man advantages of MQSm compared to the prevous tools. Host Interface Logc. As Table shows, most of the exstng smulators assume a very smplstc HIL model wth no explct management mechansm for the I/O request queue. Ths leads to an unrealstc SSD model regardng the requrements of both NVMe and SATA protocols. As we menton n Secton 3, the concurrent executon of I/O flows presents many challenges for performance predctablty and farness n MQ-SSDs. No ex- Table : Comparson of MQSm wth prevous SSD modelng tools. Smulator HIL Protocol Executon Mode End-to-End Latency Front-End Components Smulaton Error (%) NVMe SATA MQ NCQ Alone Full Emul 3 Prec NVM R/W 5 NVM Xfer FTL Proc 6 Cache Acc. 7 Host Xfer Map P 9 Map H GC Wrte Cache TSU WRL MQ FTL 3 LOC SSD-A SSD-B SSD-C SSD-D MQSm 3K 6 SSDModel [3] K FlashSm [3] K SSDSm [7] 5K NANDFlashSm [] 7K VSSIM [9] 6K WscSm [6] 7K SmpleSSD [35] 7K Standalone executon Integrated executon wth full-system smulator 3 SSD emulaton for real system Fast and accurate precondtonng of the modeled SSD to enable accurate steady-state results 5 Flash (NVM) read/wrte tmng 6 FTL request processng overhead 7 Accurate modelng of wrte cache access latency Host-to-devce and devce-to-host data transfer delay 9 Page-level address mappng Hybrd address mappng FTL transacton schedulng unt FTL wear-levelng unt 3 Bult-n support for mult-queue-aware request processng n FTL Lnes of source code 56 th USENIX Conference on Fle and Storage Technologes USENIX Assocaton

10 stng smulator mplements NVMe and mult-queue I/O request management, and, hence, accurately models the behavor of MQ-SSDs. Also, except for WscSm, we fnd that no exstng smulator mplements an accurate model of the SATA protocol and NCQ request processng. Ths leads to unrealstc SATA devce smulaton, as NCQ-based I/O schedulng plays a key role n the performance of real SSD devces [5, 6]. Steady-State Smulaton. To our knowledge, accurate and fast steady-state behavor modelng s not provded by many exstng SSD modelng tools. Among the tools lsted n Table, only SSDSm provdes a functon, called make aged, to change the status of a set of physcal pages to vald before startng the actual executon of an nput trace. Ths smple method cannot accurately replcate the steady-state behavor of an SSD for two reasons. Frst, after the executon of make aged, the physcal blocks would nclude only vald pages or only free pages. Ths s far from the steady-state status of blocks n real devces, where each non-free block has a mx of vald and nvald pages [,, ]. Second, the steadystate status of the data cache s not modeled,.e., the smulaton starts wth a completely empty wrte cache. In general, t s possble to brng these smulators to steady state. However, there s no fast pre-condtonng support for them, and pre-condtonng must be performed by executng traces. Precondtonng an exstng smulator requres users to generate traces wth a large enough number of I/O requests, and can sgnfcantly slow down the smulator, especally when a hghcapacty SSD s modeled. For example, our studes wth SSDSm show that pre-condtonng may ncrease the smulaton tme up to x f an GB SSD s modeled. 3 Detaled End-to-End Latency Model. As descrbed n Secton 3.3, the end-to-end latency of an applcaton I/O request ncludes dfferent components. Table shows that latency modelng n exstng smulators s manly focused on the latency of the flash chp operaton and the SSD nternal data transfer. As we explan n Secton 3.3, ths s an unrealstc model of the end-to-end I/O request processng latency, even for a conventonal SSD. To study the accuracy of the exstng tools n modelng real devces, we create four models for the four real SSDs shown n Table n each smulator, and execute three real traces,.e., tpcc, tpce, and exchange. We exclude the smulators that do not support trace-based executon. The four rghtmost columns of Table show the average error rate of each smulator n modelng the performance (.e., read and wrte latency) of these four real devces. The error rates of the four evaluated smulators are almost one order of magntude hgher than that of MQSm. We beleve that these hgh error rates are due to four major reasons: () the lack of wrte cache or naccurate modelng of the wrte cache access latency, () the lack of bult-n support for steady-state modelng, (3) ncomplete modelng of the request processng latency n FTL, and () the lack of modelng of the host-to-devce communcaton latency. 3 The ncrease n smulaton tme depends on the access pattern, ntensty, and mx of I/O requests (read vs. wrte) of the workload. 6 Research Drectons Enabled by MQSm MQSm s a flexble smulaton tool that enables dfferent studes on both modern and conventonal SSD devces. In ths secton, we dscuss two new research drectons enabled by MQSm, whch could not be explored easly usng exstng smulaton tools. Frst, we use MQSm to perform a detaled analyss of nter-flow nterference n a modern MQ-SSD (Secton 6.). We explan how sharng dfferent nternal resources n an MQ- SSD, such as the wrte cache, cached mappng table, and back end resources, can ntroduce farness ssues. Second, we explan how the full-system smulaton mode of MQSm can enable detaled applcaton-level studes (Secton 6.). 6. Desgn Space Exploraton of Farness and QoS Technques for MQ-SSDs As we descrbe n Secton, farness and QoS should be consdered as frst-class desgn crtera for modern datacenter SSDs. MQSm provdes an accurate framework to study nter-flow nterference, thus enables the ablty to devse nterference-aware MQ-SSD management algorthms for sharng of the nternal MQ-SSD resources. As we show n Secton 3., concurrently runnng two I/O flows mght lead to dsproportonate slowdowns for each flow, greatly degradng farness and proportonal progress. Ths s partcularly mportant n hgh-end SSD devces, whch provde hgher throughput per I/O flow, as we show n Appendx A.3. We fnd that ths nter-flow nterference s manly the result of contenton that takes place at three locatons n an MQ-SSD: ) the wrte cache n the front end, ) the cached mappng table (CMT) n the front end, and 3) the storage resources n the back end. In ths secton, we use MQSm to explore the mpact of these three ponts of contenton on performance and farness, whch cannot be explored accurately usng exstng smulators. 6.. Methodology MQ-SSD Confguraton. Table 3 lsts the specfcaton of the MQ-SSD that we model n MQSm for our contenton studes. Metrcs. To measure performance, we use weghted speedup (WS) [7] of the average response tme (RT), whch represents the overall effcency and system-level Table 3: Confguraton of the smulated SSD. SSD Organzaton Flash Communcaton Interface Flash Mcroarchtecture Flash Access Parameters Host nterface: PCIe 3. (NVMe.) User capacty: GB Wrte cache: 56 MB, CMT: MB channels, chps per channel QueueFetchSze = 5 ONFI 3. (NV-DDR) Wdth: bt, Rate: 333 MT/s KB page, B metadata per page, 56 pages per block, blocks per plane, planes per de Read latency: 75 µs, Program latency: 75 µs, Erase latency: 3. ms USENIX Assocaton th USENIX Conference on Fle and Storage Technologes 57

11 throughput [] provded by an MQ-SSD durng the concurrent executon of multple flows: WS = RT alone RT shared where RT alone and RT shared are defned n Secton 3.. To demonstrate the effect of nter-flow nterference on farness, we report slowdown and farness (F) metrcs, as defned n Secton Contenton at the Wrte Cache One pont of contenton among concurrently-runnng flows n an MQ-SSD s the wrte cache. For flows wth low to moderate wrte ntensty (where the average depth of the I/O queue less than ), or wth hgh spatal localty, the wrte cache decreases the response tme of wrte requests, by avodng the need for the requests to wat for the wrte to complete to the underlyng memory. For flows wth hgh wrte ntensty or wth hghlyrandom accesses, the wrte requests fll up the lmted capacty of the wrte cache quckly, causng sgnfcant cache thrashng and lmtng the decrease n wrte request response tme. Such flows not only do not beneft from the wrte cache themselves, but also prevent other lower-wrte-ntensty flows from beneftng from the wrte cache, leadng to a large performance loss for the lower-wrte-ntensty flows. To understand how the contenton at the wrte cache affects system performance and farness, we perform a set of experments where we run two flows, Flow- and Flow-, both of whch perform only random-access wrte requests. In both flows, the average request sze s set to kb. We set Flow- to have a moderate wrte ntensty, by lmtng the queue depth to requests. We vary the queue depth of Flow- from requests to 56 requests, to control the wrte ntensty of the flow. In order to solate the effect of wrte cache nterference n our experments, we () assgn each flow to a dedcated subset of back end resources (.e., Flow- uses Channels, and Flow- uses Channels 5 ), to avod ntroducng any nterference n the back end; and () use a perfect CMT, where all address translaton requests are hts, to avod nterference due to lmted CMT capacty. Fgure 6a shows the slowdown of each flow when the two flows run concurrently, compared to when each flow runs alone. Fgure 6b shows the farness and performance of the system when the two flows run concurrently. We make four key observatons from the fgures. Frst, Flow- s slowed down sgnfcantly when Flow- has a hgh wrte ntensty (.e., ts queue depth s greater than ), ndcatng that at hgh wrte ntenstes, Flow- nduces wrte cache trashng. Second, the slowdown of Flow- s neglgble, because of the low wrte ntensty of Flow-. Thrd, farness degrades greatly, as a result of the wrte cache contenton, when Flow- has a hgh wrte ntensty. Fourth, wrte cache contenton causes an MQ-SSD to be neffcent at concurrently runnng multple I/O flows, as the weghted speedup s reduced by over 5% when Flow- has a hgh wrte ntensty compared to when t has a low wrte ntensty. () 35 Flow Farness Flow (a) of Flow- (left) and Flow- (rght) Weghted Speedup (b) Farness (left) and system performance (rght) Fgure 6: Impact of wrte cache contenton. We conclude that wrte cache contenton leads to unfarness and overall performance degradaton for concurrently-runnng flows when one flow has a hgh wrte ntensty. In these cases, the hgh-wrte-ntensty flow () does not beneft from the wrte cache; and () prevents other, lower-wrte-ntensty flows from takng advantage of the wrte cache, even though the other flows would otherwse beneft from the cache. Ths motvates the need for far wrte cache management algorthms for MQ-SSDs that take nter-flow nterference and flow wrte ntensty nto account Contenton at the Cached Mappng Table As we dscuss n Secton 3.3, address translaton can notceably ncrease the end-to-end latency of an I/O request, especally for read requests. We fnd that for I/O flows wth random access patterns, the cached mappng table (CMT) mss rate s hgh due to poor reuse of address translaton mappngs, whch causes the I/O requests generated by the flow to stall for long perods of tme durng address translaton. Ths s not true for I/O flows wth sequental accesses, for whch the CMT mss rate remans low due to spatal localty. However, when two I/O flows run concurrently, where one flow has a random access pattern and another flow has a sequental access pattern, the poor localty of the flow wth the random access pattern may cause both flows to have hgh CMT mss rates. To understand how contenton at the CMT affects system performance and farness, we perform a set of experments where we concurrently run two flows that ssue read requests wth an average request sze of kb. In these experments, Flow- has a fully-sequental access pattern, and Flow- has a random access pattern for a fracton of the total executon tme, and has a sequental access pattern for the remanng tme. We vary the randomness (.e., the fracton of the executon tme wth a random access pattern) of Flow-. To solate the effects of CMT contenton, we assgn Flow- to Channels n the back end, and assgn Flow- to Channels 5. Fgure 7a shows the slowdown and change n CMT ht rate of each flow when Flow- and Flow- run concur- 5 th USENIX Conference on Fle and Storage Technologes USENIX Assocaton

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose, Onur Mutlu February 13, 2018 Executive Summary