Acceleating Stoage with RDMA Max Gutovoy Mellanox Technologies 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 1
What is RDMA? Remote Diect Memoy Access - povides the ability to pefom a diect memoy access (DMA) fom one compute into to anothe without involving eithe one's OS/CPU. Was ceated in 1999 (implementations: infiniband, RoCE, iwarp) Main chaacteistics: High Bandwidth Low latency Zeo copy (CPU offload) Hadwae based data tansfes Kenel bypass Diect access to HW fo use-level applications QOS Asynchonous tansactions 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 2
RDMA pimitives QP (Queue-Pai) send & ecv queues, with vaious tanspot sevices, used fo posting wok equests to the HW: RC (Reliable Connected) ~=TCP UD (Uneliable Datagam) ~= UDP UC (Uneliable Connected) RD (Reliable Datagam) defined by spec but no yet implemented CQ ( Queue) used fo epoting wok equests completions to the host MR (Memoy Region) Descibes a memoy aea, with the elevant pemissions, accessible fo RMDA fom the device. PD (Potection Domain) povides an association between QPs/MRs/MWs fo enabling and contolling HCA access to host memoy. Pogamming Model Vebs 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 3
RDMA opeations Messaging: RECV: post a buffe fo incoming data SEND: send a buffe to a emote pee (who posted a RECV buffe fo it in advance) REG_MR: memoy egistation fo RDMA opeations One-sided: RDMA_WRITE: copy a local buffe (descibed by MR-L) to a emote buffe (MR-R) RDMA_READ: copy a emote buffe (descibed by MR-R) to a local buffe (MR-L) 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 4
Memoy egistation So why we need to egiste memoy? Avoid data couption Potect fom unauthoized access Map the addesses to DMA language (PCI space) 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 5
Use Fast Memoy Registation Memoy egistation is a heavy opeation (allocations, pinning, tanslation, FW commands ) In the kenel (iser/srp/nvme-of ) we always eceive the buffe fom the use. Use allocate a buffe Use open a file (block device o file system) Use call syscall ead/wite(buffe) à the ULP sees this as a bio o as an sg list. - Pinning the buffe was done by the block laye (no need to take cae of data couption) One should use a special wok equest (WR) to make it fast Use pe-allocated MR Only DMA map the SG list and update the HW memoy management tables - Using ib_sge object that epesents a vitually contiguous buffe using (key, addes, length) tuple 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 6
Why Should We Cae About RDMA? Because Faste Stoage Needs a Faste Netwok (not only in HPC)!!! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 7
Vaiety of RDMA Stoage Potocols 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 8
Potocol Deep Dive NVMe/NVMe-oF Shae NVMe SSDs with multiple seves Bette utilization, capacity, ack space, powe Scalability management NVMe ove Fabics standad Vesion 1.0 completed in June 2016 High pefomance access to emote SSD (not only SSD) RDMA potocol is pat of the standad (e.g. keyed SGLs) Also FC and TCP (in pogess) 9 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 9
NVMe-oF Exchange Model 10 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 10
NVMe and NVMe-oF/RDMA Fit Togethe Well Netwok 11 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 11
Example: NVMe-oF Potocol (Wite) Host Registe Memoy (get MR) Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and keyed SGL. Subsystem Upon RCV Allocate Memoy fo Data Post RDMA READ to fetch data Upon READ Post command to backing stoe Upon SSD completion Send NVMe-oF Response Capsule (RC) Fee memoy Upon SEND Fee CC and completion esouces Fee send buffe Fee data buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack RDMA Read Read esponse fist Read esponse last Send Response Capsule Ack RNIC Post Send (Read data) Post Send (RC) NVMe Taget Allocate memoy fo data Registe to the RNIC Post NVMe command Wait fo completion Fee allocated memoy Fee Receive buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 12
Example: NVMe-oF Potocol (Read) Host Registe memoy (get MR) Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and keyed SGL. Subsystem Upon RCV Allocate Memoy fo Data Post command to backing stoe Upon SSD completion Post RDMA Wite to wite data back to host Send NVMe-oF Response Capsule (RC) Upon SEND Fee memoy Fee CC and completion esouces Fee send buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack Wite fist Wite last Ack Send Response Capsule Ack RNIC NVMe Taget Post Send (Wite data) Post Send (RC) Post NVMe command Wait fo completion Fee eceive buffe Fee allocated buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 13
Example: NVMe-oF Potocol (Wite IN-Capsule) Host Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and data. Useful fo small IO (Cuently up to 4k) Subsystem Upon RCV Allocate Memoy fo Data Upon SSD completion Send NVMe-oF Response Capsule (RC) Fee memoy Upon SEND Fee RC and completion esouces Fee send buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack Send Response Capsule Ack RNIC Post Send (RC) NVMe Taget Post NVMe command Wait fo completion Fee eceive buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 14
Challenges?! Pefomance Same as DAS Reduce memoy foot pint Shae esouces Scale Data is gowing We must have a ulta fast netwok Save $$$ Build systems with cheape CPU/HW Save CPU cycles Offload data path by HW High availability multipathing 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 15
NVMe-oF/RDMA has Geat Pefomance! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 16
Can we do bette? Yes we can!! Cuently WIP in Linux Inteupt/completion modeation (AKA coalescing): A technique in which events would nomally tigge a HW inteupt ae held back, eithe until a cetain amount of wok is pending, o a timeout time tigges Registe non contiguous buffe using indiect MR The use can povide an iovec whee each enty has its own length We can t assume use buffes consists of full pages We don t want the block laye to use bounce buffes save CPU cycles Use HW that suppots indiection in MM table 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 17
ConnectX-4 (and above) devices suppots indiection Implemented in iser SRP/NVMe-oF patches submitted Use IB_MR_TYPE_SG_GAPS Please Ty it!! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 18
Reducing Memoy foot pint by using SRQs SRQ stands fo Shaed Receive Queue QPs/Connections ae cheap, Receive buffes ae not! Solution: Shae eceive buffeing esouces between QPs Accoding to the paallelism equied by the application Locality of completions scalability NVMe-oF implementation today uses 1 SRQ pe HCA Lock contention in the data path No paallelism Bette to use SRQ pe coe o pe completion vecto (MSI-X) We have submitted patches to fix pefomance in Linux please ty! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 19
Save CPU by using NVMe-oF Taget Offload NVMe-oF is built on top of RDMA Tanspot communication in hadwae NVMe-oF taget offload enable the NVMe hosts to access the emote NVMe devices w/o any CPU pocessing By offloading the entie NVMe-oF data path Encap/Decap NVMe-oF <-> NVMe is done by the adapte with 0% CPU CPU is available fo othe applications Easy configuation: echo 1 >.../subsystems/<subsys>/att_offload Admin opeations ae maintained in softwae IOPs with 0% CPU (512B IO ead) Connectx-5 1.0-1.2 MIOPs Bluefield SoC 7.5 MIOPs Upsteam submission TBD Cuently available in MLNX_OFED package Linux fok is available: https://github.com/mellanox/nvmeof-p2p/ Save $$$ - NVMe-oF taget systems can use cheape CPUs Host Root Complex and Memoy Subsystem NVMe IO NVMe ove Fabics Taget Offload RDMA Tanspot RNIC Netwok Admin 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 20
NVMe-oF Taget non-offload data path 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 21
NVMe-oF Taget offload data path 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 22
RDMA Block based stoage potocols in Linux Featue NVMe-oF iser SRP Fast memoy egistation V V V Indiect memoy egistation WIP V WIP SRQ V V SRQ pe coe WIP Remote Mkey invalidation V V Block MQ V V RoCE suppot V V WIP Use space tools nvmecli/nvmetcli iscsiadm/tagetcli sp_daemon/tagetcli High availability dm-multipath/nvme-multipath dm-multipath dm-multipath T10-PI V Use space open souce taget SPDK TGT 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 23
Thanks! maxg@mellanox.com 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 24