FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

Similar documents
Reference Design: NVMe-oF JBOF

N V M e o v e r F a b r i c s -

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

NVMe Over Fabrics (NVMe-oF)

Building a High IOPS Flash Array: A Software-Defined Approach

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

NVMe Direct. Next-Generation Offload Technology. White Paper

The NE010 iwarp Adapter

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Storage Protocol Offload for Virtualized Environments Session 301-F

InfiniBand Networked Flash Storage

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

EXPERIENCES WITH NVME OVER FABRICS

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Using FPGAs to accelerate NVMe-oF based Storage Networks

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

NVMe over Universal RDMA Fabrics

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Advanced Computer Networks. End Host Optimization

INT 1011 TCP Offload Engine (Full Offload)

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

NVM Express over Fabrics Storage Solutions for Real-time Analytics

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Annual Update on Flash Memory for Non-Technologists

G2M Research Fall 2017 NVMe Market Sizing Webinar

INT-1010 TCP Offload Engine

Accelerating Data Centers Using NVMe and CUDA

The Transition to PCI Express* for Client SSDs

Designing Next Generation FS for NVMe and NVMe-oF

Low latency and high throughput storage access

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Flash Considerations for Software Composable Infrastructure. Brian Pawlowski CTO, DriveScale Inc.

At the heart of a new generation of data center infrastructures and appliances. Sept 2017

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Networking at the Speed of Light

Why NVMe/TCP is the better choice for your Data Center

NVM PCIe Networked Flash Storage

Multifunction Networking Adapters

Linux Storage System Bottleneck Exploration

Top 5 Reasons to Consider

Ziye Yang. NPG, DCG, Intel

Application Acceleration Beyond Flash Storage

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Next Generation Computing Architectures for Cloud Scale Applications

Hardware NVMe implementation on cache and storage systems

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Important new NVMe features for optimizing the data pipeline

INT G bit TCP Offload Engine SOC

Flash In the Data Center

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

Kaminario and K2 All Flash Array At A Glance 275+ Employees Boston HQ Locations in Israel, London, Paris, Beijing & Seoul 200+ Channel Partners K2 All

Beyond the Hype of NVMe & NVMeoF:

Persistent Memory over Fabrics

An NVMe-based FPGA Storage Workload Accelerator

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Optimizing the Data Center with an End to End Solutions Approach

How Are The Networks Coping Up With Flash Storage

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

iscsi Technology: A Convergence of Networking and Storage

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

We will also specifically discuss concept of a pooled system, storage node, pooling of PCIe as well as NVMe based storage.

DDN About Us Solving Large Enterprise and Web Scale Challenges

Opportunities from our Compute, Network, and Storage Inflection Points

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Birds of a Feather Presentation

How Flash-Based Storage Performs on Real Applications Session 102-C

Welcome to the IBTA Fall Webinar Series

G2M Research Presentation Flash Memory Summit 2018

No Tradeoff Low Latency + High Efficiency

Introduction to TCP/IP Offload Engine (TOE)

All Roads Lead to Convergence

Was ist dran an einer spezialisierten Data Warehousing platform?

NETWORKING FLASH STORAGE? FIBRE CHANNEL WAS ALWAYS THE ANSWER!

NVM Express Awakening a New Storage and Networking Titan Shaun Walsh G2M Research

Session 201-B: Accelerating Enterprise Applications with Flash Memory

DRBD SDS. Open Source Software defined Storage for Block IO - Appliances and Cloud Philipp Reisner. Flash Memory Summit 2016 Santa Clara, CA 1

Cisco SAN Analytics and SAN Telemetry Streaming

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

33% 148% 2. at 4 years. Silo d applications & data pockets. Slow Deployment of new services. Security exploits growing. Network bottlenecks

Cisco UCS Virtual Interface Card 1225

JMR ELECTRONICS INC. WHITE PAPER

Multi-Host Sharing of NVMe Drives and GPUs Using PCIe Fabrics

On the cost of tunnel endpoint processing in overlay virtual networks

by Brian Hausauer, Chief Architect, NetEffect, Inc

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Transcription:

Flash Memory Summit 2018 Santa Clara, CA FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric Paper Abstract: The move from direct-attach to Composable Infrastructure is being driven by large datacenters seeking increased business agility combined with lower TCO. This requires new remote-attached storage solutions which can deliver extremely high data rates with minimal latency overhead. Unfortunately, the industry-standard embedded processors used in controllers aren't fast enough to manage complex protocols at the required speed. For example, they cannot keep up with the work required to access NVMe s efficiently over an NVMe-oF networked infrastructure. The solution is to add accelerators, typically built using an ASIC, FPGA, or other high-speed hardware. These accelerators offload the processing of protocols such as RDMA, TCP, and NVMe. The result is essentially the same performance for remote storage accessed over a network as for direct-attached storage. The combination provides an optimal blend of high performance, low power, and low cost to yield tremendous CAPEX and OPEX savings in the next-generation datacenter. The technology enables virtually limitless scalability, and will drive dramatically lower TCO for hyperscale and as-a-service datacenter applications.

Speakers Biographies: Bryan Cowger, with over 25 years of storage industry experience, is VP Sales/Marketing at Kazan Networks, a startup developing ASICs that target new ways of attaching and accessing flash storage in enterprise and hyperscale datacenters. Kazan Networks products utilize emerging technologies such as NVMe and NVMe-oF. Bryan has spent his career defining and bringing to market successful high-performance storage networking ASICs for such protocols as Fibre Channel, SAS, SATA, Ethernet, PCIe, and NVMe. He has been awarded 4 patents in area of storage controller architecture. Before joining Kazan Networks, he was VP Sales/Marketing & Co-Founder at Sierra Logic, a developer of SATA-to-Fibre Channel controllers. He also spent over 10 years as a design engineer at Hewlett-Packard and Agilent Technologies. He holds a BS in Electrical Engineering from UC San Diego. Flash Memory Summit 2018 Santa Clara, CA 2

Hardware Acceleration of Storage for Composable Infrastructure Bryan Cowger VP, Kazan Networks

Agenda What is Composable Infrastructure? Challenges for CI using existing technologies NVMe over Fabrics overview Sidebar example: Metal Working Machinery (?!) Implementation example: Network TCP Engine Architectural Comparisons of available solutions Practical examples and results/benefits of HW acceleration

Today s Shared Nothing Model a.k.a. DAS CPU, Memory, etc. Dedicated Storage (HDDs -> s) CPU Challenges: - Forces the up-front decision of how much storage to devote to each server. - Locks in the compute:storage ratio.

Shared Nothing Model Option A: One Model Serves All Apps App A: Needs 1 App B: Needs 2 s App C: Needs 3 s Utilized CPU CPU CPU Not utilized Dark Flash Net utilization: 6 s out of 12 = 50%

Shared Nothing Model Option B: Specialized Server Configurations App A: Needs 1 App B: Needs 2 s App C: Needs 3 s Utilized CPU CPU CPU Dark Flash eliminated, but limits ability and future app deployments

Disaggregated Datacenter CPU CPU CPU Pool of CPUs Pool of Storage

The Composable Datacenter App A: Needs 1 CPU Utilized s Spare s Spares / Expansion Pool Minimize Dark Flash! App B: Needs 2 s CPU Buy them only as needed Power them only as needed App C: Needs 3 s CPU Pool of CPUs Pool of Storage

The Composable Datacenter Real Savings? Example: 100k servers 1M s Assume 40% increase in storage utilization CapEx: 1M s * 60% = 600k s Savings of 400k s $300 per $120M savings OpEx: Not powering 400k s Assume ~10W per Assume $0.10 KWH $10k savings per day $3.5M savings per year

The Composable Datacenter Based on NVMe-oF Software-defined DataCenter CPU Primary application for NVMe-oF Infrastructure as Code CPU Could be any fabric Ethernet Fibre Channel Infiniband Next-gen Hyperscalers focusing on Ethernet CPU Pool of CPUs Pool of Storage

Ethernet Roadmap Initial Standard Completed 1,000,000 14,000,000 100,000 100,000 link speed 12,000,000 Link rate (mb/s) 10,000 10,000 1,000 1,000 packets 10,000,000 8,000,000 6,000,000 4kB packets / s 100 100 4,000,000 10 10 2,000,000 1 1 0 1980 1980 1990 1990 2000 2010 2020

The Processor Challenge

Summary of Challenges Ethernet speeds going up Embedded processing capabilities plateauing Compute and storage disaggregating while storage media latencies are decreasing One solution: Speed up how networking controllers are built by basing them on more specialized, dedicated hardware.

Why Dedicate Hardware? Metal-stamping machine examples Programmable Progressive Stamping

Two Machines: Side-by-Side Versatility: Programmable Hard-coded Bandwidth: 12 parts / 8 minutes 1 part / 2 seconds Latency: 8 minutes 20 seconds Size: 18 x 17 6 x 3 Weight: 14 tons 2 tons Power: 9kW 2kW

Typical NVMe-oF Deployment JBOF / EBOF / FBOF Fabric... Server(s) Ethernet Fabric-to- PCIe bridge Fanout s NVMe over Fabrics NVMe (over PCIe)

Networking Controller Typical Architecture of Inbound Path Can be SW/FW or dedicated HW Inbound packets No Errors? Expected? In-order? Y Fast-path processing N Slow-path processing Typically SW/FW-based

Inbound TCP Control Engine Example Data structure at right used to track status of each TCP frame Each inbound frame header must be compared against 320 bits of control information Fast-path / slow-path decision to be made Various fields must be updated and written back TCP Network Control Block Data Structure

Inbound TCP Parsing Example Software-based algorithm: Loop 5 times: Read in 64 bits from header (1 clock) Read in 64 bits from data structure (1 clock if in TCM; 10+ clocks if in DRAM) Make decision on multiple fields (2 or more clocks) Once entire data structure / header is processed, write back necessary information (5+ clocks) Take action to move header information to next part of protocol processing (e.g. NVMe) (5+ clocks) Hardware-based algorithm: Load 320 bits twice (1 clock) Make decision on multiple fields (1 clock) Update data structure (1 clock) Take action to move header information to next part of protocol processing (e.g. NVMe) (1 clock) Total: 4 clock cycles Total: 30-50 clock cycles (2X higher if a 32-bit processor)

TCP Header Processing Hardware 40 bytes of header 40 bytes of data structure 320 bit comparator ( are these equal? ) Fast path vs slow path decision

More Complex Decisions X bytes of header X bytes of data structure n bit comparator m bit comparator p bits q bits Enable

HW-based Finite State Machines (FSM) Essentially just bespoke very long instruction word processors From Wikipedia: Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs. Hard-coded state transitions instead of instruction set-based

Typical NVMe-oF Deployment JBOF / EBOF / FBOF Fabric... Server(s) Ethernet Fabric-to- PCIe bridge Fanout s NVMe over Fabrics NVMe (over PCIe)

NVMe-oF Bridge Architecture PCIe Root Complex (PRC) PCIe Abstraction Layer (AL) FW Approximately 150 FSMs running in parallel Doorbell TFC Main Memory (TMM) Data Pages SQs CQs HW Some simple, e.g. buffer management Rx DMA DMA Req Comp Capsule SQE CQE Int NVMe Engine (NE) µp Some quite complex: RoCE v1, v2 iwarp /TCP TCP NVMe NVMe-oF Rx Packet Buffer Rx Parser/Router Rx FIFO DMA Req Comp RDMA Engine (RE) Transport Engine (TE) MPU µp TSD Mem Tx Framer Tx FIFO Slow-path implemented in FW Ethernet MAC

NVMe-oF Bridge Architecture PCIe Root Complex (PRC) PCIe Abstraction Layer (AL) FW Approximately 150 FSMs running in parallel Some simple, e.g. buffer management Rx DMA Doorbell DMA Req Comp Capsule TFC Main Memory (TMM) Data Pages SQs CQs SQE CQE Int NVMe Engine (NE) µp HW 3.5% of die Some quite complex: RoCE v1, v2 iwarp /TCP TCP NVMe NVMe-oF Rx Packet Buffer Rx Parser/Router Rx FIFO DMA Req Comp RDMA Engine (RE) Transport Engine (TE) MPU µp TSD Mem Tx Framer Tx FIFO Slow-path implemented in FW Ethernet MAC

Purely SW Architecture PCIe Root Complex (PRC) PCIe Access Layer External DRAM DRAM I/O Control Arm Core... Arm Core Arm Core... Arm Core Arm Core... Arm Core Arm Core... Arm Core Rx FIFO Tx FIFO Ethernet MAC

NVMe-oF Options High-End x86 server-based Up to 3.3GHz 200W+ / 100Gb Mid-range Network Processors Up to 3.0GHz 15-20W / 100Gb Low-power, low-cost Dedicated hardware ~0.5GHz 7W / 100Gb PCIe Root Complex (PRC) PCIe Root Complex (PRC) PCIe Access Layer PCIe Abstraction Layer (AL) Doorbell TFC Main Memory (TMM) DRAM I/O Control Arm A72 Core... Arm A72 Core Arm A72 Core... Arm A72 Core Arm A72 Core... Arm A72 Core Arm A72 Core... Arm A72 Core Rx DMA Rx Packet Buffer DMA Req Comp Capsule DMA Req Comp SQs CQs Data Pages SQE CQE Int NVMe Engine (NE) RDMA Engine (RE) Transport Engine (TE) µ P µ P TSD Mem Rx Parser/Router Tx Framer Rx FIFO Tx FIFO Rx FIFO MPU Tx FIFO Ethernet MAC Ethernet MAC

A Spectrum of Options Myriad choices for NVMe-oF / Composable Infrastructure target deployments Storage Servers Network Processors Dedicated Bridge Versatility H M L Cost H M L Power H M L Footprint H M L

Composable Infrastructure At Hyperscale Units of Compute Processor Memory I/O (RDMA, TCP/IP) Units of Storage s Fanout I/O (RDMA, TCP/IP) Manageable

Composable Infrastructure At Hyperscale We need to move the standard unit of compute from the server to the rack; we need to go from local optimization to global optimization. We can Low deliver Latency higher performance and utilization through the pooling Ethernet of resources, so by disaggregating... independent server and storage systems, the... capacity can be more finely allocated to the application. Infrastructure as Code Diane Bryant, Intel, IDF16 Cost Optimized Compute Cost Optimized Storage Storage Utilization > 80%!

Industry Examples

Industry Example: Tachyon Fibre Channel controller family first introduced in the 1990s Entirely HW FSM-based No embedded processors! High-level protocol (SCSI) in HW Complex algorithms (e.g. FC-AL initialization) in HW Generated ~$1B in revenue during the family s ~20 year lifecycle

Industry Example: Fuji + Optane Latency FIO Linux Host FIO NVMe Driver Optane Read Latency (usec) I/O Size 512B 4kB Native 8.03 8.98 Read Latency (usec) I/O Size 512B 4kB NVMe-oF 12.20 13.83 NVMe-oF Incremental Latency (usec) I/O Size 512B 4kB NVMe-oF 4.18 4.85 Linux Host Ethernet Switch NVMe Driver NVMe-oF Host Driver RNIC Driver 100G RNIC 100G 100G 1 x100g Fuji ASIC PCIe 1 x16 PCIe Switch 0.5 usec NVMe over Fabrics NVMe DAS Mode Disaggregated Mode Optane

Industry Example: Fuji + Optane IOPS / Bandwidth FIO Linux Host FIO NVMe Driver Optane Optane Optane Native 4kB Rnd 128k Seq Reads 571k IOPS 2.58 GB/s Writes 546k IOPS 2.16 GB/s Optane NVMe-oF 4kB Rnd 128k Seq Reads 571k IOPS 2.28 GB/s Writes 543k IOPS 2.18 GB/s Linux Host Ethernet Switch NVMe Driver NVMe-oF Host Driver RNIC Driver 100G RNIC 100G 100G 1 x100g Fuji ASIC PCIe 1 x16 PCIe Switch 0.5 usec NVMe over Fabrics NVMe DAS Mode Disaggregated Mode Optane Optane

Takeaways Multiple solutions being delivered to the market this year Decision to make: Versatility vs optimized hardware Composable Infrastructure is nearing reality No need to sacrifice performance while reaping upsides 2019 will be the year of initial NVMe-oF deployments

Q & A

Thank You!