Accelerating Data Centers Using NVMe and CUDA

Similar documents
PCIe Storage Beyond SSDs

Important new NVMe features for optimizing the data pipeline

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

NVMe Over Fabrics (NVMe-oF)

Building a High IOPS Flash Array: A Software-Defined Approach

An NVMe-based FPGA Storage Workload Accelerator

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

N V M e o v e r F a b r i c s -

Bringing Intelligence to Enterprise Storage Drives

InfiniBand Networked Flash Storage

Multi-Host Sharing of NVMe Drives and GPUs Using PCIe Fabrics

Hardware NVMe implementation on cache and storage systems

NVM Express over Fabrics Storage Solutions for Real-time Analytics

Enabling the NVMe CMB and PMR Ecosystem

Bringing Intelligence to Enterprise Storage Drives

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Using FPGAs to accelerate NVMe-oF based Storage Networks

Improving Ceph Performance while Reducing Costs

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

NVMe over Universal RDMA Fabrics

Opportunities from our Compute, Network, and Storage Inflection Points

Accelerating Ceph with Flash and High Speed Networks

Designing Next Generation FS for NVMe and NVMe-oF

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Persistent Memory over Fabrics

The rcuda middleware and applications

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

stec Host Cache Solution

An Intelligent & Optimized Way to Access Flash Storage Increase Performance & Scalability of Your Applications

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Building the Most Efficient Machine Learning System

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Martin Dubois, ing. Contents

Interconnect Your Future

Interconnect Your Future

NVMe : Redefining the Hardware/Software Architecture

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

The Future of High Performance Interconnects

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

Cloud Computing with FPGA-based NVMe SSDs

Deep Learning Performance and Cost Evaluation

Annual Update on Flash Memory for Non-Technologists

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Using MRAM to Create Intelligent SSDs

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Building the Most Efficient Machine Learning System

EXPERIENCES WITH NVME OVER FABRICS

Windows Support for PM. Tom Talpey, Microsoft

RDMA and Hardware Support

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

How Next Generation NV Technology Affects Storage Stacks and Architectures

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

Storage Protocol Offload for Virtualized Environments Session 301-F

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Mapping MPI+X Applications to Multi-GPU Architectures

Building an All Flash Server What s the big deal? Isn t it all just plug and play?

The Transition to PCI Express* for Client SSDs

Windows Support for PM. Tom Talpey, Microsoft

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

NVIDIA GRID. Ralph Stocker, GRID Sales Specialist, Central Europe

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

It s Time for Mass Scale VDI Adoption

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Real-Time Support for GPU. GPU Management Heechul Yun

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Interconnect Your Future

NVMe: The Protocol for Future SSDs

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Deep Learning Performance and Cost Evaluation

Accessing NVM Locally and over RDMA Challenges and Opportunities

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

IBM Power AC922 Server

Introduction to EDSFF

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Today s Data Centers. How can we improve efficiencies?

Enabling Cost-effective Data Processing with Smart SSD

ARISTA: Improving Application Performance While Reducing Complexity

Building dense NVMe storage

Transcription:

Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1

Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express (NVMe) to accelerate data center applications Uses NVMe and (Remote) Direct Memory Access to enable the Trifecta of compute, network and storage PCIe devices Builds on work done by Nvidia, Mellanox and others Santa Clara, CA 2

The Problem: x86 Status Quo Low parallelism = CPU / Memory / fabric saturation <10K IOPS per CPU 1 ~1M IOPS per flash device Fast Data workloads requiring deep parallelism are not efficient with x86 Fabric ~2M IOPS per 40G Link Fabric becomes overloaded Santa Clara, CA 1 Assuming computation per IOP is high (e.g. image search, encryption, audio processing) 3

The Solution: Donard Pre-process algorithms in the data path Re-Balance A >99% filter rate drops IOPS to 10K Simple Control and Management or or Fabric FPGA Or GPU Fabric Offloaded Santa Clara, CA 1 Assuming computation per IOP is high (e.g. image search, encryption, audio processing). 4

The PCIe Trifecta We want to enable the Trifecta on a PCIe fabric. This work leverages NVMe to enable two edges of this Trifecta. Others have worked on other arms (see next slide). Host CPU (e.g. Intel, ARM, Power 8) Storage (e.g. NVMe SSD) PCIe Fabric Compute (e.g. GPU card or FPGA card or CPU card) Host DRAM Networking (e.g. RDMA capable NIC) Network Santa Clara, CA 5

The PCIe Trifecta Storage (e.g. NVMe SSD) Compute (e.g. GPU card or FPGA card or CPU card) Networking (e.g. RDMA capable NIC) This work Work done by others (e.g. GpuDirect) Future work Santa Clara, CA 6

Why NVM Express? NVM Express (NVMe) provides a consistent, open-source interface into PCIe storage devices. An NVMe driver has been part of both Linux and Windows Servers for quite some time now. NVMe is extendible and is the focus of optimization within kernel.org 1. Using NVMe makes Donard much more scalable and amenable to community development. The work in this paper should be applicable to any NVMe compliant drive. 1 http://kernelnewbies.org/linux_3.13#head-3e5f0c2bcebc98efd197e3036dd814eadd62839c Santa Clara, CA 7

Storage->Compute Built a Linux server running kernel 3.13 Installed an NVMe SSD and a Nvidia Tesla K20c Modified the NVMe module in the kernel to add a new IOCTL that use DMA between SSD and the GPU card Used CUDA 6.0 Peer-To-Peer (p2p) APIs to enable the DMA Measured the impact of the new IOCTL on bandwidth and host DRAM utilization Storage (e.g. NVMe SSD) Technique Bandwidth 1 (GB/s) Compute (e.g. GPU card or FPGA card or CPU card) DRAM Utilization 2 Classical 1.9 5230.0 Donard (DMA) 2.5 1.0 1 Bandwidth was measured on our server which had a very standard PCIe fabric using a total transfer size of GB. Tests run 10 times. Results may vary depending on your PCIe architecture. 2 DRAM utilization estimated using the page fault counters in the x86 CPU. Normalized to Donard performance. Santa Clara, CA 8

Compute->Storage Similar IOCTL as previous slide with direction changed Had to resolve the file extents issue. Several options: 1. Overwrite an existing file 2. Create a file using existing OS constructs (slow) 3. Modify the file stats properties to allow uninitialized extents possible security issue 1 Measured the impact of the new IOCTL on bandwidth and host DRAM utilization Still trying to determine why DMA method is slower. Suspect issue with PCIe architecture in server 1 https://lwn.net/articles/492959/ Storage (e.g. NVMe SSD) Technique Bandwidth 1 (GB/s) Compute (e.g. GPU card or FPGA card or CPU card) DRAM Utilization 2 Classical 1.51 6012.0 Donard (DMA) 0.65 1.0 1 Bandwidth was measured on our server which had a very standard PCIe fabric using a total transfer size of GB. Tests run 10 times. Results may vary depending on your PCIe architecture. 2 DRAM utilization estimated using the page fault counters in the x86 CPU. Normalized to Donard performance. Santa Clara, CA 9

Donard in the Data Center Data Centers (DC) are deploying flash in a direct attach model to aid in application acceleration. Although SATA is prevalent today many DCs see a shift to PCIe attached flash and like the idea of using NVMe. Some DC customers are offloading applications to heterogeneous compute platforms such as GPU cards and FPGA cards 1. Donard assists in this offload and also reduces the burden on the host processor and host DRAM. 1 Santa Clara, CA A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, 41st 10 Annual International Symposium on Computer Architecture (ISCA), June 2014.

Donard DC Application - Haystack We wrote a program to search for the PMC logo in a large (10,000+) image database Performance improved as we migrated to DMA on a SSD+GPU compared to a traditional solution Note it also moves the bottleneck from the host DRAM interface to the GPU Other applications might include sorting and writecaching HDD SDD Mpix/s Mpix/s Bottleneck CPU 77.0 122.8 CPU CUDA 95.1 312.5 DRAM DONARD N/A 534.2 GPU Santa Clara, CA 11

Next Steps Determine reason for slow write performance from GPU into SSD Add RDMA capable NICs to the Donard platform and complete the Trifecta; this is work in progress Enable Donard within the community so others can benefit and contribute (GitHub?); already working with Steve Swanson s team at UCSD Work with the NVM Express standards body to incorporate some of the Donard ideas into NVMe Santa Clara, CA 12

Thank You! See us in booth # 416 www.pmcs.com blog.pmcs.com @pmcsierra Santa Clara, CA 13