DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Similar documents
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

PacketShader: A GPU-Accelerated Software Router

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

Re-architecting Virtualization in Heterogeneous Multicore Systems

N V M e o v e r F a b r i c s -

Near- Data Computa.on: It s Not (Just) About Performance

Toward a Memory-centric Architecture

High Performance Packet Processing with FlexNIC

Accelerating Data Centers Using NVMe and CUDA

RDMA and Hardware Support

Hardware NVMe implementation on cache and storage systems

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

Dynamic Mobile Device Integration for Efficient Cross-device Resource Sharing

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

LegUp: Accelerating Memcached on Cloud FPGAs

Data Path acceleration techniques in a NFV world

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Maximizing heterogeneous system performance with ARM interconnect and CCIX

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

HADP Talk BlueDBM: An appliance for Big Data Analytics

Networking at the Speed of Light

RoGUE: RDMA over Generic Unconverged Ethernet

An NVMe-based FPGA Storage Workload Accelerator

Recent Advances in Software Router Technologies

SDA: Software-Defined Accelerator for general-purpose big data analysis system

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

The Transition to PCI Express* for Client SSDs

BlueDBM: An Appliance for Big Data Analytics*

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Advanced Computer Networks. End Host Optimization

A Real Time Controller for E-ELT

Impact of Cache Coherence Protocols on the Processing of Network Traffic

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

A Real Time Controller for E-ELT

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Important new NVMe features for optimizing the data pipeline

P51: High Performance Networking

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Recent Advances in Heterogeneous Computing using Charm++

The NE010 iwarp Adapter

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Rack Disaggregation Using PCIe Networking

Addressing the Memory Wall

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Near-Data Processing for Differentiable Machine Learning Models

Center Extreme Scale CS Research

The Power of Batching in the Click Modular Router

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Data Processing on Emerging Hardware

Energy efficient real-time computing for extremely large telescopes with GPU

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Mapping MPI+X Applications to Multi-GPU Architectures

FlexNIC: Rethinking Network DMA

6.9. Communicating to the Outside World: Cluster Networking

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Next-Generation Cloud Platform

소프트웨어기반고성능침입탐지시스템설계및구현

PowerPC on NetFPGA CSE 237B. Erik Rubow

Application Acceleration Beyond Flash Storage

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

PacketShader as a Future Internet Platform

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Improving DPDK Performance

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPGPU introduction and network applications. PacketShaders, SSLShader

Adaptable Intelligence The Next Computing Era

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

CCIX: a new coherent multichip interconnect for accelerated use cases

Designing a True Direct-Access File System with DevFS

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Linux Device Drivers: Case Study of a Storage Controller. Manolis Marazakis FORTH-ICS (CARV)

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVMe over Universal RDMA Fabrics

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Cryptographic Hardware Support for the Linux Kernel

Intel Open Network Platform. Recep Ozdag Intel Networking Division May 8, 2013

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

OpenFlow Software Switch & Intel DPDK. performance analysis

Computational Storage: Acceleration Through Intelligence & Agility

Next Generation Enterprise Solutions from ARM

Enabling the NVMe CMB and PMR Ecosystem

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

Transcription:

DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1, and Jangwoo Kim 1 1 Dept. of Electrical and Computer Engineering, Seoul National University 2 Dept. of Computer Science and Engineering, POSTECH

Conventional Server Architecture Primarily rely on CPU and memory CPU-centric computing & in-memory storage Slow and low-bandwidth peripheral devices CPU Storage Compute Network Host- & CPU-centric 2/28

Conventional Server Architecture Primarily rely on CPU and memory CPU-centric computing & in-memory storage Slow and low-bandwidth peripheral devices CPU Storage Compute Network Host- & CPU-centric 2/28

ice-centric Server Architecture Exploit fast & high-bandwidth devices Data processing accelerators (e.g., GPU, FPGA) Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3 CPU Storage PCIe Storage NVM NVM Network NIC NIC Compute Network CPU GPU GPU FPGA FPGA Accelerator Host- & CPU-centric ice-centric 3/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Experimental results Conclusion 4/28

Existing Approaches Software optimization Memory mgmt. optimization, user-level device interface Do not address multi-device tasks P2P communication Transfer data directly through PCI Express è D2D comm. ice integration Integrate heterogeneous devices è D2D comm. 5/28

Limitations of Existing D2D Comm. P2P communication Direct data transfers through PCI Express è D2D comm. Slow and high-overhead control path A Control Data copy Kernel Others Control Kernel B C CPU Data path Control path SW Latency (us) 120 90 60 30 0 SW opt P2P CPU util. (%) 100% 75% 50% 25% 0% SW opt P2P 6/28

Limitations of Existing D2D Comm. Integrated devices Integrating heterogeneous devices è D2D comm. Fast data & control transfers Fixed and inflexible aggregate implementation A B C Controllers CPU $$$ New 7/28

Limited Performance Potential while (true) { rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break; processing(&md_ctx, buffer, recv_size); rc_write = write(fd_file, buffer, recv_size); } A CPU B Intermediate processing between device ops Prevent applications from using direct D2D comm. Cause host-side resource contention (CPU and memory) 8/28

Design Goals Performance & scalability Faster inter-device data & control communication More scalable with CPU-efficient device operations Flexibility Support any types of off-the-shelf devices Applicability Increase the opportunity of applying D2D comm. 9/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Key ideas and benefits Architecture Experimental results Conclusion 10/28

DCS-ctrl: Key Ideas & Benefits DCS-ctrl: PCIe P2P + HDC Hardware-based device-control (HDC) mechanism HDC Engine: FPGA-based device orchestrator + near-device processing unit Performance & scalability è HDC, device orchestrator Flexibility è FPGA-based, low-cost device controller Applicability è near-device processing unit 11/28

HDC Engine: Overview SW-controlled P2P Application A B C ice driver A ice driver B ice driver C ice ctrl A DCS-ctrl (HW) Application HDC Engine (FPGA) A B C ice ctrl B ice ctrl C NDP A B C A B C 12/28

DCS-ctrl: Key Ideas & Benefits A B C CPU HDC A B C CPU HDC void ssd_to_nic() { get_from_ssd(&data); process_in_hdc(&data); write_to_nic(&data); } CPU Data path Control path New ice controller A HDC B Optimized dev. control Faster & scalable communication Generic dev. interfaces Higher flexibility Near-device processing Higher applicability 13/28

Key Idea #1: ice Orchestrator Perform multi-device tasks w/o CPU involvement Offload a multi-device task to HDC Engine Manage all device operations and their dependencies Multi-device task A NDP B Scoreboard R/W Src Dst Aux State A Read Addr(A) Addr(NDP-A) - Done - - Addr(NDP-A) Addr(NDP-B) Hash Issue B Write Addr(NDP-B) Addr(B) - Ready Fast hardware-level device control 14/28

Key Idea #2: ice Controller Provide interfaces between HDC Engine & devices Include submission & completion queues Build standard & vendor-specific device commands ice controller Submission queue Completion queue PCIe switch ice Doorbell registers Flexible & low-cost device control 15/28

Key Idea #3: Near-device Processing Near-device processing units Execute intermediate processing between device ops Scale-out storage app è hash, encryption, compression Processing units LUTs Registers Applications MD5 3.0% 0.69% Swift AES256 3.52% 0.99% HDFS, Swift GZIP 5.36% 2.09% HDFS Highly applicable Easy to be to existing extended applications & support other devices & applications 16/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism - Key idea and benefits Architecture Experimental results Conclusion 17/28

Baseline Architecture Software-controlled P2P P2P comm. + indirect device-control path SW HW A ice driver A PCIe switch A Application B ice driver A B C ice driver A C 18/28

DCS-ctrl: HW-based ice Control (1/3) Offload device-control path to HDC Engine Scoreboard: schedule device operations in a multi-dev task Application SW A B - C HW FPGA-based HDC Engine Scoreboard r/w Src Dst A B C PCIe switch A B C 19/28

DCS-ctrl: Low-cost Integration (2/3) Implement an FPGA-based device controller ice controller: directly control devices using P2P SW HW Application A B - C FPGA-based HDC Engine r/w Src Dst A B C Scoreboard ice controller PCIe switch A B New C 20/28

DCS-ctrl: Near-device Processing (3/3) Provide units for intermediate processing NDP unit: perform data processing on a data path SW HW Application A B - C FPGA-based HDC Engine r/w Src Dst A B C Scoreboard ice controller PCIe switch A B New Near-device processing Intermediate buffers C 21/28

DCS-ctrl Prototype HDC Engine implemented on Xilinx Virtex-7 VC707 Supports off-the-shelf devices Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs 22/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Experimental results Conclusion 23/28

Reducing ice Control Latency encrypted_sendfile(): SSD à hash à NIC SW opt (+P2P): frequent boundary crossings, complex software DCS-ctrl: less crossings, hardware-based device control Latency (us) HW Kernel ctrl 100 42% 50 SW 0 SW opt DCS-ctrl without processing Latency (us) 300 200 100 0 HW Kernel Data Copy ctrl SW opt SW SW opt + P2P SW DCS-ctrl with processing (AES256) 72% 24/28

Reducing CPU Utilization Swift & HDFS workloads Normalized CPU utilization Offload device control & data transfers to hardware 100% Kernel (GET) GPU control 75% 50% 25% 0% SW opt SW opt +P2P Swift Kernel (PUT) Others 75% 50% 52% 49% 50% DCS-ctrl Normalized CPU utilization 100% 25% 0% Kernel (Sender) GPU control Send Recv Send Recv Send Recv SW opt SW opt +P2P HDFS Kernel (Receiver) others DCS-ctrl 25/28

Scalability: More ices Swift & HDFS workloads More CPU-efficient è support more high-performance devices SW opt SW opt + P2P DCS-ctrl SW opt SW opt + P2P DCS-ctrl CPU utilization (# cores) 6 4 2 0 CPU utilization (# cores) 6 4 2 0 0 10 20 30 40 0 10 20 30 40 Throughput (Gbps) Throughput (Gbps) Swift HDFS 26/28

Conclusion Fast & flexible device-control mechanism Hardware-based device-control (HDC) mechanism FPGA-based standard device controllers Near-device data processing (NDP) units Real hardware prototype evaluation 72% faster inter-device communication 50% lower CPU utilization for Swift & HDFS 27/28

Thank you! We will release our IP & tools soon! 28/28