DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1, and Jangwoo Kim 1 1 Dept. of Electrical and Computer Engineering, Seoul National University 2 Dept. of Computer Science and Engineering, POSTECH

Conventional Server Architecture Primarily rely on CPU and memory CPU-centric computing & in-memory storage Slow and low-bandwidth peripheral devices CPU Storage Compute Network Host- & CPU-centric 2/28

ice-centric Server Architecture Exploit fast & high-bandwidth devices Data processing accelerators (e.g., GPU, FPGA) Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3 CPU Storage PCIe Storage NVM NVM Network NIC NIC Compute Network CPU GPU GPU FPGA FPGA Accelerator Host- & CPU-centric ice-centric 3/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Experimental results Conclusion 4/28

Existing Approaches Software optimization Memory mgmt. optimization, user-level device interface Do not address multi-device tasks P2P communication Transfer data directly through PCI Express è D2D comm. ice integration Integrate heterogeneous devices è D2D comm. 5/28

Limitations of Existing D2D Comm. P2P communication Direct data transfers through PCI Express è D2D comm. Slow and high-overhead control path A Control Data copy Kernel Others Control Kernel B C CPU Data path Control path SW Latency (us) 120 90 60 30 0 SW opt P2P CPU util. (%) 100% 75% 50% 25% 0% SW opt P2P 6/28

Limitations of Existing D2D Comm. Integrated devices Integrating heterogeneous devices è D2D comm. Fast data & control transfers Fixed and inflexible aggregate implementation A B C Controllers CPU $$$ New 7/28

Limited Performance Potential while (true) { rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break; processing(&md_ctx, buffer, recv_size); rc_write = write(fd_file, buffer, recv_size); } A CPU B Intermediate processing between device ops Prevent applications from using direct D2D comm. Cause host-side resource contention (CPU and memory) 8/28

Design Goals Performance & scalability Faster inter-device data & control communication More scalable with CPU-efficient device operations Flexibility Support any types of off-the-shelf devices Applicability Increase the opportunity of applying D2D comm. 9/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Key ideas and benefits Architecture Experimental results Conclusion 10/28

DCS-ctrl: Key Ideas & Benefits DCS-ctrl: PCIe P2P + HDC Hardware-based device-control (HDC) mechanism HDC Engine: FPGA-based device orchestrator + near-device processing unit Performance & scalability è HDC, device orchestrator Flexibility è FPGA-based, low-cost device controller Applicability è near-device processing unit 11/28

HDC Engine: Overview SW-controlled P2P Application A B C ice driver A ice driver B ice driver C ice ctrl A DCS-ctrl (HW) Application HDC Engine (FPGA) A B C ice ctrl B ice ctrl C NDP A B C A B C 12/28

DCS-ctrl: Key Ideas & Benefits A B C CPU HDC A B C CPU HDC void ssd_to_nic() { get_from_ssd(&data); process_in_hdc(&data); write_to_nic(&data); } CPU Data path Control path New ice controller A HDC B Optimized dev. control Faster & scalable communication Generic dev. interfaces Higher flexibility Near-device processing Higher applicability 13/28

Key Idea #1: ice Orchestrator Perform multi-device tasks w/o CPU involvement Offload a multi-device task to HDC Engine Manage all device operations and their dependencies Multi-device task A NDP B Scoreboard R/W Src Dst Aux State A Read Addr(A) Addr(NDP-A) - Done - - Addr(NDP-A) Addr(NDP-B) Hash Issue B Write Addr(NDP-B) Addr(B) - Ready Fast hardware-level device control 14/28

Key Idea #2: ice Controller Provide interfaces between HDC Engine & devices Include submission & completion queues Build standard & vendor-specific device commands ice controller Submission queue Completion queue PCIe switch ice Doorbell registers Flexible & low-cost device control 15/28

Key Idea #3: Near-device Processing Near-device processing units Execute intermediate processing between device ops Scale-out storage app è hash, encryption, compression Processing units LUTs Registers Applications MD5 3.0% 0.69% Swift AES256 3.52% 0.99% HDFS, Swift GZIP 5.36% 2.09% HDFS Highly applicable Easy to be to existing extended applications & support other devices & applications 16/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism - Key idea and benefits Architecture Experimental results Conclusion 17/28

Baseline Architecture Software-controlled P2P P2P comm. + indirect device-control path SW HW A ice driver A PCIe switch A Application B ice driver A B C ice driver A C 18/28

DCS-ctrl: HW-based ice Control (1/3) Offload device-control path to HDC Engine Scoreboard: schedule device operations in a multi-dev task Application SW A B - C HW FPGA-based HDC Engine Scoreboard r/w Src Dst A B C PCIe switch A B C 19/28

DCS-ctrl: Low-cost Integration (2/3) Implement an FPGA-based device controller ice controller: directly control devices using P2P SW HW Application A B - C FPGA-based HDC Engine r/w Src Dst A B C Scoreboard ice controller PCIe switch A B New C 20/28

DCS-ctrl: Near-device Processing (3/3) Provide units for intermediate processing NDP unit: perform data processing on a data path SW HW Application A B - C FPGA-based HDC Engine r/w Src Dst A B C Scoreboard ice controller PCIe switch A B New Near-device processing Intermediate buffers C 21/28

DCS-ctrl Prototype HDC Engine implemented on Xilinx Virtex-7 VC707 Supports off-the-shelf devices Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs 22/28

Index Existing approaches DCS-ctrl: HW-based device-control mechanism Experimental results Conclusion 23/28

Reducing ice Control Latency encrypted_sendfile(): SSD à hash à NIC SW opt (+P2P): frequent boundary crossings, complex software DCS-ctrl: less crossings, hardware-based device control Latency (us) HW Kernel ctrl 100 42% 50 SW 0 SW opt DCS-ctrl without processing Latency (us) 300 200 100 0 HW Kernel Data Copy ctrl SW opt SW SW opt + P2P SW DCS-ctrl with processing (AES256) 72% 24/28

Reducing CPU Utilization Swift & HDFS workloads Normalized CPU utilization Offload device control & data transfers to hardware 100% Kernel (GET) GPU control 75% 50% 25% 0% SW opt SW opt +P2P Swift Kernel (PUT) Others 75% 50% 52% 49% 50% DCS-ctrl Normalized CPU utilization 100% 25% 0% Kernel (Sender) GPU control Send Recv Send Recv Send Recv SW opt SW opt +P2P HDFS Kernel (Receiver) others DCS-ctrl 25/28

Scalability: More ices Swift & HDFS workloads More CPU-efficient è support more high-performance devices SW opt SW opt + P2P DCS-ctrl SW opt SW opt + P2P DCS-ctrl CPU utilization (# cores) 6 4 2 0 CPU utilization (# cores) 6 4 2 0 0 10 20 30 40 0 10 20 30 40 Throughput (Gbps) Throughput (Gbps) Swift HDFS 26/28

Conclusion Fast & flexible device-control mechanism Hardware-based device-control (HDC) mechanism FPGA-based standard device controllers Near-device data processing (NDP) units Real hardware prototype evaluation 72% faster inter-device communication 50% lower CPU utilization for Swift & HDFS 27/28

Thank you! We will release our IP & tools soon! 28/28