LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Similar documents
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Farewell to Servers: Resource Disaggregation

IsoStack Highly Efficient Network Processing on Dedicated Cores

Distributed Shared Persistent Memory

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

FaRM: Fast Remote Memory

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Memory Management Strategies for Data Serving with RDMA

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

A Distributed Hash Table for Shared Memory

Multifunction Networking Adapters

Efficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin

RoGUE: RDMA over Generic Unconverged Ethernet

Advanced Computer Networks. End Host Optimization

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

libvnf: building VNFs made easy

Near- Data Computa.on: It s Not (Just) About Performance

08:End-host Optimizations. Advanced Computer Networks

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Eleos: Exit-Less OS Services for SGX Enclaves

RDMA and Hardware Support

PASTE: A Networking API for Non-Volatile Main Memory

An O/S perspective on networks: Active Messages and U-Net

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Pocket: Elastic Ephemeral Storage for Serverless Analytics

PARAVIRTUAL RDMA DEVICE

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Why AI Frameworks Need (not only) RDMA?

Generic RDMA Enablement in Linux

GPUnet: networking abstractions for GPU programs

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Containing RDMA and High Performance Computing

Euro-Par Pisa - Italy

Advanced Computer Networks. RDMA, Network Virtualization

Latency-Tolerant Software Distributed Shared Memory

SafeBricks: Shielding Network Functions in the Cloud

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

OFED Storage Protocols

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Flexible Architecture Research Machine (FARM)

Research Statement. 1 Thesis research: Efficient networked systems for datacenters with RPCs. Anuj Kalia

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration

Distributed Shared Persistent Memory

Background: I/O Concurrency

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Arrakis: The Operating System is the Control Plane

Virtual Memory. CS 3410 Computer System Organization & Programming

I/O Buffering and Streaming

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

14 May 2012 Virtual Memory. Definition: A process is an instance of a running program

Application Acceleration Beyond Flash Storage

Live Migration: Even faster, now with a dedicated thread!

Portland State University ECE 587/687. Virtual Memory and Virtualization

IO virtualization. Michael Kagan Mellanox Technologies

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

Much Faster Networking

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

A Userspace Packet Switch for Virtual Machines

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Virtual Memory 3. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4

CSC501 Operating Systems Principles. OS Structure

Using RDMA for Lock Management

DON T CRY OVER SPILLED RECORDS Memory elasticity of data-parallel applications and its application to cluster scheduling

Software Routers: NetMap

Accelerating Web Protocols Using RDMA

Designing a True Direct-Access File System with DevFS

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

High Speed Asynchronous Data Transfers on the Cray XT3

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

virtual memory Page 1 CSE 361S Disk Disk

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

QuartzV: Bringing Quality of Time to Virtual Machines

Accelerator-centric operating systems

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Containers Do Not Need Network Stacks

Virtual Memory. Physical Addressing. Problem 2: Capacity. Problem 1: Memory Management 11/20/15

jverbs: Java/OFED Integration for the Cloud

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Dynamic Translator-Based Virtualization

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

@2010 Badri Computer Architecture Assembly II. Virtual Memory. Topics (Chapter 9) Motivations for VM Address translation

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Virtual Memory Oct. 29, 2002

Transcription:

LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang

Time 2

Berkeley Socket Userspace Kernel Hardware Time 1983 2

Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace Kernel Hardware 1995 2017 Time 1983 2000s 2014? U-Net RDMA in HPC RDMA in Datacenters 2

RDMA (Remote Direct Memory Access) CPU User Kernel Memory RDMA Directly read/write remote memory Bypassing kernel Memory zero copy Benefits Low latency High throughput Low CPU utilization 3

Things have worked well in HPC Special hardware Few applications Cheaper developer 4

RDMA-Based Datacenter Applications 5

RDMA-Based Datacenter Applications Pilaf [ATC 13] HERD-RPC [ATC 16] Cell [ATC 16] FaRM [NSDI 14] Wukong [OSDI 16] FaSST [OSDI 16] HERD [SIGCOMM 14] Hotpot [SoCC 17] NAM-DB [VLDB 17] RSI [VLDB 16] DrTM [SOSP 15] APUS [SoCC 17] Octopus [ATC 17] DrTM+R [EuroSys 16] FaRM+Xact Mojim [SOSP 15] [ASPLOS 15] 5

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 6

Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Kernel Space OS Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

Userspace Hardware Low-level Difficult to use High-level Easy to use 8

Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use 8

Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use Socket 8

Userspace Hardware Low-level Difficult to use RDMA Developers want High-level Easy to use Socket 8

Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation 8

Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8

Userspace Fat applications No resource sharing Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 10

Userspace Hardware 11

Userspace On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Hardware 11

Userspace Hardware On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Requests /us 6 4.5 3 1.5 0 Write-64B Write-1K 1 4 16 64 256 1024 Total Size (MB) 11

Userspace Hardware Requests /us Expensive, unscalable On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region 6 4.5 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) Write-64B Write-1K hardware 11

Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

13

Fat applications No resource sharing Expensive, unscalable hardware 13

Fat applications No resource sharing Expensive, unscalable hardware Are we removing too much from kernel? 13

Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 14

Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

LITE - Local Indirection TiEr High-level High-level abstraction abstraction Resource Resource sharing sharing Protection Performance isolation Performance isolation Protection 15

All problems in computer science can be solved by another level of indirection Butler Lampson 16

User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 17

User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18

Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18

Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

User Space Simpler applications Conn Mgmt User-Level RDMA App send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

Implementing Remote memset Native RDMA 20

Implementing Remote memset Native RDMA LITE 20

Implementing Remote memset Native RDMA LITE 20

Implementing Remote memset Native RDMA LITE 20

All problems in computer science can be solved by another level of indirection Butler Lampson 21

All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler 21

All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection David Wheeler 21

Main Challenge: How to preserve the performance benefit of RDMA? 22

Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel Berkeley Socket CPU User Kernel Memory RDMA Userspace Kernel Hardware 23

Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel CPU User Memory Kernel CPU User Memory Kernel Berkeley Socket RDMA LITE Userspace Kernel Hardware 23

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Hardware RNIC Address mapping Permission check 24

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC Address mapping Permission check 24

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC No redundant indirection Scalable performance 24

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost 25

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler 25

Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler Great Performance and Scalability 25

Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 26

LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 RNIC global lkey global rkey 27

LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS LITE APIs Kernel App mgmt mem synch msging RPC Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction Verbs Abstraction OS LITE 1-Side RDMA LITE APIs global lkey global rkey RNIC Driver lh1 Kernel App lh2 mgmt mem synch msging RPC Permission check Address mapping LITE RPC addr1 addr2 RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

Onload Costly Operations LITE Connections Queues Keys Memory space OS RNIC Permission check Address mapping 28

Onload Costly Operations LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC Perform address mapping and protection in kernel 28

Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? 29

Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29

Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29

Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29

Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey RNIC Global lkey Global rkey Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29

LITE LMR and RDMA Userspace application LITE in Kernel Network Remote nodes 30

LITE LMR and RDMA LMR Userspace application LITE in Kernel Network Remote nodes 30

LITE LMR and RDMA LMR Node Phy Addr 1 0x45 4 0x27 Userspace application LITE in Kernel Network Remote nodes 30

LITE LMR and RDMA LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE LMR and RDMA lh LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check Userspace application QoS Offset 4 0x27 LITE in Kernel Network Node 4 0x27 Remote nodes 0x45 30

LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 1 4 16 64 256 1024 Total Size (MB) 31

LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 1 4 16 64 256 1024 Total Size (MB) 31

LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 LITE 1 scales 4 much 16 better 64 than 256 native 1024 RDMA wrt MR Total size Size (MB) and numbers 31

LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32

LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32

LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32

LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32

LITE-RDMA Latency 60 Latency (us) 45 30 user space kernel space LITE 15only adds a very slight overhead even when native RDMA doesn t have 0 scalability issues 8 512 2048 8K 32K Request Size (B) 32

LITE RPC RPC communication using two RDMA-write-imm One global busy poll thread Separate LMRs at server for different RPC clients Hide syscall cost behind performance critical path Benefits Low latency Low memory utilization Low CPU utilization 33

Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 34

LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 LITE-Graph-DSM 1300 0 5 Simple to use Needs no expert knowledge Flexible, powerful abstraction Easy to achieve optimized performance * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition 35

MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) 25 23 21 8 6 4 Hadoop Phoenix LITE 2 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36

MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) 25 23 21 8 6 4 Hadoop Phoenix LITE LITE-MapReduce 2 outperforms Hadoop by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36

Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) 8 6 4 2 LITE-Graph Grappa PowerGraph 0 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37

Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) 8 6 4 LITE-Graph Grappa PowerGraph LITE-Graph 2 outperforms PowerGraph 0 by 3.5x to 5.6x 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37

Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! 38

Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! Division across user space, kernel, and hardware 38

Thank you Questions? Get LITE at: https://github.com/wuklab/lite wuklab.io