Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Big Data Management and Processing on Modern Clusters Substantial impact on designing and utilizing data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark Front-end Tier Back-end Tier Internet Data Accessing and Serving Web Server Web Server Web Server Memcached + DB Memcached (MySQL) + DB Memcached (MySQL) + DB (MySQL) NoSQL DB (HBase) NoSQL DB (HBase) NoSQL DB (HBase) Data Analytics Apps/Jobs MapReduce Spark HDFS SDC SNIA 18 2

Big Data Processing with Apache Big Data Analytics Stacks Major components included: MapReduce (Batch) Spark (Iterative and Interactive) HBase (Query) HDFS (Storage) RPC (Inter-process communication) Underlying Hadoop Distributed File System (HDFS) used by MapReduce, Spark, HBase, and many others Model scales but high amount of communication and I/O can be further optimized! User Applications MapReduce Spark HBase HDFS Hadoop Common (RPC) Apache Big Data Analytics Stacks SDC SNIA 18 3

Drivers of Modern HPC Cluster and Data Center Architecture Multi-/Many-core Processors Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Single Root I/O Virtualization (SR-IOV) NVM and NVMe-SSD High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Accelerators (NVIDIA GPGPUs and FPGAs) Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM SDSC Comet TACC Stampede Cloud Cloud SDC SNIA 18 4

The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks http://hibd.cse.ohio-state.edu Users Base: 290 organizations from 34 countries More than 27,800 downloads from the project site Available for InfiniBand and RoCE Available for x86 and OpenPOWER Significant performance improvement with RDMA+DRAM compared to default Sockets-based designs; How about RDMA+NVRAM? SDC SNIA 18 5

Non-Volatile Memory (NVM) and NVMe-SSD 3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*] Non-Volatile Memory (NVM) provides byte-addressability with persistence The huge explosion of data in diverse fields require fast analysis and storage NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications Storage technology is moving rapidly towards NVM [*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/ SDC SNIA 18 6

NVRAM Emulation based on DRAM Popular methods employed by recent works to emulate NVRAM performance model over DRAM Two ways: Emulate byte-addressable NVRAM over DRAM Emulate block-based NVM device over DRAM Application Virtual File System Block Device PCMDisk (RAM-Disk + Delay) DRAM mmap/memcpy/msync (DAX) open/read/write/close Load/Store Application pmem_memcpy_persist Persistent Memory Library Clflush + Delay Load/store DRAM SDC SNIA 18 7

Presentation Outline NRCIO: NVM-aware RDMA-based Communication and I/O Schemes NRCIO for Big Data Analytics NVMe-SSD based Big Data Analytics Conclusion and Q&A SDC SNIA 18 8

Design Scope (NVM for RDMA) D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common) DRAM NVM NVM DRAM NVM NVM HDFS-RDMA (RDMADFSClient) Client Client CPU NIC PCIe PCIe NIC HDFS-RDMA Server (RDMADFSServer) Server CPU HDFS-RDMA (RDMADFSClient) Client Client CPU NIC PCIe PCIe NIC HDFS-RDMA (RDMADFSServer) Server Server CPU HDFS-RDMA (RDMADFSClient) Client Client CPU NIC PCIe PCIe NIC HDFS-RDMA (RDMADFSServer) Server Server CPU D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM N-to-N over RDMA: Communication buffers for client and server are allocated in NVM SDC SNIA 18 9

NVRAM-aware RDMA-based Communication in NRCIO NRCIO RDMA Write over NVRAM NRCIO RDMA Read over NVRAM SDC SNIA 18 10

DRAM-TO-NVRAM RDMA-Aware Communication with NRCIO Latency (us) 25 20 15 10 5 0 NRCIO-RW 256 4K 16K 256 4K 16K 256 4K 16K 1xDRAM 2xDRAM 5xDRAM Data Size (Bytes) NRCIO-RR Latency (ms) 3.5 3 2.5 2 1.5 1 0.5 0 NRCIO-RW NRCIO-RR 256K 1M 4M 256K 1M 4M 256K 1M 4M 1xDRAM 2xDRAM 5xDRAM Data Size (Bytes) Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA with DRAM as source and NVRAM as destination {NxDRAM} NVRAM emulation mode = Nx NVRAM write slowdown vs. DRAM with clflushopt (emulated) + sfence Smaller impact of time-for-persistence on the end-to-end latencies for small messages vs. large messages => larger number of cache lines to flush SDC SNIA 18 11

Latency (us) NVRAM-TO-NVRAM RDMA-Aware Communication with NRCIO 25 20 15 10 5 0 NRCIO-RW NRCIO-RR 64 1K 16K 64 1K 16K 64 1K 16K No Persist (D2D) 1x,2x 2x,5x Data Size (Bytes) Comparison of communication latency using NRCIO RDMA read and write communication protocols over InfiniBand EDR HCA vs. DRAM {Ax, By} NVRAM emulation mode = Ax NVRAM read slowdown and Bx NVRAM write slowdown vs. NVRAM High end-to-end latencies due to slower writes to non-volatile persistent memory E.g., 3.9x for {1x,2x} and 8x for {2x,5x} SDC SNIA 18 12 Latency (ms) 3.5 3 2.5 2 1.5 1 0.5 0 NRCIO-RW NRCIO-RR 256K 1M 4M 256K 1M 4M 256K 1M 4M No Persist (D2D) 1x,2x 2x,5x Data Size (Bytes)

Presentation Outline NRCIO: NVM-aware RDMA-based Communication and I/O Schemes NRCIO for Big Data Analytics NVMe-SSD based Big Data Analytics Conclusion and Q&A SDC SNIA 18 13

Opportunities of Using NVRAM+RDMA in HDFS Files are divided into fixed sized blocks Blocks divided into packets NameNode: stores the file system namespace DataNode: stores data blocks in local storage devices Uses block replication for fault tolerance Replication enhances data-locality and read throughput Communication and I/O intensive Java Sockets based communication Data needs to be persistent, typically on SSD/HDD Client NameNode DataNodes SDC SNIA 18 14

Design Overview of NVM and RDMA-aware HDFS (NVFS) Applications and Benchmarks Hadoop MapReduce Spark HBase RDMA Receiver RDMA Sender Co-Design (Cost-Effectiveness, Use-case) NVM and RDMA-aware HDFS (NVFS) DataNode RDMA Writer/Reader DFSClient Replicator RDMA Receiver NVFS -BlkIO NVM NVFS- MemIO SSD SSD SSD Design Features RDMA over NVM HDFS I/O with NVM Block Access Memory Access Hybrid design NVM with SSD as a hybrid storage for HDFS I/O Co-Design with Spark and HBase Cost-effectiveness Use-case N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, 24th International Conference on Supercomputing (ICS), June 2016 SDC SNIA 18 15

Evaluation with Hadoop MapReduce Average Throughput (MBps) 350 300 250 200 150 100 50 0 HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) Write Read SDSC Comet (32 nodes) TestDFSIO on SDSC Comet (32 nodes) Write: NVFS-MemIO gains by 4x over HDFS Read: NVFS-MemIO gains by 1.2x over HDFS 4x 1.2x Average Throughput (MBps) TestDFSIO 1400 1200 1000 800 600 400 200 0 HDFS (56Gbps) NVFS-BlkIO (56Gbps) NVFS-MemIO (56Gbps) 4x Write Read SDC SNIA 18 16 2x OSU Nowlab (4 nodes) TestDFSIO on OSU Nowlab (4 nodes) Write: NVFS-MemIO gains by 4x over HDFS Read: NVFS-MemIO gains by 2x over HDFS

Throughput (ops/s) 800 700 600 500 400 300 200 100 0 Evaluation with HBase HDFS (56Gbps) 8:800K 16:1600K 32:3200K Cluster Size : No. of Records HBase 100% insert NVFS (56Gbps) 21% 1200 1000 YCSB 100% Insert on SDSC Comet (32 nodes) NVFS-BlkIO gains by 21% by storing only WALs to NVM YCSB 50% Read, 50% Update on SDSC Comet (32 nodes) NVFS-BlkIO gains by 20% by storing only WALs to NVM Throughput (ops/s) 800 600 400 200 0 8:800K 16:1600K 32:3200K Cluster Size : Number of Records HBase 50% read, 50% update 20% SDC SNIA 18 17

Opportunities to Use NVRAM+RDMA in MapReduce Bulk Data Transfer Disk Operations Map and Reduce Tasks carry out the total job execution Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent) Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent) Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take place in SSD/HDD typically SDC SNIA 18 18

Opportunities to Use NVRAM in MapReduce-RDMA Design Map Task Reduce Task Input Files Spill Read Map Merge Opportunities exist to improve the performance with Map NVRAM Task Spill Read Map Merge Intermediate Data RDMA Shuffle Shuffle In- Mem Merge Reduce Task In- Mem Merge Reduce All Operations are In- Memory Reduce Output Files SDC SNIA 18 19

NVRAM-Assisted Map Spilling in MapReduce-RDMA Map Task Reduce Task Input Files Read Read Map Map Task Map Spill Merge Spill Merge Intermediate Data Shuffle q Minimizes the disk operations in Spill phase NVRAM RDMA Shuffle In- Mem Merge Reduce Task In- Mem Merge Reduce Reduce Output Files M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016. M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on HPC Systems? IEEE BigData 2017. SDC SNIA 18 20

Comparison with Sort and TeraSort 51% 55% 2.48x 2.37x RMR-NVM achieves 2.37x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 55% compared to MR-IPoIB, 28% compared to RMR RMR-NVM achieves 2.48x benefit for Map phase compared to RMR and MR-IPoIB; overall benefit 51% compared to MR-IPoIB, 31% compared to RMR SDC SNIA 18 21

Evaluation of Intel HiBench Workloads We evaluate different HiBench workloads with Huge data sets on 8 nodes Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB: Sort: 42% (25 GB) TeraSort: 39% (32 GB) PageRank: 21% (5 million pages) Other workloads: WordCount: 18% (25 GB) KMeans: 11% (100 million samples) SDC SNIA 18 22

Evaluation of PUMA Workloads We evaluate different PUMA workloads on 8 nodes with 30GB data size Performance benefits for Shuffle-intensive workloads compared to MR-IPoIB : AdjList: 39% SelfJoin: 58% RankedInvIndex: 39% Other workloads: SeqCount: 32% InvIndex: 18% SDC SNIA 18 23

Presentation Outline NRCIO: NVM-aware RDMA-based Communication and I/O Schemes NRCIO for Big Data Analytics NVMe-SSD based Big Data Analytics Conclusion and Q&A SDC SNIA 18 24

Overview of NVMe Standard NVMe is the standardized interface for PCIe SSDs Built on RDMA principles Submission and completion I/O queues Similar semantics as RDMA send/recv queues Asynchronous command processing Up to 64K I/O queues, with up to 64K commands per queue Efficient small random I/O operation MSI/MSI-X and interrupt aggregation NVMe Command Processing Source: NVMExpress.org SDC SNIA 18 25

Overview of NVMe-over-Fabric Remote access to flash with NVMe over the network RDMA fabric is of most importance Low latency makes remote access feasible 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues I/O Submission Queue SQ I/O Completion Queue RQ NVMe RDMA Fabric Low latency overhead compared to local I/O NVMf Architecture SDC SNIA 18 26

Design Challenges with NVMe-SSD QoS Hardware-assisted QoS Persistence Flushing buffered data Performance Consider flash related design aspects Read/Write performance skew Garbage collection Virtualization SR-IOV hardware support Namespace isolation New software systems Disaggregated Storage with NVMf Persistent Caches Co-design SDC SNIA 18 27

Evaluation with RocksDB Latency (us) Latency (us) 15 500 10 400 300 5 200 100 0 Insert Overwrite Random Read 0 Write Sync Read Write POSIX SPDK POSIX SPDK 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write Overwrite: Compaction and flushing in background Low potential for improvement Read: Performance much worse; Additional tuning/optimization required SDC SNIA 18 28

Evaluation with RocksDB Throughput (ops/sec) Throughput (ops/sec) 600000 20000 500000 400000 15000 300000 10000 200000 100000 5000 0 0 Insert Overwrite Random Read Write Sync Read Write POSIX SPDK POSIX SPDK 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write Overwrite: Compaction and flushing in background Low potential for improvement Read: Performance much worse; Additional tuning/optimization required SDC SNIA 18 29

QoS-aware SPDK Design Bandwidth (MB/s) Scenario 1 150 100 50 0 1 5 9 13 17 21 25 29 33 37 41 45 49 Time High Priority Job (WRR) Medium Priority Job (WRR) High Priority Job (OSU-Design) Medium Priority Job (OSU-Design) Job Bandwidth Ratio 5 4 3 2 1 0 Synthetic Application Scenarios 2 3 4 5 Scenario SPDK-WRR OSU-Design Desired Synthetic application scenarios with different QoSrequirements Comparison using SPDK with Weighted Round Robbin NVMe arbitration Near desired job bandwidth ratios Stable and consistent bandwidth S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and Provisioning QoSfor NVMe SSDs, (Under Review) SDC SNIA 18 30

Conclusion and Future Work Big Data Analytics needs high-performance NVM-aware RDMA-based Communication and I/O Schemes Proposed a new library, NRCIO (work-in-progress) Re-design HDFS storage architecture with NVRAM Re-design RDMA-MapReduce with NVRAM Design Big Data analytics stacks with NVMe and NVMf protocols Results are promising Further optimizations innrcio Co-design with more Big Data analytics frameworks TensorFlow, Object Storage, Database, etc. SDC SNIA 18 31

Thank You! luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ SDC SNIA 18 32