EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Similar documents
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Software-defined Storage: Fast, Safe and Efficient

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Building the Most Efficient Machine Learning System

High-Performance Training for Deep Learning and Computer Vision HPC

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

The Future of High Performance Interconnects

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Advanced RDMA-based Admission Control for Modern Data-Centers

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Building High Speed Erasure Coding Libraries for ARM and x86 Processors. Per Simonsen, CEO, MemoScale May 2017

Paving the Road to Exascale

Building the Most Efficient Machine Learning System

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Oasis: An Active Storage Framework for Object Storage Platform

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Repair Pipelining for Erasure-Coded Storage

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Unified Runtime for PGAS and MPI over OFED

InfiniBand Networked Flash Storage

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Interconnect Your Future

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Storage. CS 3410 Computer System Organization & Programming

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

A Case for High Performance Computing with Virtual Machines

Solutions for Scalable HPC

FPGA Implementation of Erasure Codes in NVMe based JBOFs

Erasure coding and AONT algorithm selection for Secure Distributed Storage. Alem Abreha Sowmya Shetty

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

In-Network Computing. Paving the Road to Exascale. June 2017

High Performance Computing

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

Improving Cloud Storage Cost and Data Resiliency with Erasure Codes Michael Penick

Deep Learning Performance and Cost Evaluation

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

A Performance Evaluation of Open Source Erasure Codes for Storage Applications

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

Enhancing Checkpoint Performance with Staging IO & SSD

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Modern Erasure Codes for Distributed Storage Systems

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Accelerating HPL on Heterogeneous GPU Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Learning with Purpose

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

High-Performance Broadcast for Streaming and Deep Learning

Application Acceleration Beyond Flash Storage

Notices and Disclaimers

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Interconnect Your Future

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Designing High Performance DSM Systems using InfiniBand Features

Decentralized Distributed Storage System for Big Data

Interconnect Your Future

Modern Erasure Codes for Distributed Storage Systems

Deep Learning Performance and Cost Evaluation

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

10GE network tests with UDP. Janusz Szuba European XFEL

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Transcription:

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University Bench 218

Bench 218 2 Outline Introduction EC-Bench Evaluation Conclusion and Future Work

Bench 218 3 Fault Tolerance & Replication Fault tolerance is a critical requirement for distributed storages Availability: data is still accessible under failures Durability: no data corruptions under failures Traditional storage scheme to achieve fault tolerance is replication Common replication factors: 3-1 Replication factor = 3 D Original data Data used to maintain fault tolerance

Bench 218 4 Booming of Data 1,4 Data Stored in Exabytes 1,2 1, 8 985 1,327 36% CAGR 216-221 6 721 4 2 286 397 547 Source: Cisco Global Cloud Index, 216-221 216 217 218 219 22 221 The booming of data makes storage overhead of replication considerable

Bench 218 5 Erasure Coding Erasure Coding is a promising redundancy storage scheme Minimize storage overhead by applying erasure encoding on data Deliver higher reliability with same storage overhead than replication Employed in Google, Microsoft, Facebook, Baidu, etc. v Microsoft Azure reduces storage overhead from 3x (3-way replication) to 1.33x - k = 4 - m = 2 (erasure coding) Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 Original data Data used to maintain fault tolerance

Bench 218 6 Replication vs. Erasure Coding Replication Storage overhead = 2/1 = 2 D Replication factor = 3 Original data Data used to maintain fault tolerance Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 - k = 4 - m = 2 Erasure Coding Storage overhead = 2/4 =.5 Original data Data used to maintain fault tolerance

Bench 218 7 Erasure Coding: Encoding Ø Takes in k data chunks and generates m parity chunks Ø Distributes (k + m) chunks to (k + m) independent nodes - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3

Bench 218 8 Erasure Coding: Decoding Ø Any k of (k + m) chunks are sufficient to recover the original k data chunks - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3 Read Decode D1 D2 D3 D4 D5 D6

Bench 218 9 Erasure Coding Pros: Low storage overhead with high fault tolerance Cons: High computation overhead introduced by encode and decode Ø High performance hardware-optimized erasure coding libraries to alleviate computation overhead Ø Intel CPUs -> Intel ISA-L Ø Nvidia GPUs -> Gibraltar Ø Mellanox InfiniBand -> Mellanox-EC

Bench 218 1 Onload and Offload Erasure Coders Onload Erasure Coder Erasure coding operations are performed by host processors Jerasure and Intel ISA-L Onload Erasure Coders Jerasure CPU ISA-L Offload Erasure Coder Erasure coding operations are conducted by accelerators like Offload Erasure Coders GPU Gibraltar IB HCA Mellanox-EC GPUs and Mellanox InfiniBand HCAs Gibraltar and Mellanox-EC

Bench 218 11 Our Contributions Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Performance Throughput Normalized Throughput Ø Evaluations on four popular open source erasure coders with EC-Bench Onload Offload CPU Cache Jerasure Intel ISA-L Heterogeneity Resource Utilization Gibraltar Mellanox-EC

Bench 218 12 Outline Introduction EC-Bench Evaluation Conclusion and Future Work

Bench 218 13 EC-Bench: Encoding Benchmark Data Buffer Big File 1. Read Chunk D1 Chunk D2 Chunk D3 Chunk D4 2. Encode Coding Buffer Chunk P1 Chunk P2 Chunk P3 - k = 6 - m = 3 Chunk D5 Chunk D6

Bench 218 14 EC-Bench: Decoding Benchmark Big File 1. Read Data Buffer 2. Encode Coding Buffer D1 D2 D3 D4 D5 D6 P1 P2 P3 - k = 6 - m = 3 3. Decode

Bench 218 15 EC-Bench: Parameter Space # Data Chunks # Parity Chunks D1 D2 D3 D4 Dk P1 Pm Chunk Size k the number of data chunks m the number of parity chunks c the size of each chunk

Bench 218 16 EC-Bench: Metrics Ø Throughput!h# = % + ' ( ) * Ø Normalized Throughput o Enable to compare the performance across different configurations o Previous studies have demonstrated that optimal erasure codes take k 1 XOR operations to generate one byte!h# +,-. = % 1 ( ' ( ) * = % 1 ( ' % + ' (!h#

Bench 218 17 EC-Bench: Metrics Ø CPU Utilization!"# #$%&%'($%)* =!"#!,-&./ $ Ø Cache Pressure!(-h. "1.//21. = 31!(-h. 5%//./ $

Bench 218 18 Outline Introduction EC-Bench Evaluation Conclusion and Future Work

Bench 218 19 Evaluation: Open Source Libraries Erasure Coder Specific Hardware Support Description Jerasure Intel ISA-L Gibraltar Mellanox-EC CPU CPU with SSE/AVX GPU IB NIC with EC Offload Jerasure is a CPU-based library released in 27 that supports a wide variety of erasure codes. Compiled without SSE support. Intel Intelligent Storage Acceleration Library (ISA-L) is a collection of optimized low-level functions including erasure coding. The erasure coding functions are optimized for Intel instructions, such as Intel SSE, vector and encryption instructions. Gibraltar is a GPU-based library for Reed- Solomon coding. Mellanox-EC proposed by Mellanox is an HCAbased library for Reed-Solomon coding.

Bench 218 2 Evaluation: Experimental Setup 2.4 GHz Intel(R) Xeon(R) CPU E5-268 v4 o 28 cores, 32KB L1 cache, 256KB L2 cache, and 35MB L3 cache 128GB DRAM Nvidia K8 GPU Mellanox ConnectX-5 IB-EDR (1 Gbps) NIC Explored Configurations (RS(k, m)) o RS(3,2), RS(6,3), RS(1,4) and RS(17,3)

Bench 218 21 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M 128 112 96 8 64 48 32 16 Normalized Throughput (MB/sec)

Bench 218 22 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 128 112 For small chunk sizes (< 32B), Jerasure performs better than Intel ISA-L 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M 96 8 64 48 32 16 Normalized Throughput (MB/sec)

Bench 218 23 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 Chunk size = 2 KB 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) 128 112 96 8 64 48 32 16 For both onload coders, the best chunk size to carry out is 2 KB Normalized Throughput (MB/sec)

Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size Chunk size 512 KB 512 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) 128 112 For both offload coders, the best chunk size to carry out is 512 KB Because of data transformation overhead 96 8 64 48 32 16 Bench 218 24 Normalized Throughput (MB/sec)

Bench 218 25 Evaluation: Experimental Results Normalized Throughput (MB/sec) 25 2 15 1 5 Onload Coders 2KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) 25 2 15 1 5 RS(1, 4) is the best configuration for Intel ISA-L and Gibraltar Mellanox-EC performs the best with configuration RS(17, 3) Jerasure with RS(3, 2) outperforms other configurations slightly Offload Coders 512KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) Jerasure ISA-L Mellanox-EC Gibraltar Normalized Throughput Performance of Onload and Offload Coders

Bench 218 26 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M

Bench 218 27 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 Chunk size = 32 B 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M In Intel ISA-L, different approaches for < 32 bytes and 32 bytes (details in the function ec_encode_data_avx2)

Bench 218 28 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M Offload coders make better use of CPU cycles, thus have less impact on applications computation o CPU Utilization with Varied Chunk Sizes for RS(6, 3) With chunk size = 64 MB,.41 million cycles by Mellanox-EC vs. 2932.23 million cycles by ISA-L 64M

Bench 218 29 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M

Bench 218 3 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M Within onload coders, Intel ISA-L makes better use of L1 cache compared to Jerasure

Bench 218 31 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M In general, offload coders make less pressure on L1 cache, thus have less impact on applications cache usage

Bench 218 32 Outline Introduction EC-Bench Evaluation Conclusion and Future Work

Bench 218 33 Conclusion and Future Work Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Ø Evaluations on four popular open source erasure coders with EC-Bench Ø Advanced onload coders outperform offload coders Ø Offload coders make better use of CPU cycles and cache Ø Future work Ø Support more hardware-optimized erasure coders, e.g., FPGA-optimized erasure coders

Bench 218 34 Thank You! {shi.876, lu.932, panda.2}@osu.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/