EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Size: px

Start display at page:

Download "EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures"

Britton Russell
5 years ago
Views:

1 EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, The Ohio State University Bench 218

2 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

3 Bench Fault Tolerance & Replication Fault tolerance is a critical requirement for distributed storages Availability: data is still accessible under failures Durability: no data corruptions under failures Traditional storage scheme to achieve fault tolerance is replication Common replication factors: 3-1 Replication factor = 3 D Original data Data used to maintain fault tolerance

Bench 218 4 Booming of Data 1,4 Data Stored in Exabytes 1,2 1,

Cisco Global Cloud Index, 216-221 216 217 218 219 22 221 The

4 Bench Booming of Data 1,4 Data Stored in Exabytes 1,2 1, ,327 36% CAGR Source: Cisco Global Cloud Index, The booming of data makes storage overhead of replication considerable

5 Bench Erasure Coding Erasure Coding is a promising redundancy storage scheme Minimize storage overhead by applying erasure encoding on data Deliver higher reliability with same storage overhead than replication Employed in Google, Microsoft, Facebook, Baidu, etc. v Microsoft Azure reduces storage overhead from 3x (3-way replication) to 1.33x - k = 4 - m = 2 (erasure coding) Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 Original data Data used to maintain fault tolerance

6 Bench Replication vs. Erasure Coding Replication Storage overhead = 2/1 = 2 D Replication factor = 3 Original data Data used to maintain fault tolerance Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 - k = 4 - m = 2 Erasure Coding Storage overhead = 2/4 =.5 Original data Data used to maintain fault tolerance

7 Bench Erasure Coding: Encoding Ø Takes in k data chunks and generates m parity chunks Ø Distributes (k + m) chunks to (k + m) independent nodes - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3

8 Bench Erasure Coding: Decoding Ø Any k of (k + m) chunks are sufficient to recover the original k data chunks - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3 Read Decode D1 D2 D3 D4 D5 D6

9 Bench Erasure Coding Pros: Low storage overhead with high fault tolerance Cons: High computation overhead introduced by encode and decode Ø High performance hardware-optimized erasure coding libraries to alleviate computation overhead Ø Intel CPUs -> Intel ISA-L Ø Nvidia GPUs -> Gibraltar Ø Mellanox InfiniBand -> Mellanox-EC

Bench 218 1 Onload and Offload Erasure Coders Onload Erasure Coder Erasure coding operations are performed by host processors Jerasure and Intel ISA-L Onload Erasure Coders Jerasure CPU ISA-L

10 Bench Onload and Offload Erasure Coders Onload Erasure Coder Erasure coding operations are performed by host processors Jerasure and Intel ISA-L Onload Erasure Coders Jerasure CPU ISA-L Offload Erasure Coder Erasure coding operations are conducted by accelerators like Offload Erasure Coders GPU Gibraltar IB HCA Mellanox-EC GPUs and Mellanox InfiniBand HCAs Gibraltar and Mellanox-EC

11 Bench Our Contributions Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Performance Throughput Normalized Throughput Ø Evaluations on four popular open source erasure coders with EC-Bench Onload Offload CPU Cache Jerasure Intel ISA-L Heterogeneity Resource Utilization Gibraltar Mellanox-EC

12 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

13 Bench EC-Bench: Encoding Benchmark Data Buffer Big File 1. Read Chunk D1 Chunk D2 Chunk D3 Chunk D4 2. Encode Coding Buffer Chunk P1 Chunk P2 Chunk P3 - k = 6 - m = 3 Chunk D5 Chunk D6

14 Bench EC-Bench: Decoding Benchmark Big File 1. Read Data Buffer 2. Encode Coding Buffer D1 D2 D3 D4 D5 D6 P1 P2 P3 - k = 6 - m = 3 3. Decode

15 Bench EC-Bench: Parameter Space # Data Chunks # Parity Chunks D1 D2 D3 D4 Dk P1 Pm Chunk Size k the number of data chunks m the number of parity chunks c the size of each chunk

16 Bench EC-Bench: Metrics Ø Throughput!h# = % + ' ( ) * Ø Normalized Throughput o Enable to compare the performance across different configurations o Previous studies have demonstrated that optimal erasure codes take k 1 XOR operations to generate one byte!h# +,-. = % 1 ( ' ( ) * = % 1 ( ' % + ' (!h#

17 Bench EC-Bench: Metrics Ø CPU Utilization!"# #$%&%'($%)* =!"#!,-&./ $ Ø Cache Pressure!(-h. "1.//21. = 31!(-h. 5%//./ $

18 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

19 Bench Evaluation: Open Source Libraries Erasure Coder Specific Hardware Support Description Jerasure Intel ISA-L Gibraltar Mellanox-EC CPU CPU with SSE/AVX GPU IB NIC with EC Offload Jerasure is a CPU-based library released in 27 that supports a wide variety of erasure codes. Compiled without SSE support. Intel Intelligent Storage Acceleration Library (ISA-L) is a collection of optimized low-level functions including erasure coding. The erasure coding functions are optimized for Intel instructions, such as Intel SSE, vector and encryption instructions. Gibraltar is a GPU-based library for Reed- Solomon coding. Mellanox-EC proposed by Mellanox is an HCAbased library for Reed-Solomon coding.

20 Bench Evaluation: Experimental Setup 2.4 GHz Intel(R) Xeon(R) CPU E5-268 v4 o 28 cores, 32KB L1 cache, 256KB L2 cache, and 35MB L3 cache 128GB DRAM Nvidia K8 GPU Mellanox ConnectX-5 IB-EDR (1 Gbps) NIC Explored Configurations (RS(k, m)) o RS(3,2), RS(6,3), RS(1,4) and RS(17,3)

21 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size 512 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M Normalized Throughput (MB/sec)

Bench 218 22 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 128 112

22 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size For small chunk sizes (< 32B), Jerasure performs better than Intel ISA-L 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M Normalized Throughput (MB/sec)

23 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size 512 Chunk size = 2 KB 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) For both onload coders, the best chunk size to carry out is 2 KB Normalized Throughput (MB/sec)

Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size Chunk size 512 KB 512 8K 128K

24 Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size Chunk size 512 KB 512 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) For both offload coders, the best chunk size to carry out is 512 KB Because of data transformation overhead Bench Normalized Throughput (MB/sec)

25 Bench Evaluation: Experimental Results Normalized Throughput (MB/sec) Onload Coders 2KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) RS(1, 4) is the best configuration for Intel ISA-L and Gibraltar Mellanox-EC performs the best with configuration RS(17, 3) Jerasure with RS(3, 2) outperforms other configurations slightly Offload Coders 512KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) Jerasure ISA-L Mellanox-EC Gibraltar Normalized Throughput Performance of Onload and Offload Coders

26 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M

Bench 218 27 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75

Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M In Intel

27 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) Chunk size = 32 B K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M In Intel ISA-L, different approaches for < 32 bytes and 32 bytes (details in the function ec_encode_data_avx2)

28 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M Offload coders make better use of CPU cycles, thus have less impact on applications computation o CPU Utilization with Varied Chunk Sizes for RS(6, 3) With chunk size = 64 MB,.41 million cycles by Mellanox-EC vs million cycles by ISA-L 64M

29 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M

30 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M Within onload coders, Intel ISA-L makes better use of L1 cache compared to Jerasure

31 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M In general, offload coders make less pressure on L1 cache, thus have less impact on applications cache usage

32 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

33 Bench Conclusion and Future Work Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Ø Evaluations on four popular open source erasure coders with EC-Bench Ø Advanced onload coders outperform offload coders Ø Offload coders make better use of CPU cycles and cache Ø Future work Ø Support more hardware-optimized erasure coders, e.g., FPGA-optimized erasure coders

34 Bench Thank You! {shi.876, lu.932, Network-Based Computing Laboratory The High-Performance Big Data Project

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory