EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University Bench 218
Bench 218 2 Outline Introduction EC-Bench Evaluation Conclusion and Future Work
Bench 218 3 Fault Tolerance & Replication Fault tolerance is a critical requirement for distributed storages Availability: data is still accessible under failures Durability: no data corruptions under failures Traditional storage scheme to achieve fault tolerance is replication Common replication factors: 3-1 Replication factor = 3 D Original data Data used to maintain fault tolerance
Bench 218 4 Booming of Data 1,4 Data Stored in Exabytes 1,2 1, 8 985 1,327 36% CAGR 216-221 6 721 4 2 286 397 547 Source: Cisco Global Cloud Index, 216-221 216 217 218 219 22 221 The booming of data makes storage overhead of replication considerable
Bench 218 5 Erasure Coding Erasure Coding is a promising redundancy storage scheme Minimize storage overhead by applying erasure encoding on data Deliver higher reliability with same storage overhead than replication Employed in Google, Microsoft, Facebook, Baidu, etc. v Microsoft Azure reduces storage overhead from 3x (3-way replication) to 1.33x - k = 4 - m = 2 (erasure coding) Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 Original data Data used to maintain fault tolerance
Bench 218 6 Replication vs. Erasure Coding Replication Storage overhead = 2/1 = 2 D Replication factor = 3 Original data Data used to maintain fault tolerance Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 - k = 4 - m = 2 Erasure Coding Storage overhead = 2/4 =.5 Original data Data used to maintain fault tolerance
Bench 218 7 Erasure Coding: Encoding Ø Takes in k data chunks and generates m parity chunks Ø Distributes (k + m) chunks to (k + m) independent nodes - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3
Bench 218 8 Erasure Coding: Decoding Ø Any k of (k + m) chunks are sufficient to recover the original k data chunks - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3 Read Decode D1 D2 D3 D4 D5 D6
Bench 218 9 Erasure Coding Pros: Low storage overhead with high fault tolerance Cons: High computation overhead introduced by encode and decode Ø High performance hardware-optimized erasure coding libraries to alleviate computation overhead Ø Intel CPUs -> Intel ISA-L Ø Nvidia GPUs -> Gibraltar Ø Mellanox InfiniBand -> Mellanox-EC
Bench 218 1 Onload and Offload Erasure Coders Onload Erasure Coder Erasure coding operations are performed by host processors Jerasure and Intel ISA-L Onload Erasure Coders Jerasure CPU ISA-L Offload Erasure Coder Erasure coding operations are conducted by accelerators like Offload Erasure Coders GPU Gibraltar IB HCA Mellanox-EC GPUs and Mellanox InfiniBand HCAs Gibraltar and Mellanox-EC
Bench 218 11 Our Contributions Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Performance Throughput Normalized Throughput Ø Evaluations on four popular open source erasure coders with EC-Bench Onload Offload CPU Cache Jerasure Intel ISA-L Heterogeneity Resource Utilization Gibraltar Mellanox-EC
Bench 218 12 Outline Introduction EC-Bench Evaluation Conclusion and Future Work
Bench 218 13 EC-Bench: Encoding Benchmark Data Buffer Big File 1. Read Chunk D1 Chunk D2 Chunk D3 Chunk D4 2. Encode Coding Buffer Chunk P1 Chunk P2 Chunk P3 - k = 6 - m = 3 Chunk D5 Chunk D6
Bench 218 14 EC-Bench: Decoding Benchmark Big File 1. Read Data Buffer 2. Encode Coding Buffer D1 D2 D3 D4 D5 D6 P1 P2 P3 - k = 6 - m = 3 3. Decode
Bench 218 15 EC-Bench: Parameter Space # Data Chunks # Parity Chunks D1 D2 D3 D4 Dk P1 Pm Chunk Size k the number of data chunks m the number of parity chunks c the size of each chunk
Bench 218 16 EC-Bench: Metrics Ø Throughput!h# = % + ' ( ) * Ø Normalized Throughput o Enable to compare the performance across different configurations o Previous studies have demonstrated that optimal erasure codes take k 1 XOR operations to generate one byte!h# +,-. = % 1 ( ' ( ) * = % 1 ( ' % + ' (!h#
Bench 218 17 EC-Bench: Metrics Ø CPU Utilization!"# #$%&%'($%)* =!"#!,-&./ $ Ø Cache Pressure!(-h. "1.//21. = 31!(-h. 5%//./ $
Bench 218 18 Outline Introduction EC-Bench Evaluation Conclusion and Future Work
Bench 218 19 Evaluation: Open Source Libraries Erasure Coder Specific Hardware Support Description Jerasure Intel ISA-L Gibraltar Mellanox-EC CPU CPU with SSE/AVX GPU IB NIC with EC Offload Jerasure is a CPU-based library released in 27 that supports a wide variety of erasure codes. Compiled without SSE support. Intel Intelligent Storage Acceleration Library (ISA-L) is a collection of optimized low-level functions including erasure coding. The erasure coding functions are optimized for Intel instructions, such as Intel SSE, vector and encryption instructions. Gibraltar is a GPU-based library for Reed- Solomon coding. Mellanox-EC proposed by Mellanox is an HCAbased library for Reed-Solomon coding.
Bench 218 2 Evaluation: Experimental Setup 2.4 GHz Intel(R) Xeon(R) CPU E5-268 v4 o 28 cores, 32KB L1 cache, 256KB L2 cache, and 35MB L3 cache 128GB DRAM Nvidia K8 GPU Mellanox ConnectX-5 IB-EDR (1 Gbps) NIC Explored Configurations (RS(k, m)) o RS(3,2), RS(6,3), RS(1,4) and RS(17,3)
Bench 218 21 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M 128 112 96 8 64 48 32 16 Normalized Throughput (MB/sec)
Bench 218 22 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 128 112 For small chunk sizes (< 32B), Jerasure performs better than Intel ISA-L 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M 96 8 64 48 32 16 Normalized Throughput (MB/sec)
Bench 218 23 Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size 512 Chunk size = 2 KB 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) 128 112 96 8 64 48 32 16 For both onload coders, the best chunk size to carry out is 2 KB Normalized Throughput (MB/sec)
Evaluation: Experimental Results Throughput (MB/sec) 4 35 3 25 2 15 1 5 1 Jerasure ISA-L Mellanox-EC Gibraltar 4 16 32 16 28 14 24 12 2 1 16 12 8 4 8 6 4 2 32 Chunk Size Chunk size 512 KB 512 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) 128 112 For both offload coders, the best chunk size to carry out is 512 KB Because of data transformation overhead 96 8 64 48 32 16 Bench 218 24 Normalized Throughput (MB/sec)
Bench 218 25 Evaluation: Experimental Results Normalized Throughput (MB/sec) 25 2 15 1 5 Onload Coders 2KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) 25 2 15 1 5 RS(1, 4) is the best configuration for Intel ISA-L and Gibraltar Mellanox-EC performs the best with configuration RS(17, 3) Jerasure with RS(3, 2) outperforms other configurations slightly Offload Coders 512KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) Jerasure ISA-L Mellanox-EC Gibraltar Normalized Throughput Performance of Onload and Offload Coders
Bench 218 26 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M
Bench 218 27 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 Chunk size = 32 B 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M In Intel ISA-L, different approaches for < 32 bytes and 32 bytes (details in the function ec_encode_data_avx2)
Bench 218 28 Evaluation: Experimental Results CPU Utilization (million cycles / sec) 1 875 75 625 5 375 25 125 1 4 16 64 256 1K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M Offload coders make better use of CPU cycles, thus have less impact on applications computation o CPU Utilization with Varied Chunk Sizes for RS(6, 3) With chunk size = 64 MB,.41 million cycles by Mellanox-EC vs. 2932.23 million cycles by ISA-L 64M
Bench 218 29 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M
Bench 218 3 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M Within onload coders, Intel ISA-L makes better use of L1 cache compared to Jerasure
Bench 218 31 Evaluation: Experimental Results L1 Cache Miss (millions / sec) 8 7 6 5 4 3 2 1 1 4 Jerasure ISA-L Mellanox-EC Gibraltar 16 64 256 1K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M In general, offload coders make less pressure on L1 cache, thus have less impact on applications cache usage
Bench 218 32 Outline Introduction EC-Bench Evaluation Conclusion and Future Work
Bench 218 33 Conclusion and Future Work Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Ø Evaluations on four popular open source erasure coders with EC-Bench Ø Advanced onload coders outperform offload coders Ø Offload coders make better use of CPU cycles and cache Ø Future work Ø Support more hardware-optimized erasure coders, e.g., FPGA-optimized erasure coders
Bench 218 34 Thank You! {shi.876, lu.932, panda.2}@osu.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/