Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques Tianli Zhou & Chao Tian Texas A&M University
2 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
3 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
Data Reliability 4
4 Data Reliability Disaster happens
4 Data Reliability Disaster happens Disk failure Server malfunction and more...
5 Replication - Erasure Code Erasure Code Computational Overhead Replication Space Overhead
Replication - Erasure Code Minimum space overhead Much more computational load Erasure code relies on Galois Field (finite field) ops Erasure Code Computational Overhead Replication Space Overhead 5
Replication - Erasure Code Minimum space overhead Much more computational load Erasure code relies on Galois Field (finite field) ops How to reduce it Erasure Code Computational Overhead Replication Space Overhead 5
6 Acceleration Techniques Optimize Coding Matrix Schedule Computing Order Covert to XOR operations Apply SIMD Specially Designed Code Optimize for CPU Cache
7 Acceleration Techniques Fast GF Coding Covert to XOR operations Optimize Coding Matrix Schedule Computing Order Coding Theory Specially Designed Code Computation Resource Apply SIMD Optimize for CPU Cache
7 Acceleration Techniques Fast GF Coding Covert to XOR operations Optimize Coding Matrix Schedule Computing Order? Coding Theory Specially Designed Code Computation Resource Apply SIMD Optimize for CPU Cache
8 Question to Answer Most effective technique(s)? Utilize together? Components to be optimized? Problem is still important Reduce energy consumption Virtualized environment Mobile / embedded system Future I/O technology
9 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
10 Erasure Code storage nodes Store coded data to multiple nodes
11 Erasure Code storage nodes Recover even nodes failures happen
12 Erasure Code storage nodes Encode k=3 segments n=5 coded nodes
12 Erasure Code storage nodes Encode k=3 segments n=5 coded nodes
13 Erasure Code storage nodes k=3 segments MDS code: recover data from any k surviving nodes
14 Erasure Code storage nodes k=3 segments MDS code: recover data from any k surviving nodes
15 Erasure Code storage nodes k=3 segments k=3 systematic nodes m=2 parity nodes
15 Erasure Code storage nodes k=3 segments k=3 systematic nodes m=2 parity nodes
16 Encoding Matrix multiplication on finite field I P X =
16 Encoding Matrix multiplication on finite field I P X = All elements and ops in GF(2 w ) Galois Field (GF) Finite field
17 Decoding I P
17 Decoding I P
17 Decoding Submatrix corresponding to surviving data I P
Decoding Submatrix corresponding to surviving data Invert I P -1 17
Decoding Submatrix corresponding to surviving data Invert Multiplication I P -1 X = 17
18 Cauchy Reed-Solomon Codes Direct assign P to Cauchy matrix I P
Cauchy Reed-Solomon Codes 19
19 Cauchy Reed-Solomon Codes All elements and ops in GF(2 w )
20 Cauchy Reed-Solomon Codes Direct assign P to Cauchy matrix I P
21 Fast GF Ops Binary Representation Binary representation of GF elements P All coefficients and messages are in GF(2 w )
22 Fast GF Ops Binary Representation Message bit-vector representation P
22 Fast GF Ops Binary Representation Message bit-vector representation P
22 Fast GF Ops Binary Representation Message bit-vector representation P 1 by w bit-vector
23 Fast GF Ops Binary Representation Bitmatrix representation P
24 Fast GF Ops Binary Representation Bitmatrix representation P
24 Fast GF Ops Binary Representation Bitmatrix representation P w by w bitmatrix
25 Fast GF Ops Binary Representation Matrix multiplication on GF -> XORs P
25 Fast GF Ops Binary Representation Matrix multiplication on GF -> XORs P
25 Fast GF Ops Binary Representation Matrix multiplication on GF -> XORs P
25 Fast GF Ops Binary Representation Matrix multiplication on GF -> XORs P
26 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
Techniques Tiers 27
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Bitmatrix Normalization (BN) Normalize by each element, find less bit 1s I P James S. Plank, Scott Simmerman, and Catherine D. Schuman. Jerasure: A library in C/C++ facilitating erasure coding for storage applications-version 1.2. Technical Report CS-08-627, University of Tennessee, 2008.
Smart Scheduling (SS) Utilize both message and computed parities James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Smart Scheduling (SS) Utilize both message and computed parities 3 XORs 3 XORs 2 XORs James S. Plank. The RAID-6 liberation code. In Proceedings of the 6th Usenix Conference on File and Storage Technologies, pages 97 110, 2008. 29
Matching (UM, WM) Common XORs of pair message bits Unweighted/weighted greedy algorithm P Cheng Huang, Jin Li, and Minghua Chen. On optimizing XOR-based codes for fault-tolerant storage applications. In Proceedings of Information Theory Workshop, pages 218 223, 2007. 30
Matching (UM, WM) Common XORs of pair message bits Unweighted/weighted greedy algorithm P Cheng Huang, Jin Li, and Minghua Chen. On optimizing XOR-based codes for fault-tolerant storage applications. In Proceedings of Information Theory Workshop, pages 218 223, 2007. 31
Matching (UM, WM) Common XORs of pair message bits Unweighted/weighted greedy algorithm P Cheng Huang, Jin Li, and Minghua Chen. On optimizing XOR-based codes for fault-tolerant storage applications. In Proceedings of Information Theory Workshop, pages 218 223, 2007. 32
Matching (UM, WM) Common XORs of pair message bits Unweighted/weighted greedy algorithm P Cheng Huang, Jin Li, and Minghua Chen. On optimizing XOR-based codes for fault-tolerant storage applications. In Proceedings of Information Theory Workshop, pages 218 223, 2007. 33
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty Cache Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 34
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty Cache Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 34
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty Cache Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 34
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty Cache Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 34
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits AVX2: 256 bits AVX512: 512 bits 35
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits V-XOR 0 1 1 0 1... 0 1 1... 0 1 0... 0 AVX2: 256 bits AVX512: 512 bits 35
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits V-XOR 0 1 1 0 1... 0 1 1... 0 1 0... 0 AVX2: 256 bits AVX512: 512 bits V-GF 5 3 5 2... 1 3 6... 4 6 6 4... 5 2 2 4... 3 X 5 X 5 1... 2 1 1 4... 6 35
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits V-XOR 0 1 1 0 1... 0 1 1... 0 1 0... 0 AVX2: 256 bits AVX512: 512 bits V-GF 5 3 5 2... 1 3 6... 4 6 6 4... 5 2 2 4... 3 X 5 X 5 1... 2 1 Jerasure 2.0 1 4... 6 35
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits AVX2: 256 bits V-XOR 0 1 1 v.s. 0 1... 0 1 1... 0 1 0... 0 AVX512: 512 bits V-GF 5 3 5 2... 1 3 6... 4 6 6 4... 5 2 2 4... 3 X 5 X 5 1... 2 1 Jerasure 2.0 1 4... 6 35
Vectorization Using SIMD ISA perform multiple bits operation SSE: 128 bits AVX2: 256 bits V-XOR 0 1 1 v.s. 0 1... 0 1 1... 0 1 0... 0 AVX512: 512 bits V-GF 5 3 5 2... 1 3 6... 4 6 6 4... 5 2 2 4... 3 X 5 X 5 1... 2 1 Jerasure 2.0 1 4... 6 35
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty Cache Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 36
Scheduling - Cache Optimization (S-CO) Reduce cache miss penalty u 0 p 0,p 1,p 2... Cache u 1 p 2,p 3,p 5... u 2 p 0,p 1,p 4... u 3 p 2,p 6,p 8... Memory Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computers, 63(9):2259 2272, 2014. 36
Question to Answer Most effective technique(s)? Utilize together? Components to be optimized? 37
Individual Techniques 38
38 Individual Techniques XOR-based Vectorization
38 Individual Techniques XOR-based Vectorization Bitmatrix Normalization Unweighted Matching Weighted Matching
38 Individual Techniques XOR-based Vectorization Bitmatrix Normalization Smart Scheduling Unweighted Matching Weighted Matching
38 Individual Techniques Cache Optimization XOR-based Vectorization Bitmatrix Normalization Smart Scheduling Unweighted Matching Weighted Matching
Optimization Tiers 39
Optimization Tiers 39
40 Optimization Tiers Generate Coding Schedule Run Coding Schedule
41 Optimization Tiers Generate Coding Schedule Run Coding Schedule
41 Optimization Tiers By-pass By-pass Generate Coding Schedule Run Coding Schedule
41 Optimization Tiers Unweighted Matching (UM) Weighted Matching (WM) By-pass By-pass Generate Coding Schedule Run Coding Schedule
Question to Answer Most effective technique(s)? Utilize together? Components to be optimized? 42
43 Combinations (i,j)-strategy By-pass Normalization (BN) By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM)
43 Combinations (i,j)-strategy By-pass Normalization (BN) i 0,1 By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM) j 0,3
43 Combinations (i,j)-strategy Example: strategy-(1,1) By-pass Normalization (BN) i 0,1 By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM) j 0,3
43 Combinations (i,j)-strategy Example: strategy-(1,1) By-pass Normalization (BN) i 0,1 By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM) j 0,3
44 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
Question to Answer Most effective technique(s)? Utilize together? Components to be optimized? 45
46 Choice of Cauchy Matrix Different X,Y array yields different Cauchy Matrix
46 Choice of Cauchy Matrix Different X,Y array yields different Cauchy Matrix
46 Choice of Cauchy Matrix Different X,Y array yields different Cauchy Matrix
46 Choice of Cauchy Matrix Different X,Y array yields different Cauchy Matrix
47 Choice of Cauchy Matrix Different Cauchy Matrix affect coding chain By-pass Normalization (BN) X,Y array By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM) Coding schedule, (memcpy, XOR)
Choice of Cauchy Matrix Different Cauchy Matrix affect coding chain By-pass Normalization (BN) X,Y array By-pass Smart Scheduling (SS) Unweighted Matching (UM) Weighted Matching (WM) Coding schedule, (memcpy, XOR) Cost of given X,Y array 47
Bitmatrix Optimization Individual bitmatrix optimization for all (i,j)-strategies Simulated Annealing and Genetic Algorithm # of operations as cost 48
[no_opt by SA by GA ], # of ops in schedule 49
[no_opt by SA by GA ], # of ops in schedule 49
[no_opt by SA by GA ], # of ops in schedule 49
Strategies-(i,j) with Optimized Bitmatrices strategy-(1,3) is the best strategy 50
Optimized Bitmatrix Reduced Costs 51
Cost Function Improvement Memcpy is 1.5x faster than XOR Cost functions: Total # of XORs in schedule Total # of ops in schedule Weighted sum of ops in schedule Table 4: Encoding throughput (GB/s) using bitmatrices obtained by the genetic algorithm under different cost functions 52
53 Contents Motivation Background and Review Evaluating Individual Techniques Find the Best Strategy under Optimized Bitmatrix Proposed Coding Procedure and Evaluation Conclusion
54 Proposed Coding Procedure Pre-optimized Bitmatrix WM
54 Proposed Coding Procedure Pre-optimized Bitmatrix Coding Schedule WM
Testing Setup Ryzen 1700X @ 3.4Ghz (8C/16T) 16GB DDR4 Ubuntu 18.04 64-bit, GCC 7.3.0 Jerasure 1.2A/ Jerasure 2.0: XOR-based Cauchy RS code (BN, SS applied) GF-based RS code Raid-6 Other codes: EVENODD RDP STAR 55
56 Encoding v.s. Efficient RS/CRS code Table 5: Encoding throughput (GB/s) for methods that allow general (n, k) parameters and w = 8 Outperform vectorized XOR-based CRS code for m>2
57 Encoding v.s. Efficient RS/CRS code Table 5: Encoding throughput (GB/s) for methods that allow general (n, k) parameters and w = 8 Outperform vectorized XOR-based CRS code for m>2 Outperform vectorized GF-based RS code with big margin (~ 2x faster)
58 Encoding v.s. Three Parities Codes Similar or better performance than specially designed codes
59 Encoding v.s. Two Parities Codes Similar or better performance than specially designed codes
Overall Encoding Improvement 60
Decoding 61
61 Decoding Direct read in most cases
61 Decoding Direct read in most cases Degraded read (decoding) only if systematic nodes fail
61 Decoding Direct read in most cases Degraded read (decoding) only if systematic nodes fail Single disk failure rate : 1% In (10,6) erasure code storage system 1 systematic node failure: ~ 0.055 2 systematic node failure: ~ 0.0014 3 systematic node failure: ~ 1.8x10-5 4 systematic node failure: ~ 1.4x10-7
61 Decoding Direct read in most cases Degraded read (decoding) only if systematic nodes fail Single disk failure rate : 1% In (10,6) erasure code storage system 1 systematic node failure: ~ 0.055 2 systematic node failure: ~ 0.0014 3 systematic node failure: ~ 1.8x10-5 4 systematic node failure: ~ 1.4x10-7 Decoding performance should be viewed as secondary importance
Decoding Throughput 62
63 Contents Motivation Background and Review Individual Techniques and Combinations Bitmatrix Optimization Proposed Coding Procedure and Evaluation Conclusion
Conclusion Comprehensive study of acceleration techniques Combine existing techniques and jointly optimize the bitmatrix Proposed approach outperforms most existing approaches in encoding throughput. 64
Conclusion : Key Finding Vectorization at XOR-level is much better choice than vectorization of finite field operations Higher throughput (~ 2x faster) Easy migration to newer SIMD ISA SSE: 128 bit AVX2: 256 bit AVX-512: 512 bit 65
Thank you! Questions? Tianli Zhou: zhoutl1106@tamu.edu Chao Tian: chao.tian@tamu.edu Source code will be released soon 66