Elzar Triple Modular Redundancy using Intel AVX

Size: px

Start display at page:

Download "Elzar Triple Modular Redundancy using Intel AVX"

Laurel Day
6 years ago
Views:

1 Elzar Triple Modular Redundancy using Intel AVX Dmitrii Kuvaiskii Oleksii Oleksenko Pramod Bhatotia Christof Fetzer Pascal Felber

2 Hardware Errors in the Wild Online services run in huge data centers 1

3 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception 1

4 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions 1

5 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect 1

Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful

6 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect Corruption can occur transiently in CPU or RAM. Guarding against such corruptions is an important goal in Mesa s overall design 1

7 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches 2

8 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance 2

9 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Tolerates arbitrary faults Pessimistic fault model High resource overheads 2

10 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Pessimistic fault model High resource overheads 2

11 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Low performance overheads Pessimistic fault model Only anticipated faults High resource overheads Manual and error-prone 2

12 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Checksums / Assertions Triple Modular Redundancy 2

13 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Checksums / Assertions Triple Modular Redundancy Practical fault model 2

14 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Triple Modular Redundancy Practical fault model Disciplined protection Checksums / Assertions better fault coverage 2

15 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU 3

16 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native z = add x, y 3

17 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native TMR z z = add x, y z2 = add x2, y2 z3 = add x3, y3 = add x, y 3

18 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) with SIMD Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y 3

19 Triple Modular Redundancy Triple Modular Redundancy (TMR) with SIMD Source code Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y Idea SIMD uses less instructions less perf overhead? 3

20 Research Question Can we use SIMD for efficient fault tolerance? 4

21 Research Question Can we use SIMD for efficient fault tolerance? Implementation using Intel AVX 4

22 Elzar Source code Elzar Executable CPU 5

23 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 5

24 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) 5

25 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) Investigation Two bottlenecks in current AVX 5

26 Motivation Intel AVX Design Evaluation Discussion

27 Intel AVX Background 6

28 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 64-bit 64-bit 64-bit 64-bit 6

29 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 New instructions + 64-bit 64-bit 64-bit 64-bit x1 x2 x3 x4 YMM0 y1 y2 y3 y4 YMM1 x1+y1 x2+y2 x3+y3 x4+y4 YMM2 6

30 AVX as free resource? AVX is not actively used? 7

31 AVX as free resource? AVX is not actively used? unused used 7

32 AVX as free resource? AVX is not actively used? unused used 7

33 AVX as free resource? AVX is not actively used? unused used 7

34 AVX as free resource? AVX is not actively used? Reuse for fault tolerance! unused used 7

35 Motivation Intel AVX Design Evaluation Discussion

36 Fault Model Protect against transient faults in CPU corruptions in CPU registers CPU miscomputations in CPU execution units Memory is protected by other means DRAM protected by ECC Memory CPU caches protected by ECC and parity 8

37 Elzar by Example Native x z = load a = add x, 1 9

38 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) 9

39 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) 9

40 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Common instructions introduce lower overhead 9

41 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 1: Memory accesses require wrappers 9

42 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 2: Checks are expensive 9

43 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 10

44 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) No such thing as avx-load! 1 10

45 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a 10

46 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract a 10

47 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM a load x 10

48 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x 10

49 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x Memory accesses require wrappers 10

50 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 11

51 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 11

52 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor 11

53 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor ptest FLAGS <all zeros, proceed> 11

54 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z shuffle xor ptest FLAGS 11

55 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS 11

56 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover 11

57 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover Checks introduce additional 55% overhead 11

58 Motivation Intel AVX Design Evaluation Discussion

59 Performance overheads 12

60 Performance overheads lower better 12

61 Performance overheads lower better 12

62 Performance overheads lower better Average performance overhead is 4 12

63 Performance overheads lower better Amortized by very poor memory access pattern 12

64 Performance overheads lower better (1) Native benefits from AVX vectorization (2) Most of time spent in mem-intensive bzero() 12

65 Comparison with SWIFT-R (state-of-the-art TMR approach) 13

66 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13

67 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13

68 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar performs 46% worse than SWIFT-R on average 13

69 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Dominated by memory accesses, Elzar inserts many wrappers 13

70 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar produces 3x less instructions than SWIFT-R 13

71 Motivation Intel AVX Design Evaluation Discussion

72 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks 14

73 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing 14

74 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val 14

75 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val Potentially 71% better than SWIFT-R 14

76 Conclusion 15

77 Conclusion Source code Elzar Executable AVX CPU 15

78 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 15

79 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead 15

80 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R 15

81 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R Discussion With FPGA, ~71% better than SWIFT-R 15

82 Thank you! GitHub repo:

83 backup slides

84 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU

85 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU Intel AVX units

86 Elzar: Check on Branch Native cmp x, y jne truebranch

87 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover

88 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover Impact Checks on branches introduce only 4% overhead

89 Main Bottlenecks Problem AVX instruction set lacks certain instructions Loads/stores require extracting AVX-replicated address Elzar creates expensive wrappers around loads/stores AVX-512 introduces promising gather/scatter instructions AVX has only one control-flow instruction ptest Elzar has to insert ptest for each control-flow decision AVX could add cmp instructions directly affecting control flow AVX misses integer division, integer truncation, etc Elzar produces very ineffective code in these cases

90 Estimation of Proposed Changes lower better

91 Estimation of Proposed Changes lower better

92 Estimation of Proposed Changes lower better

93 Estimation of Proposed Changes lower better Estimated average overhead would be 48% (71% improvement over SWIFT-R)

HAFT Hardware-Assisted Fault Tolerance

HAFT Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii Rasha Faqeh Pramod Bhatotia Christof Fetzer Technische Universität Dresden Pascal Felber Université de Neuchâtel Hardware Errors in the Wild Online