Evaluating the Error Resilience of Parallel Programs

Size: px

Start display at page:

Download "Evaluating the Error Resilience of Parallel Programs"

Conrad White
5 years ago
Views:

1 Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1

2 Reliability trends The soft error rate per chip will continue to increase [Constantinescu, Micro 03; Shivakumar, DSN 02] ASC Q Supercomputer at LANL encountered 27.7 CPU failures per week [Michalak, IEEE transactions on device and materials reliability 05] 2

3 Goal Investigate the error- resilience characteristics of applications Application- specific, software- based, fault tolerance mechanisms Find correlation between error resilience characteristics and application properties 3

4 Previous Study on GPUs Understand the error resilience of GPU applications by building a methodology and tool called GPU- Qin [Fang, ISPASS 14] 50% 40% SDC rate 30% 20% 10% 0% AES HashGPU-md5 HashGPU-sha1 MRI-Q MAT Transpose SAD-k0 SAD-k1 SAD-k2 Stencil SCAN-block MONTE MergeSort-k0 BFS LBM Benchmarks 4

5 Operations and Resilience of GPU Applications Operations Benchmarks Measured SDC Comparison- based MergeSort 6% Bit- wise Operation HashGPU, AES 25% - 37% Average- out Effect Stencil, MONTE 1% - 5% Graph Processing BFS 10% Linear Algebra Transpose, MAT, MRI- Q, SCAN- block, LBM, SAD 15% - 38% 5

6 This paper: OpenMP programs Thread model Master thread Start Program structure Input processing Pre algorithm Parallel region Begin: read_graphics(image_orig); image = image_setup(image_orig); scale_down(image) #pragma omp parallel for shared(var_list1)... Master thread Master thread End Slave threads Post algorithm Output processing scale_up(image) write_graphics(image); End 6

7 Evaluate the error resilience Naïve approach: Randomly choose thread and instruction to inject NOT working: biasing the FT design Challenges: C1: exposing the thread model C2: exposing the program structure 7

8 Methodology Controlled fault injection: fault injection in master and slave threads separately (C1) Integrate with the thread ID Identify the boundaries of segments (c2) Identify the range of the instructions for each segment by using source- code to instruction- level mapping LLVM IR level (assembly- like, yet captures the program structure) 8

9 LLFI and our extension Start Fault injection Add support instruction/ for structure register selector and thread selection(c1) Instrument IR code of the program with function calls Compile time Specify code region and thread to inject (C1 & C2) Custom fault injector Yes Fault injection executable Inject? No Next instruction Profiling executable Collect statistics on per- thread basis (C1) Runtime 9

10 Fault Model Transient faults Single- bit flip No cost to extend to multiple- bit flip Faults in arithmetic and logic unit(alu), floating point Unit (FPU) and the load- store unit(lsu, memory- address computation only). Assume memory and cache are ECC protected; do not consider control logic of processor (e.g. instruction scheduling mechanism) 10

11 Characterization study Intel Xeon CPU 24 hardware cores Outcomes of experiment: Benign: correct output Crash: exceptions from the system or application Silent Data Corruption (SDC): incorrect output Hang: finish in considerably long time 10,000 fault injection runs for each benchmark (to achieve < 1% error bar) 11

12 Details about benchmarks 8 Applications from Rodinia benchmark suite Benchmark Bread- first- search LU Decomposition K- nearest neighbor Hotspot Needleman- Wunsch Pathfinder SRAD Kmeans Acronym bfs lud nn hotspot nw pathfinder srad kmeans 12

13 Difference of SDC rates Compare the SDC rates for 50% 40% 30% 20% 10% 0% 1. Injecting into master thread only 2. Injecting into a random slave thread It shows the importance of differentiating between the master and slave threads 60% SDC rate of slave threads For correct SDC characterization rate of the master thread For selective fault tolerance mechanisms Master threads have higher SDC rates than slave threads (except for lud) Biggest: 16% in pathfinder; Average: 7.8% 13

14 Differentiating between program segments 60% 50% 40% 30% 20% 10% Faults occur in each of the program segments Input processing Pre-algorithm Parallel region Post-algorithm Output processing 0% bfs lud hotspot nw pathfinder srad kmeans 14

15 Mean and standard deviation of SDC rates for each segment Segment Number of applications Input Processing Pre- algorithm Parallel Segment Post- algorithm Output processing Mean 28.6% 20.3% 16.1% 27% 42.4% Standard Deviation 11% 23% 14% 13% 5% Output processing has the highest mean SDC rate, and also the lowest standard deviation 15

Grouping by operations Operations Benchmarks Measured SDC Operations Measured SDC Comparison- based Grid computation Average- out Effect Graph

2% bfs, pathfinder Linear Algebra lud 44% Comparison- based Bit- wise Operation Comparison- based and average- out operations show the lowest

16 Grouping by operations Operations Benchmarks Measured SDC Operations Measured SDC Comparison- based Grid computation Average- out Effect Graph Processing nn, nw less than 1% hotspot, srad 23% kmeans 4.2% bfs, pathfinder Linear Algebra lud 44% Comparison- based Bit- wise Operation Comparison- based and average- out operations show the lowest SDC rates in Average- out both cases Effect Linear algebra 9% ~ and 10% grid computation, Graph which are regular computation Processing patterns, give the highest SDC rates 6% 25% - 37% 1% - 5% 10% Linear Algebra 15% - 38% 16

Future work Operations Benchmarks Measured SDC Operations Measured SDC Comparison- based nn, nw less than 1% Comparison- based 6% Grid computation hotspot, srad 23% Bit- wise Operation 25% - 37%

17 Future work Operations Benchmarks Measured SDC Operations Measured SDC Comparison- based nn, nw less than 1% Comparison- based 6% Grid computation hotspot, srad 23% Bit- wise Operation 25% - 37% Average- out Effect kmeans 4.2% Average- out Effect 1% - 5% Graph Processing bfs, pathfinder 9% ~ 10% Linear Algebra lud 44% Graph Processing 10% Linear Algebra 15% - 38% Understand variance and investigate more applications Compare CPUs and GPUs Same applications Same level of abstraction 17

18 Conclusion Error resilience characterization of OpenMP programs needs to take into account the thread model and the program structure. Preliminary support for our hypothesis that error resilience properties do correlate with the algorithmic characteristics of parallel applications. Project website: 18

19 Backup slide SDC rates converge AES# MAT# MergeSort5k0# HASHGPU5md5# HASHGPU5sha1# LBM# Transpose# BFS# SCAN# MRI5Q# Stencil# SAD5k1# SAD5k2# SAD5k3# MONTE# 50%# 45%# 40%# 35%# 30%# 25%# 20%# 15%# 10%# 5%# 0%# 200# 400# 600# 800# 1000# 1200# 1400# 1600# 1800# 2000# 2200# 2400# 2600# 2800# 19

LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults

LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults Qining Lu, Mostafa Farahani, Jiesheng Wei, Anna Thomas, and Karthik Pattabiraman Department of Electrical and Computer Engineering,