Characterizing Result Errors in Internet Desktop Grids

Size: px

Start display at page:

Download "Characterizing Result Errors in Internet Desktop Grids"

Theresa Craig
6 years ago
Views:

1 Characterizing Result Errors in Internet Desktop Grids D. Kondo 1, F. Araujo 2, P. Malecot 1, P. Domingues 2, L. Silva 2, G. Fedak 1, F. Cappello 1 1 INRIA, France 2 University of Coimbra, Portugal

2 Desktop Grids Astronomy Math Biology LIP IBM AFM Total: ~50 applications using ~1.1 PetaFLOPS from ~1 million active resources

3 Background In large-scale desktop grids involving volunteered, anonymous (and thereby potentially untrusted, insecure) resources, errors are inevitable Software/Hardware Stack Potential Source of Error Application Middleware OS Hardware (Disk, CPU, Memory, Network)

4 Background In large-scale desktop grids involving volunteered, anonymous (and thereby potentially untrusted, insecure) resources, errors are inevitable Software/Hardware Stack Application Potential Source of Error Modify application results Middleware OS Hardware (Disk, CPU, Memory, Network)

5 Background In large-scale desktop grids involving volunteered, anonymous (and thereby potentially untrusted, insecure) resources, errors are inevitable Software/Hardware Stack Application Middleware Potential Source of Error Modify application results Revise and recompile middleware OS Hardware (Disk, CPU, Memory, Network)

6 Background In large-scale desktop grids involving volunteered, anonymous (and thereby potentially untrusted, insecure) resources, errors are inevitable Software/Hardware Stack Application Middleware OS Potential Source of Error Modify application results Revise and recompile middleware Viruses Hardware (Disk, CPU, Memory, Network)

7 Background In large-scale desktop grids involving volunteered, anonymous (and thereby potentially untrusted, insecure) resources, errors are inevitable Software/Hardware Stack Application Middleware OS Hardware (Disk, CPU, Memory, Network) Potential Source of Error Modify application results Revise and recompile middleware Viruses Disk crash, overclocking and overheating of CPU

8 Motivation Number of application-level mechanisms for tolerating errors exist [Sarmenta, Lo] Effectiveness of mechanisms depend on when errors in real systems Yet, characterization of errors is poorly understood

9 Goal Characterize error rates in a real system Frequency Stationarity Correlation Evaluate error tolerance mechanisms in light of this characterization

10 Outline Background Terminology Related Work Method Error Characterization Summary and Future Work

11 Background Terminology server workers

12 Background Terminology server workers workunit download

13 Background Terminology server workers

14 Background Terminology server (correct or erroneous) result upload workers

15 Related Work Error Tolerance Mechanisms [Sarmenta01, Zhao01, Taufer05] Majority voting Spot-checking with blacklisting Credibility-based methods

16 Majority Voting [Sarmenta01] Send 2m-1 instances of the same workunit to multiple workers, and the compare the results Majority vote is complete after receiving m identical results

17 Majority Voting [Sarmenta01] ε ϕ m Fraction of results that will be erroneous Probability that a worker (from the set of erroneous and nonerroneous hosts) returns an erroneous result Number of identical results before a vote is considered to be complete ε majv (ϕ, m) = 2m 1 j=m ( 2m 1 j ) ϕ j (1 ϕ) 2m 1 j

18 Issues Model assumes error rates are not correlated among hosts If error rate is high (>1%), much redundancy required to achieve low error bounds

19 Spot-Checking [Sarmenta01] Distribute workunit with known correct result randomly to workers Compare workers result to known correct result If there is a difference, blacklist that worker

20 Spot-Checking [Sarmenta01] ε q n f s Fraction of results that will be erroneous Frequency of spot-checking Number of workunits to be computed by each worker Fraction of hosts that commit at least one error Error rate per erroneous host ε scbl (q, n, f, s) = sf(1 qs) n (1 f) + f(1 qs) n

21 Issues Assumes blacklisting is efficient and effective Assumes consistency of error rates over time If error rates are low, then the number (n) of workunits to be computed per worker must very high

22 Credibility-Based System [Sarmenta01] Define credibility of an entity as the conditional probability of its correctness given its history of past (spot-)checks Workers build (or lose) credibility as they pass or fail (spot-)checks Compute credibility of result based on worker credibility Issue: assumes the error rate per host is consistent over time

23 Methodology XtremLab: BOINC-based project for characterizing Internet desktop grids Application continuously computes floatingpoint and integer operations Validator conducts syntactical and semantic checks of results Gathered data from about 600 hosts between April - July, 2006

24 Observations and Assumptions Most errors manifest themselves as scrambled or truncated output Likely due to I/O errors Detected errors would have caused a result error in a real application E.g. I/O error corresponds to a corrupt write of checkpoint file

25 Error Rates in Entire Platform "!)- ><05A(!)!!##" >0'A(!)!-,"$!)"!! " # $ % &./012345(46(74/89532:(732;(<//4/: '("!!$

26 Error Rates in Entire Platform "!)- ><05A(!)!!##" >0'A(!)!-,"$ Errors are widespread: ~35% of hosts are erroneous!)"!! " # $ % &./012345(46(74/89532:(732;(<//4/: '("!!$

27 Implications Working example 10 batches, 100 workunits each!overall! 0.01 need!result! 1"10-5 To get!result! 1"10-5 Majority vote: need majority vote (m) of 2 Spot-checking: number of workunits (n) per worker > 5300 * Blacklisting all erroneous hosts is most likely not efficient * q=0.10, f=0.35, s=0.003

28 Cumulative Error Rates and Effect on Throughput Cumulative fraction of errors error throughput Cumulative Fraction of Valid Throughput Fraction of sorted erroneous hosts

29 Cumulative Error Rates and Effect on Throughput Cumulative fraction of errors Error rates skewed. Top 10% produce 70% of errors error throughput Cumulative Fraction of Valid Throughput Fraction of sorted erroneous hosts

30 Cumulative Error Rates and Effect on Throughput Cumulative fraction of errors Error rates skewed. Top 10% produce 70% of errors Blacklisting all hosts not efficient. Would reduce throughput by 40% error throughput Cumulative Fraction of Valid Throughput Fraction of sorted erroneous hosts

31 Cumulative Error Rates and Effect on Throughput Cumulative fraction of errors Error rates skewed. Top 10% produce 70% of errors Blacklisting all hosts not efficient. Would reduce throughput by 40% error throughput Cumulative Fraction of Valid Throughput Fraction of sorted erroneous hosts

32 Spot-Checking with Blacklisting Revisited #!!$ #!!' <--/-.-=5, #!!& #!!" #!!% "!! #!!! #"!! $!!! ()*+,-./0.1/-2)3456.7,-.1/-2,-.-,8)4-,9.:3;

33 Spot-Checking with Blacklisting Revisited #!!$ #!!' Spot-checking acts as low-pass filter, reducing error rates to 2 x 10-4 <--/-.-=5, #!!& #!!" #!!% "!! #!!! #"!! $!!! ()*+,-./0.1/-2)3456.7,-.1/-2,-.-,8)4-,9.:3;

34 Majority Voting Revisited &!! &!!' <--/-.-64, &!!&! &!!&' &!!"! &!!"'! " # $ % &! ()*+,-./0.12, ,8)748.-,9)1-,2.:*;

35 Majority Voting Revisited &!! &!!' Error rate decreases exponentially, quickly below 1x10-5 <--/-.-64, &!!&! &!!&' &!!"! &!!"'! " # $ % &! ()*+,-./0.12, ,8)748.-,9)1-,2.:*;

36 Implications To get!result down to 2 x10-4 Spot-checking is a possibility: most benefit when n is [0,1000] To get!result! 2 x10-4 Use majority voting as!result exponentially decreases with m

37 Error Rate Stationarity A process is stationary if its statistical properties do not change over time Determine how stationary mean of host error rate (s) is over time Determine change in mean error rates over 96-hour periods for each host

38 Statistics for Host Error Rates over 96-hour periods Statistic Host Group µ σ σ/µ All erroneous Top 10% erroneous Bottom 90% erroneous Only about 10% of the error rates were within 25% of the mean

39 Implications Spot-checking and credibility-based systems may have limited effectiveness Both depend on the consistency of error rates over time Host with low error rate could build high credibility, and then triple its error rates

40 Correlation of Error Rates Determine independence of error on one host with that on another Independence: P(A and B) = P(A)*P(B) Determine empirical joint probability that any two hosts have error simultaneously Computed theoretical probability of two hosts from error simultaneously If error rates are not positively correlated P(A)*P(B) - P(A and B) # 0 theoretical - empirical # 0 } } P(A and B) P(A)*P(B)

41 Pairwise Host Error Rates # $'-?@;@872/A4&05762/19 $', $'+ $'* $') $'( $'! B5762/19&C&$D&$'$#((! B5762/19&E&$D&$'-,))+ $'" $'# $!!!"!# $ # "!./00&10& /678&79:&4;</5/678&<7/5=/>4&45515&5724> %&#$!!

42 Pairwise Host Error Rates # $'-?@;@872/A4&05762/19 $', $'+ $'* $') $'( $'! $'" $'# B5762/19&C&$D&$'$#((! B5762/19&E&$D&$'-,))+ Most host errors not positively correlated. Implication: majority voting likely effective in real systems $!!!"!# $ # "!./00&10& /678&79:&4;</5/678&<7/5=/>4&45515&5724> %&#$!!

43 Summary of Characterization Results A significant fraction of hosts (about 35%) will commit at least a single error over time The mean error rate over all hosts (0.0022) is quite low A large fraction of errors (0.70) result from a small fraction of hosts (0.10) Error rates over time vary greatly (as much 3.48 times) Error rates between two hosts often seem uncorrelated (more than of hosts do not have positively correlated errors)

44 Summary of Implications If one can afford redundancy or one needs an error rate to be less then 2 " 10-4, then majority voting should be considered If one can afford an error rate greater then 2 x 10-4 and can make batches relatively long, spotchecking with blacklisting should be considered Fluctuations in error rates over time may limit the effectiveness of spot-checking and crediblility-based systems

45 Future Work Use of synthetic application Important to have application regularity (I/O, computation) Not that different from real desktop grid applications (cannot be obtrusive) Compute-intensive, small-memory footprint, light periodic I/O for application-level checkpoints Characterize and run real desktop grid applications Profile applications Execute workunits representative from each profile

46 Thank you

Characterizing Result Errors in Internet Desktop Grids

Characterizing Result Errors in Internet Desktop Grids Derrick Kondo 1, Filipe Araujo 2, Paul Malecot 1, Patricio Domingues 3, Luis Moura Silva 2, Gilles Fedak 1, and Franck Cappello 1 1 INRIA Futurs,