A Large-Scale Study of Soft- Errors on GPUs in the Field

Size: px

Start display at page:

Download "A Large-Scale Study of Soft- Errors on GPUs in the Field"

Alison Ball
6 years ago
Views:

1 A Large-Scale Study of Soft- Errors on GPUs in the Field Bin Nie*, Devesh Tiwari +, Saurabh Gupta +, Evgenia Smirni*, and James H. Rogers + *College of William and Mary + Oak Ridge National Laboratory

2 Wide GPU Deployment in HPC State-of-the-art Oak Ridge National Lab 18,688 NVIDIA K20X GPUs University of Illinois 4,224 NVIDIA K20X GPUs Next generation Summit (2018) Oak Ridge National Lab NVIDIA Volta GPUs Sierra (2017) Lawrence Livermore National Lab NVIDIA Volta GPUs 2

3 Reliability is Important Long-running scientific applications E.g., climate modeling, astrophysics Severe resilience challenge at EXASCALE!* S3D Our Focus: GPU Soft-Errors Single Bit Error (SBE) Double Bit Error (DBE) Dynamic Page Retirement (DPR) Error MAESTRO GPU is protected by Error-Correcting Code (ECC) * Top Ten Exascale Research Challenges, DOE ASCAC Subcommittee Report, Feb

4 Trade-off: Performance vs Reliability Peak Memory bandwidth ECC off: 250 GB/s ECC on: ~212.5 GB/s GPU memory size ECC off: 6 GB ECC on: ~5.25 GB 18,688 NVIDIA K20X GPUs Total 14,016 GB for ECC! Is it worthwhile to pay ECC penalty? First step: better understanding of GPU soft-errors 4

5 Goals and Challenges Goals Study characteristics of SBEs and DPRs What factors lead to GPU soft-errors: GPU Utilization? Applications/Users? Temperature? Challenges Post-hoc analysis Large-scale system is dynamic in nature Big amount of data with many dimensions 5

6 Data Collection 18,688 GPUs in Titan Node (GPU +CPU) Blade (four nodes) Cage (eight blades) Cabinet (three cages) 8 25 Titan supercomputer (200 cabinets) Sampling period Feb ~ Mid June 2015 (> 60 million node hours) For each node, there are Application utilization (core-hour, memory) Temperature (every minute) Soft-errors 6

7 Open Questions for SBEs Temporal Locality GPU Utilization Applications Spatial Locality* SBEs Users * Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation, Tiwari et. al., HPCA

8 SBE Temporal Locality Conjecture: SBEs evenly distributed over days Top 2 days à 97% SBEs Exclude top 2 SBE days: Turning on ECC may not pay off equally on all days 8

9 Effect of GPU Utilization on SBEs Conjecture: high utilization à more SBEs node with least SBEs node with most SBEs High variance in utilization does not lead to more SBEs. Fault injection tools may not necessarily increase the probability of SBE occurrence by increasing GPU utilization 9

10 Effect of Applications on SBEs Conjecture: certain applications see more SBEs app with least SBEs Similarly, certain users see more SBEs. app with most SBEs Application-centric GPU error resilience techniques are likely to result in higher benefits 10

11 SBEs in Summary Temporal Locality GPU Utilization Applications Spatial Locality* SBEs Users * Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation, Tiwari et. al., HPCA

12 Open Questions for DPRs Applications Users Relation to SBEs DPRs Temperature 12

13 Relationship Between DPRs and SBEs Conjecture: more SBEs before after DPR occurrence Compare SBE count before and after DPR occurrence DPR nodes avg=160 Non-DPR nodes avg=0 This temporary burstiness of SBEs stops in 24 hours More SBEs do not mean higher likelihood of DPRs 13

14 Effect of Temperature on DPRs Temperature is recorded every minute on every node Three time windows before DPR occurrence: 60min, 15min, 5min Two classes: DPR offenders Non-DPR offenders (not in same cage) Node (GPU +CPU) Blade (four nodes) Cage (eight blades) Cabinet (three cages) 8 25 Titan supercomputer (200 cabinets) 14

15 Temperature Averages Conjecture: high temperature à more DPRs Temperature ( C) min before 15min before 5min before DPR Non-DPR (not in same cage) 15

16 Temperatures in Detail Conjecture: high temperature à more DPRs CDF 80% 60% 100% 5 min 80% 8 C CDF 100% 40% 20% 0% C 40% 20% DPR non-dpr Temperature ( C) 60% 60 min 0% 15 DPR non-dpr Temperature ( C) 65 Keeping temperature high may lead to increasing probability of DPR occurrences 16

17 DPRs in Summary Applications Users Counter-intuitive Relation to SBEs DPRs Temperature 17

18 Conclusions and Future Work Conclusions Monitor Titan supercomputer for more than 130 days Study characteristics of SBEs and DPRs Investigate factors associated with soft-errors More in paper! Nodes with more SBEs do not necessarily have more DPRs Nodes with soft-errors do not lead to degraded performance What is missing? Predict GPU soft-errors 18

19 Thank you!

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership