Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun Feb 15, 2012

Energy Efficiency and Temperature Temperature-induced challenges Cooling Cost Leakage Performance Reliability Thermal challenges accelerate in high-performance systems! Energy problem High cost: a 10MW datacenter spends millions of dollars per year for operational and cooling costs Adverse effects on the environment 2

% Time Spent at Various Temperature Ranges Is Energy Management Sufficient? Energy or performance-aware methods are not always effective for managing temperature Dynamic techniques specifically addressing temperature-induced problems Efficient framework for evaluating dynamic techniques 3

Outline Modeling Integrated simulation of performance, power, temperature and reliability Analysis Importance of modeling thermal variations Effect of thread migration policies Novel policies 2X increase in processor lifetime with a performance cost of less than 4% Proactive management: Learning workload characteristics for better runtime adaptation 4

Modeling Framework Performance Simulator Power Modeling Instruction-Level Thermal Modeling Phase Profile (SimPoint) Phase-Based Performance & Power Modeling (M5 / Wattch) Database Performance / Power Query Tool Scheduling Manager Thermal Modeling (HotSpot) Reliability Computation Offline Runtime [Sigmetrics 09] 5

Long-Term Performance Modeling SimPoint: [Sherwood, ASPLOS 02] Captures representative phases Complete phase profile of each application Similar to Co-Phase Matrix for multi-threaded simulation [Biesbrouck, ISPASS 04] All available voltage/frequency settings Stored in the database 6

Power (Watts) Phase Modeling 12 11 10 9 8 7 bzip 0 50 100 150 200 250 300 350 400 450 500 Time (ms) M5/Wattch Phase-Based Complete phase profile: every 100 M instructions Profile is recorded in database: Phase-ID trace Power & performance values Queried by scheduler during simulation 7

Power Modeling and Management ALU operations Cache accesses Branch predictions M5 [Binkert, CAECW 03] Wattch [Brooks, ISCA 00] Dynamic Power Component area Temperature Voltage setting Leakage Model [Su, ISLPED 03] Leakage Power POWER TRACE L2 caches CACTI [Tarjan, HP Labs] Dynamic & Leakage Dynamic Power Management Fixed timeout Put a core into sleep mode after it has been idle for t timeout 8

Thread Management Performance and / or Temperature Info Scheduling Manager DVFS DPM Migration Clock-Gating Job Scheduling Parameter Sampling Interval Wake-up Delay Model: V/f change Core sleep/wake-up Migration Application Startup Value 50ms 25ms syscall + cold start syscall: Measured in Linux-M5 (<3us) Cold start: Average delay: 204us (range: 2 to 740us) Distinct penalty for each benchmark DVFS Migration syscall + 20 us syscall +cold start 9

Thermal Modeling Scheduling Manager POWER TRACE Thermal Model HotSpot [Skadron, ISCA 03] Database Die and Package Properties (65nm) bzip 10

Reliability Modeling Thermal hot spots [Failure Mechanisms for Semiconductor Devices, JEDEC] Electromigration Time dependent dielectric breakdown: λ e kt E a λ: Failure rate; T: temperature E a : Activation energy, k: Boltzman s constant 10 15 C increase in temperature causes ~2X increase in failure rate Thermal cycling [JEDEC] Fatigue failures: T q f T: Magnitude of variation f: Frequency of cycles 10 o C increase in ΔT Failures happen 16 times more frequently 11

Migration and Clock Gating Stop-Go T > T threshold Stop Clock Migration T > T threshold Migrate job to coolest core Balance Highest IPC job Coolest core High Power Balance_Location Highest IPC job Expected coolest location IPC 1 > IPC 2 > > IPC 16 12

Voltage/Frequency Scaling DVFS-Threshold T threshold Reduce V/f one step DVFS-Location 100% 95% DVFS-Performance - Memory-bound Low V/f - CPU-bound High V/f µ : CPI-based metric [Dhiman, ISLPED 07] Low µ: 85% Medium µ: 95% High µ: 100% 5-6% worst-case performance cost 85% 13

Systems with Full Utilization 2.25 2 1.75 1.5 1.25 1 0.75 MTTF 0.98 0.96 0.94 0.92 0.9 0.88 Performance 0.9 0.85 0.8 0.75 Energy balance_loc & dvfs_t dvfs_t balan_loc & dvfs_perf_t dvfs_perf_t balance_loc & loc_dvfs location_dvfs 14

balance balance_loc balance_loc & dvfs_t balance_loc &dvfs_perf_t balance_loc & loc_dvfs dvfs_perf_t dvfs_perf dvfs_t migration location _dvfs stopgo System 87.5% utilized Partial Utilization 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 MTTF Performance Energy 15

Temperature (C) Temperature (C) Temporal Thermal Profiles Migration 90 86 82 78 74 core5 core15 0 1 2 3 4 5 6 7 8 9 Time (s) Balance_Location & Location_DVFS Low & stable profile for all the cores 90 86 82 78 74 0 1 2 3 4 5 6 7 8 9 Time (s) 16

Breakdown of Failures Dynamic power management Sleep state Accelerated thermal cycling 17

Guidelines for Runtime Management Modeling thermal cycling is critical, especially for partially utilized systems. Policies that minimize # of migrations help with both performance and reliability. Thermal asymmetries should be considered for effective thermal management. Proactive techniques can raise the performance of the entire system. 18

Temperature (C) Temperature (C) Reactive vs. Proactive Management Reactive Proactive 90 85 90 85 Forecast 80 80 75 70 e.g., DVFS, fetch-gating, workload migration, Time 75 70 Reduce and balance temperature Adjust workload, V/f setting, etc. T after proactive management Time 19

Proactive Management Flow Temperature Data from Thermal Sensors Predictor (ARMA) Periodic ARMA Model Validation & Model Update Temperature at time (t current + t n ) for all cores SCHEDULER Temperature-Aware Allocation on Cores [Transactions on CAD 09] 20

Temperature Prediction 21

What else can we predict? bzip How about parallel workloads?

System Model Dispatching Queues Allocation Policy Dynamic Load Balancing (DLB): Threads Core-1 Core-2 Core-3... Recently run thread: Allocate to the core it ran previously on Otherwise Allocate to the core that has the lowest priority thread Significant imbalance at runtime Balance 23

Proactive Temperature Balancing Uses principle of locality as in default load balancing policy at initial assignment Utilizes ARMA predictor & thermal forecast: A core is projected to have a hot spot OR ΔT spatial is projected to be large Move waiting threads first to balance temperature Migrate threads as a last resort Threads waiting running Core-1 Core-2 24

Experimental Setup Workload and Power Workload characterization: Measured on Sun s UltraSPARC T1 (Niagara-1) Power values: Average power for each unit Niagara-1: Peak power close to average power Core utilization, cache misses, # instructions, etc. Figure: Leon et al., ISSCC 06 Simulation Framework: Scheduler, power manager, thermal simulator 25

Simulation Framework Inputs: Workload information Floorplan, package Temperature (for dynamic policies) Scheduler: a. Simulator b. OS Scheduler Inputs: Workload information Activity of cores Power Manager DPM, DVFS Inputs: Power trace for each unit Floorplan, package and die properties Thermal Simulator HotSpot [Skadron, ISCA 03] Transient Temperature Response for Each Unit 26

% Hot Spots > 85 C Performance Hot Spots and Performance 40 35 30 25 20 15 10 5 0 Load Balancing Reactive Migration Reactive DVFS Proactive DVFS Proactive Balancing 1.0 0.9 0.8 0.7 0.6 0.5 Web-med Web-high Web& Database Mplayer& Web AVG Avg Perf (Right Axis) (a) Simulator 27

% Hot Spots > 85 C Hot Spots 30 25 Proactive Balancing (PTB) reduces hot spots by 60% in average w.r.t. Reactive Migration 20 15 10 DLB R-Mig PTB 5 0 Web-med Database Web&DB Mplayer AVG across all 8 benchmarks (b) Implementation in Solaris Scheduler 28

% of gradients >15C Thermal Gradients Proactive Balancing bounds gradients to <3% 12 10 8 6 4 2 0 DLB R-Mig PTB No PM DPM Spatially balanced temperature improves: Cooling efficiency Reliability Performance (b) Implementation in Solaris Scheduler 29

% of cycles >20C Thermal Cycles Frequency of cycles reduced to below 5% for the worst case 25 20 15 10 5 0 AVG MAX (Web-med) DLB R-Mig PTB Benefits of reducing cycling: Chip-level Higher reliability Datacenter level Higher cooling efficiency Fan speed or liquid flow rate does not need to vary frequently (b) Implementation in Solaris Scheduler 30

Performance Performance Proactive Balancing achieves significant reduction in performance cost in comparison to migration 1 0.98 0.96 0.94 0.92 R-Mig PTB 0.9 Web-med Database Web&DB Mplayer *Performance relative to Dynamic Load Balancing. Performance metric is load average. (b) Implementation in Solaris Scheduler 31

Summary & On-going Research We need joint analysis & management of power, performance, and temperature for achieving true energy efficiency. Intelligent management provides significant lifetime improvement at minimal performance cost. Proactive strategies learn system and workload dynamics and leverage this information for better decision making. Energy-aware software tuning for high performance computing (HPC) applications Power capping of multicore systems running multithreaded workloads [TEMM 11] [HPEC 11] [ICCAD 11] [MICRO 11]

Performance and Energy Aware Computing Laboratory For more information: http://www.bu.edu/peaclab acoskun@bu.edu Funding