Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Similar documents
Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Predictive Thermal Management for Hard Real-Time Tasks

Energy efficient mapping of virtual machines

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Thermal Modeling and Active Cooling

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Leakage Mitigation Techniques in Smartphone SoCs

POWER MANAGEMENT AND ENERGY EFFICIENCY

A Simple Model for Estimating Power Consumption of a Multicore Server System

T. N. Vijaykumar School of Electrical and Computer Engineering Purdue University, W Lafayette, IN

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Transistors and Wires

Temperature Aware Thread Block Scheduling in GPGPUs

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Reducing the Energy Cost of Computing through Efficient Co-Scheduling of Parallel Workloads

Coordinating Liquid and Free Air Cooling with Workload Allocation for Data Center Power Minimization

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

A Comparison of Capacity Management Schemes for Shared CMP Caches

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

CS152 Computer Architecture and Engineering. Lecture 9 Performance Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

CS370 Operating Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Survey of Energy-Cognizant Scheduling Techniques

How much energy can you save with a multicore computer for web applications?

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.

Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials

Energy consumption in embedded systems; abstractions for software models, programming languages and verification methods

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Network Swapping. Outline Motivations HW and SW support for swapping under Linux OS

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

Performance Measurement (as seen by the customer)

CPU Scheduling: Objectives

Hot vs Cold Energy Efficient Data Centers. - SVLG Data Center Center Efficiency Summit

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Accurate and Stable Empirical CPU Power Modelling for Multi- and Many-Core Systems

Multithreaded Processors. Department of Electrical Engineering Stanford University

3D MPSoCs with Active Cooling

Reconfigurable Multicore Server Processors for Low Power Operation

Advanced Computer Architecture (CS620)

Performance and Power Analysis of RCCE Message Passing on the Intel Single-Chip Cloud Computer

Lecture 1: Introduction

Power Modeling and Thermal Management Techniques for Manycores

MANAGING LIFETIME RELIABILITY, PERFORMANCE, AND POWER TRADEOFFS IN MULTICORE MICROARCHITECTURES

Temperature measurement in the Intel CoreTM Duo Processor

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Achieving Fair or Differentiated Cache Sharing in Power-Constrained Chip Multiprocessors

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Dynamic Cache Pooling in 3D Multicore Processors

Efficient Program Power Behavior Characterization

Power Control in Virtualized Data Centers

ECE 571 Advanced Microprocessor-Based Design Lecture 5

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

How Can EDA Help Solve Challenges in Data Center Energy Efficiency?

Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Energy Aware Scheduling in Cloud Datacenter

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg

Analyzing Performance Asymmetric Multicore Processors for Latency Sensitive Datacenter Applications

COL862 - Low Power Computing

Thermal-aware scratchpad memory design and allocation.

Power and Thermal Models. for RAMP2

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand

CS370 Operating Systems

Power and Energy Management. Advanced Operating Systems, Semester 2, 2011, UNSW Etienne Le Sueur

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

Power and Energy Management

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

Response Time and Throughput

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reliable Architectures

Lecture 18: Multithreading and Multicores

Announcements. Program #1. Reading. Due 2/15 at 5:00 pm. Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed)

8. CONCLUSION AND FUTURE WORK. To address the formulated research issues, this thesis has achieved each of the objectives delineated in Chapter 1.

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

Load Balancing. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

An Approach for Adaptive DRAM Temperature and Power Management

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

CS 152 Computer Architecture and Engineering

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

CS425 Computer Systems Architecture

COL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques

CS3350B Computer Architecture CPU Performance and Profiling

TEMPERATURE MANAGEMENT IN DATA CENTERS: WHY SOME (MIGHT) LIKE IT HOT

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

HeatWatch Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu

Profiling: Understand Your Application

Transcription:

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun Feb 15, 2012

Energy Efficiency and Temperature Temperature-induced challenges Cooling Cost Leakage Performance Reliability Thermal challenges accelerate in high-performance systems! Energy problem High cost: a 10MW datacenter spends millions of dollars per year for operational and cooling costs Adverse effects on the environment 2

% Time Spent at Various Temperature Ranges Is Energy Management Sufficient? Energy or performance-aware methods are not always effective for managing temperature Dynamic techniques specifically addressing temperature-induced problems Efficient framework for evaluating dynamic techniques 3

Outline Modeling Integrated simulation of performance, power, temperature and reliability Analysis Importance of modeling thermal variations Effect of thread migration policies Novel policies 2X increase in processor lifetime with a performance cost of less than 4% Proactive management: Learning workload characteristics for better runtime adaptation 4

Modeling Framework Performance Simulator Power Modeling Instruction-Level Thermal Modeling Phase Profile (SimPoint) Phase-Based Performance & Power Modeling (M5 / Wattch) Database Performance / Power Query Tool Scheduling Manager Thermal Modeling (HotSpot) Reliability Computation Offline Runtime [Sigmetrics 09] 5

Long-Term Performance Modeling SimPoint: [Sherwood, ASPLOS 02] Captures representative phases Complete phase profile of each application Similar to Co-Phase Matrix for multi-threaded simulation [Biesbrouck, ISPASS 04] All available voltage/frequency settings Stored in the database 6

Power (Watts) Phase Modeling 12 11 10 9 8 7 bzip 0 50 100 150 200 250 300 350 400 450 500 Time (ms) M5/Wattch Phase-Based Complete phase profile: every 100 M instructions Profile is recorded in database: Phase-ID trace Power & performance values Queried by scheduler during simulation 7

Power Modeling and Management ALU operations Cache accesses Branch predictions M5 [Binkert, CAECW 03] Wattch [Brooks, ISCA 00] Dynamic Power Component area Temperature Voltage setting Leakage Model [Su, ISLPED 03] Leakage Power POWER TRACE L2 caches CACTI [Tarjan, HP Labs] Dynamic & Leakage Dynamic Power Management Fixed timeout Put a core into sleep mode after it has been idle for t timeout 8

Thread Management Performance and / or Temperature Info Scheduling Manager DVFS DPM Migration Clock-Gating Job Scheduling Parameter Sampling Interval Wake-up Delay Model: V/f change Core sleep/wake-up Migration Application Startup Value 50ms 25ms syscall + cold start syscall: Measured in Linux-M5 (<3us) Cold start: Average delay: 204us (range: 2 to 740us) Distinct penalty for each benchmark DVFS Migration syscall + 20 us syscall +cold start 9

Thermal Modeling Scheduling Manager POWER TRACE Thermal Model HotSpot [Skadron, ISCA 03] Database Die and Package Properties (65nm) bzip 10

Reliability Modeling Thermal hot spots [Failure Mechanisms for Semiconductor Devices, JEDEC] Electromigration Time dependent dielectric breakdown: λ e kt E a λ: Failure rate; T: temperature E a : Activation energy, k: Boltzman s constant 10 15 C increase in temperature causes ~2X increase in failure rate Thermal cycling [JEDEC] Fatigue failures: T q f T: Magnitude of variation f: Frequency of cycles 10 o C increase in ΔT Failures happen 16 times more frequently 11

Migration and Clock Gating Stop-Go T > T threshold Stop Clock Migration T > T threshold Migrate job to coolest core Balance Highest IPC job Coolest core High Power Balance_Location Highest IPC job Expected coolest location IPC 1 > IPC 2 > > IPC 16 12

Voltage/Frequency Scaling DVFS-Threshold T threshold Reduce V/f one step DVFS-Location 100% 95% DVFS-Performance - Memory-bound Low V/f - CPU-bound High V/f µ : CPI-based metric [Dhiman, ISLPED 07] Low µ: 85% Medium µ: 95% High µ: 100% 5-6% worst-case performance cost 85% 13

Systems with Full Utilization 2.25 2 1.75 1.5 1.25 1 0.75 MTTF 0.98 0.96 0.94 0.92 0.9 0.88 Performance 0.9 0.85 0.8 0.75 Energy balance_loc & dvfs_t dvfs_t balan_loc & dvfs_perf_t dvfs_perf_t balance_loc & loc_dvfs location_dvfs 14

balance balance_loc balance_loc & dvfs_t balance_loc &dvfs_perf_t balance_loc & loc_dvfs dvfs_perf_t dvfs_perf dvfs_t migration location _dvfs stopgo System 87.5% utilized Partial Utilization 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 MTTF Performance Energy 15

Temperature (C) Temperature (C) Temporal Thermal Profiles Migration 90 86 82 78 74 core5 core15 0 1 2 3 4 5 6 7 8 9 Time (s) Balance_Location & Location_DVFS Low & stable profile for all the cores 90 86 82 78 74 0 1 2 3 4 5 6 7 8 9 Time (s) 16

Breakdown of Failures Dynamic power management Sleep state Accelerated thermal cycling 17

Guidelines for Runtime Management Modeling thermal cycling is critical, especially for partially utilized systems. Policies that minimize # of migrations help with both performance and reliability. Thermal asymmetries should be considered for effective thermal management. Proactive techniques can raise the performance of the entire system. 18

Temperature (C) Temperature (C) Reactive vs. Proactive Management Reactive Proactive 90 85 90 85 Forecast 80 80 75 70 e.g., DVFS, fetch-gating, workload migration, Time 75 70 Reduce and balance temperature Adjust workload, V/f setting, etc. T after proactive management Time 19

Proactive Management Flow Temperature Data from Thermal Sensors Predictor (ARMA) Periodic ARMA Model Validation & Model Update Temperature at time (t current + t n ) for all cores SCHEDULER Temperature-Aware Allocation on Cores [Transactions on CAD 09] 20

Temperature Prediction 21

What else can we predict? bzip How about parallel workloads?

System Model Dispatching Queues Allocation Policy Dynamic Load Balancing (DLB): Threads Core-1 Core-2 Core-3... Recently run thread: Allocate to the core it ran previously on Otherwise Allocate to the core that has the lowest priority thread Significant imbalance at runtime Balance 23

Proactive Temperature Balancing Uses principle of locality as in default load balancing policy at initial assignment Utilizes ARMA predictor & thermal forecast: A core is projected to have a hot spot OR ΔT spatial is projected to be large Move waiting threads first to balance temperature Migrate threads as a last resort Threads waiting running Core-1 Core-2 24

Experimental Setup Workload and Power Workload characterization: Measured on Sun s UltraSPARC T1 (Niagara-1) Power values: Average power for each unit Niagara-1: Peak power close to average power Core utilization, cache misses, # instructions, etc. Figure: Leon et al., ISSCC 06 Simulation Framework: Scheduler, power manager, thermal simulator 25

Simulation Framework Inputs: Workload information Floorplan, package Temperature (for dynamic policies) Scheduler: a. Simulator b. OS Scheduler Inputs: Workload information Activity of cores Power Manager DPM, DVFS Inputs: Power trace for each unit Floorplan, package and die properties Thermal Simulator HotSpot [Skadron, ISCA 03] Transient Temperature Response for Each Unit 26

% Hot Spots > 85 C Performance Hot Spots and Performance 40 35 30 25 20 15 10 5 0 Load Balancing Reactive Migration Reactive DVFS Proactive DVFS Proactive Balancing 1.0 0.9 0.8 0.7 0.6 0.5 Web-med Web-high Web& Database Mplayer& Web AVG Avg Perf (Right Axis) (a) Simulator 27

% Hot Spots > 85 C Hot Spots 30 25 Proactive Balancing (PTB) reduces hot spots by 60% in average w.r.t. Reactive Migration 20 15 10 DLB R-Mig PTB 5 0 Web-med Database Web&DB Mplayer AVG across all 8 benchmarks (b) Implementation in Solaris Scheduler 28

% of gradients >15C Thermal Gradients Proactive Balancing bounds gradients to <3% 12 10 8 6 4 2 0 DLB R-Mig PTB No PM DPM Spatially balanced temperature improves: Cooling efficiency Reliability Performance (b) Implementation in Solaris Scheduler 29

% of cycles >20C Thermal Cycles Frequency of cycles reduced to below 5% for the worst case 25 20 15 10 5 0 AVG MAX (Web-med) DLB R-Mig PTB Benefits of reducing cycling: Chip-level Higher reliability Datacenter level Higher cooling efficiency Fan speed or liquid flow rate does not need to vary frequently (b) Implementation in Solaris Scheduler 30

Performance Performance Proactive Balancing achieves significant reduction in performance cost in comparison to migration 1 0.98 0.96 0.94 0.92 R-Mig PTB 0.9 Web-med Database Web&DB Mplayer *Performance relative to Dynamic Load Balancing. Performance metric is load average. (b) Implementation in Solaris Scheduler 31

Summary & On-going Research We need joint analysis & management of power, performance, and temperature for achieving true energy efficiency. Intelligent management provides significant lifetime improvement at minimal performance cost. Proactive strategies learn system and workload dynamics and leverage this information for better decision making. Energy-aware software tuning for high performance computing (HPC) applications Power capping of multicore systems running multithreaded workloads [TEMM 11] [HPEC 11] [ICCAD 11] [MICRO 11]

Performance and Energy Aware Computing Laboratory For more information: http://www.bu.edu/peaclab acoskun@bu.edu Funding