Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

Similar documents
Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

Potentials and Limitations for Energy Efficiency Auto-Tuning

POWER MANAGEMENT AND ENERGY EFFICIENCY

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Tracing and Visualization of Energy Related Metrics

I/O Profiling Towards the Exascale

Analyzing I/O Performance on a NEXTGenIO Class System

POWER-AWARE SOFTWARE ON ARM. Paul Fox

By Arjan Van De Ven, Senior Staff Software Engineer at Intel.

Adaptive Power Profiling for Many-Core HPC Architectures

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Power Management for Embedded Systems

Measuring the impacts of the Preempt-RT patch

LOWERING POWER CONSUMPTION OF HEVC DECODING. Chi Ching Chi Techinische Universität Berlin - AES PEGPUM 2014

NEXTGenIO Performance Tools for In-Memory I/O

Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

Tips and Tricks: Designing low power Native and WebApps. Harita Chilukuri and Abhishek Dhanotia

READEX Runtime Exploitation of Application Dynamism for Energyefficient

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Tweaking Linux for a Green Datacenter

Feng Chen and Xiaodong Zhang Dept. of Computer Science and Engineering The Ohio State University

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms

Energy Efficiency in Operating Systems

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Performance Analysis with Vampir

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Introduction to Energy-Efficient Software 2 nd life talk

User s Guide. Alexandra Yates Kristen C. Accardi

How to get realistic C-states latency and residency? Vincent Guittot

Software and Tools for HPE s The Machine Project

Analysis of RAPL Energy Prediction Accuracy in a Matrix Multiplication Application on Shared Memory

Towards More Power Friendly Xen

Performance analysis basics

Performance Analysis with Vampir

Applying Polling Techniques to QEMU

TCPivo A High-Performance Packet Replay Engine. Wu-chang Feng Ashvin Goel Abdelmajid Bezzaz Wu-chi Feng Jonathan Walpole

User s Guide. Alexandra Yates Kristen C. Accardi

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Introducing OTF / Vampir / VampirTrace

Introduction to Performance Tuning & Optimization Tools

Performance Profiling

Hybrid Implementation of 3D Kirchhoff Migration

AutoTune Workshop. Michael Gerndt Technische Universität München

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

LEoNIDS: a Low-latency and Energyefficient Intrusion Detection System

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Energy Models for DVFS Processors

PyTimechart practical. Pierre Tardy Software Engineer - UMG October 2011

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview

Datura The new HPC-Plant at Albert Einstein Institute

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

AUTOMATIC SMT THREADING

Tuning Alya with READEX for Energy-Efficiency

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

rcuda: an approach to provide remote access to GPU computational power

Embedded Systems Architecture

COL862 Programming Assignment-1

Real Time Linux patches: history and usage

Fine Grained Power Modeling For Smartphones Using System Call Tracing. Y. Charlie Hu Paramvir Bahl Yi-Min Wang

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Managing Hardware Power Saving Modes for High Performance Computing

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

Profiling: Understand Your Application

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Load Balancing. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Simultaneous Multithreading on Pentium 4

Realtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Performance Tuning VTune Performance Analyzer

Revealing the performance aspects in your code

Parallel Computing. Hwansoo Han (SKKU)

Sustainable Computing: Informatics and Systems

AMD Opteron Processors In the Cloud

Philippe Thierry Sr Staff Engineer Intel Corp.

Linux Plumber Conference Scheduler micro-conf

A Simple Framework for Energy Efficiency Evaluation and Hardware Parameter. Tuning with Modular Support for Different HPC Platforms

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Embedded System Architecture

How Linux RT_PREEMPT Works

Tizen Power Management

Bill Nesheim Sun Microsystems, Inc. Bob Kasten Intel Corporation

Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors

Design Principles for End-to-End Multicore Schedulers

Performance Tuning Guide for Cisco UCS M5 Servers

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Performance Analysis with Vampir

e-ale-rt-apps Building Real-Time Applications for Linux Version c CC-BY SA4

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

Transcription:

Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems Thomas Ilsche, Marcus Hähnel, Robert Schöne, Mario Bielert, and Daniel Hackenberg Technische Universität Dresden

Observation 2 Systems with continuous energy measurement Tuned for low idle power consumption Prolonged phases of excessive power consumption during idle phases Powernightmare

Background Processor 3 Each processor is a package A package comprises multiple cores Each core has two hardware threads A hardware thread is called CPU Image source: http://download.intel.com/pressroom/kits/45nm/penryn_dualcore_txt.jpg

Background C-states 4 Idle power conservation Increasing latency Controllable per CPU, but applied per core Package C-state determined by lowest core C-state Effective use is essential for low idle power 160 140 120 100 80 60 40 20 0 135 W Package C-state TDP of Intel Xeon E5-2690 v3 38 W Shallower Lower C-state Sleep Deeper Higher 30 W 13 W C0 C1E C3 C6 Package TDP in Watts

Background Linux idle governor 5 Selects C-state for CPU ladder_governor gradually changes C-state menu_governor is based on a heuristic Heuristic used to predict idle time Next timer event with correction factor Repeatable interval detector (up to 8 data points) Latency requirement

Investigation lo2s 6 Uses Linux perf infrastructure Create a trace combining Active processes using the trace point sched_switch Selected C-state using the cpu_idle trace point External power measurements C-state residency using x86_adapt Available at https://github.com/tud-zih-energy/lo2s

Investigation lo2s 7 Scheduled processes Power measurement C-state of the cores Vampir showing a lo2s trace of a parallel build using make

Investigation Powernightmare 8 Up to 3 wakeups needed for correction after a misprediction by the heuristic Zoomed Full begin duration of Powernightmare: of a Scheduled C-states, tasks, system C-states and and socket socket power power

Triggering the issue 9 Code to reliably trigger a Powernightmare int main() { #pragma omp parallel { #pragma omp barrier while (1) { for (int i = 0; i < 8; i++) { #pragma omp barrier usleep (10); } sleep (10); } } }

Approaching the problem 10 Changing task behavior Improving the idle time prediction Biasing the prediction error C-state selection by hardware Mitigating the impact

Impact mitigation approach 11 Set a wakeup timer if huge difference between next known timer and predicted idle time Prediction correct Prediction incorrect Wakeup event in predicted time interval Cancel timer Timer triggers wakeup Ignore recent residency Enter high C-state Misprediction corrected

Powernightmare with timer 12 Fallback timer corrects wrong C-state selection Only 10 ms of shallow sleep Reduced impact of Powernightmare with active fallback timer

Verification 13 Measurements taken over 20 minutes Trigger workload every 10 seconds Power consumption during idle and trigger workload

Production servers? 14 Found on node of production HPC system taurus Lustre related pattern every 25 seconds Triggers one second Powernightmare After Scheduling Lustre ping of Lustre several related cores kernel remain tasks in C1

Summary 15 Analyzed pattern of inefficient use of sleep states Developed a methodology and tools to observe Investigation shows misprediction in idle governor Proposed solution to mitigate effect Discussion with Linux community initiated Increasing probability with rising number of cores Effect not limited to HPC Systems

Any questions?