Decelerating Suspend and Resume in Operating Systems. XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017

Similar documents
INFLUENTIAL OS RESEARCH

Message Passing. Advanced Operating Systems Tutorial 5

Modern systems: multicore issues

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

the Barrelfish multikernel: with Timothy Roscoe

The Multikernel A new OS architecture for scalable multicore systems

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

VIRTUALIZATION: IBM VM/370 AND XEN

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Design Principles for End-to-End Multicore Schedulers

Harnessing Energy Efficiency of. OS, Heterogeneous Platform

The Barrelfish operating system for heterogeneous multicore systems

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Drowsy Power Management. Matthew Lentz James Litton Bobby Bhattacharjee University of Maryland

Energy Discounted Computing On Multicore Smartphones Meng Zhu & Kai Shen. Atul Bhargav

Breaking the Boundaries in Heterogeneous-ISA Datacenters

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Decoupling Cores, Kernels and Operating Systems

Embedded Systems: Architecture

Understanding the Characteristics of Android Wear OS. Renju Liu and Felix Xiaozhu Lin Purdue ECE

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

Bachelor s Thesis Nr. 98b

CPU-GPU Heterogeneous Computing

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Building blocks for 64-bit Systems Development of System IP in ARM

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

UEFI ARM Update. UEFI PlugFest March 18-22, 2013 Andrew N. Sloss (ARM, Inc.) presented by

Building supercomputers from embedded technologies

ARM big.little Technology Unleashed An Improved User Experience Delivered

Dongjun Shin Samsung Electronics

SATA in Mobile Computing

The Mont-Blanc approach towards Exascale

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Single System Image in a Linux-based Replicated Operating System Kernel

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

ARM Powered SoCs OpenEmbedded: a framework for toolcha. generation and rootfs management

Common Kernel Development for Heterogeneous Linux Platforms

System Simulator for x86

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Chapter 5 C. Virtual machines

SoC Idling & CPU Cluster PM

Xen and the Art of Virtualization

Master s Thesis Nr. 10

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

I/O and virtualization

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Techniques and tools for measuring energy efficiency of scientific software applications

WearDrive: Fast and Energy Efficient Storage for Wearables

Dynamic Translator-Based Virtualization

Bachelor s Thesis Nr. 138b

Module 6: INPUT - OUTPUT (I/O)

Position Paper: OpenMP scheduling on ARM big.little architecture

Enabling Arm DynamIQ support. Dan Handley (Arm) Ionela Voinescu (Arm) Vincent Guittot (Linaro)

Simulation Based Analysis and Debug of Heterogeneous Platforms

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms

Utilization-based Power Modeling of Modern Mobile Application Processor

Reducing CPU usage of a Toro Appliance

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Heterogeneous Architecture. Luca Benini

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Cycle Approximate Simulation of RISC-V Processors

Xen and the Art of Virtualization

Virtualization. ...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania.

Each Milliwatt Matters

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

KeyStone II. CorePac Overview

KEYNOTE FOR

Runtime Integrity Checking for Exploit Mitigation on Embedded Devices

Virtualization. Dr. Yingwu Zhu

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

The Trouble with Hardware. Timothy Roscoe Systems Group, ETH Zurich November 30 th 2017

Nested Virtualization and Server Consolidation

Power Management as I knew it. Jim Kardach

StreamBox: Modern Stream Processing on a Multicore Machine

Suspend-aware Segment Cleaning in Log-structured File System

Optimizing TCP Receive Performance

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Part I Overview Chapter 1: Introduction

Hardware-Assisted On-Demand Hypervisor Activation for Efficient Security Critical Code Execution on Mobile Devices

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Modern Processor Architectures. L25: Modern Compiler Design

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Efficient inter-kernel communication in a heterogeneous multikernel operating system. Ajithchandra Saya

Low Power System Design

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

European energy efficient supercomputer project

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Transcription:

Decelerating Suspend and Resume in Operating Systems XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017 1

Mobile/IoT devices see many short-lived tasks Sleeping for a long time; Waking up frequently Smartwatch: 72 times per day Each task is short-lived Smartwatch: 10 secs Background task: < 1 sec 2

Suspend/Resume OS Workflow CPU ON CPU OFF Sync Filesystem Call IO Drivers Suspend Freeze Tasks Resume Call IO Drivers Thaw Tasks CPU OFF CPU ON 3

Suspend/Resume Is Expensive Slow suspend/resume is long known for desktop/server Suspend/resume mostly slowed down by SATA and USB devices These machines suspend/resume only occasionally Much worse on mobile/iot due to short-lived tasks Suspend/resume takes ~500 ms on Samsung Note4 Smartphone E.g. consume 43% of total energy on sensing benchmark[1] Need to understand suspend/resume on mobile/iot devices 1. M. Lentz, J. Litton, and B. Bhattacharjee. Drowsy power management. SOSP, 2015 4

Profiling Suspend/Resume Nexus 5 Samsung Gear Samsung Note 4 Panda Board 5

Suspend/Resume on Mobile SoC Is Slow Nexus 5 Gear Note 4 Panda Suspend 119 ms 191 ms 231 ms 262 ms Resume 88 ms 159 ms 316 ms 492 ms 6

Main Reason: IO Power Transitions Are Slow Others IO thaw tasks freeze tasks sync fs 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% Nexus 5 Gear Note 4 Panda Suspend 0% Nexus5 Gear Note4 Panda Resume 7

Slow IOs Are Various and Diverse Nexus 5 Gear Note 4 serial_hsl mdss pcieh mmc_host mmc_host dwmmc2 Top IO devices Nexus 5 Gear Note 4 187 120 116 Number of IO drivers for each platform 8

Alternative Solution: Async IO Power Transitions Asynchronous PM overlaps power transitions of multiple IO devices Key difficulty: dependencies among hundreds of IO devices Subtle and implicit OS may not know them Linux kernel community has a long debate Still very conservative about async PM 9

Our Key idea Objective Offloading suspend/resume to a very weak core Hardware support A weak core (common on mobile SoCs) Suspend Resume Kernel CPU Virtual Execution Weak core Software support A baremetal virtual executor on the weak core DRAM IO Offloading suspend/resume via virtualization 10

Weak Core on Modern SoCs Low power cores already exist on modern SoCs E.g. Apple motion coprocessor (Cortex-M3) Shared memory and IO bus; incoherent cache domain Heterogeneous but similar ISA (ARMv7/8 vs ARMv7m) 11

Weak Core on Modern SoCs Low power cores already exist on modern SoCs E.g. Apple motion coprocessor (Cortex-M3) Shared memory and IO bus; incoherent cache domain Heterogeneous but similar ISA (ARMv7/8 vs ARMv7m) Weak cores are ideal for executing OS suspend/resume Idle power 3.8 mw vs 30 mw Kernel execution favors weak cores [1] Small code working set Less predictable control flow 1. J. Mogul, J. Mudigonda, N. Binkert, P. Ranganathan, and V. Talwar. Using asymmetric singleisa cmps to save energy on operating systems. Micro, IEEE, 2008 12

Software Challenges Objective: Execute the kernel suspend/resume path on a weak core, without cache coherence and without a unified ISA Manually partitioning mature kernels is infeasible Modern kernels are beasts Windows: 45M SLoC 1 Linux 4.4: 16M SLoC 2 Suspend resume code is complicate (30k SLoC in Linux) Commodity kernels are rapidly evolving 1. https://www.facebook.com/windows/posts/155741344475532 2. https://www.linuxcounter.net/statistics/kernel 13

Our Solution Launching a virtual machine on the weak core to execute unmodified kernel binary for the main CPU Suspend Resume Kernel CPU DRAM Virtual Execution Weak core IO This contrasts with traditional virtualization Host is much more powerful than guest Offloading suspend/resume via virtualization 14

System Overview Main Core Main Core Weak Core CPU ON CPU ON Suspend CPU OFF Weak ON Resume CPU OFF Suspended Binary translation of unmodified kernel CPU ON CPU ON Weak OFF Commodity Our System 15

Does This Really Work? No one would believe binary translation works for us We need aggressive optimizations ~20x slow down from initial implementation Reason: commodity binary translators are generic and conservative Status register is emulated Frequent Interrupt Check 16

Our Key Optimizations Exploit ISA similarity (ARMv7 vs ARMv7M) Baremetal stacks Relaxed handling of interrupts and exceptions Kernel virtual memory 17

Current Implementation Linux 4.4 Dual ARM Cortex-A9 Cache & MMU System-level Interconnect Main Memory QEMU based virtual executor Dual ARM Cortex-M3 Cache IO Platform: TI OMAP4 SoC Trimming down QEMU from 2.6M SLoC to 50.5K SLoc 4.5K SLoC new code A first-of-its-kind virtualization environment on an embedded core TI OMAP4 18

callback kfifo glob Microbenchmarks Native Translated (unoptimized) Translated (optimized) Performance metric: # of CPU cycles Baseline: native compilation & execution on the main CPU (Cortex-A9) Native: native compilation & execution on weak core (Cortex-M3) Translated (unoptimized/optimized): translated execution on weak core 0X 5X 10X 15X 20X Overhead in terms of cycle count 19

callback kfifo glob Microbenchmarks Native Translated (unoptimized) Translated (optimized) 0X 5X 10X 15X 20X Overhead in terms of cycle count 1. M. Lentz, J. Litton, and B. Bhattacharjee. Drowsy power management. SOSP, 2015 Performance metric: # of CPU cycles Baseline: native compilation & execution on the main CPU (Cortex-A9) Native: native compilation & execution on weak core (Cortex-M3) Translated (unoptimized/optimized): translated execution on weak core Optimization Result: 5x overhead reduction 2x within native execution Estimated Energy Saving: 70% energy reduced in suspend/resume 30% overall battery life extended Benchmark: Mobile sensing scenario [1] 20

Summary Observation: Busy/idle waits for IOs bottleneck OS suspend/resume path Goal: Offloading suspend/resume to a weak core with incoherent cache and heterogenous ISAs Key idea: Binary translate and execute unmodified kernel on weak core Highlight: For the first time we run a virtual environment on an embedded core for offloading specific kernel paths Suspend Resume Kernel CPU DRAM Main Core CPU ON CPU OFF Binary translation of unmodified kernel Virtual Execution Weak core IO Weak Core Weak ON Suspended CPU ON Weak OFF 21

Q/A 22

ARM big.little SoC Little Core Power Big Core Power Ratio Exynos 5430 85 mw (Cortex A7) 750 mw (Cortex A15) 8.8 Exynos 5433 189 mw (Cortex A53) 1480 mw (Cortex A57) OMAP 4460 21.1 mw (Cortex M3) 672 mw (Cortex A9) 31.8 Power Consumption Comparison between ARM big.little and OMAP4 Performance Energy Performance/Energy A15 (Exynox 5430) 99.69MB/s 19.75mWh ~5.04 A7 (Exynos 5430) 77.93MB/s 10.56mWh ~7.38 A57 (Exynos 5433) 155.29MB/s 27.72mWh ~5.60 A53 (Exynos 5433) 109.36MB/s 17.11mWh ~6.39 BaseMark OS II - XML Parsing Energy Efficiency 7.8 23

How do we estimate our energy saving Without offloading: E cpu = (T busy_exec + T busy_wait ) * P busy + T idle * P idle With offloading: E pm = X * F * T busy_exec * P busy + T busy_wait * P busy + T idle * P idle 24

Prior Art: Multikernel OSes Single System Image One kernel for each type of cores Helios [1] Barrelfish [2] K2 [3] Popcorn Linux [4] Kernels often pass messages to communicate They give up compatibility with commodity kernels Kernel A Kernel B Core A Core B DRAM IO A multikernel OS 1. E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. SOSP, 2009. 2. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A new OS architecture for scalable multicore systems. SOSP, 2009. 3. F. X. Lin, Z. Wang, and L. Zhong. K2: A mobile operating system for heterogeneous coherence domains. ASPLOS, 2014 4. Antonio Barbalace, Marina Sadini, Saif B.M. Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray and Binoy Ravindran,"Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms. EuroSys, 2015 25