Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Size: px

Start display at page:

Download "Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications"

Beatrix Hardy
5 years ago
Views:

1 University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel * * University of Dortmund + ETC, Altera Europe 1

2 Outline Motivation and Introduction Compile-time Transformations Runtime Components Optimal Allocation Strategy Results and Conclusion 2

3 Motivation - Energy Consumption More than 200 experiments with real-life benchmarks: Memory subsystem energy of an Uni-Proc. ARM: 65% Energy: Memory 65,2% Proc 34,8% ARM Proc. (w/o Caches) Memory subsystem is the performance and energy bottleneck Memory Wall Problem Increasing number of portable devices Growing complexity of embedded software Unavoidable to reduce energy consumption! [ M. Verma, Advanced Memory Opt., PhD, 2006 ] 3

4 Motivation - Predictability Cache-based system Scratchpad-based system max. min. Perf. max. min. Perf. Context Switch t t For SPMs: Predictable performance degradation For Caches: Lightweight context switches, but unpredictable degradation 4

5 Introduction Scratchpads Fast & predictable accesses Energy efficient On-Chip CPU SPM Address space: Management Centralized Runtime decisions, but Compile-time supported MEM SPM App. 1 App. 2? SPM Manager SPM App. 3 App. 2 App. n App. 1 t 5

6 Approach overview Introduction Two-fold approach: compile-time analysis and runtime decisions Resembles HW-based caching, but with SW flexibility Applications provide hints about their memory access patterns No need to know the complete set of applications at compile-time Capable of managing runtime allocated memory objects Integrates into an embedded operating system App. 1 App. 2 Compile-time Transformations Standard Compiler (GCC) App. n Profit values / Allocation hints Allocation Manager Operating System 6

7 Compile-time Transformations Profiler RTEMS.o Appl. C LockGen / etc. Prepared C Profit Annotator Adapted C GCC Opt Sys. Image * * SPMM.o Preparation step Mark MOs in use / Insert Locks Replace malloc() with spmalloc() Adaptation step Insert MO property structures Add dereferencing layer Annotate profit values Split into one function per file Compile & Link GCC Compiler-in-the-Loop Reuse of platform dependent tools *Built with the tool design suite ICD-C (see 7

8 Scratchpad Memory Manager Structure Part of the RTEMS operating system Generic management module Replaceable allocation strategies Convenient application interface RTEMS App. 2 App. 1 Sched. DMM SPMM Alloc. Strategie App. n Memory object representation Created at compile-time Types: functions, global static data, dynamic data Additional data structures encapsulating object s properties (size, profit value, current location, locking state) 8

9 Runtime Allocation Strategies Non-saving allocation strategies MOs marked as locked are not removed from scratchpad Statically Locking Allocation Strategy First Fit / Best Fit Strategy Single Pass / Triple Pass Strategy Restoring allocation strategies MOs marked as locked are restored on the same location Dynamically Overlaying Allocation Strategy Chunk Allocation Strategy 9

10 Optimal Allocation Strategy ILP-based optimal offline allocation strategy Needs global view over the entire runtime Context switching fixed to precomputed points of control Uses profiling for determining the sequential order of control points Objective function to be maximized: Σ of profits of objects on the SPM decreased by copy cost to the SPM decreased by the write-back cost (for data) decreased by initial copy costs (for static data) Constraints: Locks, Sizes, Placement, Movement, 10

11 Target platform - MPARM ARM7 SoC Simulator Bus: AMBA AHB Multi-processor capable SoC Configurable local scratchpad memory Configurable local caches RTEMS operating system Cache ARM SPM Bus Bus (AMBA) (AMBA) IRQ Private MEM Semaphores Energy Model Transaction-Level models from ST Microelectronics 11

12 Results / Evaluation Benchmarks AUTO benchmark is derived from the MiBenchTestsuite: BASICMATH, BITCOUNT, QSORT and SUSAN TELECOM benchmarks consist of typical encoding tasks: CRC32, FFT, IFFT, ADPCM and GSM MEDIA- benchmarks consist of AV processing applications: ADPCM, G723 and EDGE-DETECTION MEDIA+ same as MEDIAplus MPEG2 decoder SORT is a collection of sorting algorithms: BUBBLESORT, HEAPSORT, INSERTIONSORT, QUICKSORT, 12

13 Results MEDIA+ Energy Baseline: Main memory only Best: Static for 16k 58% Overall best: Chunk 49% MEDIA+ Cycles Baseline: Main memory only Best: Static for 16k 65% Overall best: Chunk 61% Target architecture couples energy and runtime reduction 13

14 Results Energy consumption compared to optimal solution Baseline: ILP based solution Average values over SPM sizes of 256 4k Bytes Overall average values included Restoring strategies achieve good results: 6% deviation First-fit fails Prefer copying over computation First-fit not well suited for limited size SPMs Best approach: Chunk deviation from opt. solution 14

Results Comparison of SPMM to Caches for SORT Baseline: Main memory only SPMM peak energy reduction by 83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way cache SPMM

15 Results Comparison of SPMM to Caches for SORT Baseline: Main memory only SPMM peak energy reduction by 83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way cache SPMM capable of outperforming caches OS and libraries are not considered yet Chunk allocation results: SPM Size way 74,81% 65,35% 64,39% 65,64% 63,73% 15

16 Conclusion / Further Work SPM support integrated into the RTEMS operating system Exploits possibility to annotate and transform input code For the ARM architecture: Copying is not so bad Outperforms caches, while offering more control and better system-scope predictability Develop further allocation strategies Consider stack and local variables SPMM-ize libraries and the OS kernel 16

Optimizations - Compilation for Embedded Processors -

12 Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 23 211 年 1 月 12 日 These slides use Microsoft clip arts.