Potentials and Limitations for Energy Efficiency Auto-Tuning

Size: px

Start display at page:

Download "Potentials and Limitations for Energy Efficiency Auto-Tuning"

Julianna Washington
5 years ago
Views:

Autotuning for HPC (Architectures) Robert Schöne (robert.schoene@tu-dresden.

1 Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne Andreas Knüpfer Daniel Molka

2 Outline Motivation Dynamic Voltage and Frequency Scaling Dynamic Concurrency Throttling x86_adapt Kernel Module and Prefetchers Adaption Library Evaluation Conclusion and Future Work Robert Schöne 2/24

3 Motivation Power consumption of HPC systems increases Models for energy efficient operation points available How do we use these models? Measure Performance Counters Sampling Instrumentation Performance analysis tools! Create profiles or traces Lack of the ability to change the hardware and software environment Energy efficiency optimization techniques have side effects Power optimization is worthless when runtime increases Robert Schöne 3/24

4 Performance Analysis Optimization Cycle Robert Schöne 4/24

5 Dynamic Voltage and Frequency Scaling DVFS Most common energy efficiency optimization P ~ V² * f Often applied when main memory bandwidth is bottleneck Assumptions: Reducing CPU voltage and frequency does not influence main memory performance Reducing CPU voltage and frequency reduces power consumption Frequency decisions can be written to sysfs, operating system changes hardware configuration libcpufreq provides interface to sysfs files (but libenopt is much nicer!) Robert Schöne 5/24

6 processor frequency Problems Frequency domains on processors Basic assumption is wrong Memory bandwidth depends on processor frequency Transition delay time between request and execution of frequency change In the order of microseconds time Robert Schöne 6/24

change In the order of microseconds Westmere EP Processor Sandy Bridge EP Processor Robert

an analysis of current x86_64 processors, in: Proceedings of the 2012 USENIX conference on

7 Problems Frequency domains on processors Basic assumption is wrong Memory bandwidth depends on processor frequency Transition delay time between request and execution of frequency change In the order of microseconds Westmere EP Processor Sandy Bridge EP Processor Robert Schöne, Daniel Hackenberg and Daniel Molka, Memory performance at reduced CPU clock speeds: an analysis of current x86_64 processors, in: Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems, Hollywood, CA, pages 9--9, USENIX Association, 2012 Robert Schöne 7/24

8 processor frequency Problems Frequency domains on processors Basic assumption is wrong Memory bandwidth depends on processor frequency Transition delay time between request and execution of frequency change In the order of tens of microseconds time Robert Schöne 8/24

9 processor frequency Problems Frequency domains on processors Basic assumption is wrong Memory bandwidth depends on processor frequency Transition delay time between request and execution of frequency change In the order of tens of microseconds time Robert Schöne 9/24

10 Dynamic Concurrency Throttling DCT Can be applied in thread parallel applications (OpenMP) If parallel regions do not scale with the number of threads, use less threads Software indicators: e.g. number of time spent in critical sections Hardware indicators: performance counters (false/true sharing, LLC-misses) Allows processor cores to go into idle states, which allows to clock gate or power gate the cores, saving energy Can also decrease runtime Robert Schöne 10/24

11 DCT Problems Caches are flushed Processor wakeup time can be high (depending on depth of idle state) OpenMP runtime defines behavior of idling threads Working sets are changed Example Thread active Parallel region ends: Threads go idle Caches are flushed Thread busy-wait New work: Wakeup cores/threads Reestablish cache set time Robert Schöne 11/24

12 DCT Problems Caches are flushed Processor wakeup time can be high (depending on depth of idle state) OpenMP runtime defines behavior of idling threads Working sets are changed Example Thread active Block time exceeded: Threads go idle Caches flushed Thread busy-wait New work: Wakeup threads Reestablish cache set time Robert Schöne 12/24

13 DCT Problems Caches are flushed Processor wakeup time can be high (depending on depth of idle state) OpenMP runtime defines behavior of idling threads Working sets are changed Example Thread active Thread busy-wait Change number of threads: Other working set for active threads Are unused threads idle? time Robert Schöne 13/24

14 Block Time Influence Block time prevents idle state Average power consumption 270 W Minimal power consumption 240 W Robert Schöne 14/24

15 Block Time Influence Reduced block time to 0 Average power consumption 225 W Minimal power consumption 170 W Robert Schöne 15/24

16 x86_adapt Kernel Module Provides interface for MSR and PCI based hardware configurations Establishes files per CPU/northbridge at /dev to change these configurations allows Linux file access permissions unlike msr kernel module it defines which parts of the MSRs can be accessed Provides additional files at /dev with the definition of available performance knobs E.g. prefetchers, loop predictors, Known knobs compiled to module, runtime checks which ones are available for current system Robert Schöne 16/24

Prefetcher Performance Evaluation Toggling overhead: 1115 cycles on SandyBridge EP Performance gain due to prefetchers depends on access stride /* on all cores do */ char * c =

17 Prefetcher Performance Evaluation Toggling overhead: 1115 cycles on SandyBridge EP Performance gain due to prefetchers depends on access stride /* on all cores do */ char * c = initialize(); for (i=0;i<n;i+=stride/16) /* load 16 bytes, starting at x[i] to XMM register */ movdqu(&(x[i])); Sandy Bridge EP DRAM-Bandwidth depending on Prefetcher Settings Robert Schöne 17/24

18 libadapt Uses configuration files which define adaptions Possible adaptions: Dynamic Voltage and Frequency Scaling Dynamic Concurrency Throttling x86_adapt Kernel Module Definitions for enter and exit of functions Can be linked to VampirTrace binary_0 : { name = "/home/user/my_app" ; function_0 : { # dct settings on the main thread # outside of a parallel region name = "my_function" ; # change number of threads before # entering the region dct_threads_before = 1; # reset it afterwards dct_threads_after = 0; }; function_1 : { # change non dct settings on all # threads name = "!$omp parallel@my_app.c:42" ; # change frequency before and after # entering the region dvfs_freq_before = ; dvfs_freq_after = ; # disabling prefetcher for this region adapt_amd_region_prefetch_disable_before =1; adapt_amd_region_prefetch_disable_after =0; }; }; Robert Schöne 18/24

19 Real World Evaluation Ivy Bridge HE system NAS Parallel Benchmarks (OpenMP) Except dc Sizes A, B, C OpenMP Wrapper library Run each benchmark at size B with instrumentation (wrapper+vt) Create performance profile, which contains runtime and performance metrics Based on the profile, create a configuration file with optimized DVFS settings for each parallel region Simple metric: Cache misses per second Run the benchmarks with different sizes and configurations Evaluate energy efficiency Robert Schöne 19/24

Evaluation Ivy Bridge HE DVFS Results Average energy saving: 6% Maximal energy saving: 23% With increasing problem size, overhead becomes less Sometimes significant increase of runtime Robert

20 Evaluation Ivy Bridge HE DVFS Results Average energy saving: 6% Maximal energy saving: 23% With increasing problem size, overhead becomes less Sometimes significant increase of runtime Robert Schöne and Daniel Molka, Integrating performance analysis and energy efficiency optimizations in a unified environment (2013), in: Computer Science - Research and Development(1-9) Robert Schöne 20/24

21 Integrated Optimization Cycle Robert Schöne 21/24

22 Conclusion and Future Work Conclusion: Each optimization knob is difficult in its own way libadapt: allows DCT, DVFS and changing of hardware settings Extended performance measurement tool to include libadapt Future work: Establish optimization cycle MPI parallel optimization Sampling integration Integration in Score-P Robert Schöne 22/24

23 Acknowledgement: Cool Silicon Spitzencluster Funded via BMBF 13N10186 Robert Schöne 23/24

24 Thank you Questions? Robert Schöne 24/24

Detecting Memory-Boundedness with Hardware Performance Counters

Detecting Memory-Boundedness with Hardware Performance Counters Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)