Static and Dynamic Frequency Scaling on Multicore CPUs

Size: px

Start display at page:

Download "Static and Dynamic Frequency Scaling on Multicore CPUs"

Lee Kennedy
6 years ago
Views:

1 Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University 2 Argonne National Laboratory 3 Pacific Northwest National Laboratory 4 Colorado State University 5 NRA : High Performance and Embedded Architecture and Compilation

2 : Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 2

3 Motivation & Overview: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 3

Motivation & Overview: Energy Efficiency Energy efficiency is of great

Dynamic Voltage and Frequency Scaling (DVFS) Well studied and widely adopted

Fundamental control mechanism enables trade-off between performance and

4 Motivation & Overview: Energy Efficiency Energy efficiency is of great importance in cases ranging from battery-operated devices to data centers. Dynamic Voltage and Frequency Scaling (DVFS) Well studied and widely adopted in both personal computing devices and large scale clusters. Fundamental control mechanism enables trade-off between performance and energy to improve energy efficiency. Energy savings achieved through voltage and frequency scaling. 4

5 Motivation & Overview: Typical DVFS approaches Limitations observed among these DVFS approaches: Û Frequency that optimizes CPU energy varies significantly across processors. Different processors have different optimal frequencies even for the same compute-bound application. Û Optimizing energy for parallel application highly dependent on its parallel scaling. n turn depends on the operating frequency. Mostly ignored in previous related work. Û Dynamic schemes remain constrained to runtime inspection at specific time intervals Short-running program phases will not have benefits compared using best frequency from the start. e.g. Linux On-demand governor. 5

6 Motivation & Overview: Our Proposed Approach A compile-time approach to select best frequency for affine program regions. 1 Static analysis to approximate the operational intensity and parallel scaling. 2 Categorize the program region into different categories such as compute-bound and bandwidth-bound. 3 Select frequency / core pair for each program category from one-time E/EDP profiling through microbenchmarking. A lightweight runtime approach throttles frequency based on application energy efficiency. Use as baseline to compare and validate against our compile-time approach. 6

7 Motivation & Overview: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 7

8 Hybrid Static/Dynamic Frequency Scaling: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 8

9 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for DGEMM/MKL Energy (J) Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of DGEMM/MKL on 4 CPUs 4 ntel x86 CPUs: SandyBridge, vybridge, Haswell. Representative Benchmarks: DGEMM/MKL: Compute-bound out-of-core computation. Jacobi-2D: Bandwidth-bound computation. Table : Processor characteristics ntel CPU Microarch. Node Cores L1 L2 L3 Freq. Range Voltage Range Core i7-26k SandyBridge 32nm 4 32K 256K 8192K GHz V Xeon E vybridge 22nm 4 32K 256K 8192K GHz V Xeon E5-265 vybridge 22nm 8 32K 256K 248K GHz V Core i7-477k Haswell 22nm 4 32K 256K 8192K GHz V 9

10 Hybrid Static/Dynamic Frequency Scaling: llustrative Example CPU only Energy on 4 ntel CPUs for Jacobi-2D Energy (J) Energy (J) Energy (J) Frequency (GHz) Frequency (GHz) Frequency (GHz) Frequency (GHz) core 2 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 4 cores 8 cores 1 4 cores Figure : Energy consumption of Jacobi-2d on 4 CPUs Energy (J) Effect of Execution Time The optimal frequency shift towards lower frequencies compared to compute-bound cases. Execution time decreases slower than frequency increases: expected acceleration not achieved. 2 Effect of Weak Scaling Weak Scaling: adding cores not decreasing execution time linearly (bandwidth saturation). Lead to higher energy cost for 4/8 cores than single core when increasing frequency. 1

11 Hybrid Static/Dynamic Frequency Scaling: llustrative Example Summary of Motivations Optimal CPU energy not at min nor max frequency in most cases. Optimal CPU energy achieved at different frequencies depending on number of cores used. Effect of Execution Time lower the best frequency compared to compute-bound. Effect of Weak Scaling higher energy cost for multi-cores than single core. Objectives Goal #1: Characterize programs into different categories. Goal #2: Analysis fast enough for large programs.e.g. 1s for 1 lines. Trade off result accuracy for analysis time. Goal #3: Select best frequency/core pair to optimize CPU energy for affine kernels. 11

12 Hybrid Static/Dynamic Frequency Scaling: Overview of the flow nput program Source code Analysis Extract Features Sequential Frequency /Core Decision Tree Poor scalability Compute-bound Categorization Bandwidth-bound 12

13 Hybrid Static/Dynamic Frequency Scaling: Extracting Features Overview of the process: 1 Extract polyhedral representation of the program (including OpenMP doall) 2 nline parameter values (many of our analysis are not parametric) 3 Count the number of operations in each loop / region of interest 4 Compute the data space (in cache lines) accessed by each loop / region 5 Run various algorithms to estimate cache misses, thread workload, etc. Core features currently extracted: EXACT Number of FLOPs EXACT Data footprint (read/written) APPROX Data cache misses (at each level, inc. shared/private) APPROX OpenMP thread workload 13

14 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Metric used to estimate Operational ntensity (O) O: Ratio of operations executed per bytes of data transferred to/from memory. O is sufficient to categorize program region as compute-bound or bandwidth-bound. Counting cache misses 1 Slice the region of interest (e.g. one OpenMP thread), inline parameters 2 Specify cache properties: shared/private, size in kb, cache line size in B 3 Compute data space accessed by each loop, in terms of distinct cache lines(dl) 4 Ad-hoc algorithm to estimate misses at each loop level, based on DL Some observations: Massive problem simplification: do not consider associativity, only look at capacity/compulsory misses Data prefetching (next-line), inter-iteration data reuse, and SMD all taken into account 14

15 Hybrid Static/Dynamic Frequency Scaling: Approximating Cache Misses Algorithm 1 EstimateCacheMisses nput: Number of array elements per cache line: ls cache size, in lines: C Polyhedral program: P Output: Estimate of misses 1: for all loops l do 2: Misses[l] = DLBase[l] = DLnext[l] = Processed[l] = 3: end for 4: for all Arrays A do 5: for all loops l do 6: DLbase[l] += #(DL A ) PSl, 7: DLnext[l] += #(DL A \ PSl, DLA ) PSl,1 8: end for 9: end for 1: for all loops l in postfix AST order and Processed[l] == do 11: if DLbase[l] < C then 12: Misses[l] = DLbase[l] 13: else 14: if l is inner-most loop then 15: Misses[l] = DLbase[l] 16: else 17: Miss = 18: for all loops ll immediately surrounded by l do 19: Miss += Misses[ll] 2: Processed[ll] = 1 21: end for 22: Misses[l] = Miss * TripCount(l) 23: if DLnext[l] < C then 24: Misses[l] -= DLnext[l] 25: end if 26: end if 27: end if 28: Processed[l] = 1 29: end for 3: return Misses[fk]; <= Compute DL for each loop <= Recursively traverse the program, bottom-up <= Handle inter-iteration reuse (experimental) 15

16 Hybrid Static/Dynamic Frequency Scaling: Approximating FLOPs and Compute O Counting FLOPS Collecting the number of operations not part of array subscript and multiply with the number of points in iteration domain of the statement. Ensure: Estimate of O Require: Cache line size in element No.: ls cache size, in lines: C Polyhedral program: P Parameter values: n 1: P = attachcontextnformation( n, P) 2: Misses = EstimateCacheMisses(ls, C, P ) 3: Flops = countflops(p ) 4: return Flops / (Misses * ls) Parallelism Features Sequential/Parallel: check if any OpenMP doall pragma exist. Poor scaling: performance improvement not scale with number of cores. Achieved by counting the number of active/inactive threads. 16

17 Hybrid Static/Dynamic Frequency Scaling: Decision Tree to Select frequency / core pair The frequency / core pair assigned to each category obtained through program profiling. Decision Tree 1 f the code is sequential, choose config_sequential; 2 else if the code is bandwidth-bound (O < threshold), choose config_bwbound. 3 else if the code has poor scaling, choose config_poorscale. 4 else choose it config_computebound. 17

18 Experimental Evaluation: Outline 1 Motivation & Overview 2 Hybrid Static/Dynamic Frequency Scaling 3 Experimental Evaluation 18

19 Experimental Evaluation: Experimental Evaluation Experimental Protocol Evaluation of compile-time approach PolyBench/C Benchmarks optimized by PoCC(6 in total) PolyBench-Parallel optimized with auto-parallelization PolyBench-Poly optimized with data-locality(tiling, coarse/fine-grain parallelism). Compile-time approach implemented as PolyFeat in PoCC. Collecting energy/time PCM for ntel CPUs, BM AMESTER for POWER8. Table : Number of benchmarks in each category Benchmarks seq/par bw-bound poor scale comp-bound polybench-parrallel 12/ polybench-poly 5/

20 Experimental Evaluation: Experimental Results Summary of CPU only Energy and EDP improvements Saving in % On-Demand vs. Powersave CPUmiser vs. Powersave OurStatic vs. Powersave Best vs. Powersave 6% 4% 2% % -2% -4% -6% SandyBridge-E SandyBridge-EDP vybridge4-e vybridge4-edp vybridge8-e vybridge8-edp Haswell-E Haswell-EDP POWER8-E POWER8-EDP Results On-demand: energy efficiency improved for vybridge and P8, where max frequency leads to min energy for compute-bound. CPUMiser, runtime DVFS approach based on CP of the workload. optimize energy with performance lost under certain threshold. Our static approach consistently outperforming other schemes and close to Best achievable. 2

Experimental Evaluation: Experimental Results Comparison with Runtime DVFS Saving in % SandyBridge-E vybridge4-e vybridge8-e Haswell-E 15% 5% -5% -15% -25% -35% -45% -55% Our Static Our Dynamic

21 Experimental Evaluation: Experimental Results Comparison with Runtime DVFS Saving in % SandyBridge-E vybridge4-e vybridge8-e Haswell-E 15% 5% -5% -15% -25% -35% -45% -55% Our Static Our Dynamic CPUmiser1% CPUmiser5% CPUmiser1% Ondemand95% Ondemand85% Ondemand75% Ondemand65% Our Dynamic Approach: nspect changes in energy efficiency (energy normalized by instructions). Decision to adapting frequency when: Energy efficiency decreased compared to previous.(processor-specific) O changed indicate computation performed varied.(application-specific) For interval with similar O, apply energy convexity rule.(gradient descent) 21

22 Experimental Evaluation: Phase Analysis Power traces for single-phase programs 5 jacobi-2d-imper (par) P F 5 mvt (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Power consumption is stable, frequency oscillate between two points. ndicating the optimal frequency is in between these points. 46 out of 6 benchmarks have single phase. 22

23 Experimental Evaluation: Phase Analysis Power traces for multiple-phase programs 5 covariance (par) P F 5 covariance (poly) P F Power (Watt) Frequency (GHz) Power (Watt) Frequency (GHz) Visualized as stable change for part of execution of the frequency. Phases are not identical, because of data locality optimization. 14 out of 6 benchmarks show 2 or more (up to 5) phases. 23

24 Experimental Evaluation: Conclusion Take home message: Compile-time frequency and core count selection for affine program New static analysis for program region categorization Fully implemented in PolyFeat / PoCC One-time machine profiling with representative benchmarks Significant energy savings over powersave linux governor Key aspect: analysis speed versus accuracy Trade-off precision when approximating operational intensity Analysis completes in < 1 second for regions of 1+ lines Future work: automatic region detection regions can be automatically detected w/ our algo. allows more precise frequency selection granularity driven by cost of frequency change 24

25 end: References Hong, C., Bao, W., Cohen, A., Krishnamoorthy, S., Pouchet, L.N., Rastello, F., Ramanujam, J. and Sadayappan, P. (216) Effective padding of multidimensional arrays to avoid cache conflict misses. n Proceedings of the 37th ACM SGPLAN Conference on Programming Language Design and mplementation. Bao, W., Krishnamoorthy, S., Pouchet, L. N., Rastello, F., & Sadayappan, P. (216) Polycheck: Dynamic verification of iteration space transformations on affine programs. n Proceedings of the 43rd Annual ACM SGPLAN-SGACT Symposium on Principles of Programming Languages. Bao, W., Tavarageri, S., Ozguner, F., & Sadayappan, P. (214) PWCET: Power-Aware Worst Case Execution Time Analysis rd nternational Conference on Parallel Processing Workshops. Bao, W. (214) Power Aware WCET Analysis. Diss. The Ohio State University. 32

26 Thank you

Static and Dynamic Frequency Scaling on Multicore CPUs

Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory