A Probabilistic Graphical Model-based Approach for Minimizing Energy under Performance Constraints

Size: px

Start display at page:

Download "A Probabilistic Graphical Model-based Approach for Minimizing Energy under Performance Constraints"

Michael Cameron
5 years ago
Views:

1 A Probabilistic Graphical Model-based Approach for Minimizing Energy under Performance Constraints Nikita Mishra, Huazhe Zhang, John Lafferty and Hank Hoffmann University of Chicago

2 Fraction of time CPU utilization CPU utilization Average CPU utilization of more than 5,000 servers during 6-month period [1] [1]Barroso, Luiz André, and Urs Hölzle. "The case for energy-proportional computing." IEEE computer (2007):

3 Example of a configuration space 2.26 Hz Clock Speed Memory Controller 1 Memory Controller 2 Cores Memory controller 3

4 Adaptive systems Automatically tune configurations for different utilizations to achieve most energy efficient state 4

5 Adaptive systems Automatically tune configurations for different utilizations to achieve most energy efficient state Requires the power and performance profile for the application 4

6 Why is it a difficult problem? 5

7 Why is it a difficult problem? Configuration space can be quite large. With brute force it may take a lot of time. 5

8 Why is it a difficult problem? Configuration space can be quite large. With brute force it may take a lot of time. The behavior of each application is different for different machine. 5

9 Why is it a difficult problem? Configuration space can be quite large. With brute force it may take a lot of time. The behavior of each application is different for different machine. The application behavior could even vary with different input. E.g. (Video streaming application x264) 5

10 Cores Example: streamcluster Performance rate (in iter/s) Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6

11 Cores Example: streamcluster Performance rate (in iter/s) 8 Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6

12 Cores Example: streamcluster Performance rate (in iter/s) Multiple local solutions 8 Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6

13 Example: kmeans Optimal configuration frontier Pareto frontier of Performance rate (in Iter/s) vs system-power(in Watts) at different configurations 7

14 LEO (Learning for Energy Optimization) Historical Data Target Application 8

15 LEO (Learning for Energy Optimization) Historical Data Target Application Incorporate performance profiles of previously seen applications 8

16 Example: kmeans Performance rate (in Iter/s) vs Configuration index Estimated Pareto-optimal frontiers vs true frontier found with exhaustive search 9

17 Motivation/Overview Statistical modelling Evaluation Summary Outline 10

18 Outline Statistical modelling 10

19 Outline Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10

20 Outline Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10

21 Outline Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10

22 Outline Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10

23 Graphical Models z1 z2 zm -1 zm y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 11

24 Graphical Models z1 z2 zm -1 zm y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 11

25 Graphical Models z1 z2 zm -1 zm y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 11

26 Hierarchical Bayesian Model Hidden Nodes, z1 z2 zm -1 zm All applications (Observed data) y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 12

27 Hierarchical Bayesian Model Hidden Nodes, z1 z2 zm -1 zm All applications (Observed data) y1 y2 ym -1 ym Target Application (Partially observed data) yi: Vector of performance rate by the i th application for different configurations. 12

28 Hierarchical Bayesian Model Hidden Nodes, Couples each of the applications z1 z2 zm -1 zm All applications (Observed data) y1 y2 ym -1 ym Target Application (Partially observed data) yi: Vector of performance rate by the i th application for different configurations. 12

29 Hierarchical Bayesian Model Hidden Nodes, z1 z2 zm -1 zm Couples each of the applications Penalizes large variations in the application All applications (Observed data) y1 y2 ym -1 ym Target Application (Partially observed data) yi: Vector of performance rate by the i th application for different configurations. 12

30 Hierarchical Bayesian Model Hidden Nodes, z1 z2 zm -1 zm All applications (Observed data) y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 12

31 Hierarchical Bayesian Model Hidden Nodes, z1 z2 zm -1 zm True value of target application All applications (Observed data) y1 y2 ym -1 ym yi: Vector of performance rate by the i th application for different configurations. 13

32 Expectation Maximization Algorithm Model Parameters Latent variables Initialize 14

33 Expectation Maximization Algorithm Model Parameters Latent variables Ɵnew= Initialize Initialize 14

34 Expectation Maximization Algorithm Model Parameters Latent variables Ɵnew= Initialize Initialize = E-step Create Expected log-likelihood function 14

35 Expectation Maximization Algorithm Model Parameters Latent variables Ɵnew= M-step Maximize Initialize Expected Initialize log-likelihood function = E-step Create Expected log-likelihood function 14

36 Expectation Maximization Algorithm Model Parameters Ɵnew Latent variables Ɵnew= M-step Maximize Initialize Expected Initialize log-likelihood function Observed data = E-step Create Expected log-likelihood function 14

37 Performance (in Iter/s) Example: kmeans (Initialization) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

38 Performance (in Iter/s) Example: kmeans (Initialization) Observed Samples Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

39 Performance (in Iter/s) Example: kmeans (EM Iteration - 1) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

40 Performance (in Iter/s) Example: kmeans (EM Iteration - 2) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

41 Performance (in Iter/s) Example: kmeans (EM Iteration - 3) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

42 Performance (in Iter/s) Example: kmeans (EM Iteration - 4) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

43 Performance (in Iter/s) Example: kmeans (EM Iteration - 4) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15

44 LEO (Learning for Energy Optimization) Set ym = Observed Power LEO Get p = Estimated Power Feedback! Controller Select the configuration LEO Set ym = Observed Performance Get r = Estimated Performance 16

45 LEO (Learning for Energy Optimization) Set ym = Observed Power LEO Get p = Estimated Power Feedback! Controller Select the configuration LEO Set ym = Observed Performance Get r = Estimated Performance 16

46 LEO (Learning for Energy Optimization) Set ym = Observed Power LEO Get p = Estimated Power Feedback! Controller Select the configuration LEO Set ym = Observed Performance Get r = Estimated Performance 16

47 Motivation/Overview Statistical modelling Evaluation Experimental Setup Power and performance estimation Energy savings/ Phase transition Summary Outline 17

48 Outline Evaluation Experimental Setup 17

Outline Evaluation Experimental Setup Dual-socket Linux 3.2.

49 Outline Evaluation Experimental Setup Dual-socket Linux system with SuperMICRO X9DRL-iF motherboard and two Intel Xeon E processors 17

50 Experimental Setup Configurations (1024 configurations) 18

51 Configurations (1024 configurations) Clock speed: Experimental Setup Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings 18

52 Configurations (1024 configurations) Clock speed: Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings Memory controller: Experimental Setup numactl library to control the access. 2 memory controls - 2 settings 18

53 Configurations (1024 configurations) Clock speed: Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings Memory controller: numactl library to control the access. 2 memory controls - 2 settings Cores: Experimental Setup Two 8 cores and hyper-threading - 32 settings 18

54 Configurations (1024 configurations) Clock speed: Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings Memory controller: numactl library to control the access. 2 memory controls - 2 settings Cores: Two 8 cores and hyper-threading - 32 settings Measurements Experimental Setup 18

55 Configurations (1024 configurations) Clock speed: Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings Memory controller: numactl library to control the access. 2 memory controls - 2 settings Cores: Two 8 cores and hyper-threading - 32 settings Measurements Power Experimental Setup WattsUp meter provides total system power at 1s intervals. 18

56 Configurations (1024 configurations) Clock speed: Set using cpufrequtils package 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings Memory controller: numactl library to control the access. 2 memory controls - 2 settings Cores: Two 8 cores and hyper-threading - 32 settings Measurements Power WattsUp meter provides total system power at 1s intervals. Performance Experimental Setup Applications report the heartrate, which is application specific. 18

57 Benchmarks Experimental Setup 19

58 Experimental Setup Benchmarks We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. 19

59 Experimental Setup Benchmarks We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. Baseline heuristics 19

60 Experimental Setup Benchmarks We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. Baseline heuristics Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. 19

61 Experimental Setup Benchmarks We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. Baseline heuristics Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. Offline algorithm- Average over the rest of the applications to estimate the power and performance of the given application. 19

62 Experimental Setup Benchmarks We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. Baseline heuristics Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. Offline algorithm- Average over the rest of the applications to estimate the power and performance of the given application. Race-to-idle- Allocates all resources to the application and once it is finished the system goes to idle. 19

63 Motivation/Overview Statistical modelling Evaluation Experimental setup Power and performance estimation Energy savings/ Phase transition Summary Outline 20

64 Power and performance estimation Performance rate (in Iter/s) vs Configuration index System-power (in Watts) vs Configuration index 21

65 Power and performance estimation Swish Search web- server X264 Video encoder 22

66 ACCURACY Summary: Performance estimation LEO Online Offline Kmeans LEO Online Offline

67 ACCURACY Summary: Performance estimation LEO Online Offline Jacobi LEO Online Offline

68 ACCURACY Summary: Performance estimation LEO Online Offline Overall LEO Online Offline

69 ACCURACY Summary: System-power estimation LEO Online Offline Overall LEO Online Offline

70 Motivation/Overview Statistical modelling Experiments Experimental setup Power and performance estimation Energy savings/ Phase transition Summary Outline 27

71 Summary: Energy savings Comparison of average energy compared with the optimal (over different utilizations and all the benchmarks), LEO - +6% Online - +24% Offline - +29% Race-to idle - +90% 28

72 Phase - transitions Performance and power for fluidanimate along phases with different computational demands 29

73 Phase - transitions Performance and power for fluidanimate along phases with different computational demands 29

74 Multiple Applications Comparison of performance estimation(in iter/s) and system-power(in Watts) estimation for different algorithms over the set of mixture of applications Performance(in Iter/s) System-power(in Watts) Mixture 1 Mixture 2 Overall Mixture 1 Mixture 2 Overall LEO Online Offline

75 Summary

76 Sensitivity analysis of LEO vs Online As compared to LEO which quickly reaches near optimality, our baseline method (online regression) cannot perform below 15 samples because the design matrix of regression model would be rank deficient. 32

77 Related Work Offline optimization techniques (e.g.,[59, 35, 33, 10, 2]) But they are limited by reliance on a robust training phase. Online optimization techniques [44] For example, Flicker is a configurable architecture and optimization framework that uses only online models to maximize performance under a power limitation. ParallelismDial, Uses online adaptation to tailor parallelism to application workload. 33

COL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques

COL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques Authors: Huazhe Zhang and Henry Hoffmann, Published: ASPLOS '16 Proceedings