Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Size: px

Start display at page:

Download "Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer"

Lorin Greene
5 years ago
Views:

1 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig

2 Introduction Power Management Power and energy now first-order constraints Recent CPUs allow voltage/frequency scaling Minimize energy or power within some performance constraint Application Driven DVFS Performance during memory-bound periods more or less unaffected by frequency Reducing frequency and voltage saves energy/power Periods are often recurrent Periods can be learned and predicted Intl. Conf. on Parallel Architectures and Compilation Techniques - October

Introduction Dynamic Voltage and Frequency Scaling (DVFS) on Many-Cores Voltage and frequency control not independent for each core: control granularity of domains Possible application-level

3 Introduction Dynamic Voltage and Frequency Scaling (DVFS) on Many-Cores Voltage and frequency control not independent for each core: control granularity of domains Possible application-level monitoring and control of settings The challenge Power management on many-cores Exploit application behavior to: Minimize energy consumption within a performance window Target platform: SCC (exp. concept vehicle from Intel Labs) running MPI applications Intl. Conf. on Parallel Architectures and Compilation Techniques - October

4 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

5 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Front End: Phase Predictor Back End: Power Controller Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

6 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

7 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Local Manager Requests Domain Manager Control System Phase state misprediction + Application performance Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

8 Proposed Scheme Overview A modular, hierarchical, transparent, dynamic software power management scheme for a many-core system Phase misprediction Phase misprediction Phase Search Macro Phase Phase Partition Subphases Phases f/t table Setup f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

9 Proposed Scheme Overview A modular, hierarchical, transparent,, dynamic software power management scheme for a many-core system Captures application behavior without user intervention Phase Search Macro Phase Phase Partition Subphases Phases Tracks performance vs. energy behavior and adapts Phase misprediction f/t table Setup Phase misprediction f/t Data DVFS Control Table update Front-End: Phase Predictor Back-End: Power Manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

10 Contributions Novel power management scheme for many-core systems Hierarchical capable of operating on domain-based systems Novel phase prediction scheme (SMRP SuperMaximal Repeat phase Predictor) Based on supermaximal repeat string better accuracy than previous approaches Transparent instrumentation of MPI applications Implemented and evaluated schemes on a real experimental many-core system Significant energy savings with little performance degradation Average 15% energy savings with only 7% performance degradation Well within 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October

11 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

12 Power Manager: Local Manager Input: phases of repeatable behavior For each repetitive phase a frequency/time table is built Iterative approach: 1. Start at highest frequency 2. Measure execution time of current instance and record in table 3. Reduce frequency by one step 4. Stop if performance impact higher than threshold δ (e.g., 10%) 5. Otherwise, repeat until lowest frequency is reached Output: frequency requested per core Tile 0 Tile 3 Core 0 Core 1 Domain 0 Sub- Phase 0 Sub- Phase m f f1. f4 f f1. f4.... t t1. t4 t t1. t4 Local manager Local Freq. Analysis fr0 fr1 fr6 fr6 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

13 Local Management Example Learning i i i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October

14 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

15 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

16 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

17 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Intl. Conf. on Parallel Architectures and Compilation Techniques - October

18 Local Management Example Learning i i i i Sub-phase i detected: Set f to highest value (e.g., f i1 =800MHz) Measure execution time t i 1 Sub-phase i detected: Set f to next value (e.g., f i2 =533MHz) Measure execution time t i 2 t i2 < (1+δ)t i1 continue exploration Sub-phase i detected: Set f to next value (e.g., f i3 =400MHz) Measure execution time t i 3 t i3 > (1+δ)t i2 stop exploration Sub-phase i detected: Use f i3 = 533MHz, no further exploration Steady-state frequency will change if there is a phase misprediction In reality test of new frequencies is dependent on domain management decisions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

19 Power Manager: Domain Manager One per voltage domain Decides/controls voltage and frequency Input: frequency requests from local managers for each core Output: frequencies for the entire domain. Policies investigated: Simple: simply service requests in order Mean: select the mean of the requests All_low, All_high: assign the lowest/highest frequency requested Voltage set for entire domain based on highest frequency (max) fr0 fr1 fr6 fr7 fr0 fr1 fr6 fr7 Domain Freq. Control max(f0...f3) Voltage LUT Vdom Domain manager Intl. Conf. on Parallel Architectures and Compilation Techniques - October

20 Domain Management Example i i Intl. Conf. on Parallel Architectures and Compilation Techniques - October

21 Domain Management Example i i Domain Manager Local Manager 0 fr0 Policy f() Local Manager 1 fr1 e.g.: SCC V/F f doms V dom Local Manager 7 fr7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

22 Domain Management Example i i Domain Manager Local Manager 0 fr0 800 Policy f() Local Manager 1 Local Manager 7 fr fr e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

23 Domain Management Example i i Domain Manager Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Policy f() e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

24 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom Intl. Conf. on Parallel Architectures and Compilation Techniques - October

25 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() 800 e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V Intl. Conf. on Parallel Architectures and Compilation Techniques - October

26 Domain Management Example i i Local Manager 0 Local Manager 1 Local Manager 7 fr0 800 fr fr Domain Manager Policy f() e.g.: SCC V/F f doms V dom max (f doms ) = 800 V dom = 1.1V repeat Frequency (MHz) Intl. Conf. on Parallel Architectures and Compilation Techniques - October Voltage (V)

27 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

28 Phase Predictor Recurring patterns in MPI applications Communication and execution patterns at MPI-event granularity Patterns highly repeatable Pattern detection Use of Supermaximal Repeat String algorithm Predictor Predicts next call and program phase with projected execution time Places DVFS scheduling points around repeatable regions within pattern Front-end to the local power controllers Implemented as a wrapper library for MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October

29 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

30 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

31 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

32 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

33 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

34 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

35 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

36 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

37 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

38 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

39 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

40 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

41 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

42 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

43 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

44 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

45 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

46 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

47 Phase Predictor Example P0 P1 P2 P3 P4 P5 P6 P7 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

48 Outline Introduction Power Manager Phase Predictor Results Conclusions Intl. Conf. on Parallel Architectures and Compilation Techniques - October

49 Experimental Setup Platform Single chip Cloud Computer 48 cores running Linux kernel V and F levels obtained empirically Compiler: Frequency (MHz) GCC Voltage (V) Benchmarks: NAS MPI Parallel Benchmarks 2 SPEC MPI 2007 Evaluation Methodology: Lab setup for accurate power measurement Schemes evaluated: Scheme Predictor Domain Policy SMRP + Amean SMRP Mean SMRP + Simple SMRP Simple Chipwide + SMRP + Amean SMRP Intl. Conf. on Parallel Architectures and Compilation Techniques - October Mean, but chip-wide GHTP + Amean GHTP Mean GHTP + Simple GHTP Simple

50 EDP Results Bottom Line 1.20 GHTP+Simple GHTP+Amean Chipwide+SMRP+Amean SMRP+Simple SMRP+Amean is ft mg cg lu bt sp lammps tachyon AVG 11% avg. EDP reduction with 7% increase in execution time Intl. Conf. on Parallel Architectures and Compilation Techniques - October

51 Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW Simple Amean Alllow Allhigh ChipW EDP Execution Time Results Domain Management Policies 1.4 EDP Execution Time is ft mg cg lu bt sp lammps tachyon avg Arithmetic mean policy performs best Intl. Conf. on Parallel Architectures and Compilation Techniques - October

52 SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP SMRP GHTP Results Phase Predictor Performance 100% Accuracy Coverage 80% 60% 40% 20% 0% is ft mg cg lu bt sp lammps tachyon avg Predictor 17% more accurate on avg. than state-of-the-art GHT Intl. Conf. on Parallel Architectures and Compilation Techniques - October

53 Conclusions Many-cores offer new challenges and opportunities in DVFS Possible division into domains Possible application-level control Presented a novel power mng. scheme applicable to many-cores Modularity: allows for separation of concerns between phase detection/prediction and control Hierarchical: can accommodate control at domains Transparent: does not require user or OS intervention Demonstrated significant energy improvements of 15% on average on a real system Benefits come from both better prediction and better management Intl. Conf. on Parallel Architectures and Compilation Techniques - October

54 Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig

Directly measure input voltage and current with

shunt resistor V A V A Management Console

Lab Measurement PC Measurement PC Intl. Conf.

55 Experimental Setup Lab environment to accurately and directly measure system power Directly measure input voltage and current with digital multimeters current measure through shunt resistor V A V A Management Console Management PC Console PC Windows Lab Windows Lab Measurement PC Measurement PC Intl. Conf. on Parallel Architectures and Compilation Techniques - October

56 Speedup normalized to 533MHz SCC MPI Frequency Scaling SCC is less communication bound than traditional clusters Computation and Communication Speedup Ratios for different Frequencies on the SCC Intl. Conf. on Parallel Architectures and Compilation Techniques - October

57 Performance Threshold Sensitivity 10% seems to be the sweet spot 1.2 EDP Execution Time % 10% 15% 20% 25% 30% Performance Threshold δ Intl. Conf. on Parallel Architectures and Compilation Techniques - October

58 Related Work Hardware schemes (Isci et. al. MICRO 06, Huang et. al. ISCA 03) Require additional hardware for monitoring and control Results obtained through simulation DVFS management of MPI applications (Freeh et. al. PPoPP 05, Lim et. al. SC 06, Rountree et. al. ICS 09) Assume per-core power management Require profile data Powernap (Meisner et. al. ASPLOS 09) Idle time policy Our scheme is complementary to such idle time schemes Intl. Conf. on Parallel Architectures and Compilation Techniques - October

59 Bibliography C. Isci, A. Buyuktosunoglu, C.-Y. Cher, and M. Martonosi An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget MICRO 2006 M. C. Huang, J. Renau, and J. Torrellas Positional Adaptation of Processors: Application to Energy Reduction ISCA 2003 V. W. Freeh and D. K. Lowenthal Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster PPoPP 2005 B. Rountree et. al. Adagio: Making DVS Practical for Complex HPC Applications ICS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

60 Bibliography (cont.) M. Y. Lim, V. W. Freeh, and D. K. Lowenthal Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs SC 2006 D. Meisner, B. T. Gold, and T. F. Wenisch PowerNap: Eliminating Server Idle Power ASPLOS 2009 Intl. Conf. on Parallel Architectures and Compilation Techniques - October

61 Background: DVFS Dynamic power consumption: P dy α V dd2. f Frequency is also a function of V dd (lower V dd lower f) Thus, lowering both V dd and f can bring significant power savings If power savings come with little impact on performance then energy savings can be achieved as well DVFS usually applied to cores but not memories Higher core f higher memory latency in core cycles Commonly accepted rule-of-thumb is 3:1 ratio of power savings to performance degradation Intl. Conf. on Parallel Architectures and Compilation Techniques - October

62 Background: MPI Applications Library supporting message-passing programming model User API for exchanging messages across abstract processes Common messages are send, receive and collective types System interface to hardware communication mechanisms (e.g., TCP/IP, Infiniband, vendor proprietary) In most systems library runs in the same core as the user code DVFS can be applied to both user code and MPI library code Common programming styles lead to much regularity in patterns of message exchanges Well-defined standard allows for easy addition of wrappers to common MPI calls Intl. Conf. on Parallel Architectures and Compilation Techniques - October

63 Background: SCC Many-Core Experimental concept vehicle developed by Intel Labs to serve as a platform for software research 48 Intel Pentium IA cores Tiled organization with 2 cores per tile and mesh interconnect Frequency domain per tile and voltage domain per 4 tiles Current frequency and voltage level can be read and set by user software through registers Voltage changes take ~1ms and frequency changes take only a few cycles Intl. Conf. on Parallel Architectures and Compilation Techniques - October

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16