Power-Aware Compile Technology. Xiaoming Li

Size: px

Start display at page:

Download "Power-Aware Compile Technology. Xiaoming Li"

Brian Hodge
5 years ago
Views:

1 Power-Aware Compile Technology Xiaoming Li

2 Frying Eggs

3 Future CPU? Watts/cm i386 Hot plate i486 Nuclear Reactor Pentium III processor Pentium II processor Pentium Pro processor Pentium processor!"#$!$ %"&$ %"#$ %"'#$ %"(#$ %"!)$ %"!'$ %"!$ %"%&$ Rocket Nozzle Sun s Surface ( Source: Fred Pollack, Intel, Micro 1999 keynote)

4 Maximizing power efficiency is within reach Hardware support Enhanced SpeedStep: Low overhead frequency/voltage scaling. 10us/transition. Opportunities: CPU frequency and frontbus frequency are decoupled. Programs have memory bound segments and CPU bound segments

5 Is any energy wasted when executing programs? Current Status: Programs run at a single frequency from start to end, neglecting segmental behavior of execution. CPU idles in memory bound segments. Out plan: Find out how a program execute. Remove the fat in energy consumption

6 Searching for the most power efficient FFT Select the proper frequency/voltage for different regions in the FFT code. Challenges include: Where to switch? Which frequency? Schedule the code to reveal more opportunities for frequency scaling. Challenges include: How to schedule for power?

7 What previous research does? Hardware/OS [Berkeley, MIT, UMD, UMich, UVa, ] Interactive applications Predict memory access pattern Fix window size -> Wrong prediction Batch applications Predict the execution time of every task Distribute unused time to remaining tasks Low granularity, no use for DSP program 7

8 Previous Compiler-based DVS algorithms The Design, Implementation, and Evaluation of a Compiler Algorithm for CPU Energy Reduction Chung- Hsing Hsu and Ulrich Kremer, PLDI 03. Select basic blocks from program structure ENTRY C1 C2 L3 Entry/exit is unique EXIT Loop Call site If statement Seq. of regions Entire procedure C5 L4 8 16

9 Chung-Hsing Hsu and Ulrich Kremer s Approach (Cont ) Measure the execution time and the power consumption of every region at every possible frequencies. Change the frequency of only one region Exhaustive search for the best region and the optimal frequency. 9

10 Run programs on the simulator Compile-time Dynamic Voltage Scaling Settings: Opportunities and Limits, Fen Xie, Margaret Martonosi, and Sharad Malik, PLDI 03 Divide the execution into memory accesses and cpu operations. Assuming the processor has continuous frequency spectrum. Model power consumption in memory accesses phases and cpu active phases. Use existing optimizing software to find the best single region for scheduling. 10

11 Re-examine our goal What we really want to optimize? Power ~ O(v) Trivial solution if just to reduce power Energy ~ O(v 2 ) Minimal energy consumption at the lowest frequency Energy Delay SPEC / Jules Energy Delay 2 Test if we really make improvement 11

12 Energy vs. Delay Landscape A energy B C How to affect tradeoffs? How to compare tradeoffs? delay

13 Optimization Space energy Pareto Optimal delay

14 Projection of Compile Optimizations F parallelizing scaling energy energy-aware compilation F =(F, s, p, -O) runtime 14

15 Our Goal energy new, higher quality Pareto front for any metric runtime 15

16 Simulator vs. Real Machine? Simulator Watt, SimPower Power-model should be verified. Not the best environment for compiler research. Real machine How to identify phases in the program? How to measure power consumption? How to search the front-line of energy-delay? 16

17 Identify Program Phases Use hardware counters Low overhead Limited number Find the correct events Memory access: L2_Cache_Load_In, L2_Prefetch_Load_In Instruction number: Instruction_Retired Execution time: Cycle 17

18 Insert Reading Points Control the overhead of reading. Reading evenly during the execution Use a simplified model of memory accesses and working cycles. Understand how compiler translate instructions. Constant loading Array access 18

19 L2 Cache Miss/10 us Are there really program phases? e e+06 2e e+06 3e+06 Cycle PM 19

20 30 25 L2 Cache Miss/10 us e e e e+06 Cycle PM 20

21 Iterations have different patterns 21

22 Frequency Scaling Select the program region with the highest cache miss ratio. Lower the processor frequency before entering the region. Restore the frequency after exiting the region. 22

23 WHT-2 19 (out-of-cache) low frequency Each point shows the cache miss ratio every 100!seconds Cache miss ratio high frequency Time 23

24 Example: code with voltage/frequency scaling instructions setfreq(2); i14 = 0; while (i14 <= 32767) { s277 = T2[i14]; s278 = T2[ i14]; t459 = s277 - s278; i14++; } setfreq(3); decrease frequency increase frequency 24

25 Frequency Scaling Select the program region with the highest cache miss ratio. Lower the processor frequency before entering the region. Restore the frequency after exiting the region. Transform the program to reveal more opportunities for frequency scaling. 25

26 26

27 Measure Energy Consumption Energy = Volt*Amp*Time Volt: Constant Amp: Oscilloscope Time: Cycle/Frequency 27

28 Pentium-M 2.13GHz Six frequency settings 2.13 GHz at volt (max performance) 800 MHz at volt (min performance/energy) The change in performance/energy tradeoff is dramatic. 28

29 29

30 30

31 WHT-2 20 Experiment Results Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) % energy reduction % energy reduction Withing 5% of the execution time of the fastest version Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 31

32 DCT-2 20 Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 32

33 Real DFT-2 20 Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 33

34 DFT-2 20 Energy versus execution time Pareto curve Energy (Joules) Execution Time (Seconds) Energy (Joules) Execution Time (Seconds) Fixed Dynamic 34

35 Future Directions Loop transformation Global optimization Strength reduction Parallelization for power... 35

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era