CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era SPEC Bechmark Several fallacies and pitfalls First homework is assigned. 2 1
What is Power Wall? For the past 30 years, life is good Moore s Law: doubling transistors every 18 months More transistors/mm 2 à more activity / area Also, vendors simply increase CPU clock rate to speed up CPU Automatic! However, the free lunch is over! But why is there a problem now? More transistors/mm 2 à Higher power density (watt/cm 2 ) à Higher temperature Fact: total power consumption of world s PCs: 1992: 10 Mwatts (87M CPUs) 2001: 9000 Mwatts (500M CPUs) That s 4 Hoover Dams! 4x Q: Guess how much money 1Mwatts will cost for one year? 35*24*1000*(0.12$ per Kwh) à $1,051,200 3 Power Density (watts/cm 2 ) Watts/cm 2 1000 100 10 1 Power was doubling every 4 years i38 Hot plate i48 Nuclear Reactor Pentium Pro Pentium Pentium 4 Pentium III Pentium II 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ Surpassed hot-plate power density in 0.5 Not too long to reach nuclear reactor Rocket Nozzle Courtesy : New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies Fred Pollack, Intel Corp. Micro32 conference key note 1999. 4 2
5 The Power Wall 10 Generations FIGURE. Clock rate and Power for Intel x8 microprocessors over 30 years. Pentium 4 made a dramatic jump in clock rate and power but less so in performance. The Prescott thermal problems led to the abandonment of the Pentium 4 line. The Core 2 line reverts to a simpler pipeline with lower clock rates. The Core i5 pipelines follow in its footsteps. 3
Tips to Understand Electricity, Voltage, and Leakage Electricity runs through circuits in the same way as water runs through plumbing Voltage is like water pressure Electric current is the amount of water A capacitor is like a tank or container If you increase water pressure, you can fill a tank more quickly. If you increase the voltage on a CPU chip, it can perform its operations more quickly However, at a higher voltage the CPU uses more power and generates more heat If all you need is the difference between on and off, then it doesn't matter if you are talking about a 100 watt light bulb or an LED that uses a fraction of a watt When the distance between the wires gets small, some current leaks across to nearby wires (As chips become smaller and smaller) Source: http://pclt.sites.yale.edu/circuit-size-voltage-and-heat 7 The Power Wall (Cont.) Metrics: Power vs Energy (watt vs joule) For CMOS, the primary source of energy consumption is dynamic energy. i.e., the energy needed to switch from 0 to 1, or from 1 to 0 The other source is from leakage (current that flows even when a transistor is off), usually account for 40% on servers! The energy required per transistor to go 1->0->1 Energy Capacitive load Voltage 2 The power required per transistor: Power Capacitive load Voltage 2 Frequency 30 5V 1.5V 300 10 4
Power Example 1 Suppose we build a new CPU, which has 15% of capacitive load reduction 15% voltage and 15% frequency reduction 2 Pnew Cold 0.85 (Vold 0.85) Fold 0.85 4 0.85 0.52 2 P C V F old old n Today s situation of the power wall old n We can t reduce voltage further n It will make transistors more leaky n We can t remove more heat by cooling fans n 100 watts is already too much! n How can we improve performance? 11 old Entering the Multicore Era Multicore processors: more than one core per chip In the past, programmers rely on hardware innovation to double program performance No code change at all Today, programmers need to rewrite programs to take advantage of multicore processors Moreover, with more cores, programmers must continue to improve program performance constantly ( scalability ) It requires explicitly parallel programming Different from instruction level parallelism: Hardware executes multiple instructions at once Hidden from the programmer Hard to do Programming for performance!(not only for correctness) Load balancing Optimizing communication and synchronization More and more cores per chip are upcoming à thus, even harder!! 13 5
History of Processor Performance Constrained by power, instruction-level parallelism, memory latency 14 Benchmarking SPEC CPU Benchmark The programs used to measure performance Is a typical representation of actual workload The latest version is SPEC CPU2017 CINT200 (integer) and CFP200 (floating-point) Measure elapsed time to execute a number of programs No I/O involved, only focused on CPU performance Then normalize it relative to a reference machine Summarize result as Geometric Mean of performance ratios n n ÕExecution timeratio i i 1 15
Performance Ratio SPECratio time on SPARCstation 10/40 time on target machine SPEC ratio larger è Performance better 1 CINT200 for Intel Core i7 920 12 integer compute programs 17 7
Why not use Arithmetic Mean (or Average)? Scenario I Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 100 sec A:.1 + 1 B: 1 +.1 A, B are same Scenario II Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 10 sec A:.1 +.1 B: 1 +.01 B is faster Scenario III Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 1000 sec A:.1 + 10 B: 1 + 1 A is faster 18 There is a New SPEC Power Benchmark (#ops/watt) FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2. GHz Intel Xeon X550 with 1 GB of DRAM and one 100 GB SSD disk. 19 8
Fallicies and Pitfalls Appear in every chapter Fallacy: are commonly held misconceptions that you will encounter Pitfall: are easily made mistakes Only true in a very limited context They help you avoid making the same mistakes 20 1 st Fallacy: Low Power at Idle Computers at low utilization will use a little bit of power Look back at the previous Intel core i7 power benchmark At 100% load: 258W At 50% load: 170W (%) At 10% load: 121W (47%!) à Best performance ever: still 33% E.g., Google data centers Mostly operates at 10% 50% load At 100% load less than 1% of the time Current research: consider designing processors to use power proportional to workload 21 9
1 st Pitfall If you improve one aspect of a computer, then you would expect a proportional Amdahl s Law: improvement in overall performance T improvement factor affected T improved + unaffected n E.g., Multiply accounts for 80s of total time of 100s. n How much improvement in multiply in order to get 5 overall speedup? 80 20 + 20 n Is it possible? n T 22 2 nd Fallacy: Designing for performance and designing for energy efficiency are unrelated 2 nd Pitfall: Using a subset of the Performance Equation as a performance metric. Even using 2 components is not correct. CPU Time Instructions Program Clock cycles Instruction Seconds Clock cycle 23 10
3 rd Pitfall: Use MIPS as a Performance Metric MIPS: Millions Instructions Per Second Why? Does not account for differences in the complexity between instructions Instruction count MIPS Execution time 10 Instruction count Instruction count CPI 10 Clock rate Clock rate CPI 10 24 An Example of MIPS Two different compilers are used on a 100 MHz machine with three classes of instructions: Class A, B, C, which require 1, 2, and 3 cycles. Both compilers are used to compile a program. The 1st compiler's code: 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The 2nd compiler's code: 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Q1:Which code will be faster according to execution time? Q2: Which code will be faster according to MIPS? 25 11
Q1:Which code will be faster according to execution time? Exectime clockcycles clockrate time 1 time Answer to Q1 (5 + 2 + 3) 10 cycles 100 10 cycles / sec IC CPI clockrate (5mil 1) + (1mil 2) + (1mil 3) cycles 100 10 cycles / sec 2 (10 + 2 + 3) 10 100 10 10 sec 0.1sec 100 (10mil 1) + (1mil 2) + (1mil 3) cycles 100 10 cycles / sec 15 sec 0.15sec 100 Code 1 Faster than Code 2 wrt execution time 2 Q2: Which code will be faster according to MIPS? Answer to Q2 Code 2 is # of instructions Faster wrt MIPS! - MIPS 10 execution time 7 10 instr - MIPS1 10 70MIPS 0.1sec 12 10 instr - MIPS2 10 80MIPS 0.15 27 12
Remarks from Chapter 1 Performance/Cost ratio keeps improving Due to underlying technology development Hierarchical layers of abstraction exist in both hardware and software Instruction set architecture (ISA) The hardware/software interface Execution time: the best performance measure! Power is a limiting factor Hence we follow a different path now: use parallelism to improve performance 28 Your First Homework 1) Book excercises: 1.5, 1., 1.7, 1.9, 1.14.1 (typo: 10 à 10 ) 2) An additional question: Look at the most recent Top One supercomputer, suppose each person can calculate one floating-point operation per second, then how many Days it would take for all people on the earth to calculate what the Top One supercomputer can compute in only 1 second? Popluation can round off to billions. 1s VS Due time: 11:59PM, January 31 on Wednesday midnight No late homework will be accepted. Submit it through Canvas TA already posted it to Canvas now. 29 13