1 TDT 4260 lecture 2 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU
2 Lecture overview Chapter 1: Fundamentals of Quantitative Design and Analysis, continued Technology trends Power & energy, costs Performance, metrics Speedup, Amdahl s law
3 Trends in Technology Integrated circuit technology (Moore s Law) Transistor density: 35%/year Die size: 10-20%/year Combined effect: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM, but slower Magnetic disk capacity: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM
4 Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Numbers are computed based on current systems vs. systems in the early 80s Details in Fig 1.10 (1982 2010)
5 Latency Lags Bandwidth Log-log plot of bandwidth and latency milestones
6 Current Trends in Architecture Cannot continue to exploit more Instruction-Level parallelism (ILP) Limited single processor performance improvement since 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These forms of parallelism require explicit restructuring of the application Increasing performance with technology generations is now to a larger extent up to the programmer
7 Transistors and Wires Feature size is the minimum size of transistor or wire in the x- or y-dimension 10 microns in 1971 to.032 microns (32nm) in 2011 miniaturization Integration density scales quadratically Transistor performance scales linearly Wire delay does not improve with feature size since it is proportional to the length of the wire On-chip communication becomes an increasing problem On-chip data locality becomes increasingly important
8 The Memory Wall Main Memory Latency Processor Performance Relative Performance 100000 10000 1000 100 10 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year The Processor Memory Gap Consequence: deeper memory hierarchies P Registers L1 cache L2 cache L3 cache Memory - - - Complicates understanding of performance cache usage has an increasing influence on performance 2010
9 I/O Pin Problem # I/O signaling pins number drives chip cost high frequency operation on the circuit board is a challenge Projections from ITRS (International Technology Roadmap for Semiconductors) From PACT paper by Huh, Burger and Keckler 2001
10 Power and Energy Remember: Energy (Joule) = Power (Watt) * time (second) 1 Watt = 1 Joule/second Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption
11 TDP, a recent example (Parallelization of a PARSEC application [in Workshop at SC 12]) Red dashed line is the TDP of each processor Low-power Sandy Bridge core i5 (laptop) close to TDP Application is not «challenging enough» for the server node Sandy Bridge EP Vectorization with SSE and AVX is very energy efficient! (much better performance, but almost free!)
12 More details in (Learn more --- not part of the course) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors Juan M. Cebrian, Lasse Natvig, and Jan Christian Meyer Journal of Computing, November 2013, pages 1-15. NTNU-Video from our PP4EE-seminar http://video.adm.ntnu.no/openvideo/pres/52526d7a44e9c
13 Dynamic Energy and Power Dynamic energy Transistor switch from 0 1 or 1 0 The energy of a single transition is ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy
14 Clock rates stopped increasing Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air
15 Reducing Power Do nothing well Alternative formulation: Only power on the units that are currently needed Dynamic Voltage-Frequency Scaling (DVFS) Common in modern microprocessors Low power state for DRAM, disks Overclocking Intel Turbo Boost Technology (TBT) AMD PowerNow
16 Static Power Static power consumption Power static = Current static x Voltage Scales with number of transistors Even if they are idle (but powered) Power gating Turn off the power supply to units that are not in use Race to halt Typical embedded systems (Eg. Nordic semiconductor)
17 Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume
18 Cost and COTS Cost to produce one unit include (development cost / # sold units) benefit of large volume COTS commodity off the shelf much better performance/price pr. component strong influence on the selection of components for building supercomputers in more than 20 years Recent example: Mont Blanc project and Exynos 5 (next two slides)
19 Mont Blanc project Philippo Mantovani Mont Blanc 1 Mont Blanc 2 (Mont Blanc 3?) put Europe on the map of supercomputer vendors ARM is a partner Coordinated by UPC/BSC Alex Ramirez presented the Mont Blanc project at NTNU, November 2012 http://video.adm.ntnu.no/openvideo/pres/50c5c63af0f02
20 They consider using ARM Mali GPUs Competition between Nvidia GPU and Mali GPU still going on
21 back to Manufacturing ICs Yield: proportion of working dies per wafer
22 NEW TOPIC PERFORMANCE
23 Defining Performance Which airplane has the best performance? Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300 400 500 Passenger Capacity Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 2000 4000 6000 8000 10000 Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 500 1000 1500 Cruising Speed (mph) Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100000 200000 300000 400000 Passengers x mph
24 Response Time Book definition: Time from issuing a command to its completion This is often referred to as the turn-around time Alternative response time definition: Time from issue to first response Execution time is the time the processor is busy execution the program Turn-around time includes the time the process waits to be executed, execution time does not
25 Response Time and Throughput Throughput Total work done per unit time How are response time and throughput affected by Replacing the processor with a faster version? Adding more processors?
26 Measuring Execution Time Elapsed time/wall clock time Total turn-around time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent processing a given job Discounts I/O time Comprised of user CPU time and system CPU time Different programs are affected differently by CPU and system performance Time is the only unambiguous performance measure
27 Speedup Speedup = Performance of system / Performance Baseline Remember that Performance = 1/ Execution time Speedup = Execution Time Baseline / Execution time of system Parallel systems: Speedup = Parallel Performance / Sequential Performance Note: Use the best sequential algorithm, not the parallel algorithm with p = 1
28 Superlinear speedup
29 Benchmarks Benchmark types Kernels, toy programs and synthetic benchmarks Disadvantage: Too easy for compiler writers and computer architects to cheat BUT, can be very useful to help understanding of the interplay between architecture and software Embedded benchmarks: EEMBC, MiBench, etc. Desktop benchmarks: SPEC 2006, PARSEC, etc. NEW; ParVec from CARD- NTNU Server benchmarks: TPC-C, TPC-H, etc. Computers should work well for a collection of programs Average performance is the key metric Benchmarks are often assembled into suites
30 Summarizing and Reporting Results Averages can be computed in different ways Arithmetic mean Harmonic mean Geometric mean Complete and precise description of what you have measured and reported is mandatory! Reproducibility of experiments is very important You should include enough information for an independent researcher to repeat your experiment
31 Smith, CACM oct. 1988
32 Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, overlap computation with communication/data retrieval, memory banks, pipelining, multiple functional units Principle of Locality Reuse of data and instructions Focus on the Common Case Amdahl s Law (Demonstrates how much a serial part of an application limits its parallelization) Sometimes called pessimistic wrt. parallelization
33 Amdahl s Law (1967) (fixed problem size) If a fraction s of a (uniprocessor) computation is inherently serial, the speedup is at most 1/s Total work in computation serial fraction s parallel fraction p s + p = 1 (100%) S(n) = Time(1) / Time(n) = (s + p) / [s +(p/n)] = 1 / [s + (1-s) / n] Out of scale = n / [1 + (n -1)s] pessimistic and famous
34 Gustafson s law (1987) (scaled problem size, fixed execution time) Total execution time on parallel computer with n processors is fixed serial fraction s parallel fraction p s + p = 1 (100%) S (n) = Time (1)/Time (n) = (s + p n)/(s + p ) = s + p n = s + (1-s )n = n +(1-n)s Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. Not a new law, but Amdahl s law with changed assumptions