45-year CPU Evolution: 1 Law -2 Equations

Size: px

Start display at page:

Download "45-year CPU Evolution: 1 Law -2 Equations"

Garey Rogers
5 years ago
Views:

1 PowerPC 601 Pentium 4 Prescott year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal

2 Are there some fundamental rules? Moore s Law Exponential Evolution Program execution time Hennessy-Patterson = ( + ) = Instruction count Avg Cycles/Instruction Computing Waiting data Cycle time CMOS power dissipation = + WP Daniel Etiemble 2

3 Moore s law N increases Nb transistors/chip doubles every N months (12/18/24) WP Daniel Etiemble 3

Technological nodes CMOS technological nodes (nm) 10000 3000 1500 1000 800 600 But MISMATCHES 1000 100 10 350 250 180 130 90 65 45 32 22 14 10 CPU clock

next one (rough approximation) Gate delay /1.4 => Higher clock frequencies Increase of transistor count per area unit.

4 Technological nodes CMOS technological nodes (nm) But MISMATCHES CPU clock frequency CPU performance (before 1987) CPU performance (after 1986) From one node to the next one (rough approximation) Gate delay /1.4 => Higher clock frequencies Increase of transistor count per area unit. Consequences From multi-chips to one chip Larger on-chip memories More functionalities per chip DRAM bandwidth 1/DRAM latency DRAM capacity 0% 20% 40% 60% Cache hierarchies WP Daniel Etiemble 4

5 Evolution of hierarchies CPU CPU L1-D L1-I Cache 386 L2 Cache Memory Memory L1-ID 2 cores L1-ID L2 L2 Power6 L3 Cache Latency & Bandwidth Memory CPU L1-D L1-I Pentium Unified L2 Pentium 4 Memory L1-ID CPU L2 L3 8 cores L4 Memory Cache L1-ID L2 L3 CPU Power 8 NUCA WP Daniel Etiemble 5

6 Improving Performance = Increase F (Technological nodes) - Gate delay /1.4 => Higher clock frequencies - Increased Performance Increase F - Increased Power and Power density = + Frequency HEAT WALL 2017 : F in the 3-4 GHz range Except water cooled IBM z14 CPU (5.2 GHz) WP Daniel Etiemble 6

7 Improving Performance = ILP in mono-processor CPU Scalar CPU (IPC <1) Pipeline and superpipeline Superscalar CPU (IPC >1) In-order CPUs Out-of-order CPU VLIW CPU (IPC >1) 256 Fetch Decode Instruction window ROB Reservation stations FU FU FU Out-of-order Reorder Buffer Retirement µop/cycle Int. Physical Registers Intrinsic limits of ILP in a sequential program & HW diminishing return. Larger buffer to extract µops to launch, but launching width remains constant! ROB Renaming Register renaming 3 to 4 µops/clock PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL WP Daniel Etiemble 7

Improving Performance = Data Parallelism in

several data SSE2/3/4 Neon Altivec AVX AVX2 AVX 512

GPU SIMT 1 instruction for several threads

8 Improving Performance = Data Parallelism in mono-processor CPU CPU SIMD 1 instruction with several data SSE2/3/4 Neon Altivec AVX AVX2 AVX 512 X3 X2 X1 X0 Y3 Y2 Y1 Y0 op op op op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 source 1/dest. source 2 source 1/dest. GPU SIMT 1 instruction for several threads Instruction Decoder and Warp Scheduler registers registers registers registers registers registers WP Daniel Etiemble 8 thread

9 CPU + GPU Two different programming models Two chips or one chip? CPU CPU Memory PCI express GPU GPU Memory CPU + GPU or APU 2014 Séminaire CNAM 15/12/2015 Daniel Etiemble 9

Improving Performance = Data or/and Thread Parallelism in Parallel Architectures Dispatch IC among several processors (or cores) - Except for simple cases or embarrasingly parallel applications, the

10 Improving Performance = Data or/and Thread Parallelism in Parallel Architectures Dispatch IC among several processors (or cores) - Except for simple cases or embarrasingly parallel applications, the dispatch depends on the architecture, the programming model, Amdahl law, etc. - In some architectures, communication times must be included. OpenMP Pthreads MPI Multiprocessors Multi-computers Switching from sequential to parallel programming - Limited to servers and super-computers during the «free lunch» period WP Daniel Etiemble 10

11 Free lunch (by Intel) F IPC IC (SIMD) WP Daniel Etiemble 11

12 From mono-processors to multi-cores Performance Power trade-off Intrinsic limitations of mono-processors, even multithreaded Multi-processors to multi-cores Clusters of multicores. WP Daniel Etiemble 12

Multithreaded CPUs = - Sequential programs - Several programs (multiprogramming): IC = IC i - Several threads (TLP) : IC = IC i Fine grain multithreading - Switching in one clock from one thread to

13 Multithreaded CPUs = - Sequential programs - Several programs (multiprogramming): IC = IC i - Several threads (TLP) : IC = IC i Fine grain multithreading - Switching in one clock from one thread to next one on pipeline hazards - Sun Niagara, Oracle servers Simultaneous multithreading - Issuing instructions from different threads at each clock - Intel Hyperthreading (2), IBM Power (2 to 8) Multithreading reduces CPI Mem Multi-cores with multithreaded cores WP Daniel Etiemble 13

14 Reducing static power dissipation = TECHNOLOGY - Ex: Intel Tri-gate CIRCUITERY - Ex: Virtual Vss 65-nm Xeon CPU L3 WP Daniel Etiemble 14

15 Reducing dynamic power dissipation = Several Operating Modes per blocks: Ex: 5th generation Intel Cores Power domains V dd F Clock domains WP Daniel Etiemble 15

16 What about future? Moore s law: towards fundamental limits Execution time: F: significant changes are doubtful IPC: Limits of ILP in cores New PIM architectures? (Data access issues: Memory Wall ) IC continues to decrease more job per instruction SIMD width ( ) and data size (16-bit FP) New 2D instructions: Tensor cores (Nvidia), Matrix Multiplication Unit (Google TPU) More cores From multi-cores to many-cores. Exponential increase of core number??? Power dissipation As long as CMOS will be used WP Daniel Etiemble 16

17 Concluding remarks Main trends (not details) of CPU evolution can be explained by Moore s law = ( + ) = = + Valid for software programmable processors as long as CMOS technology will be used. Mixed HW-SW architectures (FPGA) are more complex to modelize. Evolution of CPU architectures is driven by new specific applications (AI, IoT, ). Business (as usual). Ex: proprietary versus open-source ISAs. WP Daniel Etiemble 17

45-year CPU evolution: one law and two equations

45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France de@lri.fr Abstract Moore s law and two equations allow to explain the main trends of CPU evolution