Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Size: px

Start display at page:

Download "Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University"

Bennett Quinn
6 years ago
Views:

1 Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University

2 Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000 SPECfp2000 Athlon 64 FX GHz Pentium 4 Extreme Edition 3.46GHz Pentium 4 Prescott 3.8GHz Opteron GHz Itanium2 9MB 1.6GHz Pentium M GHz POWER5 1.9GHz SPARC64 V 1.89GHz Athlon GHz Alpha 21264C 1.25GHz

3 Pipelining Sequential execution Inst s Pipelining (RISC) Clock cycles Inst s Clock cycles 3

4 Superpipelining Superpipelining p p Subdivide each pipeline stage Higher clock speed in Athlon, 12+ in Pentium Pro/II/III, 14 in UltraSparc-III, in PowerPC G5, 20+ in Pentium 4 Inst s Clock cycles 4

5 Superscalar Superscalar The execution stage has a bunch of different functional units. Execute multiple instructions in parallel Pentium: 2-way superscalar int wb Inst s fetch float 1 float 2 float 3 wb test branch address mem 1 mem 2 wb Clock cycles decode & dispatch 5

6 Superpipelined Superscalar (1) Superpipelining p p + Superscalar 2-way: MIPS R way: PowerPC G3/G4, Pentium Pro/II/III/M/4, / / / Athlon 4-way: UltraSparc, MIPS R10000, PowerPC G4e Alpha & 21264, Core 2 Duo 5-issue: PowerPC G5 Inst s Clock cycles 6

7 Superpipelined Superscalar (2) Tackling instruction dependencies Branch prediction + speculative execution Mispredict penalty: cycles in Pentium Pro/II/III Instruction scheduling In-order execution + compiler optimization» Rearrange the instructions at compile time.» Compiler can see further down the program than the hardware.» SuperSparc, Spa HyperSparc, UltraSparc, Alpha & Out-of-order execution» Reorder instruction execution sequence in hardware at runtime»register renaming reduces the dependency d further.» MIPS R10000, Alpha 21264, POWER/PowerPC, Pentium Pro, Pentium 4, Core 2 Duo 7

8 Superpipelined Superscalar (3) ILP vs. Clock speed 1400 SPE ECint_peak Alpha Athlon Opteron PA RISC Pentium III, 4 POWER MIPS SPARC64 SPARC Sun Xeon Itanium Frequency (MHz) 8

9 Superpipelined Superscalar (4) 9

10 VLIW Very Long Instruction Word Compilers explicitly group instructions to be executed in parallel Long instructions: often 128 bits or more Easier to design CPUs: simpler decode/dispatch stage Intel IA-64: Itanium I/II Inst s WB WB Clock cycles 10

11 SMT (1) Limitations in instruction-level parallelism Instruction dependencies Branch mispredictions Cache misses Interrupts and exceptions Simultaneous multithreading (SMT) Inst s Thread 1 WB WB WB Thread 2 WB WB WB WB Clock cycles 11

12 SMT (2) Intel Hyper-Threading Technology Utilizes thread-level parallelism Fill pipelines with the instructions from multiple threads running at the same time. An SMT processor appears as if it were multiple independent processors. Uses processor resources more effectively logical processor 0 Arch. state (registers) logical processor 1 Arch. state (registers) Execution units Cache(s) System bus 12

13 Multi-core (1) Multi-core Put two or more processor cores onto a single chip Previously called CMP (Chip Multiprocessor) Examples AMD Opteron: dual-core (Apr. 2005) AMD dual-core Athlon 64 X2: dual-core (May 2005) Intel Core Duo, Core 2 Duo, Xeon: dual-core Sun UltraSparc T1: eight-core, 32 threads (Nov. 2005) Intel Core 2 Extreme Quad-core processor (Sep. 2006) Intel Xeon X7460 Six-core processor (Sep. 2008) Microsoft s Xbox 360: triple-core PowerPC microprocessor 13

14 Multi-core (2) 14

15 Multi-core (3) Memory wall CPU 55%/year, Memory 10%/year (1986 ~ 2000) Caches show diminishing returns. ILP wall Control dependency Data dependency Power wall Dynamic power Frequency 3 Static ti power Frequency Total power The number of cores 15

16 Multi-core (4) Performance 1.73x Power PERFORM MANCE 1.13x 1.00x More MIPS/watt POWER Single Core Source: Intel Raise Clock (20%) Lower Clock (20%) PERFOR RMANCE 0.87x E FORMANC PER POWER 0.51x POW WER Dual Core 1.73x PER RFORMAN NCE 1.02x WER PO 16

17 Microprocessor Design CISC RISC Superscalar Instructions take variable times to complete Simple instructions, optimized for speed Same individual instruction latency Greater throughput through instruction overlap Multiple instructions executing simultaneously SMT VLIW Multi core Additional hardware resources (regs, PC, SP) Each context gets processor for x cycles Superinstructions grouped together Decreased hardware control complexity Need a good compiler Duplicate entire processors 17

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000