Intel released new technology call P6P

Size: px

Start display at page:

Download "Intel released new technology call P6P"

Allan Williams
5 years ago
Views:

1 P6 and IA released on 1978 Pentium release on has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new technology call P6P P6 Pentium's L2L cache problem KB Pentium interface cache and main memory via external bus Pipeline stall : 1. Prefetch read cache and 2. Execution unit read data from main memory It s s use the same bus 1

2 P6 P6 move L2 L cache on same package with CPU Pentium Pro and Pentium II separate die with CPU with 512 KB cache ( include Pentium III in Slot 1 ) Celeron integrate 128 KB cache on die Pentium III ( Coppermine ) integrate 256 KB cache on die ( 28 millions transistor on die of Pentium III ) Xeon integrate cache 2 MB on die DIB DIB ( Dual Independent Bus ) : FSB and BSB Cache Bus ( Back Side Bus ) = 64 bit 256 bit in Pentium III ( Coppermine ) BSB speed is higher than mainboard s bus speed Pentium Pro and Pentium II ( include Pentium III in Slot 1 ) BSB speed = ½ CPU speed Celeron and Pentium III ( Coppermine ) BSB speed = CPU speed 2

3 BSB and FSB Cache 128 KB has Hit Rate > 90% BSB free CPU from use dedicate bus, CPU clock independence with main board clock ( september 2000 ) CPU s s speed is 1.13 GHz, but main board clock = 133 MHz FSB ( Front Side Bus ) = Bus on main board interface CPU with I/O I O and main memory FSB speed = 66 MHz, 100 and 133 MHz P6 Architecture Separate cache L1 to L2L via BSB L1 to Mem via FSB L1 cache = 32KB - Instruction 16 KB - Data Cache 16 KB 3

4 P6 Architecture Dynamic Execution Microarchitecture Fetch / Decode Unit Dispatch / Execute Unit Retire Unit Instruction Pool Dynamic Execution Multiple Branch Prediction Dynamic Data Flow Analysis Speculative Execution P6 Architecture Multiple Branch Prediction Concept from mainframe Use multiple pipeline for call or return instruction Fetch/Decode unit use to find branch instruction Dynamic Data Flow Analysis Analyze and search for out of order instruction Dispatch/Execute unit scan and sort instruction for Maximize usage of Execution unit 4

5 Speculative Execution P6 Architecture Dispatch/Execute unit use to analyze instruction Forward execute instruction and send to instruction pool Keep result in temporary register Retire unit use to find executed instruction and out of order ( No branch ),, Commit and confirm result in register, Then delete from pool This 3 techniques, Made P6 P is non sequential CPU Pentium Pro 5

Pentium Pro first P6P architecture short life cycle, a few series of Pentium Pro Speed 150, 166, 180 and 200 MHz

6 P6 Architecture P6 next evolution of Intel s s CPU No more 80X86 86 core P6 Core is RISC Redesign all instruction on RISC core Backward compatible by mapping 80x86 86 to RISC command Improve Branch Prediction P6 Pentium Pro first P6P architecture short life cycle, a few series of Pentium Pro Speed 150, 166, 180 and 200 MHz L1 Cache ( ) L2 Cache 256 and 512 KB on same package L2 Cache 1 MB at 200 MHz Pentium Pro Pentium Pro 1 MB L2 L Cache 6

7 Pentium II Pentium II = Pentium Pro + MMX Speed 233, 266, 300 and 333 MHz Package S.E.C.C S ( Slot 1 ) FSB 66 MHz L1 Cache ,, L2L Cache 512 KB FSB 100 MHz, Speed 350, 400, 450 MHz L2 Cache 2 MB name Pentium II Xeon ( speed cache = CPU ) Package S.E.C.C S 2 ( Slot 2 ) Celeron Celeron = Pentium II but low throughput ( Same Core ) Speed 266, 300 MHz No L2L Cache L1 Cache FSB 66 MHz L2 Cache 128 KB ( Cache speed = CPU ) Speed 300A, 333, 366, 400, 433, 466,, and 500 MHz FSB 66 MHz 7

8 Celeron Package PPGA ( Plastic Pin Grid Array ) 370 Pin Package FC-PGA ( + SSE ) Change to 0.18 micron Core 1.5 VDC SSE 3D D speed upgrade by adding new instruction Streaming SIMD Extension ( SSE ) Can jump over L2L Cache Processor Serial Number Pentium III L1 Cache L2 Cache 512 KB ( Coppermine : Cache 256 KB ) 8

9 Pentium 4 In P6P architecture Speed upgrade from 150 MHz to 1.13GHz Change technology 0.5 to 0.25 and 0.13 micron VCC 3.3 to 2.2 and 1.5 V Pentium 4 : Same core with Penutium III But many thing has change Pentium MHz bus to 200MHz and 400 MHz DDR ( Double Date Rate) Double clock speed in integer ALU ( < 1 clock / instruction ) Add Execution trace cache ( keep translate Micro-op op ) Upgrade pipeline and Branch Prediction from P6P SSE Extension 2 ( new 144 instructions ) Floating point 128 bit Dynamic Execution : add Instruction Pool from keep 40 Micro-Ops to 100 Micro-Ops Execution Trace Cache + Dynamic Execution All Loop work in Instruction Pool 9

10 AMD K5K AMD K5K 5 Stage pipeline Super scalar technique Branch Prediction Dynamic Execution Architecture same as Pentium But Pentium pipe line is better 10

11 K6-III P6 architectures better than K6-IIIK K7 ( Althon ) 11

12 Crusoe Intel and AMD structure RISC + 80X86 86 Shell Mappig 80x86 86 instruction to RISC Core instruction Crusoe CPU of Transmeta Use software to help hardware work Translate instruction by hardware ( Code Morphing ) Crusoe 12

13 Crusoe Software Code Morphing 128 bit VLIW ( Very Long Instruction Word ), 4 instructions 4 execution unit : Integer, Floating Point, Load/ Store and Branch Crusoe TM register Instruction cache 64 KB Data cache 64 KB L2 Cache 256 KB Speed MHz Low power consumption 1/3 of Pentium III Crusoe 13

14 CPU Compare IA-64 Intel + HP CISC and RISC processor RISC core + CISC RISC processor : PowerPC, Alpha, Sparc,, MIPS CPU problem Jump : Branch prediction Read memory : Cache and Prefetch queue > 1 instruction/clock clock : Super scalar 14

15 Merced ( Itanium ) EPIC (Explicitly Parallel Instruction Computing) 128 General register 128 Floating point register Parallel processing unit VLIW ( Very Long Instruction Word ) 128 bit ( 41 X ) Compiler optimization Branch Removal!! " #$ #$!! "#$ Predicate Register ( 64 ) 15

IA-64 Technique 64 Bit processor, improve from P6P architecture VLIW Compiler optimization Speculation feature ( reduce memory timing ) 6 GFLOPs FPU 128 + 128 register Support by many software

16 IA-64 Technique 64 Bit processor, improve from P6P architecture VLIW Compiler optimization Speculation feature ( reduce memory timing ) 6 GFLOPs FPU register Support by many software provider IA-32 Compatible ( Virtual 8086 Mode ) IA-32 to IA-64 by Hardware translation mechanism Itanium March MHz 20 Instruction / Clock 3 level cache, 4 MB 320 millions transistors 25 millions for CPU 295 millions for L3 L cache 16

17 Itanium 17

Advanced Processor Architecture

Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong