Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation

Size: px

Start display at page:

Download "Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation"

Prudence Cunningham
5 years ago
Views:

1 IA-32 Architecture Sunil Saxena Principal Engineer Corporation September 11, 2000 Copyright 2000 Corporation. Linux Supercluster Users Conference Agenda Pentium III Processor New Features Pentium 4 Processor New Features Pentium 4 Processor Micro-architecture Copyright 2000 Corporation. Linux Supercluster Users Conference Page 2 1

2 Performance IA Processor Roadmap Extends IA Headroom, Scalability and Availability for the Most Demanding Environments Pentium III Xeon processor Itanium TM processor Cascades Foster McKinley Future IA Madison IA-64 Perf Deerfield IA-64 Price/Perf... Outstanding Performance for 32 Bit Volume Apps µ.18µ.13µ Strong Execution on Itanium Processor, Continued Focus on the Long Term Copyright 2000 Corporation. Linux Supercluster Users Conference Page 3 Pentium III Processor Pentium III Processor New Features 36-bit Physical Addressing Physical Address Extension - PAE-36 Page Size Extensions - PSE-36 Page Attribute Table Fast Floating-point save/restore New Instructions New Exceptions Copyright 2000 Corporation. Linux Supercluster Users Conference Page 4 2

3 36-bit Addressing 36-bit Addressing PSE-36 PAE-36 PSE-36 4GB mapped through 4K of page directories and 4MB page tables Memory above 4 GB is only accessible as 4 MB pages Operating system can freely use both 4KB and 4MB pages without PDE P structure change All 4KB pages and page tables MUST reside below 4GB boundary Reduces effort needed to develop & support changes in virtual memory mory subsystem PAE-36 4GB mapped through 16K of page directories and 16MB page tables All Memory accessible as 4KB or 2MB pages OS needs to load PDEPTRs for mapping changes on writes to CR3 CONFIG_HIMEM to enable more than 4 GB physical memory Copyright 2000 Corporation. Linux Supercluster Users Conference Page 5 4KB Page Translation Linear Address PTE PDE 4-Kb page CR3 Page Directory 1024 Entries Page Table 1024 Entries Copyright 2000 Corporation. Linux Supercluster Users Conference Page 6 3

4 4KB PAE Translation Linear Address PDE PTE 4-Kb page CR3 PDPE Page Directory Pointer Table Page Directory 512 Entries Page Table 512 Entries Copyright 2000 Corporation. Linux Supercluster Users Conference Page 7 4MB Translation Linear Address PDE CR3 Provides bits of physical address of 4MB Page Page Directory 1024 Entries 4-MB page (Bits are currently RESERVED) Page Directory Entry Provides bits of physical address of 4MB Page (new) Copyright 2000 Corporation. Linux Supercluster Users Conference Page 8 4

5 Example Using PSE-36 ONLY 4MB PAGES ABOVE 4GB 8GB CR3 4-byte Entries 4MB Page 4K Page... 4MB Page Page Directory 4-byte Entries 4K Page 4K Page Page Table 4GB 4K & 4MB PAGES BELOW 4GB 0 Copyright 2000 Corporation. Linux Supercluster Users Conference Page 9 Physical Memory Page Attribute Table (PAT) Physical Memory Attributes Described through the Page-Tables Builds upon enhanced memory type capability provided via MTRR s in Pentium Pro processor Relaxes MTRR alignment/length requirements Builds upon PCD/PWT bits on IA-32 Architecture These interact with effective memory type determination PAT Architecture PAT is an 8-entry 8 table indexed via PCD, PWT, and Resv.. bits Allows up to 8 memory attributes defined by the page tables PAT is always enabled when Paging is used Default table entries fully compatible with PCD/PWT/Resv settings PAT entries R/W programmable via RDMSR/WRMSR (0x277) 8 bits per entry; 3 bits for attribute with other bits reserved Memory attributes as specified by Pentium Pro processor Copyright 2000 Corporation. Linux Supercluster Users Conference Page 10 5

6 Page Attribute Table (PAT) PAT Architecture (continued) PAT Memory Types interact with MTRRs As architecturally specified by Pentium Pro processor Implementation specific combinations remain undefined Should not be depended upon by system software Precautions OS Uses Page Directory as a Page Table: Restricted to using 4 lowest PAT entries PAT bit 7 in 4K PTE is PS bit when used as a PDE Memory type changes for pages require TLB invalidation Follow procedure as when changing MTRRs cache flush, TLB invalidation PAT entries on multiple processors must be maintained in consistent manner by OS All processors have same values in PAT Copyright 2000 Corporation. Linux Supercluster Users Conference Page 11 Page Attribute Table (PAT) Precautions (continued) Page Aliasing PAT maintains memory types according to linear addresses Architecture allows OS to map single physical page with 2 linear addresses containing differing types This may lead to undefined results and must be avoided PAT Uses Essentially unlimited MTRRs Provide support for more devices (frame buffers, RAID cards, etc ) to map memory as WC Allows map system memory for specific optimizations Memory shared with 3D accelerator/cpu for textures Reduce eviction, read-for-ownership bus transactions and cache thrashing for common operations such as memory fill Copyright 2000 Corporation. Linux Supercluster Users Conference Page 12 6

7 Fast Floating Point Save/Restore These instructions minimize cost of saving/restoring Floating Point/MMX Technology State Does NOT re-initialize the FPU state after saving Performance improvements come from more natural format and alignment of the cpu state State area is larger 512 bytes MUST be aligned on 16 byte boundary, else GP(0) fault Use of reserved fields risks incompatibility with future Architecture processors FXSAVE does not check for unmasked exceptions (i.e. like FNSAVE) FXRSTOR does not fault when loading an image that contains pending exceptions Copyright 2000 Corporation. Linux Supercluster Users Conference Page 13 Pentium III New Instructions Core Architecture Floating Point Arch. Multimedia Architecture Memory Architecture Pentium III MMX FP processors= Dynamic Technology Execution New Media SIMD FP Instr. P6 bus + WC I/O Streaming Mem Instr 52 New SIMD Single Precision Floating Point Instructions up to 4 FP results per cycle Eight 128 bit registers 4 x Single precision FP numbers 12 New Media Instructions 8 New Cacheability Instructions Copyright 2000 Corporation. Linux Supercluster Users Conference Page 14 7

8 Prefetching Instruction Prefetch gets a cacheline at a time Prefetch Hint (Load) Instructions Instructions do not fault Retires quickly to free up machine resources Hints to cache at different levels Store in different levels of cache hierarchy Don t store in the cache hierarchy (stream) Potential OS tuning uses e.g. TCP/IP Checksum gets ~2x speedup Copyright 2000 Corporation. Linux Supercluster Users Conference Page 15 Streaming Store Instruction Store data to memory minimizing cache pollution Potential OS tuning benefits 128 bit registers used with streaming store to zero pages ~4x faster mem copy ~2x faster using prefetch/stream together Copyright 2000 Corporation. Linux Supercluster Users Conference Page 16 8

9 New Exceptions Interrupt vector 19 used to invoke unmasked exception handlers Provide larger (512 bytes of state) context record to handler Handlers need to account for SIMD nature of Pentium III SSE numeric exceptions One instruction can generate multiple exceptions Exceptions are precise (reported when detected) Pentium III SSE Instructions architecturally separate from x87- FP Pentium III SSE Instructions do not report x87-fp/mmx FP/MMX Technology exceptions New handlers must include IEEE filter to decode and emulate exception raising SIMD instructions Copyright 2000 Corporation. Linux Supercluster Users Conference Page 17 Pentium 4 Processor Pentium 4 Processor New Features SSE2 Instructions Enhanced Prefetch Instructions System Bus and Cache Enhancements OS Recommendation New Instruction support Copyright 2000 Corporation. Linux Supercluster Users Conference Page 18 9

10 Pentium 4 Architecture Overview Willamette is the next generation IA-32 processor microarchitecture New micro-architecture ~1.4x average performance of Pentium III processor family on same process Enables faster processor speeds (1 GHz+) Trace Cache for Instruction Decode Willamette New Instructions New platform (chipsets, AGP4X) Copyright 2000 Corporation. Linux Supercluster Users Conference Page 19 Pentium 4 New Instructions New 128 bit arithmetic instructions Extend MMX technology instructions from 64 bit to 128 bit data type Operates on XMM registers instead of MMX/x87-FP registers New 128-bit integer and SIMD-Integer Integer instructions Memory operands MUST be 128-bit aligned! Will cause Exception during executions if not aligned. Packed 32 * 32 bit Multiply Packed 64 bit Add/Subtract Shift, Shuffle, Unpack, Move, Conversion New SIMD Double Precision FP instructions Full complement of FP arithmetic operations Packed/Scalar DP SP conversions New cache / memory management instructions Cache line flush instruction Fences (LFence( / MFence) New streaming store instructions Copyright 2000 Corporation. Linux Supercluster Users Conference Page 20 10

11 Streaming SIMD Extensions 2 Floating Point Registers (Scalar/packed SIMD-SP-FP, SIMD-DP-FP, 128-bit Integer) XMM Integer / x87 Registers (64-bit Integer, x87 data) FP0 or MM0. XMM7 FP7 or MM7 Copyright 2000 Corporation. Linux Supercluster Users Conference Page 21 Example SIMD Add (ADDPD( ADDPD) Effectively performs two double precision ops in one cycle a1+b1=c1 in parallel with a0+b0=c0 Useful for matrix operations S2 a1 a0 + + b1 b0 128-bit Registers c1 c0 Copyright 2000 Corporation. Linux Supercluster Users Conference Page 22 11

12 Prefetches Prefetches The Pentium 4 processor has automatic prefetches which Work on large buffers Have Sequential access Even fewer prefetches necessary Use sequential access to buffers and get prefetches for free Copyright 2000 Corporation. Linux Supercluster Users Conference Page 23 Prefetches But prefetch instructions may still be the best solution in some cases PrefetchNTA reduces cache evictions of useful data ( x 1.15x gain) Benefits unusual (ie( ie,, non-contiguous) data access patterns Can maximize read bandwidth to system memory Increase fetch-ahead distance since memory- latency/computation delta increases Copyright 2000 Corporation. Linux Supercluster Users Conference Page 24 12

13 System Bus & Cache Enhancements The Pentium 4 system bus is an evolutionary extension of the P6 bus 3.2 GByte/sec data transfer rate 100MHz quad pumped data bus - similar to AGP-4X Source synchronous 64 bit data bus Caches Trace cache for decoded instructions 128 byte cache lines with 64 byte sectors 256K on-die, 2nd level write-back, unified data and instruction cache APIC Messages now sent over front side bus Physical destination mode expanded to 8-bits8 ISR, IRR, TMR implementation increased to 256 bits Remote read is no longer supported Copyright 2000 Corporation. Linux Supercluster Users Conference Page 25 OS Recommendations All spin-loops should include the PAUSE instruction Backwards compatible with prior IA-32 processors Significant performance benefit in future IA-32 processors Already done in 2.4-test* kernels Cache line size is 128 bytes with 64 byte sectors Impact to hot locks Hot locks should be on separate 64 byte sectors Impact to data structure alignment 128 byte line allocation in cache Use Non-execution based Timing Loops! Already done in 2.4-test* kernels Copyright 2000 Corporation. Linux Supercluster Users Conference Page 26 13

14 Pentium 4 New Instruction Support FXSAVE/FXRSTOR support for Pentium 4 state Already done if enabled for Pentium III processor (Internet Streaming SIMD Extensions) No New State! Already done in 2.4-test* kernels New Exception Handlers Double Precision SIMD capable IEEE Compliant Prefetch and Streaming Store Optimizations Integer state streaming store instruction MOVNTi For zeroing, memcpy,, etc. Does not use FP state so DNA is avoided Copyright 2000 Corporation. Linux Supercluster Users Conference Page 27 Pentium 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture Copyright 2000 Corporation. Linux Supercluster Users Conference 14

15 Agenda IA-32 Processor Roadmap Design Goals Frequency Instructions Per Cycle Summary Copyright 2000 Corporation. Linux Supercluster Users Conference Page 29 Pentium 4 Processor NetBurst Micro-Architecture Performance P6 Micro-Architecture P5 Micro-Architecture NOW 486 Micro-architecture Time Copyright 2000 Corporation. Linux Supercluster Users Conference Page 30 15

16 Pentium 4 Processor Design Goals Deliver world class performance across both existing and emerging applications Deliver performance headroom and scalability for the future Micro-architecture that that will will Drive Performance Leadership for for the the Next Several Years Copyright 2000 Corporation. Linux Supercluster Users Conference Page 31 CPU Architecture 101 Delivered Performance = Frequency * Instructions Per Cycle Frequency Copyright 2000 Corporation. Linux Supercluster Users Conference Page 32 16

17 Frequency What limits frequency? Process technology Microarchitecture On a given process technology Fewer gates per pipeline stage will deliver higher frequency Frequency is is driven by Microarchitecture Copyright 2000 Corporation. Linux Supercluster Users Conference Page 33 Netburst TM Micro-architecture Pipeline vs P Fetch Fetch Decode Basic P6 Pipeline Decode Decode Rename ROB Rd ROB Rd Rdy/Sch Dispatch Basic Pentium 4 Processor Pipeline TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp Exec RF RF Intro at 733MHz.18µ Intro at 1.4GHz Ex.18µ Flgs Br Ck Drive Hyper pipelined Technology enables industry leading performance and clock rate Copyright 2000 Corporation. Linux Supercluster Users Conference Page 34 17

18 Hyper Pipelined Technology Frequency Today 1.4GHz 1.13GHz Netburst Micro-Architecture P6 Micro-Architecture 166MHz 60MHz Introduction Time Copyright 2000 Corporation. Linux Supercluster Users Conference Page MHz 5 P5 Micro-Architecture CPU Architecture 101 Delivered Performance = Frequency * Instructions Per Cycle Instructions Per Cycle Copyright 2000 Corporation. Linux Supercluster Users Conference Page 36 18

19 Improving Instructions Per Cycle Improve efficiency Branch prediction Do more things in a clock Reduce time it takes to do something Reducing latency Copyright 2000 Corporation. Linux Supercluster Users Conference Page 37 Improving Instructions Per Cycle Improve efficiency Branch prediction Do more things in a clock Reduce time it takes to do something Reducing latency Copyright 2000 Corporation. Linux Supercluster Users Conference Page 38 19

20 Branch Prediction Accurate branch prediction is key to enabling longer pipelines Dramatic improvement over P6 branch predictor: 8x the size (4K) Eliminated 1/3 of the mispredictions Proven to be better than all other publicly disclosed predictors (g-share, hybrid, etc) Copyright 2000 Corporation. Linux Supercluster Users Conference Page 39 Execution Trace Cache Advanced L1 instruction cache Caches decoded IA-32 instructions (uops) Removes decoder pipeline latency Capacity is ~12K uops Integrates branches into single line Follows predicted path of program execution Execution Trace Cache feeds fast engine Copyright 2000 Corporation. Linux Supercluster Users Conference Page 40 20

21 Execution Trace Cache 1 cmp 2 br -> > T (unused code) T1: 3 sub 4 br -> > T (unused code) T2: 5 mov 6 sub 7 br -> > T (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> > T4 Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 Copyright 2000 Corporation. Linux Supercluster Users Conference Page 41 Advanced Dynamic Execution Extends basic features found in P6 core Very deep speculative execution 126 instructions in flight (3x P6) 48 loads (3x P6) and 24 stores (2x P6) Provides larger window of visibility Better use of execution resources Deep Speculation Improves Parallelism Copyright 2000 Corporation. Linux Supercluster Users Conference Page 42 21

22 Improving Instructions Per Cycle Improve efficiency Branch prediction Do more things in a clock Reduce time it takes to do something Reducing latency Copyright 2000 Corporation. Linux Supercluster Users Conference Page 43 Rapid Execution Engine Dramatically lower ALU latency P6: 11 1GHz P4P: ½ ½ >1.4GHz 1ns <0.36ns Copyright 2000 Corporation. Linux Supercluster Users Conference Page 44 22

23 Example with Higher IPC and Faster Clock! Code Sequence Ld Add Add Ld Add Add 10 clocks 10ns IPC = 0.6 Pentium 4 6 clocks 4.3ns IPC = 1.0 Copyright 2000 Corporation. Linux Supercluster Users Conference Page 45 Recap Frequency Adder Speed L1 Cache Speed L1 Cache Size L1 Cache Bandwidth L2 Cache Bandwidth Uop Fetch Bandwidth Adder Bandwidth Branch targets Instructions In flight Loads in flight Stores in flight Pentium III Processor 1 GHz 1 ns 3 ns 16 KB 16 GB/sec 16 GB/sec 3 billion/sec 2 billion/sec Pentium 4 Processor > 1.4 Ghz <.36 ns < 1.42 ns 8 KB > 44.8 GB/sec > 44.8 GB/sec > 4.2 billion/sec > 5.6 billion/sec Relative Improvement > 1.4 > 2.8 > > 2.8 > 2.8 > 1.4 > 2.8 Copyright 2000 Corporation. Linux Supercluster Users Conference Page

24 Example - Security and e Security and e-commerce Secure transactions enable e-commercee SSL is the standard for secure Web transactions Protocol for secure communication Built upon a core set of algorithms Public-key encryption RSA, DSA, Diffie-Hellman, etc. Message digest SHA-1, MD5, etc. Digital signature Bulk encryption RC4, DES, 3DES, AES Copyright 2000 Corporation. Linux Supercluster Users Conference Page 47 Security Impacts Performance SSL - The basics Browser Web Server Client Hello(settings,etc) Server Hello (certificate, suite,etc) Pre-master secret key Session ID, Ready Session ID, Ready Data exchanges (bulk encryption) Copyright 2000 Corporation. Linux Supercluster Users Conference Page 48 24

25 Security Impacts Performance The high cost of SSL Source - Lab research Transaction time Kbytes Transmitted Secure transactions are orders of of magnitude slower than non-secure Copyright 2000 Corporation. Linux Supercluster Users Conference Page 49 Identify Key Algorithms Computation in SSL Goal: Increase the number of secure transactions Identify server performance issues in SSL One server may deal with hundreds of clients Montgomery Product setup Authentication Bulk Data Encryption close 70% Compute time consumed by one short SSL transaction Copyright 2000 Corporation. Linux Supercluster Users Conference Page 50 Source - Lab research 25

26 Breakthrough Performance on Pentium 4 Processor Architectural Features New instructions in SSE2 PMULUDQ (32x32=>64) PADDQ (64+64=>64) PSHUFD (Re-arrange DWORDs) All pipelined SIMD Increase size and reduce number of individual multiplications Copyright 2000 Corporation. Linux Supercluster Users Conference Page 51 Breakthrough Performance on Pentium 4 Processor Timings Algorithm Bits Lang Clocks Ratio Naïve 32-bit 1x16 C Optimized ASM using MUL 1x32 asm Using Pentium 4 New Instruct2x32 asm Almost 20x performance gain versus naïve implementation Copyright 2000 Corporation. Linux Supercluster Users Conference Page 52 26

27 Breakthrough Performance on Pentium 4 Processor Summary Expect bit RSA Decrypts/second Breakthrough performance on public key algorithms for Pentium 4 processor The right architecture The right instruction set Pentium 4 processor delivers more secure transactions to to more users Copyright 2000 Corporation. Linux Supercluster Users Conference Page 53 27

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business