INTEL MILLION, TRANSISTOR 64-BIT MICROPROCESSOR. Design Charter. Target New Market Segments with. Maximize Performance. Supercomputer Concepts

Size: px

Start display at page:

Download "INTEL MILLION, TRANSISTOR 64-BIT MICROPROCESSOR. Design Charter. Target New Market Segments with. Maximize Performance. Supercomputer Concepts"

Arnold Barker
6 years ago
Views:

1 INTEL MILLION, TRANSISTOR 64-BIT MICROPROCESSOR Design Charter Target New Market Segments with 1 Million Transistor Budget No Compatibility Constraints Maximize Performance Achieved by 64 -Bit Architecture Supercomputer Concepts

2 Performance Beyond MIPS Operating System Computation Visualization Bandwidth Fast Integer Memory Management Balanced Integer and Floating - Point Speed 3 - D Graphics On-Chip Caches 64- and 128-Bit Buses Ext. Addr Microprocessor Block Diagram Data Cache Ext. Data Dest SRC 1 SRC2 I I i I I f 1 r Adder Unit Multiplier Unit

3 Fine Grain Parallel Computing Operations Per Clock Fine Grained Parallelism Goals Overlap Load I Store I Loop Control and Floating- Point Operations Peak FP Performace in Inner Loops Vector Processing in Software Support Parallel Non-Vectorizable Code

4 -- - w- i860tm CPU Register Model Bit 3 Port Load /Store 5 Port Bit Load 1 Store Support Parallel Execution RlSC Core Load /Store Oriented Architecture - 1 Clock Instructions Good Integer Performance - 85K Dhrystones at 40 Mhz 32 -Bit Linear Addressing

5 RlSC Core Features Auto lncrement Addressing Arbitrary lncrement Vector Accesses Conditional Branches Taken and Not Taken Test 1 Branch and Loop Pixel Store General Lock and Unlock Test and Set Locked lncrement Compare and Swap RlSC Core Features Dual Instruction Mode 128 -Bit Load I Stores Fast Data Movement Pipelined Load Optimized for Cache Miss No Cache Allocate Supports External Bus Pipeline Avoids Cache Thrashing

6 Memory Management Paging Unit Compatible with 3 86'"C~~ 2 Level Paging 4K Page Size Supports Heterogeneous MP OS User 1 Supervisor Protection 64 Entry TLB, 4-Way Set Associative a Miss Processing in Hardware Floating-Point Units Block Diagram Instruction Data I Rtglstcrs end Controls 1 Dest SRC 1 SRC2

7 Ill em Floating - Point Unit Instruction Execution Data Dependent Code Scalar Mode Vector Code Pipeline Mode DO 10,1=1, X [I] = A [I] * B [I] + C FMUL XyA,TEMP FADD TEMP, C, X 6 Clocks ml2tpm A [I], B [I], X [I Clock Dual Operation Example Source 1 Source 2 I Destination OP 21 Multiply Unlt (SP) Adder Unl t I KR x Source 2 + Source 1 J

8 V Addition IFI 1 Subtc-- 9ic.l Blpqck Diagram EXPl EXPP MANTl MANTP ACT PRENORYALIZATION SCANNING * EXP -=7 YANT Multiplication Block Diagram ADD I-W -IIDO(. I

9 Fl~a ting - Poinr Unit Performance 3 -D Graphics Unit Supports Backend Rendering Intensity Interpolation Distance Interpolation Hidden Surf ace Elimination 32 - and 64 -Bit Integer Add / Subtract 3% Die Area 1 Ox Faster Than RlSC Core

10 3 -D GrapP~ics Unit Operates in Parallel with RlSC Core Up to 8 Pixels or 4 Z -Values Per Clock Pixel and Z -Value Packing Pixel Masking Store Bit Pixel D 16 -Bit Pixel 1 32-Bit Pixel R G B C=Color,!=Intensity, R=Red Intensity, G= Green Intensity, B=Blue Intensity Graphics Loop Example 16 -Bit Gouraud Shaded Pixels, 16 -Bit Z -Buffer blaloop: d.faddz z, deltaz, z d.fzchks fo,fo,fo d.faddz z, deltaz, z d.form fo, newz d.faddp blue, deltablue, blue d.faddp green, deltagreen, green d.faddp red, deltared, red d.form fo, newi d.fzchks oldz, newz, newz nop fst.d newz, zbuffer pst.d newi, framebuffer nop nop nop nop bla r l2,r 13,blaloop f1d.d framebuffer, oldz 4 Pixels1 9 Clocks 19 Million Pixels1 Clock at 40 MHz

11 i860tmcpu lnstruction and Data Caches 8Kbytes Data and 4Kbytes lnstuction Cache 32 -Byte Line Size 2 Way Associative Architecture- -Write Back 64 / 128 -Bit Internal Cache Paths Internal Cache Bandwidths: MByte l Sec: lnstruction Cache MByteISec: Data Cache User Controllable Quick Cache Miss Processing Wrap Around Cache Fllls Write Buffers: 128-Bit Wide, Two Deep Delayed Write Back - Write Miss Processing

12 Cache Circuit Block Diagram Cache as "Vector Registers" Intermediate Results, Reused Vectors in Cache Use Once Operands Don't Disturb Cache "Vector Registers" Use Normal Loads "Memory Operands" Use Pipeline Loads Analyze Loop Nest to Optimize Vector Reuse

13 Bus lnterface 64-Bit External Data Bus Optimizations for DRAM lnterface - Two Levels of Pipelining - NExt NEar Indication - Two Clock Transfer Rate Allows for Second Level Cache - Low Latency Accesses General 1X Clock with 33.3 and 40 MHz Speeds - Internal Speed=External Clock Speed a 168 Pin Package Power Consumption - Peak: 3W - Normal: <2W

14 Benchmarks at 40 MHz Peak Performance Dhrystone 1.1 Whetstone (DP) Stanford Suite Linpack (DP, Coded DAXPY) 3 - D Triangles 1 Sec. 120 MOPS 85 K 24 M 33 VAX MIPS 13 MFlops 50 K i860tm CPU A Performance Perspective

16 Summary 64 -Bit Architecture 3 -D Graphics Supercomputer Concepts Brought to Single Chip 1 / 2 Cray 1 Performance at 40 MHz and 386 Are Trademarks of Intel Corp. Cray Is a Trademark of Cray Research, Inc.

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

ELE 375 Final Exam Fall, 2000 Prof. Martonosi Question Score 1 /10 2 /20 3 /15 4 /15 5 /10 6 /20 7 /20 8 /25 9 /30 10 /30 11 /30 12 /15 13 /10 Total / 250 Please write your answers clearly in the space