Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Size: px

Start display at page:

Download "Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts"

Domenic Thompson
5 years ago
Views:

1 Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism => single-instruction multiple data (SIMD)! features " packed data type " a rich set of MMX instructions to perform parallel operations " saturation arithmetic different from regular arithmetic: don t truncate/wrapping around choosing largest or smallest numbers " parallel compare " overlapped operations " pack/unpack data type " compatible extension architectures Adv CPU-1

Enhanced Instruction Set Operating In Parallel Fashion "

2 Packed Data Types (small data types packed into one) register! Dual Usage of Floating-point Register! Enhanced Instruction Set Operating In Parallel Fashion " Totally 57 MMX instructions are added to IA. Adv CPU-2 Fast DSP computation Adv CPU-3

3 Performance of Matrix Multiplication Performance Comparison between IA and MMX -working example on Matrix and vector multiplication Traditional IA MMX No.ofLoads 32 8 No.ofMultiply 16 4 No.ofAdd 15 3 Vector Vector *Loop control 12 0 multiplication Other overhead 0 3 Final result save 1 1 Instr Count **Cycle Count Total Instrs 4(4x76+3)=1228 4(4x19+3) = 316 Matrix Vector Multiplication Both under 1200 cycles 207 cycles optimized mode Comp Result: Speed up 5.8 times * Assume we per form 4 MACs (out of 16) per loop iteration of our code. for ( K = 1; K < 5; K++) { Mac (K); } So for each loop, there will be 3 instruction per iteration, increment, compare, and branch. ** 1) The cycle count is dominated by the nonpipelined, 11-cycle integer multiply operation 2) 4 mispredictions totally when existing the loops 3) All data are in on-chip caches; Adv CPU-4 More Parallelisms! Streaming SIMD Extension (SSE) since Pentium III. " Physically add eight new 128 bit XMM registers and 70 instruction set. New machine state introduced. " Support four 32-bit single precision floating point operations in parallel. Recall all MMX SIMD instruction are all for mere integers.! Streaming SIMD Extension 2 (SSE2) since Pentium 4. " Use XMM registers. No new machine. " 144 new instructions added. " Support double precision floating point parallel operations.! IA-64 ItaniumTM Architecture. " Enable, enhance, express, exploit Parallelism at: Proc./Thread level for programmers, at the instruction level for compilers. All explicitly. Adv CPU-5

4 Objectives of IA-64 Instruction Set Architecture (ISA)! Intel and HP Technology Alliance! Enable industry leading system performance " Breakthrough performance " Headroom! Enable compatibility with today s IA-32 software & PA- RISC software! Allow scalability over a wide range of implementations! Full 64-bit Full 64-bit computing Adv CPU-6 Next Generation Terminology! EPIC: (Explicitly Parallel Instruction Computing): the next generation processor technology " e.g., RISC, CISC! IA-64 (Intel Architecture, 64-bit): the architecture that incorporates EPIC Technology " e.g., IA-32, PA-RISC! Merced processor: the project name for Intel s first IA-64-based implementation " e.g., Pentium II, PA-8500 Adv CPU-7

5 Features of IA-64 Architecture! Explicit Parallelism " ILP is explicit in machine code " compiler analyzes and identifies parallelism at compile time! Predication Enhances Parallelism! Speculation Minimizes the Effect of Memory Latency! IA-64 Processors are Massively Resourced " Many registers " Many functional units " Inherently scalable! Performance, headroom, binary compatibility Adv CPU-8 Predication: Features and Benefits! Compiler given larger scheduling scope " Nearly all instructions can be predicated " State updated if an instruction?s predicate is true, otherwise " acts as a NOP " Compiler assigns predicates, compare instructions set them " Architecture provides 64 1-bit predicate registers (PR)! Predicated execution removes branches " Convert a control dependence to a data dependence " Reduce mispredict penalties! Parallel execution through larger basic " Effective use of parallel hardware Adv CPU-9

6 Intel/HP IA-64 Explicitly Parallel Instruction Computer (EPIC)! IA-64: instruction set architecture; EPIC is type " EPIC = 2nd generation VLIW?! Itanium the first implementation (2001) " Highly parallel and deeply pipelined hardware at 800Mhz " 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process! bit integer registers bit floating point registers " Not separate register files per functional unit as in old VLIW! Hardware checks dependencies (interlocks => binary compatibility over time)! Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? Adv CPU-10 Binary Compatibility PA-RISC Object Code Design Criteria Systems Architecture Transparent to User Default IA-32 Object Code High-level Language Native Compiler and Optimizer C, C++, Fortran, COBOL Application Source Compatible C, C++ and FTN Native IA-64 Code Dynamic Translator HP-UX and NT IA-64 Play: Next generation ISA Adv CPU-11

7 VLIW Processor Architectures for DSP!Why VLIW Architecture? " VLIW is especially suitable for DSP applications " DSP algorithms are dominated by data-parallel computation and consist of core tight loops executed repeatedly. # Convolution, FFT " Single-chip high-performance VLIW processors with multiple FUs are commercially available. Adv CPU-12 VLIW Architecture! Instruction-Level Parallelism (ILP) " Multiple different FUs in parallel. " Each instruction contains an operation code for each FU.! Data-Level Parallelism (DLP) " Single FU is divided to perform the same operation on multiple smaller precision data.! Instruction Set Architecture " Each processor has its own instruction to further enhance the performance. " Complex_multiply for FFT and autocorrelation algorithms! Memory I/O " Via DMA controller " Predictable access time " Hide the data transfer time behind the processing time by independent work " Real-time requirement Adv CPU-13

TI TMS320C62! 256 bits per instr. (8x32bit)! 2 clusters " Each with 4 Fus " Each with 16 32-bit register " One cross-cluster read port each way! Two integer ALU support partitioned instr.

8 TI TMS320C62! 256 bits per instr. (8x32bit)! 2 clusters " Each with 4 Fus " Each with bit register " One cross-cluster read port each way! Two integer ALU support partitioned instr.! Programmable DMA controller with two 32-kB memory Adv CPU-14 TI TMS320C80 ILP, DLP, multiple processors on single chip 4 ADSP (DSP+VLIW) A 16-bit MUL, a 3-input 32-bit ALU, a branch unit, 2 load/strore units. 3 zero-overhead loop controllers One 2-KB I-cache, Four 2- KB D-cache RISC processor FPU:FPMAC A 4-KB I-cache, A 4-KB D- cache! DMA (Transfer Controller) " Support various types of data transfers with complex address calculation.! No support for some powerful instrs. " SAD, inner-product Adv CPU-15

9 Philips Trimedia TM1000! 27 Fus, coprocessor for MPEG-2 decoding! NO DMA controller, 16 KB D-cache, 32 KB I-cache! One PCI port, MM I/O! Issue 5 simultaneous instr per cycle! DSPALU: partitioned Instr.! DSPMUL: partitioned instr. Inner-product Adv CPU-16 Transmeta s Crusoe Processor, TM5400! General purpose microprocessor based on VLIW. " Difficult: Binary code compatibility, Very complicated compiler! Support X86 (MS Windows, Linux): " X86 code morphing software using dynamic binary code translation.! 2 interger units, 1 FPU, 1 load/store, 1 branch " 64 KB 16-way L1 D-cache " 64 KB 8-way I-cache " 256 KB L2 cache " bit GPR " VLIW instr size: 64, 128 bits, 4 instr per cycle. Support partioned instr.

Dynamically translates x86 instructions into VLIW instructions

3/4 VLIW hardware 128 bit Very long Instruction Word Processor

10 Crusoe: A low-power x86 processor! Crusoe processor = Software + hardware Code Morphing software Dynamically translates x86 instructions into VLIW instructions Provides x86 compatibility Optimization and scheduling by software 3/4 VLIW hardware 128 bit Very long Instruction Word Processor Simple and fast Fewer transistors 1/4 Low power x86 compatibility PC performance Adv CPU-18 Crusoe VLIW Adv CPU-19

11 Code Morphing Software A dynamic translation system, reside in a ROM, First program to start executing when booting! Drawing the H/W and S/W line " Software: decoding x86 instructions and generating parallel molecule " Hardware: execute using a simple, high-speed VLIW engine! Decoding and scheduling " Translation cache : CMS translates instructions once, saving the resulting translation for re-use $ Skip the translation in the next time Play: Transmeta Crusoe Adv CPU-20

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the