6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Similar documents
6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

64-Bit Data Bus. Instruction Address IF. Instruction Data ID2 AC1 AC2 EX WB. X Data Byte Instruction Line Cache. Y Pipe.

IBM 6X86MX Microprocessor Databook

Superscalar Processors

Component Operation 16

PowerPC 740 and 750

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

The P6 Architecture: Background Information for Developers

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Processor Structure and Function

2.5 Address Space. The IBM 6x86 CPU can directly address 64 KBytes of I/O space and 4 GBytes of physical memory (Figure 2-24).

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Computer Systems Architecture

Hardware and Software Architecture. Chapter 2

Pentium IV-XEON. Computer architectures M

Hardware-Based Speculation

Case Study IBM PowerPC 620

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Basic Execution Environment

1. PowerPC 970MP Overview

E0-243: Computer Architecture

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Pipelining and Vector Processing

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

UMBC. contain new IP while 4th and 5th bytes contain CS. CALL BX and CALL [BX] versions also exist. contain displacement added to IP.

The Pentium II/III Processor Compiler on a Chip

EC 513 Computer Architecture

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CS252 Graduate Computer Architecture Midterm 1 Solutions

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Advanced Computer Architecture

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Handout 2 ILP: Part B

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Instruction Pipelining Review

Copyright 2012, Elsevier Inc. All rights reserved.

HW1 Solutions. Type Old Mix New Mix Cost CPI

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Main Points of the Computer Organization and System Software Module

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0.

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Limitations of Scalar Pipelines

Chapter 3: Addressing Modes

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CS152 Exam #2 Fall Professor Dave Patterson

Chapter 12. CPU Structure and Function. Yonsei University

Purpose This course provides an overview of the SH-2A 32-bit RISC CPU core built into newer microcontrollers in the popular SH-2 series

Techniques for Efficient Processing in Runahead Execution Engines

Dynamic Control Hazard Avoidance

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Dynamic Scheduling. CSE471 Susan Eggers 1

CS 152, Spring 2011 Section 8

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Keywords and Review Questions

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CIS 371 Spring 2015 Computer Organization and Design 7 May 2015 Final Exam Answer Key

CSE502: Computer Architecture CSE 502: Computer Architecture

2 MARKS Q&A 1 KNREDDY UNIT-I

Control Hazards. Prediction

6.004 Tutorial Problems L22 Branch Prediction

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture-13 (ROB and Multi-threading) CS422-Spring

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Tutorial 11. Final Exam Review

s complement 1-bit Booth s 2-bit Booth s

DETECTION OF SOFTWARE DATA DEPENDENCY IN SUPERSCALAR COMPUTER ARCHITECTURE EXECUTION. Elena ZAHARIEVA-STOYANOVA, Lorentz JÄNTSCHI

Updated Exercises by Diana Franklin

Instruction-Level Parallelism and Its Exploitation

Pipelining and Vector Processing

CS433 Homework 2 (Chapter 3)

CPSC 313, Summer 2013 Final Exam Solution Date: June 24, 2013; Instructor: Mike Feeley

Superscalar Organization

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Multi-cycle Instructions in the Pipeline (Floating Point)

6/20/2011. Introduction. Chapter Objectives Upon completion of this chapter, you will be able to:

Itanium 2 Processor Microarchitecture Overview

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

CS433 Homework 2 (Chapter 3)

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Transcription:

1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high performance, x86-compatible processors. Increased performance is accomplished by the use of superscalar and superpipelined design techniques. The 6x86 CPU is superscalar in that it contains two separate pipelines that allow multiple instructions to be processed at the same time. The use of advanced processing technology and the increased number of pipeline stages (superpipelining) allow the 6x86 CPU to achieve clocks rates of 100 MHz and above. Through the use of unique architectural features, the 6x86 processor eliminates many data dependencies and resource conflicts, resulting in optimal performance for both 16-bit and 32-bit x86 software. The 6x86 CPU contains two caches: a 16-KByte dual-ported unified cache and a 256-byte instruction line cache. Since the unified cache can store instructions and data in any ratio, the unified cache offers a higher hit rate than separate data and instruction caches of equal size. An increase in overall cache-to-integer unit bandwidth is achieved by supplementing the unified cache with a small, high-speed, fully associative instruction line cache. The inclusion of the instruction line cache avoids excessive conflicts between code and data accesses in the unified cache. The on-chip FPU allows floating point instructions to execute in parallel with integer instructions and features a 64-bit data interface. The FPU incorporates a four-deep instruction queue and a four-deep store queue to facilitate parallel execution. The 6x86 CPU operates from a 3.3 volt power supply resulting in reasonable power consumption at all frequencies. In addition, the 6x86 CPU incorporates a low power suspend mode, stop clock capability, and system management mode (SMM) for power sensitive applications. 1.1 Major Functional Blocks The 6x86 processor consists of five major functional blocks, as shown in the overall block diagram on the first page of this manual: Cache Unit Memory Management Unit Floating Point Unit Bus Interface Unit s are executed in the X and Y pipelines within the and also in the Floating Point Unit (FPU). The Cache Unit stores the most recently used data and instruc- PRELIMINARY 1-1

tions to allow fast access to the information by the and FPU. Physical addresses are calculated by the Memory Management Unit and passed to the Cache Unit and the Bus Interface Unit (BIU). The BIU provides the interface between the external system board and the processor s internal execution units. 1.2 The (Figure 1-1) provides parallel instruction execution using two seven-stage integer pipelines. Each of the two pipelines, X and Y, can process several instructions simultaneously. Fetch Decode 1 In-Order Processing Decode 2 Decode 2 Address Calculation 1 Address Calculation 1 Address Calculation 2 Address Calculation 2 Out-of-Order Completion Execution Write-Back Execution Write-Back X Pipeline Y Pipeline 1727301 Figure 1-1. 1-2 PRELIMINARY

1 The consists of the following pipeline stages: Fetch (IF) Decode 1 (ID1) Decode 2 (ID2) Address Calculation 1 (AC1) Address Calculation 2 (AC2) Execute (EX) Write-Back (WB) The instruction decode and address calculation functions are both divided into superpipelined stages. 1.2.1 Pipeline Stages The Fetch (IF) stage, shared by both the X and Y pipelines, fetches 16 bytes of code from the cache unit in a single clock cycle. Within this section, the code stream is checked for any branch instructions that could affect normal program sequencing. If an unconditional or conditional branch is detected, branch prediction logic within the IF stage generates a predicted target address for the instruction. The IF stage then begins fetching instructions at the predicted address. The superpipelined Decode function contains the ID1 and ID2 stages. ID1, shared by both pipelines, evaluates the code stream provided by the IF stage and determines the number of bytes in each instruction. Up to two instructions per clock are delivered to the ID2 stages, one in each pipeline. The ID2 stages decode instructions and send the decoded instructions to either the X or Y pipeline for execution. The particular pipeline is chosen, based on which instructions are already in each pipeline and how fast they are expected to flow through the remaining pipeline stages. The Address Calculation function contains two stages, AC1 and AC2. If the instruction refers to a memory operand, the AC1 calculates a linear memory address for the instruction. The AC2 stage performs any required memory management functions, cache accesses, and register file accesses. If a floating point instruction is detected by AC2, the instruction is sent to the FPU for processing. The Execute (EX) stage executes instructions using the operands provided by the address calculation stage. The Write-Back (WB) stage is the last IU stage. The WB stage stores execution results either to a register file within the IU or to a write buffer in the cache control unit. 1.2.2 Out-of-Order Processing If an instruction executes faster than the previous instruction in the other pipeline, the instructions may complete out of order. All instructions are processed in order, up to the EX stage. While in the EX and WB stages, instructions may be completed out of order. If there is a data dependency between two instructions, the necessary hardware interlocks are enforced to ensure correct program execution. Even though instructions may complete out of order, exceptions and writes resulting from the instructions are always issued in program order. PRELIMINARY 1-3

1.2.3 Pipeline Selection In most cases, instructions are processed in either pipeline and without pairing constraints on the instructions. However, certain instructions are processed only in the X pipeline: Branch instructions Floating point instructions Exclusive instructions Branch and floating point instructions may be paired with a second instruction in the Y pipeline. Exclusive s cannot be paired with instructions in the Y pipeline. These instructions typically require multiple memory accesses. Although exclusive instructions may not be paired, hardware from both pipelines is used to accelerate instruction completion. Listed below are the 6x86 CPU exclusive instruction types: Protected mode segment loads Special register accesses (Control, Debug, and Test Registers) String instructions Multiply and divide I/O port accesses Push all (PUSHA) and pop all (POPA) Intersegment jumps, calls, and returns 1.2.4 Data Dependency Solutions When two instructions that are executing in parallel require access to the same data or register, one of the following types of data dependencies may occur: Read-After-Write (RAW) Write-After-Read (WAR) Write-After-Write (WAW) Data dependencies typically force serialized execution of instructions. However, the 6x86 CPU implements three mechanisms that allow parallel execution of instructions containing data dependencies: Register Renaming Data Forwarding Data Bypassing The following sections provide detailed examples of these mechanisms. 1.2.4.1 Register Renaming The 6x86 CPU contains 32 physical general purpose registers. Each of the 32 registers in the register file can be temporarily assigned as one of the general purpose registers defined by the x86 architecture (EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP). For each register write operation a new physical register is selected to allow previous data to be retained temporarily. Register renaming effectively removes all WAW and WAR dependencies. The programmer does not have to consider register renaming; it is completely transparent to both the operating system and application software. 1-4 PRELIMINARY

1 Example #1 - Register Renaming Eliminates Write-After-Read (WAR) Dependency A WAR dependency exists when the first in a pair of instructions reads a logical register, and the second instruction writes to the same logical register. This type of dependency is illustrated by the pair of instructions shown below: X PIPE Y PIPE (1) MOV BX, AX (2) ADD AX, CX BX AX AX AX + CX Note: In this and the following examples the original instruction order is shown in parentheses. In the absence of register renaming, the ADD instruction in the Y pipe would have to be stalled to allow the MOV instruction in the X pipe to read the AX register. The 6x86 CPU, however, avoids the Y pipe stall (Table 1-1). As each instruction executes, the results are placed in new physical registers to avoid the possibility of overwriting a logical register value and to allow the two instructions to complete in parallel (or out of order) rather than in sequence. Table 1-1. Register Renaming with WAR Dependency Physical Register Contents Reg0 Reg1 Reg2 Reg3 Reg4 Pipe Action (Initial) AX BX CX MOV BX, AX AX CX BX X Reg3 Reg0 ADD AX, CX CX BX AX Y Reg4 Reg0 + Reg2 Note: The representation of the MOV and ADD instructions in the final column of Table 1-1 are completely independent. PRELIMINARY 1-5

Example #2 - Register Renaming Eliminates Write-After-Write (WAW) Dependency A WAW dependency occurs when two consecutive instructions perform writes to the same logical register. This type of dependency is illustrated by the pair of instructions shown below: X PIPE Y PIPE (1) ADD AX, BX (2) MOV AX, [mem] AX AX + BX AX [mem] Without register renaming, the MOV instruction in the Y pipe would have to be stalled to guarantee that the ADD instruction in the X pipe would write its results to the AX register first. The 6x86 CPU uses register renaming and avoids the Y pipe stall. The contents of the AX and BX registers are placed in physical registers (Table 1-2). As each instruction executes, the results are placed in new physical registers to avoid the possibility of overwriting a logical register value and to allow the two instructions to complete in parallel (or out of order) rather than in sequence. Table 1-2. Register Renaming with WAW Dependency Physical Register Contents Reg0 Reg1 Reg2 Reg3 Pipe Action (Initial) AX BX ADD AX, BX BX AX X Reg2 Reg0 + Reg1 MOV AX, [mem] BX AX Y Reg3 [mem] Note: All subsequent reads of the logical register AX will refer to Reg 3, the result of the MOV instruction. 1-6 PRELIMINARY

1 1.2.4.2 Data Forwarding Register renaming alone cannot remove RAW dependencies. The 6x86 CPU uses two types of data forwarding in conjunction with register renaming to eliminate RAW dependencies: Operand Forwarding Result Forwarding Operand forwarding takes place when the first in a pair of instructions performs a move from register or memory, and the data that is read by the first instruction is required by the second instruction. The 6x86 CPU performs the read operation and makes the data read available to both instructions simultaneously. Result forwarding takes place when the first in a pair of instructions performs an operation (such as an ADD) and the result is required by the second instruction to perform a move to a register or memory. The 6x86 CPU performs the required operation and stores the results of the operation to the destination of both instructions simultaneously. PRELIMINARY 1-7

Example #3 - Operand Forwarding Eliminates Read-After-Write (RAW) Dependency A RAW dependency occurs when the first in a pair of instructions performs a write, and the second instruction reads the same register. This type of dependency is illustrated by the pair of instructions shown below in the X and Y pipelines: X PIPE Y PIPE (1) MOV AX, [mem] (2) ADD BX, AX AX [mem] BX AX + BX The 6x86 CPU uses operand forwarding and avoids a Y pipe stall (Table 1-3). Operand forwarding allows simultaneous execution of both instructions by first reading memory and then making the results available to both pipelines in parallel. Table 1-3. Example of Operand Forwarding Physical Register Contents Reg0 Reg1 Reg2 Reg3 Pipe Action (Initial) AX BX MOV AX, [mem] BX AX X Reg2 [mem] ADD BX, AX AX BX Y Reg3 [mem] + Reg1 Operand forwarding can only occur if the first instruction does not modify its source data. In other words, the instruction is a move type instruction (for example, MOV, POP, LEA). Operand forwarding occurs for both register and memory operands. The size of the first instruction destination and the second instruction source must match. 1-8 PRELIMINARY

1 Example #4 - Result Forwarding Eliminates Read-After-Write (RAW) Dependency In this example, a RAW dependency occurs when the first in a pair of instructions performs a write, and the second instruction reads the same register. This dependency is illustrated by the pair of instructions in the X and Y pipelines, as shown below: X PIPE Y PIPE (1) ADD AX, BX (2) MOV [mem], AX AX AX + BX [mem] AX The 6x86 CPU uses result forwarding and avoids a Y pipe stall (Table 1-4). Instead of transferring the contents of the AX register to memory, the result of the previous ADD instruction (Reg0 + Reg1) is written directly to memory, thereby saving a clock cycle. Table 1-4. Result Forwarding Example Physical Register Contents Reg0 Reg1 Reg2 Pipe Action (Initial) AX BX ADD AX, BX BX AX X Reg2 Reg0 + Reg1 MOV [mem], AX BX AX Y [mem] Reg0 +Reg1 The second instruction must be a move instruction and the destination of the second instruction may be either a register or memory. PRELIMINARY 1-9

1.2.4.3 Data Bypassing In addition to register renaming and data forwarding, the 6x86 CPU implements a third data dependency-resolution technique called data bypassing. Data bypassing reduces the performance penalty of those memory data RAW dependencies that cannot be eliminated by data forwarding. Data bypassing is implemented when the first in a pair of instructions writes to memory and the second instruction reads the same data from memory. The 6x86 CPU retains the data from the first instruction and passes it to the second instruction, thereby eliminating a memory read cycle. Data bypassing only occurs for cacheable memory locations. Example #1- Data Bypassing with Read-After-Write (RAW) Dependency In this example, a RAW dependency occurs when the first in a pair of instructions performs a write to memory and the second instruction reads the same memory location. This dependency is illustrated by the pair of instructions in the X and Y pipelines as shown below: X PIPE Y PIPE (1) ADD [mem], AX (2) SUB BX, [mem] [mem] [mem] + AX BX BX - [mem] The 6x86 CPU uses data bypassing and stalls the Y pipe for only one clock by eliminating the Y pipe s memory read cycle (Table 1-5). Instead of reading memory in the Y pipe, the result of the previous instruction ([mem] + Reg0) is used to subtract from Reg1, thereby saving a memory access cycle. Table 1-5. Example of Data Bypassing Physical Register Contents Reg0 Reg1 Reg2 Pipe Action (Initial) AX BX ADD [mem], AX AX BX X [mem] [mem] + Reg0 SUB BX, [mem] AX BX Y Reg2 Reg1 - {[mem] + Reg0} 1-10 PRELIMINARY

1 1.2.5 Branch Control Branch instructions occur on average every four to six instructions in x86-compatible programs. When the normal sequential flow of a program changes due to a branch instruction, the pipeline stages may stall while waiting for the CPU to calculate, retrieve, and decode the new instruction stream. The 6x86 CPU minimizes the performance degradation and latency of branch instructions through the use of branch prediction and speculative execution. 1.2.5.1 Branch Prediction The 6x86 CPU uses a 256-entry, 4-way set associative Branch Target Buffer (BTB) to store branch target addresses and branch prediction information. During the fetch stage, the instruction stream is checked for the presence of branch instructions. If an unconditional branch instruction is encountered, the 6x86 CPU accesses the BTB to check for the branch instruction s target address. If the branch instruction s target address is found in the BTB, the 6x86 CPU begins fetching at the target address specified by the BTB. In case of conditional branches, the BTB also provides history information to indicate whether the branch is more likely to be taken or not taken. If the conditional branch instruction is found in the BTB, the 6x86 CPU begins fetching instructions at the predicted target address. If the conditional branch misses in the BTB, the 6x86 CPU predicts that the branch will not be taken, and instruction fetching continues with the next sequential instruction. The decision to fetch the taken or not taken target address is based on a four-state branch prediction algorithm. Once fetched, a conditional branch instruction is first decoded and then dispatched to the X pipeline only. The conditional branch instruction proceeds through the X pipeline and is then resolved in either the EX stage or the WB stage. The conditional branch is resolved in the EX stage, if the instruction responsible for setting the condition codes is completed prior to the execution of the branch. If the instruction that sets the condition codes is executed in parallel with the branch, the conditional branch instruction is resolved in the WB stage. Correctly predicted branch instructions execute in a single core clock. If resolution of a branch indicates that a misprediction has occurred, the 6x86 CPU flushes the pipeline and starts fetching from the correct target address. The 6x86 CPU prefetches both the predicted and the non-predicted path for each conditional branch, thereby eliminating the cache access cycle on a misprediction. If the branch is resolved in the EX stage, the resulting misprediction latency is four cycles. If the branch is resolved in the WB stage, the latency is five cycles. Since the target address of return (RET) instructions is dynamic rather than static, the 6x86 CPU caches target addresses for RET instructions in an eight-entry return stack rather than in the BTB. The return address is pushed on the return stack during a CALL instruction and popped during the corresponding RET instruction. PRELIMINARY 1-11

Cache Units 1.2.5.2 Speculative Execution The 6x86 CPU is capable of speculative execution following a floating point instruction or predicted branch. Speculative execution allows the pipelines to continuously execute instructions following a branch without stalling the pipelines waiting for branch resolution. The same mechanism is used to execute floating point instructions (see Section 1.5) in parallel with integer instructions. The 6x86 CPU is capable of up to four levels of speculation (i.e., combinations of four conditional branches and floating point operations). After generating the fetch address using branch prediction, the CPU checkpoints the machine state (registers, flags, and processor environment), increments the speculation level counter, and begins operating on the predicted instruction stream. Once the branch instruction is resolved, the CPU decreases the speculation level. For a correctly predicted branch, the status of the checkpointed resources is cleared. For a branch misprediction, the 6x86 processor generates the correct fetch address and uses the checkpointed values to restore the machine state in a single clock. In order to maintain compatibility, writes that result from speculatively executed instructions are not permitted to update the cache or external memory until the appropriate branch is resolved. Speculative execution continues until one of the following conditions occurs: 1) A branch or floating point operation is decoded and the speculation level is already at four. 2) An exception or a fault occurs. 3) The write buffers are full. 4) An attempt is made to modify a non-checkpointed resource (i.e., segment registers, system flags). 1.3 Cache Units The 6x86 CPU employs two caches, the Unified Cache and the Line Cache (Figure 1-2). 1.3.1 Unified Cache The 16-KByte unified write-back cache functions as the primary data cache and as the secondary instruction cache. Configured as a four-way set-associative cache, the cache stores up to 16 KBytes of code and data in 512 lines. The cache is dual-ported and allows any two of the following operations to occur in parallel: Code fetch Data read (X pipe, Y pipeline or FPU) Data write (X pipe, Y pipeline or FPU) The unified cache uses a pseudo-lru replacement algorithm and can be configured to allocate new lines on read misses only or on read and write misses. More information concerning the unified cache can be found in Section 2.7.1 (Page 2-52). 1-12 PRELIMINARY

Cache Units 1 1.3.2 Line Cache The fully associative 256-byte instruction line cache serves as the primary instruction cache. The instruction line cache is filled from the unified cache through the data bus. Fetches from the integer unit that hit in the instruction line cache do not access the unified cache. If an instruction line cache miss occurs, the instruction line data from the unified cache is transferred to the instruction line cache and the integer unit, simultaneously. The instruction line cache uses a pseudo-lru replacement algorithm. To ensure proper operation in the case of self-modifying code, any writes to the unified cache are checked against the contents of the instruction line cache. If a hit occurs in the instruction line cache, the appropriate line is invalidated. Data Integer Unit IF X Pipe Y Pipe Address Line Cache 256-Byte Fully Associative, 8 Lines Data Bypass Aligner FPU Data Bus Bus Interface Unit Unified Cache Cache Tags X, Y Linear Address Line Cache Miss Address 16-KByte, 4-Way Set Associative, 512 Lines Modified X, Y Physical Addresses Memory Management Unit (TLB) = Dual Bus = Single Bus 1739503 Figure 1-2. Cache Unit Operations PRELIMINARY 1-13

Memory Management Unit 1.4 Memory Management Unit The Memory Management Unit (MMU), shown in Figure 1-3, translates the linear address supplied by the IU into a physical address to be used by the unified cache and the bus interface. Memory management procedures are x86 compatible, adhering to standard paging mechanisms. The 6x86 MMU includes two paging mechanisms (Figure 1-3), a traditional paging mechanism, and a 6x86 variable-size paging mechanism. 1.4.1 Variable-Size Paging Mechanism The Cyrix variable-size paging mechanism allows software to map pages between 4 KBytes and 4 GBytes in size. The large contiguous memories provided by this mechanism help avoid TLB (Translation Lookaside Buffer) thrashing [see Section 2.6.4 (Page 2-45)] associated with some operating systems and applications. For example, use of a single large page instead of a series of small 4-KByte pages can greatly improve performance in an application using a large video memory buffer. Control Linear Address Variable-Size Paging Mechanism 3 127 DTE Cache Main TLB = On Chip 0 0 Victim TLB 7 0 DTE PTE Physical Page Directory Table Page Table Page Frame CR3 Control Register Traditional Paging Mechanism 1728901 Figure 1-3. Paging Mechanism within the Memory Management Unit 1-14 PRELIMINARY

Floating Point Unit 1 1.4.2 Traditional Paging Mechanism The traditional paging mechanism has been enhanced on the 6x86 CPU with the addition of the Directory Table Entry (DTE) cache and the Victim TLB. The main TLB (Translation Lookaside Buffer) is a direct-mapped 128-entry cache for page table entries. The four-entry fully associative DTE cache stores the most recent DTE accesses. If a Page Table Entry (PTE) miss occurs followed by a DTE hit, only a single memory access to the PTE table is required. The Victim TLB stores PTEs which have been displaced from the main TLB due to a TLB miss. If a PTE access occurs while the PTE is stored in the victim TLB, the PTE in the victim TLB is swapped with a PTE in the main TLB. This has the effect of selectively increasing TLB associativity. The 6x86 CPU updates the eight-entry fully associative victim TLB on an oldest entry replacement basis. 1.5 Floating Point Unit The 6x86 Floating Point Unit (FPU) interfaces to the integer unit and the cache unit through a 64-bit bus. The 6x86 FPU is x87 instruction set compatible and adheres to the IEEE-754 standard. Since most applications contain FPU instructions mixed with integer instructions, the 6x86 FPU achieves high performance by completing integer and FPU operations in parallel. FPU Parallel Execution The 6x86 CPU executes integer instructions in parallel with FPU instructions. Integer instructions may complete out of order with respect to the FPU instructions. The 6x86 CPU maintains x86 compatibility by signaling exceptions and issuing write cycles in program order. As previously discussed, FPU instructions are always dispatched to the integer unit s X pipeline. The address calculation stage of the X pipeline checks for memory management exceptions and accesses memory operands used by the FPU. If no exceptions are detected, the 6x86 CPU checkpoints the state of the CPU and, during AC2, dispatches the floating point instruction to the FPU instruction queue. The 6x86 CPU can then complete any subsequent integer instructions speculatively and out of order relative to the FPU instruction and relative to any potential FPU exceptions which may occur. As additional FPU instructions enter the pipeline, the 6x86 CPU dispatches up to four FPU instructions to the FPU instruction queue. The 6x86 CPU continues executing speculatively and out of order, relative to the FPU queue, until the 6x86 CPU encounters one of the conditions that causes speculative execution to halt. As the FPU completes instructions, the speculation level decreases and the checkpointed resources are available for reuse in subsequent operations. The 6x86 FPU also uses a set of four write buffers to prevent stalls due to speculative writes. PRELIMINARY 1-15

Bus Interface Unit 1.6 Bus Interface Unit The Bus Interface Unit (BIU) provides the signals and timing required by external circuitry. The signal descriptions and bus interface timing information is provided in Chapters 3 and 4 of this manual. 1-16 PRELIMINARY