Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr.

Size: px

Start display at page:

Download "Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr."

Cecilia Blankenship
5 years ago
Views:

1 Memory Hierarchies vs. By : Login : Date : Nov 8 th, 1999 Director: Professor Al-Khaiyat TA : Mr. Byung Sung 1

2 Introduction: As a semester paper for computer sciences architecture course, this paper describe an important concept in architecture area, memory hierarchies, in order to utilize the computer system better and more efficiently, computer memories are built as hierarchies. With a series of different kinds of memories ranging from very fast, expensive, and therefore small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. Two processors: PowerPC, the most successful RISC architecture and Pentium II processor, Intel s high performance desktop processor, integrating the best attributes of the P6 micro architecture are used as examples to further illustrate the memory hierarchies concept. Their characteristics on memory hierarchies are described separately and the differences are shown. This paper is organized as the following: Introduction. Chapter one: Memory hierarchies. Chapter two: PowerPC. Chapter three: Pentium Pro and Pentium II Processor. Chapter four: Comparison. 2

3 Chapter one: Memory hierarchies. 1.1 Principle of Locality: Computer Processors tend to access memory in a patterned way. For example, in the absence of logical branches, the Program Counter is incremented by one after each instruction. Thus, if memory location x is accessed at time t, there is a high probability that the processor will request an instruction from memory location x+1 in the near future. This clustering of memory references into groups is termed Locality of Reference. More specific it can be grouped into: Temporal locality: If a memory location is referenced, it will tend to be referenced again. Spatial locality: It a memory location is referenced, those address location that is close to it will tend to be referenced soon. 1.2 Definitions of memory hierarchies: Computer memories are built as hierarchies, with a series of different kinds of memories ranging from very fast, expensive, and therefore small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. For example, registers typically form the fastest memory, then cache, main memory, disks, and finally tape as the slowest, largest and cheapest. Characteristic: 1: The processor sends its request to the fastest, smallest partition of memory (cache). If what it wants is there, it can be quickly loaded. If it isn't, the request is forwarded to the next lowest level of the hierarchy and so on. The key idea is that when the lower (slower and larger and cheaper) members of the hierarchy answer a request from higher levels for the content of location x, they also send at the same time the content of x+1, x+2,... Because of locality of reference, it is likely that these will be needed in short order, and if they are, they can be loaded quickly from faster memory. 2: Since an entire large matrix cannot fit in the registers, it must be moved up and down through the hierarchy, transferred up to the registers when work needs to be done, and transferred back down to the main memory (or disk or tape) when it is no longer needed. 3: Useful floating-point operations can only be done on data at the top of the hierarchy, in the registers. 4: It takes time to move between levels in the memory hierarchy, and moving is slower the farther down in the hierarchy one goes. Indeed, one such data movement takes far longer than performing a floating-point operation. The following pages contains two pictures to give reader more impressive picture on the memory hierarchies. 3

4 Semester paper for CSE 3322, Fall 1999 Figure: memory hierarchy: This represents a typical memory hierarchy for a computer system. The fastest components are at the top of the hierarchy, but are the most expensive and therefore are always of lowest size or capacity. Hence the memory hierarchy is naturally represented as a triangle. As we move down in the hierarchy, the components get larger in capacity, but they also get slower in speed. The differences in speed between registers->cache->memory are somewhat smaller compared to the difference between memory->disk. This clearly indicates that having to store data on the disk (from paging, I/O, etc.) is much less desirable compared to having data in any other parts of the memory hierarchy. Figure: typical memory hierarchy: 4

5 Chapter Two: PowerPC The PowerPC architecture is the most successful RISC architecture that has yet appeared train control in its car and truck products. The PowerPC architecture is the culmination of several previous IBM processor designs, IBM 801 led to the appearance of two further architectures: the shortsuccessful RS/6000 platform, which is also known as the POWER architecture. As show in the. IBM ROMP IBM POWER RS/6000) POWER PC

6 Nowadays, built upon the scalable PowerPC architecture, IBM PowerPC microprocessors, embedded controllers, and cores offer solutions for a broad spectrum of applications, from high-end workstations, servers, and desktop computers to consumer electronics and hand-held communications devices. Stand-alone PowerPC 750 and 604e microprocessors, offer the performance power dissipation needed for emerging desktop and portable computers. and low In the following paper we use PowerPC 604e as an example to illustrate the memory hierarchies of PowerPC microprocessor family. 6

2.2 PowerPC 604e Figure: PowerPC TM 604e High- High Speed and Performance PowerPC 604e RISC Microprocessor includes 250, 300, 333 The PowerPC 604e* microprocessor is a 32 bit implementation of the

7 2.2 PowerPC 604e Figure: PowerPC TM 604e High- High Speed and Performance PowerPC 604e RISC Microprocessor includes 250, 300, 333 The PowerPC 604e* microprocessor is a 32 bit implementation of the PowerPC family of Reduced Instruction Set Computer (RISC) microprocessors. micro architecture derivative of the PowerPC 604e microprocessor using split voltages of 1.9 VDC for core logic and 3.3 VDC for I/O. The PowerPC 604e microprocessor is targeted at the workstation, PC ower user desktop segments. The suite of operating environments available to systems designed in accordance with the PowerPC microprocessor Common Hardware reference Platform The 604e is an implementa (RISC) microprocessors. The 604e implements the PowerPC architecture as it is specified for 32 bit addressing, -bit effective (logical) addresses, integer data types of 8, 16, and floating- -precision and double precision). For 64- PowerPC implementations, the PowerPC architecture provides additional bit addressing, and related features. 7

8 The 604e is a super scalar processor capable of issuing four instructions simultaneously. As many as seven instructions can be finished execution in parallel. The 604e has seven execution units that can operate in parallel: Floating-point unit (FPU) Branch processing unit (BPU) Condition register unit (CRU) Load/store unit (LSU) Three integer units (IUs): Two single-cycle integer units (SCIUs) One multiple-cycle integer unit (MCIU) Figure: PowerPC TM 604e block diagram : 8

9 2.3 Cache Introduction Semester paper for CSE 3322, Fall 1999 The 604e has separate 32-Kbyte data and instruction caches. This is double the size of the 604 caches. The 604e caches are logically organized as a four-way set with 256 sets compared to the 604 s 128 sets. The physical address bits that determine the set are 19 through 26 with 19 being the mostsignificant bit of the index. If bit 19 is zero, the block of data is an even 4-Kbyte page that resides in sets 0 127; otherwise, bit 19 is one and the block of data is an odd 4-Kbyte page that resides in sets Because the caches are four-way set-associative, the cache set element (CSE[0 1]) signals remain unchanged from the 604. The cache is designed to adhere to a write-back policy, but the 604e allows control of cache ability, write policy, and memory coherency at the page and block level, as defined by the PowerPC architecture. The caches use a least recently used (LRU) replacement policy. Figure: The organization of the caches. The 604e cache implementation has the following characteristics: The 604e has separate 32-Kbyte data and instruction caches. This is double the size of the 604 caches. Instruction and data caches are four-way set associative. The 604e has 256 sets, twice as much as the 604 s 128 sets. Caches implement an LRU replacement algorithm within each set. The cache directories are physically addressed. The physical (real) address tag is stored in the cache directory. Both the instruction and data caches have 32-byte cache blocks. A cache block is the block of memory that a coherency state describes, also referred to as a cache line. The coherency state bits for each block of the data cache allow encoding for all four possible MESI states: Modified (Exclusive) (M) Exclusive (Unmodified) (E) Shared (S) Invalid (I) 9

10 The coherency state bit for each cache block of the instruction cache allows encoding for two possible states: Invalid (INV) Valid (VAL) Each cache can be invalidated or locked by setting the appropriate bits in the hardware implementation dependent register 0 (HID0), a special-purpose register (SPR) specific to the 604e. The 604e uses eight-word burst transactions to transfer cache blocks to and from memory. When requesting burst reads, the 604e presents a double word aligned address. Memory controllers are expected to transfer this double word of data first, followed by double words from increasing addresses, wrapping back to the beginning of the eight-word block as required. Burst misses can be buffered into two 8-word line-fill buffers before being loaded into the cache. Writes of cache blocks by the 604e (for a copy-back operation) always present the first address of the block, and transfer data beginning at the start of the block. However, this does not preclude other masters from transferring critical double words first on the bus for writes. Note that in this chapter the terms multiprocessor and multiple-processor are used in the context of maintaining cache coherency. These devices could be processors or other devices that can access system memory, maintain their own caches, and function as bus masters requiring cache coherency. The instruction cache is connected to the bus interface unit (BIU) with a 64-bit bus; likewise, the data cache is connected both to the BIU and the load/store unit (LSU) with a 64-bit bus. The 64-bit bus allows two instructions to be loaded into the instruction cache or a double word (for example, a double-precision floating-point operand) to be loaded into the data cache in a single clock. The instruction cache provides a 128-bit interface to the instruction fetcher; so four instructions can be made available to the instruction unit in a single clock cycle. 10

11 2.3.2 Data Cache Organization As shown, the physically addressed data cache lies between the load/store instruction unit (LSU) and the bus interface unit (BIU), and provides the ability to read and write data in memory by reducing the number of system bus transactions required for execution of load/store instructions. The LSU transfers data between the data cache and the result bus, which routes data to the other execution units. The LSU supports the address generation and all the data alignment to and from the data cache. The LSU also handles other types of instructions that access memory, such as cache control instructions, and supports out-of-order loads and stores while ensuring the integrity of data. The 604e s data cache is a 32-Kbyte, four-way set-associative cache. It is a physically indexed; no blocking, and write-back cache with hardware support for reloading on cache misses. Each cache block contains eight contiguous words from memory that are loaded from an eight-word boundary (that is, bits A27 A31 of the EA are zero); as a result, cache blocks are aligned with page boundaries. Within a single cycle, the data cache provides a double-word access to the LSU. The 604e implements three copy-back write buffers (the 604 has one). The additional copy-back buffers allow certain instructions to take further advantage of the pipelined system bus to provide highly efficient handling of cache copy-back operations, block invalidate operations caused by the Data Cache Block Flush (dcbf) instruction, and cache block clean operations resulting from the Data Cache Block Store (dcbst) instruction. The data cache supports a coherent memory system using the four-state MESI coherency (modified/exclusive/shared/invalid) protocol. Like the 604, the data cache tags are dual-ported, so snooping does not affect the internal operation of other transactions on the system interface. If a snoop hit occurs in a modified block, the LSU is blocked internally for one cycle to allow the eight-word block of data to be copied to the write-back buffer, if necessary. The data cache can be invalidated on a block or invalidate-all granularity. The data cache can be invalidated all at once or on a per cache block basis. The data cache can be disabled and invalidated by setting the HID0[17] and HID0[21] bits, respectively. It can be locked by setting HID0[19]. The 604e provides additional support for data cache line-fill buffer forwarding. In the 604, only the critical double word of a burst operation was made available to the requesting unit at the time it was burst into the line-fill buffer. Subsequent data was unavailable until the cache block was filled. On the 604e, subsequent data is also made available as it arrives in the line-fill buffer. 11

12 2.3.3 Instruction Cache Organization The 604e s 32-Kbyte, four-way set-associative instruction cache is physically indexed. The organization of the instruction cache, shown in Figure 3-1, is identical to that of the data cache. Each cache block contains eight contiguous words from memory that are loaded from an eight-word boundary (that is, bits A27 A31 of the effective addresses are zero); as a result, cache blocks are aligned with page boundaries. Within a single cycle, the instruction cache provides as many as four instructions to the instruction fetch unit. The 604e provides coherency checking for instruction fetches. Instruction fetching coherency is controlled by HID0[23]. In the default mode, HID0[23] is 0 and the GBL signal is not asserted for instruction accesses on the bus, as is the case with the 604. If the bit is set and instruction translation is enabled (MSR[IR] = 1), the GBL signal is set to reflect the M bit for this page or block. If HID0[23] is set and instruction translation is disabled (MSR[IR] = 0), the GBL signal is asserted and coherency is maintained in the instruction cache. The PowerPC architecture defines a special set of instructions for managing the instruction cache. The instruction cache can be invalidated entirely or on a cache-block basis. In addition, the instruction cache can be disabled and invalidated by setting the HID0[16] and HID0[20] bits, respectively. The instruction cache can be locked by setting HID0[18]. The instruction cache differs from the data cache in that it does not implement MESI cache coherency protocol, and a single state bit is implemented that indicates only whether a cache block is valid or invalid. If a processor modifies a memory location that may be contained in the instruction cache, software must ensure that memory updates are visible to the instruction fetching mechanism. This can be achieved by the following instruction sequence: dcbst # update memory sync # wait for update icbi # remove (invalidate) copy in instruction cache sync # wait for ICBI operation to be globally performed isync # remove copy in own instruction buffer These operations are necessary because the data cache is a write-back cache. Because instruction fetching bypasses the data cache, changes made to items in the data cache may not be reflected in memory until after a fetch operation completes. 12

13 2.4 Memory management. Semester paper for CSE 3322, Fall Main Function. The primary function of the MMU in a PowerPC processor is the translation of logical (effective) addresses to physical addresses (referred to as real addresses in the architecture specification) for memory accesses, I/O accesses (most I/O accesses are assumed to be memory-mapped), and directstore interface accesses. In addition, the MMU provides access protection on a segment, block or page basis. Two general types of accesses generated by PowerPC processors require address translation instruction accesses and data accesses to memory generated by load and store instructions. Generally, the address translation mechanism is defined in terms of segment descriptors and page tables used by PowerPC processors to locate the effective-to-physical address mapping for instruction and data accesses. The segment information translates the effective address to an interim virtual address, and the page table information translates the interim virtual address to a physical address. The segment descriptors, used to generate the interim virtual addresses, are stored as on-chip segment registers on 32-bit implementations (such as the 604e). In addition, two translation look aside buffers (TLBs) are implemented on the 604e to keep recently used page address translations on-chip. Although the PowerPC OEA describes one MMU (conceptually), the 604e hardware maintains separate TLBs and table search resources for instruction and data accesses that can be performed independently (and simultaneously). Therefore, the 604e is described as having two MMUs, one for instruction accesses (IMMU) and one for data accesses (DMMU). Pictures show on the next a few pages. The block address translation (BAT) mechanism is a software-controlled array that stores the available block address translations on-chip. BAT array entries are implemented as pairs of BAT registers that are accessible as supervisor special-purpose registers (SPRs). There are separate instruction and data BAT mechanisms, and in the 604e, they reside in the instruction and data MMUs respectively. 13

14 2.4.2 Feature Summary: 14

15 2.4.3 organization of MMU. Semester paper for CSE 3322, Fall 1999 Figure: the conceptual organization of a PowerPC MMU in a 32-bit implementation; Memory management function for a particular processor. Processors may optionally implement on-chip TLBs and may optionally support the automatic search of the page tables for PTEs. In addition, other hardware features (invisible to the system software) not depicted in the figure may be implemented. The 604e maintains two on-chip TLBs with the following characteristics: 128 entries, two-way set associative (64 x 2), LRU replacement Data TLB supports the DMMU; instruction TLB supports the IMMU Hardware TLB update Hardware update of memory access recording bits in the translation table In the event of a TLB miss, the hardware attempts to load the TLB based on the results of a translation table search operation. 15

16 Figure: PowerPC TM 604e Instruction MMU block diagram: The instruction addresses shown in are generated by the processor for sequential instruction fetches and addresses that correspond to a change of program flow. As shown in the figures, after an address is generated, the higher-order bits of the effective address, EA0 EA19 (or a smaller set of address bits, EA0 EAn, in the cases of blocks), are translated into physical address bits PA0 PA19. The lower-order address bits, A20 A31 are un translated and therefore identical for both effective and physical addresses. After translating the address, the MMUs pass the resulting 32-bit physical address to the memory subsystem. 16

Figure: PowerPC TM 604e Data MMU block diagram : Data addresses shown in are generated by load and store instructions (both for the memory and the direct-store interfaces) and by cache instructions.

17 Figure: PowerPC TM 604e Data MMU block diagram : Data addresses shown in are generated by load and store instructions (both for the memory and the direct-store interfaces) and by cache instructions. In addition to the higher-order address bits, the MMUs automatically keep an indicator of whether each access was generated as an instruction or data access and a supervisor/user indicator that reflects the state of the PR bit of the MSR when the effective address was generated. In addition, for data accesses, there is an indicator of whether the access is for a load or a store operation. This information is then used by the MMUs to appropriately direct the address translation and to enforce the protection hierarchy programmed by the operating system. 17

18 2.5 Virtual memory and memory addressing Semester paper for CSE 3322, Fall 1999 A program references memory using the effective (logical) address computed by the processor when it executes a memory access or branch instruction or when it fetches the next sequential instruction. Bytes in memory are numbered consecutively starting with zero. Each number is the address of the corresponding byte. Memory operands may be bytes, half words, words, or double words, or, for the load/store multiple and load/store string instructions, a sequence of bytes or words. The address of a memory operand is the address of its first byte (that is, of its lowest-numbered byte). Operand length is implicit for each instruction. The PowerPC architecture supports both big-endian and little-endian byte ordering. The default byte and bit ordering is big-endian. The operand of a single-register memory access instruction has a natural alignment boundary equal to the operand length. In other words, the natural address of an operand is an integral multiple of the operand length. A memory operand is said to be aligned if it is aligned at its natural boundary; otherwise it is misaligned. An effective address (EA) is the 32-bit sum computed by the processor when executing a memory access or branch instruction or when fetching the next sequential instruction. For a memory access instruction, if the sum of the effective address and the operand length exceeds the maximum effective address, the memory operand is considered to wrap around from the maximum effective address through effective address 0, as described in the following paragraphs. Effective address computations for both data and instruction accesses use 32-bit unsigned binary arithmetic. A carry from bit 0 is ignored. Load and store operations have three categories of effective address generation: Register indirect with immediate index mode Register indirect with index mode Register indirect mode Immediate Link register indirect Count register indirect 18

19 2.5.1 Addressing. Semester paper for CSE 3322, Fall 1999 PowerPC processors support the following four types of address translation: Page address translation translates the page frame address for a 4-Kbyte page size Block address translation translates the block number for blocks that range in size from 128 Kbytes to 256 Mbytes. Direct-store interface address translation used to generate direct-store interface accesses on the external bus; not optimized for performance present for compatibility only. Real addressing mode address translation when address translation is disabled, the physical address is identical to the effective address. The figure shows the four address translation mechanisms provided by the MMUs. The segment descriptors shown in the figure control both the page and direct-store interface address translation mechanisms. When an access uses the page or direct-store interface address translation, the appropriate segment descriptor is required. In 32-bit implementations, one of the 16 on-chip segment registers (which contain segment descriptors) selected by the four highest-order effective address bits. A control bit in the corresponding segment descriptor then determines if the access is to memory (memory-mapped) or to the direct-store interface space. Note that the direct-store interface is present only for compatibility with existing I/O devices that used this interface. When an access is determined to be to the direct-store interface space, the implementation invokes an elaborate hardware protocol for communication with these devices. The direct-store interface protocol is not optimized for performance, and therefore, its use is discouraged. The most efficient method for accessing I/O devices is by memory-mapping the I/O areas. For memory accesses translated by a segment descriptor, the interim virtual address is generated using the information in the segment descriptor. Page address translation corresponds to the conversion of this virtual address into the 32-bit physical address used by the memory subsystem. In most cases, the physical address for the page resides in an on-chip TLB and is available for quick access. However, if the page address translation misses in an on-chip TLB, the MMU causes a search of the page tables in memory (using the virtual address information and a hashing function) to locate the required physical address. Block address translation occurs in parallel with page and direct-store segment address translation and is similar to page address translation; however, fewer higher-order effective address bits are translated into physical address bits (more lower-order address bits (at least 17) are un translated to form the offset into a block). Also, instead of segment descriptors and a TLB, block address translations use the on-chip BAT registers as a BAT array. If an effective address matches the corresponding field of a BAT register, the information in the BAT register is used to generate the physical address; in this case, the results of the page translation and the direct-store translation (occurring in parallel) are ignored. 19

20 Figure: virtual memory and addressing. Semester paper for CSE 3322, Fall

21 Chapter Three: Pentium II 3.1 Introduction to Pentium II: The Pentium II and Pentium Pro processors are members of the P6 family of processors, which includes all of the Intel Architecture processors that implement Intel s dynamic execution microarchitecture. The Pentium II processor is the next in the Intel386, Intel486, Pentium and Pentium Pro line of Intel processors. The Pentium II processor at 450 MHz, Intel's high performance desktop processor, integrates the best attributes of the P6 micro architecture processors Dynamic Execution performance, a multi-transaction system bus, plus Intel s MMX media enhancement technology. Pentium II processors are targeted for professionals, avid PC users, and PC gamers, or the Enthusiast/Professional desktop users. In addition, they are targeted for mainstream home and business users, or the Performance desktop PC market. The Pentium II processor also meets the needs of entry-level servers and workstations. The Intel Pentium II processors deliver excellent performance for all PC software and are fully compatible with existing Intel Architecture-based software. The latest Pentium II processor, at 450 MHz, extends processing power further by offering performance headroom for business media, communication and Internet capabilities. Software designed for Intel s MMX technology unleashes the full multimedia capabilities of these processors including full-screen, full-motion video, enhanced color, and realistic graphics. The Pentium II processor brings excitement to your PC experience. Systems based on Pentium II processors also include the latest features to simplify system management and lower the total cost of ownership for large and small business environments. The Pentium II processor offers great performance for today's and tomorrow's applications. 21

22 3.2 Features Summary: Feature Content Remark 1: dynamic It incorporates a unique combination of multiple branch Memory related execution microarchitecture prediction, data flow analysis, and speculative execution, which enables the Pentium II processor to deliver higher performance than the Pentium family of processors, while maintaining binary compatibility with all previous Intel 2:Intel s MMX technology. Architecture processors. The Pentium II processor also incorporates Intel s MMX technology, for enhanced media and communication performance. 3 Energy To aid in the design of energy efficient computer systems, Pentium II processor offers multiple low-power states such as Auto HALT, Stop-Grant, Sleep and Deep Sleep, to conserve power during idle times. 4:Multiple process 5:Cache The Pentium II processor utilizes multiple process the same system bus technology as the Pentium Pro processor. This allows for a higher level of performance for both uni-processor and two-way multi-processor (2-way MP) systems. Memory is cacheable for up to 512 MB of addressable memory space, allowing significant headroom for business desktop systems. Memory related 6:Bus High-performance Dual Independent Bus (DIB) architecture (system bus and cache bus) for high bandwidth, performance and capability with future systems technologies. 7: L2 cache The Pentium II processor deviates from the Pentium Pro processor by using commercially available die for the L2 cache. The L2 cache (the Tag RAM and pipelined burst synchronous static RAM (BSRAM) memories) is now multiple die. Transfer rates between the Pentium II processor core and the L2 cache are one-half the processor core clock frequency and scale with the processor core frequency. Both the Tag RAM and BSRAM receive clocked data directly from the Pentium II processor core. As with the Pentium Pro processor, the L2 cache does not connect to the Pentium II processor system bus 8: Cache Bus With the Pentium Pro processor, the Pentium II processor has a dedicated cache bus, thus maintaining the dual independent bus architecture to deliver high bus bandwidth and high performance. Memory related Memory related 22

23 Feature Content Remark 9: Single Edge Contact (S.E.C.) The S.E.C. cartridge allows the L2 cache to remain tightly coupled to the processor, while enabling use of high volume Memory related commercial SRAM components. The L2 cache is performance optimized and tested at the package level. The S.E.C. cartridge utilizes surface mount technology and a substrate with an edge finger connection. The S.E.C. cartridge introduced on the Pentium II processor will also be used in future Slot 1 processors. 10 ECC Available with ECC (Error Correction Code) functionality Memory related on the level-two cache bus for applications where data intensity and reliability are essential. 11 Protection Parity-protected address/request and response system bus Memory related signals with a retry mechanism for high data integrity and reliability. 12 : Address 450, 400, and 350 MHz versions support memory cache Memory related ability for up to 4GB of addressable memory space. 23

24 3.3 Pentium Pro and Pentium II Semester paper for CSE 3322, Fall 1999 The Intel Pentium Pro processor introduced Dynamic Execution. It has a three-way superscalar architecture, which means that it can execute three instructions per CPU clock. Pentium II does this by incorporating even more parallelism than the Pentium processor. The Pentium Pro processor provides Dynamic Execution (micro-data flow analysis, out-oforder execution, superior branch prediction, and speculative execution) in a super scalar implementation. Three instructions decode units work in parallel to decode object code into smaller operations called micro-ops. These go into an instruction pool, and (when interdependencies don t prevent) can be executed out of order by the five parallel execution units (two integer, two FPU and one memory interface unit). The Retirement Unit retires completed micro-ops in their original program order, taking account of any branches. The power of the Pentium Pro processor is further enhanced by its caches: it has the same two on-chip 8-KByte L1 caches as does the Pentium processor, and also has a 256-KByte L2 cache that is in the same package as, and closely coupled to, the CPU, using a dedicated 64-bit ( backside ) full clock speed bus. The L1 cache is dual-ported, the L2 cache supports up to 4 concurrent accesses, and the 64-bit external data bus is transaction-oriented, meaning that each access is handled as a separate request and response, with numerous requests allowed while awaiting a response. These parallel features for data access work with the parallel execution capabilities to provide a nonblocking architecture in which the processor is more fully utilized and performance is enhanced. The Pentium Pro processor also has an expanded 36-bit address bus, giving a maximum physical address space of 64 GBytes. The Pentium II processor added MMX instructions to the Pentium Pro processor architecture, incorporating the new slot 1 and slot 2 packaging techniques. These new packaging techniques moved the L2 cache off-chip or off-die. The slot 1 and slot 2 packages uses a singleedge connector instead of a socket. The Pentium II processor expanded the L1 data cache and L1 instruction cache to 16 Kbytes each. The Pentium II processor has L2 cache sizes of 256 Kbytes, 512 Kbytes and 1 Mbytes or 2 Mbytes (slot 2 only). The slot 1 processor uses a half clock speed backside bus while the slot 2 processor uses a full clock speed backside bus. 24

25 Figure: processing units and their interface with memory subsystems 25

26 3.3 Cache : introduction Semester paper for CSE 3322, Fall 1999 The memory subsystem for the P6 Family processor consists of main system memory, the primary cache (L1), and the secondary cache (L2). The bus interface unit accesses system memory through the external system bus. This 64-bit bus is a transaction-oriented bus, meaning that each bus access is handled as separate request and response operations. While the bus inter-face unit is waiting for a response to one bus request, it can issue numerous additional requests. The bus interface unit accesses the close-coupled L2 cache through a 64-bit cache bus. This bus is also transactional oriented, supporting up to four concurrent cache accesses, and operates at the full clock speed of the processor. Access to the L1 caches is through internal buses, also at full clock speed. The 8-KByte L1 instruction cache is four-way set associative; the 8-KByte L1 data cache is dual-ported and two-way set associative, supporting one load and one store operation per cycle. Coherency between the caches and system memory are maintained using the MESI (modified, exclusive, shared, invalid) cache protocol. This protocol fosters cache coherency in singleand multiple-processor systems. It is also able to detect coherency problems created by self-modifying code. Memory requests from the processor s execution units go through the memory interface unit and the memory order buffer. These units have been designed to support a smooth flow of memory access requests through the cache and system memory hierarchy to prevent memory access blocking. The L1 data cache automatically forwards a cache miss on to the L2 cache, and then, if necessary, the bus interface unit forwards an L2 cache miss to system memory. Memory requests to the L2 cache or system memories go through the memory reorder buffer, which functions as a scheduling and dispatch station. This unit keeps track of all memory requests and is able to reorder some requests to prevent blocks and improve throughput. For example, the memory reorder buffer allows loads to pass stores. It also issues speculative loads. (Stores are always dispatched in order, and speculative stores are never issued.) 26

27 The above slides provide detailed information regarding cache memory within the P6 micro architecture. The above slide shows that the P6 micro architecture CPU core including a Level 1 (L1) Instruction cache and L1 Data cache. The L1 instruction cache is single ported while the L1 data cache is dual-ported. The Bus Interface Unit (BIU) is also integrated into the processor core. Circuits that interface the processor to the System Bus is included in the core as well. A unified data and instruction Level 2 (L2) cache is integrated in the same package as the CPU core. The L2 cache is connected to the CPU core through separate bus - the L2 Cache Bus (or Backside Bus). Most P6 micro architecture processors have L2 Cache Bus that runs at the same frequency as the CPU core. 27

28 3.3.2 L1 Cache The sizes and configuration of the L1 caches on different P6 micro architecture processors vary. 1: However, each processor is configured so that the L1 instruction cache is separate from the L1 data cache. 2: The Pentium Pro processor has an L1 instruction cache that is a 4-way set associative 8KB cache. The L1 data cache is also 8KB in size. 3: However, unlike the L1 instruction cache, the data cache is only 2-way set associative. Both caches support non-blocking accesses and can have up to 4 outstanding misses without stalling the processor. 4: The Pentium II processor has an L1 instruction cache and L1 data cache that are both 4- way set associative and 16KB in size. Both caches support non-blocking accesses and can have up to 4 outstanding misses without stalling the processor. 28

29 3.3.3 L2 Cache Semester paper for CSE 3322, Fall : Processors based on the P6 micro architecture all have a unified data and instruction L2 cache in the same package as the CPU. 2: The L2 caches are all 4-way set associative caches. 3: However, the L2 Cache Bus speed and sizes supported by each processor vary. The Pentium Pro processor has an L2 Cache Bus running at the CPU core frequency. It supports 256KB, 512KB, or 1024KB L2 cache size configurations. The Pentium II processor L2 Cache Bus runs at half the CPU core frequency. The Pentium II processor supports only 256KB and 512KB cache size configurations. 29

30 3.4 Memory and addressing modes Introduction. The memory that the processor addresses on its bus is called physical memory. Physical memory is organized as a sequence of 8-bit bytes. Each byte is assigned a unique address, called a physical address. The physical address space ranges from zero to a maximum of (64 gigabytes). Virtually any operating system or executive designed to work with an IA processor will use the processor s memory management facilities to access memory. These facilities provide features such as segmentation and paging, which allow memory to be managed efficiently and reliably. Memory management is described in detail in the following paragraphs describe the basic methods of addressing memory when memory management is used. When employing the processor s memory management facilities, programs do not directly address physical memory. Instead, they access memory using any of three memory models: flat, segmented, or real-address mode. With the flat memory model (refer to Figure), memory appears to a program as a single, continuous address space, called a linear address space. Code (a program s instructions), data, and the procedure stack are all contained in this address space. The linear address space is byte addressable, with addresses running contiguously from 0 to An address for any byte in the linear address space is called a linear address. With the segmented memory model, memory appears to a program as a group of independent address spaces called segments. When using this model, code, data, and stacks are typically contained in separate segments. To address a byte in a segment, a program must issue a logical address, which consists of a segment selector and an offset. (A logical address is often referred to as a far pointer.) The segment selector identifies the segment to be accessed and the offset identifies a byte in the address space of the segment. The programs running on an IA processor can address up to 16,383 segments of different sizes and types, and each segment can be as large as 2 36 bytes. Internally, all the segments that are defined for a system are mapped into the processor s linear address space. The processor translates each logical address into a linear address to access a memory location. This translation is transparent to the application program. The primary reason for using segmented memory is to increase the reliability of programs and systems. For example, placing a program s stack in a separate segment prevents the stack from growing into the code or data space and overwriting instructions or data, respectively. Placing the operating system s or executive s code, data, and stack in separate segments also protects them from the application program and vice versa. With either the flat or segmented model, the IA provides facilities for dividing the linear address space into pages and mapping the pages into virtual memory. If an operating system/executive uses the IA s paging mechanism, the existence of the pages is transparent to an application program. 30

31 The real-address mode model uses the memory model for the Intel 8086 processor, the first IA processor. It was provided in all the subsequent IA processors for compatibility with existing programs written to run on the Intel 8086 processor. The real-address mode uses a specific implementation of segmented memory in which the linear address space for the program and the operating system/executive consists of an array of segments of up to 64 Kbytes in size each. The maximum size of the linear address space in real-address mode is 2 20 bytes. Figure : Addressing mode 31

32 3.4.2 Memory manage, control and paging Semester paper for CSE 3322, Fall 1999 The memory management facilities of the Intel Architecture are divided into two parts: segmentation and paging. Segmentation provides a mechanism of isolating individual code, data, and stack modules so that multiple programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mechanism for implementing a conventional demand-paged, virtualmemory system where sections of a program s execution environment are mapped into physical memory as needed. Paging can also be used to provide isolation between multiple tasks. When operating in protected mode, some form of segmentation must be used. There is no mode bit to disable segmentation. The use of paging, however, is optional. These two mechanisms (segmentation and paging) can be configured to support simple single-program (or single-task) systems, multitasking systems, or multiple-processor systems that used shared memory. As shown in Figure, segmentation provides a mechanism for dividing the processor s addressable memory space (called the linear address space) into smaller protected address spaces called segments. Segments can be used to hold the code, data, and stack for a program or to hold system data structures (such as a TSS or LDT). If more than one program (or task) is running on a processor, each program can be assigned its own set of segments. The processor then enforces the boundaries between these segments and insures that one program does not interfere with the execution of another program by writing into the other program s segments. The segmentation mechanism also allows typing of segments so that the operations that may be performed on a particular type of segment can be restricted. All of the segments within a system are contained in the processor s linear address space. To locate a byte in a particular segment, a logical address (sometimes called a far pointer) must be provided. A logical address consists of a segment selector and an offset. The segment selector is a unique identifier for a segment. Among other things it provides an offset into a descriptor table (such as the global descriptor table, GDT) to a data structure called a segment descriptor. Each segment has a segment descriptor, which specifies the size of the segment, the access rights and privilege level for the segment, the segment type, and the location of the first byte of the segment in the linear address space (called the base address of the segment). The offset part of the logical address is added to the base address for the segment to locate a byte within the segment. The base address plus the offset thus forms a linear address in the processor s linear address space. 32

33 Figure: logical addressing to physical addressing. Semester paper for CSE 3322, Fall 1999 If paging is not used, the linear address space of the processor is mapped directly into the physical address space of processor. The physical address space is defined as the range of addresses that the processor can generate on its address bus. Because multitasking computing systems commonly define a linear address space much larger than it is economically feasible to contain all at once in physical memory, some method of visualizing the linear address space is needed. This virtualization of the linear address space is handled through the processor s paging mechanism. Paging supports a virtual memory environment where a large linear address space is simulated with a small amount of physical memory (RAM and ROM) and some disk storage. When using paging, each segment is divided into pages (ordinarily 4 Kbytes each in size), which are stored either in physical memory or on the disk. The operating system or executive maintains a page directory and a set of page tables to keep track of the pages. When a program (or task) attempts to access an address location in the linear address space, the processor uses the page directory and page tables to translate the linear address into a physical address and then performs the requested operation (read or write) on the memory location. If the page being accessed is not currently in physical memory, the processor interrupts execution of the program (by generating a page-fault exception). The operating system or executive then reads the page into physical memory from the disk and continues executing the program. Paging is implemented properly in the operating system or execute the swapping of pages between physical memory and the disk is transparent to the correct execution of a program. Even programs written for 16-bit Intel Architecture processors can be paged (transparently) when they are run in virtual-8086 mode. 33

34 3.5 More about paging and virtual memory. Semester paper for CSE 3322, Fall 1999 When operating in protected mode, the Intel Architecture permits the linear address space to be mapped directly into a large physical memory (for example, 4 GBytes of RAM) or indirectly (using paging) into a smaller physical memory and disk storage. This latter method of mapping the linear address space is commonly referred to as virtual memory or demand-paged virtual memory. When paging is used, the processor divides the linear address space into fixed-size pages (generally 4 Kbytes in length) that can be mapped into physical memory and/or disk storage. When a program (or task) references a logical address in memory, the processor translates the address into a linear address and then uses its paging mechanism to translate the linear address into a corresponding physical address. If the page containing the linear address is not currently in physical memory, the processor generates a page-fault exception (#PF). The exception handler for the page-fault exception typically directs the operating system or executive to load the page from disk storage into physical memory (perhaps writing a different page from physical memory out to disk in the process). When the page has been loaded in physical memory, a return from the exception handler causes the instruction that generated the exception to be restarted. The information that the processor uses to map linear addresses into the physical address space and to generate page-fault exceptions (when necessary) is contained in page directories and page tables stored in memory. Paging is different from segmentation through its use of fixed-size pages. Unlike segments, which usually are the same size as the code or data structures they hold, pages have a fixed size. If segmentation is the only form of address translation used, a data structure present in physical memory will have all of its parts in memory. If paging is used, a data structure can be partly in memory and partly in disk storage. To minimize the number of bus cycles required for address translation, the most recently accessed page-directory and page-table entries are cached in the processor in devices called translation look aside buffers (TLBs). The TLBs satisfy most requests for reading the current page directory and page tables without requiring a bus cycle. Extra bus cycles occur only when the TLBs do not contain a page-table entry, which typically happens when a page has not been accessed for a long time. Three flags in the processors control registers control paging: PG (paging) flag, bit 31 of CR0 (available in all Intel Architecture processors Beginning with the Intel386 processor). PSE (page size extensions) flag, bit 4 of CR4 (introduced in the Pentium and Pentium Pro processors). 34

35 PAE (physical address extension) flag, bit 5 of CR4 (introduced in the Pentium Pro processors). The PG flag enables the page-translation mechanism. The operating system or executive usually sets this flag during processor initialization. The PG flag must be set if the processor s pagetranslation mechanism is to be used to implement a demand-paged virtual memory system or if the operating system is designed to run more than one program (or task) in virtual-8086 mode. The PSE flag enables large page sizes: 4-MByte pages or 2-MByte pages (when the PAE flag is set). When the PSE flag is clear, the more common page length of 4 Kbytes is used. The PAE flag enables 36-bit physical addresses. This physical address extension can only be used when paging is enabled. It relies on page directories and page tables to reference physical addresses above FFFFFFFFH. The information that the processor uses to translate linear addresses into physical addresses (when paging is enabled) is contained in four data structures: Page directory An array of 32-bit page-directory entries (PDEs) contained in a 4-Kbyte page. Up to 1024 page-directory entries can be held in a page directory. Page table An array of 32-bit page-table entries (PTEs) contained in a 4-KByte page. Up to 1024 page-table entries can be held in a page table. (Page tables are not used for 2-Mbyte or 4-MByte pages. These page sizes are mapped directly from one or more page-directory entries.) Page A 4-KByte, 2-MByte, or 4-MByte flat address space. Page-Directory-Pointer Table An array of four 64-bit entries, each of which points to a page directory. These tables provide access to either 4-KByte or 4-MByte pages when normal 32-bit physical addressing is being used and to 4-KByte, 2-MByte, or 4-MByte pages when extended (36-bit) physical addressing is being used. The page size and physical address size obtained from various settings of the paging control flags. Each page-directory entry contains a PS (page size) flag that specifies whether the entry points to a page table whose entries in turn point to 4-KByte pages (PS set to 0) or whether the page-directory entry points directly to a 4- Mbytes or 2-MByte page (PSE or PAE set to 1 and PS set to 1). 35

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order