Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr.

Size: px
Start display at page:

Download "Semester paper for CSE 3322, Fall Memory Hierarchies. vs. By : Login : Date : Nov 8 th, Director: Professor Al-Khaiyat TA : Mr."

Transcription

1 Memory Hierarchies vs. By : Login : Date : Nov 8 th, 1999 Director: Professor Al-Khaiyat TA : Mr. Byung Sung 1

2 Introduction: As a semester paper for computer sciences architecture course, this paper describe an important concept in architecture area, memory hierarchies, in order to utilize the computer system better and more efficiently, computer memories are built as hierarchies. With a series of different kinds of memories ranging from very fast, expensive, and therefore small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. Two processors: PowerPC, the most successful RISC architecture and Pentium II processor, Intel s high performance desktop processor, integrating the best attributes of the P6 micro architecture are used as examples to further illustrate the memory hierarchies concept. Their characteristics on memory hierarchies are described separately and the differences are shown. This paper is organized as the following: Introduction. Chapter one: Memory hierarchies. Chapter two: PowerPC. Chapter three: Pentium Pro and Pentium II Processor. Chapter four: Comparison. 2

3 Chapter one: Memory hierarchies. 1.1 Principle of Locality: Computer Processors tend to access memory in a patterned way. For example, in the absence of logical branches, the Program Counter is incremented by one after each instruction. Thus, if memory location x is accessed at time t, there is a high probability that the processor will request an instruction from memory location x+1 in the near future. This clustering of memory references into groups is termed Locality of Reference. More specific it can be grouped into: Temporal locality: If a memory location is referenced, it will tend to be referenced again. Spatial locality: It a memory location is referenced, those address location that is close to it will tend to be referenced soon. 1.2 Definitions of memory hierarchies: Computer memories are built as hierarchies, with a series of different kinds of memories ranging from very fast, expensive, and therefore small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. For example, registers typically form the fastest memory, then cache, main memory, disks, and finally tape as the slowest, largest and cheapest. Characteristic: 1: The processor sends its request to the fastest, smallest partition of memory (cache). If what it wants is there, it can be quickly loaded. If it isn't, the request is forwarded to the next lowest level of the hierarchy and so on. The key idea is that when the lower (slower and larger and cheaper) members of the hierarchy answer a request from higher levels for the content of location x, they also send at the same time the content of x+1, x+2,... Because of locality of reference, it is likely that these will be needed in short order, and if they are, they can be loaded quickly from faster memory. 2: Since an entire large matrix cannot fit in the registers, it must be moved up and down through the hierarchy, transferred up to the registers when work needs to be done, and transferred back down to the main memory (or disk or tape) when it is no longer needed. 3: Useful floating-point operations can only be done on data at the top of the hierarchy, in the registers. 4: It takes time to move between levels in the memory hierarchy, and moving is slower the farther down in the hierarchy one goes. Indeed, one such data movement takes far longer than performing a floating-point operation. The following pages contains two pictures to give reader more impressive picture on the memory hierarchies. 3

4 Semester paper for CSE 3322, Fall 1999 Figure: memory hierarchy: This represents a typical memory hierarchy for a computer system. The fastest components are at the top of the hierarchy, but are the most expensive and therefore are always of lowest size or capacity. Hence the memory hierarchy is naturally represented as a triangle. As we move down in the hierarchy, the components get larger in capacity, but they also get slower in speed. The differences in speed between registers->cache->memory are somewhat smaller compared to the difference between memory->disk. This clearly indicates that having to store data on the disk (from paging, I/O, etc.) is much less desirable compared to having data in any other parts of the memory hierarchy. Figure: typical memory hierarchy: 4

5 Chapter Two: PowerPC The PowerPC architecture is the most successful RISC architecture that has yet appeared train control in its car and truck products. The PowerPC architecture is the culmination of several previous IBM processor designs, IBM 801 led to the appearance of two further architectures: the shortsuccessful RS/6000 platform, which is also known as the POWER architecture. As show in the. IBM ROMP IBM POWER RS/6000) POWER PC

6 Nowadays, built upon the scalable PowerPC architecture, IBM PowerPC microprocessors, embedded controllers, and cores offer solutions for a broad spectrum of applications, from high-end workstations, servers, and desktop computers to consumer electronics and hand-held communications devices. Stand-alone PowerPC 750 and 604e microprocessors, offer the performance power dissipation needed for emerging desktop and portable computers. and low In the following paper we use PowerPC 604e as an example to illustrate the memory hierarchies of PowerPC microprocessor family. 6

7 2.2 PowerPC 604e Figure: PowerPC TM 604e High- High Speed and Performance PowerPC 604e RISC Microprocessor includes 250, 300, 333 The PowerPC 604e* microprocessor is a 32 bit implementation of the PowerPC family of Reduced Instruction Set Computer (RISC) microprocessors. micro architecture derivative of the PowerPC 604e microprocessor using split voltages of 1.9 VDC for core logic and 3.3 VDC for I/O. The PowerPC 604e microprocessor is targeted at the workstation, PC ower user desktop segments. The suite of operating environments available to systems designed in accordance with the PowerPC microprocessor Common Hardware reference Platform The 604e is an implementa (RISC) microprocessors. The 604e implements the PowerPC architecture as it is specified for 32 bit addressing, -bit effective (logical) addresses, integer data types of 8, 16, and floating- -precision and double precision). For 64- PowerPC implementations, the PowerPC architecture provides additional bit addressing, and related features. 7

8 The 604e is a super scalar processor capable of issuing four instructions simultaneously. As many as seven instructions can be finished execution in parallel. The 604e has seven execution units that can operate in parallel: Floating-point unit (FPU) Branch processing unit (BPU) Condition register unit (CRU) Load/store unit (LSU) Three integer units (IUs): Two single-cycle integer units (SCIUs) One multiple-cycle integer unit (MCIU) Figure: PowerPC TM 604e block diagram : 8

9 2.3 Cache Introduction Semester paper for CSE 3322, Fall 1999 The 604e has separate 32-Kbyte data and instruction caches. This is double the size of the 604 caches. The 604e caches are logically organized as a four-way set with 256 sets compared to the 604 s 128 sets. The physical address bits that determine the set are 19 through 26 with 19 being the mostsignificant bit of the index. If bit 19 is zero, the block of data is an even 4-Kbyte page that resides in sets 0 127; otherwise, bit 19 is one and the block of data is an odd 4-Kbyte page that resides in sets Because the caches are four-way set-associative, the cache set element (CSE[0 1]) signals remain unchanged from the 604. The cache is designed to adhere to a write-back policy, but the 604e allows control of cache ability, write policy, and memory coherency at the page and block level, as defined by the PowerPC architecture. The caches use a least recently used (LRU) replacement policy. Figure: The organization of the caches. The 604e cache implementation has the following characteristics: The 604e has separate 32-Kbyte data and instruction caches. This is double the size of the 604 caches. Instruction and data caches are four-way set associative. The 604e has 256 sets, twice as much as the 604 s 128 sets. Caches implement an LRU replacement algorithm within each set. The cache directories are physically addressed. The physical (real) address tag is stored in the cache directory. Both the instruction and data caches have 32-byte cache blocks. A cache block is the block of memory that a coherency state describes, also referred to as a cache line. The coherency state bits for each block of the data cache allow encoding for all four possible MESI states: Modified (Exclusive) (M) Exclusive (Unmodified) (E) Shared (S) Invalid (I) 9

10 The coherency state bit for each cache block of the instruction cache allows encoding for two possible states: Invalid (INV) Valid (VAL) Each cache can be invalidated or locked by setting the appropriate bits in the hardware implementation dependent register 0 (HID0), a special-purpose register (SPR) specific to the 604e. The 604e uses eight-word burst transactions to transfer cache blocks to and from memory. When requesting burst reads, the 604e presents a double word aligned address. Memory controllers are expected to transfer this double word of data first, followed by double words from increasing addresses, wrapping back to the beginning of the eight-word block as required. Burst misses can be buffered into two 8-word line-fill buffers before being loaded into the cache. Writes of cache blocks by the 604e (for a copy-back operation) always present the first address of the block, and transfer data beginning at the start of the block. However, this does not preclude other masters from transferring critical double words first on the bus for writes. Note that in this chapter the terms multiprocessor and multiple-processor are used in the context of maintaining cache coherency. These devices could be processors or other devices that can access system memory, maintain their own caches, and function as bus masters requiring cache coherency. The instruction cache is connected to the bus interface unit (BIU) with a 64-bit bus; likewise, the data cache is connected both to the BIU and the load/store unit (LSU) with a 64-bit bus. The 64-bit bus allows two instructions to be loaded into the instruction cache or a double word (for example, a double-precision floating-point operand) to be loaded into the data cache in a single clock. The instruction cache provides a 128-bit interface to the instruction fetcher; so four instructions can be made available to the instruction unit in a single clock cycle. 10

11 2.3.2 Data Cache Organization As shown, the physically addressed data cache lies between the load/store instruction unit (LSU) and the bus interface unit (BIU), and provides the ability to read and write data in memory by reducing the number of system bus transactions required for execution of load/store instructions. The LSU transfers data between the data cache and the result bus, which routes data to the other execution units. The LSU supports the address generation and all the data alignment to and from the data cache. The LSU also handles other types of instructions that access memory, such as cache control instructions, and supports out-of-order loads and stores while ensuring the integrity of data. The 604e s data cache is a 32-Kbyte, four-way set-associative cache. It is a physically indexed; no blocking, and write-back cache with hardware support for reloading on cache misses. Each cache block contains eight contiguous words from memory that are loaded from an eight-word boundary (that is, bits A27 A31 of the EA are zero); as a result, cache blocks are aligned with page boundaries. Within a single cycle, the data cache provides a double-word access to the LSU. The 604e implements three copy-back write buffers (the 604 has one). The additional copy-back buffers allow certain instructions to take further advantage of the pipelined system bus to provide highly efficient handling of cache copy-back operations, block invalidate operations caused by the Data Cache Block Flush (dcbf) instruction, and cache block clean operations resulting from the Data Cache Block Store (dcbst) instruction. The data cache supports a coherent memory system using the four-state MESI coherency (modified/exclusive/shared/invalid) protocol. Like the 604, the data cache tags are dual-ported, so snooping does not affect the internal operation of other transactions on the system interface. If a snoop hit occurs in a modified block, the LSU is blocked internally for one cycle to allow the eight-word block of data to be copied to the write-back buffer, if necessary. The data cache can be invalidated on a block or invalidate-all granularity. The data cache can be invalidated all at once or on a per cache block basis. The data cache can be disabled and invalidated by setting the HID0[17] and HID0[21] bits, respectively. It can be locked by setting HID0[19]. The 604e provides additional support for data cache line-fill buffer forwarding. In the 604, only the critical double word of a burst operation was made available to the requesting unit at the time it was burst into the line-fill buffer. Subsequent data was unavailable until the cache block was filled. On the 604e, subsequent data is also made available as it arrives in the line-fill buffer. 11

12 2.3.3 Instruction Cache Organization The 604e s 32-Kbyte, four-way set-associative instruction cache is physically indexed. The organization of the instruction cache, shown in Figure 3-1, is identical to that of the data cache. Each cache block contains eight contiguous words from memory that are loaded from an eight-word boundary (that is, bits A27 A31 of the effective addresses are zero); as a result, cache blocks are aligned with page boundaries. Within a single cycle, the instruction cache provides as many as four instructions to the instruction fetch unit. The 604e provides coherency checking for instruction fetches. Instruction fetching coherency is controlled by HID0[23]. In the default mode, HID0[23] is 0 and the GBL signal is not asserted for instruction accesses on the bus, as is the case with the 604. If the bit is set and instruction translation is enabled (MSR[IR] = 1), the GBL signal is set to reflect the M bit for this page or block. If HID0[23] is set and instruction translation is disabled (MSR[IR] = 0), the GBL signal is asserted and coherency is maintained in the instruction cache. The PowerPC architecture defines a special set of instructions for managing the instruction cache. The instruction cache can be invalidated entirely or on a cache-block basis. In addition, the instruction cache can be disabled and invalidated by setting the HID0[16] and HID0[20] bits, respectively. The instruction cache can be locked by setting HID0[18]. The instruction cache differs from the data cache in that it does not implement MESI cache coherency protocol, and a single state bit is implemented that indicates only whether a cache block is valid or invalid. If a processor modifies a memory location that may be contained in the instruction cache, software must ensure that memory updates are visible to the instruction fetching mechanism. This can be achieved by the following instruction sequence: dcbst # update memory sync # wait for update icbi # remove (invalidate) copy in instruction cache sync # wait for ICBI operation to be globally performed isync # remove copy in own instruction buffer These operations are necessary because the data cache is a write-back cache. Because instruction fetching bypasses the data cache, changes made to items in the data cache may not be reflected in memory until after a fetch operation completes. 12

13 2.4 Memory management. Semester paper for CSE 3322, Fall Main Function. The primary function of the MMU in a PowerPC processor is the translation of logical (effective) addresses to physical addresses (referred to as real addresses in the architecture specification) for memory accesses, I/O accesses (most I/O accesses are assumed to be memory-mapped), and directstore interface accesses. In addition, the MMU provides access protection on a segment, block or page basis. Two general types of accesses generated by PowerPC processors require address translation instruction accesses and data accesses to memory generated by load and store instructions. Generally, the address translation mechanism is defined in terms of segment descriptors and page tables used by PowerPC processors to locate the effective-to-physical address mapping for instruction and data accesses. The segment information translates the effective address to an interim virtual address, and the page table information translates the interim virtual address to a physical address. The segment descriptors, used to generate the interim virtual addresses, are stored as on-chip segment registers on 32-bit implementations (such as the 604e). In addition, two translation look aside buffers (TLBs) are implemented on the 604e to keep recently used page address translations on-chip. Although the PowerPC OEA describes one MMU (conceptually), the 604e hardware maintains separate TLBs and table search resources for instruction and data accesses that can be performed independently (and simultaneously). Therefore, the 604e is described as having two MMUs, one for instruction accesses (IMMU) and one for data accesses (DMMU). Pictures show on the next a few pages. The block address translation (BAT) mechanism is a software-controlled array that stores the available block address translations on-chip. BAT array entries are implemented as pairs of BAT registers that are accessible as supervisor special-purpose registers (SPRs). There are separate instruction and data BAT mechanisms, and in the 604e, they reside in the instruction and data MMUs respectively. 13

14 2.4.2 Feature Summary: 14

15 2.4.3 organization of MMU. Semester paper for CSE 3322, Fall 1999 Figure: the conceptual organization of a PowerPC MMU in a 32-bit implementation; Memory management function for a particular processor. Processors may optionally implement on-chip TLBs and may optionally support the automatic search of the page tables for PTEs. In addition, other hardware features (invisible to the system software) not depicted in the figure may be implemented. The 604e maintains two on-chip TLBs with the following characteristics: 128 entries, two-way set associative (64 x 2), LRU replacement Data TLB supports the DMMU; instruction TLB supports the IMMU Hardware TLB update Hardware update of memory access recording bits in the translation table In the event of a TLB miss, the hardware attempts to load the TLB based on the results of a translation table search operation. 15

16 Figure: PowerPC TM 604e Instruction MMU block diagram: The instruction addresses shown in are generated by the processor for sequential instruction fetches and addresses that correspond to a change of program flow. As shown in the figures, after an address is generated, the higher-order bits of the effective address, EA0 EA19 (or a smaller set of address bits, EA0 EAn, in the cases of blocks), are translated into physical address bits PA0 PA19. The lower-order address bits, A20 A31 are un translated and therefore identical for both effective and physical addresses. After translating the address, the MMUs pass the resulting 32-bit physical address to the memory subsystem. 16

17 Figure: PowerPC TM 604e Data MMU block diagram : Data addresses shown in are generated by load and store instructions (both for the memory and the direct-store interfaces) and by cache instructions. In addition to the higher-order address bits, the MMUs automatically keep an indicator of whether each access was generated as an instruction or data access and a supervisor/user indicator that reflects the state of the PR bit of the MSR when the effective address was generated. In addition, for data accesses, there is an indicator of whether the access is for a load or a store operation. This information is then used by the MMUs to appropriately direct the address translation and to enforce the protection hierarchy programmed by the operating system. 17

18 2.5 Virtual memory and memory addressing Semester paper for CSE 3322, Fall 1999 A program references memory using the effective (logical) address computed by the processor when it executes a memory access or branch instruction or when it fetches the next sequential instruction. Bytes in memory are numbered consecutively starting with zero. Each number is the address of the corresponding byte. Memory operands may be bytes, half words, words, or double words, or, for the load/store multiple and load/store string instructions, a sequence of bytes or words. The address of a memory operand is the address of its first byte (that is, of its lowest-numbered byte). Operand length is implicit for each instruction. The PowerPC architecture supports both big-endian and little-endian byte ordering. The default byte and bit ordering is big-endian. The operand of a single-register memory access instruction has a natural alignment boundary equal to the operand length. In other words, the natural address of an operand is an integral multiple of the operand length. A memory operand is said to be aligned if it is aligned at its natural boundary; otherwise it is misaligned. An effective address (EA) is the 32-bit sum computed by the processor when executing a memory access or branch instruction or when fetching the next sequential instruction. For a memory access instruction, if the sum of the effective address and the operand length exceeds the maximum effective address, the memory operand is considered to wrap around from the maximum effective address through effective address 0, as described in the following paragraphs. Effective address computations for both data and instruction accesses use 32-bit unsigned binary arithmetic. A carry from bit 0 is ignored. Load and store operations have three categories of effective address generation: Register indirect with immediate index mode Register indirect with index mode Register indirect mode Immediate Link register indirect Count register indirect 18

19 2.5.1 Addressing. Semester paper for CSE 3322, Fall 1999 PowerPC processors support the following four types of address translation: Page address translation translates the page frame address for a 4-Kbyte page size Block address translation translates the block number for blocks that range in size from 128 Kbytes to 256 Mbytes. Direct-store interface address translation used to generate direct-store interface accesses on the external bus; not optimized for performance present for compatibility only. Real addressing mode address translation when address translation is disabled, the physical address is identical to the effective address. The figure shows the four address translation mechanisms provided by the MMUs. The segment descriptors shown in the figure control both the page and direct-store interface address translation mechanisms. When an access uses the page or direct-store interface address translation, the appropriate segment descriptor is required. In 32-bit implementations, one of the 16 on-chip segment registers (which contain segment descriptors) selected by the four highest-order effective address bits. A control bit in the corresponding segment descriptor then determines if the access is to memory (memory-mapped) or to the direct-store interface space. Note that the direct-store interface is present only for compatibility with existing I/O devices that used this interface. When an access is determined to be to the direct-store interface space, the implementation invokes an elaborate hardware protocol for communication with these devices. The direct-store interface protocol is not optimized for performance, and therefore, its use is discouraged. The most efficient method for accessing I/O devices is by memory-mapping the I/O areas. For memory accesses translated by a segment descriptor, the interim virtual address is generated using the information in the segment descriptor. Page address translation corresponds to the conversion of this virtual address into the 32-bit physical address used by the memory subsystem. In most cases, the physical address for the page resides in an on-chip TLB and is available for quick access. However, if the page address translation misses in an on-chip TLB, the MMU causes a search of the page tables in memory (using the virtual address information and a hashing function) to locate the required physical address. Block address translation occurs in parallel with page and direct-store segment address translation and is similar to page address translation; however, fewer higher-order effective address bits are translated into physical address bits (more lower-order address bits (at least 17) are un translated to form the offset into a block). Also, instead of segment descriptors and a TLB, block address translations use the on-chip BAT registers as a BAT array. If an effective address matches the corresponding field of a BAT register, the information in the BAT register is used to generate the physical address; in this case, the results of the page translation and the direct-store translation (occurring in parallel) are ignored. 19

20 Figure: virtual memory and addressing. Semester paper for CSE 3322, Fall

21 Chapter Three: Pentium II 3.1 Introduction to Pentium II: The Pentium II and Pentium Pro processors are members of the P6 family of processors, which includes all of the Intel Architecture processors that implement Intel s dynamic execution microarchitecture. The Pentium II processor is the next in the Intel386, Intel486, Pentium and Pentium Pro line of Intel processors. The Pentium II processor at 450 MHz, Intel's high performance desktop processor, integrates the best attributes of the P6 micro architecture processors Dynamic Execution performance, a multi-transaction system bus, plus Intel s MMX media enhancement technology. Pentium II processors are targeted for professionals, avid PC users, and PC gamers, or the Enthusiast/Professional desktop users. In addition, they are targeted for mainstream home and business users, or the Performance desktop PC market. The Pentium II processor also meets the needs of entry-level servers and workstations. The Intel Pentium II processors deliver excellent performance for all PC software and are fully compatible with existing Intel Architecture-based software. The latest Pentium II processor, at 450 MHz, extends processing power further by offering performance headroom for business media, communication and Internet capabilities. Software designed for Intel s MMX technology unleashes the full multimedia capabilities of these processors including full-screen, full-motion video, enhanced color, and realistic graphics. The Pentium II processor brings excitement to your PC experience. Systems based on Pentium II processors also include the latest features to simplify system management and lower the total cost of ownership for large and small business environments. The Pentium II processor offers great performance for today's and tomorrow's applications. 21

22 3.2 Features Summary: Feature Content Remark 1: dynamic It incorporates a unique combination of multiple branch Memory related execution microarchitecture prediction, data flow analysis, and speculative execution, which enables the Pentium II processor to deliver higher performance than the Pentium family of processors, while maintaining binary compatibility with all previous Intel 2:Intel s MMX technology. Architecture processors. The Pentium II processor also incorporates Intel s MMX technology, for enhanced media and communication performance. 3 Energy To aid in the design of energy efficient computer systems, Pentium II processor offers multiple low-power states such as Auto HALT, Stop-Grant, Sleep and Deep Sleep, to conserve power during idle times. 4:Multiple process 5:Cache The Pentium II processor utilizes multiple process the same system bus technology as the Pentium Pro processor. This allows for a higher level of performance for both uni-processor and two-way multi-processor (2-way MP) systems. Memory is cacheable for up to 512 MB of addressable memory space, allowing significant headroom for business desktop systems. Memory related 6:Bus High-performance Dual Independent Bus (DIB) architecture (system bus and cache bus) for high bandwidth, performance and capability with future systems technologies. 7: L2 cache The Pentium II processor deviates from the Pentium Pro processor by using commercially available die for the L2 cache. The L2 cache (the Tag RAM and pipelined burst synchronous static RAM (BSRAM) memories) is now multiple die. Transfer rates between the Pentium II processor core and the L2 cache are one-half the processor core clock frequency and scale with the processor core frequency. Both the Tag RAM and BSRAM receive clocked data directly from the Pentium II processor core. As with the Pentium Pro processor, the L2 cache does not connect to the Pentium II processor system bus 8: Cache Bus With the Pentium Pro processor, the Pentium II processor has a dedicated cache bus, thus maintaining the dual independent bus architecture to deliver high bus bandwidth and high performance. Memory related Memory related 22

23 Feature Content Remark 9: Single Edge Contact (S.E.C.) The S.E.C. cartridge allows the L2 cache to remain tightly coupled to the processor, while enabling use of high volume Memory related commercial SRAM components. The L2 cache is performance optimized and tested at the package level. The S.E.C. cartridge utilizes surface mount technology and a substrate with an edge finger connection. The S.E.C. cartridge introduced on the Pentium II processor will also be used in future Slot 1 processors. 10 ECC Available with ECC (Error Correction Code) functionality Memory related on the level-two cache bus for applications where data intensity and reliability are essential. 11 Protection Parity-protected address/request and response system bus Memory related signals with a retry mechanism for high data integrity and reliability. 12 : Address 450, 400, and 350 MHz versions support memory cache Memory related ability for up to 4GB of addressable memory space. 23

24 3.3 Pentium Pro and Pentium II Semester paper for CSE 3322, Fall 1999 The Intel Pentium Pro processor introduced Dynamic Execution. It has a three-way superscalar architecture, which means that it can execute three instructions per CPU clock. Pentium II does this by incorporating even more parallelism than the Pentium processor. The Pentium Pro processor provides Dynamic Execution (micro-data flow analysis, out-oforder execution, superior branch prediction, and speculative execution) in a super scalar implementation. Three instructions decode units work in parallel to decode object code into smaller operations called micro-ops. These go into an instruction pool, and (when interdependencies don t prevent) can be executed out of order by the five parallel execution units (two integer, two FPU and one memory interface unit). The Retirement Unit retires completed micro-ops in their original program order, taking account of any branches. The power of the Pentium Pro processor is further enhanced by its caches: it has the same two on-chip 8-KByte L1 caches as does the Pentium processor, and also has a 256-KByte L2 cache that is in the same package as, and closely coupled to, the CPU, using a dedicated 64-bit ( backside ) full clock speed bus. The L1 cache is dual-ported, the L2 cache supports up to 4 concurrent accesses, and the 64-bit external data bus is transaction-oriented, meaning that each access is handled as a separate request and response, with numerous requests allowed while awaiting a response. These parallel features for data access work with the parallel execution capabilities to provide a nonblocking architecture in which the processor is more fully utilized and performance is enhanced. The Pentium Pro processor also has an expanded 36-bit address bus, giving a maximum physical address space of 64 GBytes. The Pentium II processor added MMX instructions to the Pentium Pro processor architecture, incorporating the new slot 1 and slot 2 packaging techniques. These new packaging techniques moved the L2 cache off-chip or off-die. The slot 1 and slot 2 packages uses a singleedge connector instead of a socket. The Pentium II processor expanded the L1 data cache and L1 instruction cache to 16 Kbytes each. The Pentium II processor has L2 cache sizes of 256 Kbytes, 512 Kbytes and 1 Mbytes or 2 Mbytes (slot 2 only). The slot 1 processor uses a half clock speed backside bus while the slot 2 processor uses a full clock speed backside bus. 24

25 Figure: processing units and their interface with memory subsystems 25

26 3.3 Cache : introduction Semester paper for CSE 3322, Fall 1999 The memory subsystem for the P6 Family processor consists of main system memory, the primary cache (L1), and the secondary cache (L2). The bus interface unit accesses system memory through the external system bus. This 64-bit bus is a transaction-oriented bus, meaning that each bus access is handled as separate request and response operations. While the bus inter-face unit is waiting for a response to one bus request, it can issue numerous additional requests. The bus interface unit accesses the close-coupled L2 cache through a 64-bit cache bus. This bus is also transactional oriented, supporting up to four concurrent cache accesses, and operates at the full clock speed of the processor. Access to the L1 caches is through internal buses, also at full clock speed. The 8-KByte L1 instruction cache is four-way set associative; the 8-KByte L1 data cache is dual-ported and two-way set associative, supporting one load and one store operation per cycle. Coherency between the caches and system memory are maintained using the MESI (modified, exclusive, shared, invalid) cache protocol. This protocol fosters cache coherency in singleand multiple-processor systems. It is also able to detect coherency problems created by self-modifying code. Memory requests from the processor s execution units go through the memory interface unit and the memory order buffer. These units have been designed to support a smooth flow of memory access requests through the cache and system memory hierarchy to prevent memory access blocking. The L1 data cache automatically forwards a cache miss on to the L2 cache, and then, if necessary, the bus interface unit forwards an L2 cache miss to system memory. Memory requests to the L2 cache or system memories go through the memory reorder buffer, which functions as a scheduling and dispatch station. This unit keeps track of all memory requests and is able to reorder some requests to prevent blocks and improve throughput. For example, the memory reorder buffer allows loads to pass stores. It also issues speculative loads. (Stores are always dispatched in order, and speculative stores are never issued.) 26

27 The above slides provide detailed information regarding cache memory within the P6 micro architecture. The above slide shows that the P6 micro architecture CPU core including a Level 1 (L1) Instruction cache and L1 Data cache. The L1 instruction cache is single ported while the L1 data cache is dual-ported. The Bus Interface Unit (BIU) is also integrated into the processor core. Circuits that interface the processor to the System Bus is included in the core as well. A unified data and instruction Level 2 (L2) cache is integrated in the same package as the CPU core. The L2 cache is connected to the CPU core through separate bus - the L2 Cache Bus (or Backside Bus). Most P6 micro architecture processors have L2 Cache Bus that runs at the same frequency as the CPU core. 27

28 3.3.2 L1 Cache The sizes and configuration of the L1 caches on different P6 micro architecture processors vary. 1: However, each processor is configured so that the L1 instruction cache is separate from the L1 data cache. 2: The Pentium Pro processor has an L1 instruction cache that is a 4-way set associative 8KB cache. The L1 data cache is also 8KB in size. 3: However, unlike the L1 instruction cache, the data cache is only 2-way set associative. Both caches support non-blocking accesses and can have up to 4 outstanding misses without stalling the processor. 4: The Pentium II processor has an L1 instruction cache and L1 data cache that are both 4- way set associative and 16KB in size. Both caches support non-blocking accesses and can have up to 4 outstanding misses without stalling the processor. 28

29 3.3.3 L2 Cache Semester paper for CSE 3322, Fall : Processors based on the P6 micro architecture all have a unified data and instruction L2 cache in the same package as the CPU. 2: The L2 caches are all 4-way set associative caches. 3: However, the L2 Cache Bus speed and sizes supported by each processor vary. The Pentium Pro processor has an L2 Cache Bus running at the CPU core frequency. It supports 256KB, 512KB, or 1024KB L2 cache size configurations. The Pentium II processor L2 Cache Bus runs at half the CPU core frequency. The Pentium II processor supports only 256KB and 512KB cache size configurations. 29

30 3.4 Memory and addressing modes Introduction. The memory that the processor addresses on its bus is called physical memory. Physical memory is organized as a sequence of 8-bit bytes. Each byte is assigned a unique address, called a physical address. The physical address space ranges from zero to a maximum of (64 gigabytes). Virtually any operating system or executive designed to work with an IA processor will use the processor s memory management facilities to access memory. These facilities provide features such as segmentation and paging, which allow memory to be managed efficiently and reliably. Memory management is described in detail in the following paragraphs describe the basic methods of addressing memory when memory management is used. When employing the processor s memory management facilities, programs do not directly address physical memory. Instead, they access memory using any of three memory models: flat, segmented, or real-address mode. With the flat memory model (refer to Figure), memory appears to a program as a single, continuous address space, called a linear address space. Code (a program s instructions), data, and the procedure stack are all contained in this address space. The linear address space is byte addressable, with addresses running contiguously from 0 to An address for any byte in the linear address space is called a linear address. With the segmented memory model, memory appears to a program as a group of independent address spaces called segments. When using this model, code, data, and stacks are typically contained in separate segments. To address a byte in a segment, a program must issue a logical address, which consists of a segment selector and an offset. (A logical address is often referred to as a far pointer.) The segment selector identifies the segment to be accessed and the offset identifies a byte in the address space of the segment. The programs running on an IA processor can address up to 16,383 segments of different sizes and types, and each segment can be as large as 2 36 bytes. Internally, all the segments that are defined for a system are mapped into the processor s linear address space. The processor translates each logical address into a linear address to access a memory location. This translation is transparent to the application program. The primary reason for using segmented memory is to increase the reliability of programs and systems. For example, placing a program s stack in a separate segment prevents the stack from growing into the code or data space and overwriting instructions or data, respectively. Placing the operating system s or executive s code, data, and stack in separate segments also protects them from the application program and vice versa. With either the flat or segmented model, the IA provides facilities for dividing the linear address space into pages and mapping the pages into virtual memory. If an operating system/executive uses the IA s paging mechanism, the existence of the pages is transparent to an application program. 30

31 The real-address mode model uses the memory model for the Intel 8086 processor, the first IA processor. It was provided in all the subsequent IA processors for compatibility with existing programs written to run on the Intel 8086 processor. The real-address mode uses a specific implementation of segmented memory in which the linear address space for the program and the operating system/executive consists of an array of segments of up to 64 Kbytes in size each. The maximum size of the linear address space in real-address mode is 2 20 bytes. Figure : Addressing mode 31

32 3.4.2 Memory manage, control and paging Semester paper for CSE 3322, Fall 1999 The memory management facilities of the Intel Architecture are divided into two parts: segmentation and paging. Segmentation provides a mechanism of isolating individual code, data, and stack modules so that multiple programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mechanism for implementing a conventional demand-paged, virtualmemory system where sections of a program s execution environment are mapped into physical memory as needed. Paging can also be used to provide isolation between multiple tasks. When operating in protected mode, some form of segmentation must be used. There is no mode bit to disable segmentation. The use of paging, however, is optional. These two mechanisms (segmentation and paging) can be configured to support simple single-program (or single-task) systems, multitasking systems, or multiple-processor systems that used shared memory. As shown in Figure, segmentation provides a mechanism for dividing the processor s addressable memory space (called the linear address space) into smaller protected address spaces called segments. Segments can be used to hold the code, data, and stack for a program or to hold system data structures (such as a TSS or LDT). If more than one program (or task) is running on a processor, each program can be assigned its own set of segments. The processor then enforces the boundaries between these segments and insures that one program does not interfere with the execution of another program by writing into the other program s segments. The segmentation mechanism also allows typing of segments so that the operations that may be performed on a particular type of segment can be restricted. All of the segments within a system are contained in the processor s linear address space. To locate a byte in a particular segment, a logical address (sometimes called a far pointer) must be provided. A logical address consists of a segment selector and an offset. The segment selector is a unique identifier for a segment. Among other things it provides an offset into a descriptor table (such as the global descriptor table, GDT) to a data structure called a segment descriptor. Each segment has a segment descriptor, which specifies the size of the segment, the access rights and privilege level for the segment, the segment type, and the location of the first byte of the segment in the linear address space (called the base address of the segment). The offset part of the logical address is added to the base address for the segment to locate a byte within the segment. The base address plus the offset thus forms a linear address in the processor s linear address space. 32

33 Figure: logical addressing to physical addressing. Semester paper for CSE 3322, Fall 1999 If paging is not used, the linear address space of the processor is mapped directly into the physical address space of processor. The physical address space is defined as the range of addresses that the processor can generate on its address bus. Because multitasking computing systems commonly define a linear address space much larger than it is economically feasible to contain all at once in physical memory, some method of visualizing the linear address space is needed. This virtualization of the linear address space is handled through the processor s paging mechanism. Paging supports a virtual memory environment where a large linear address space is simulated with a small amount of physical memory (RAM and ROM) and some disk storage. When using paging, each segment is divided into pages (ordinarily 4 Kbytes each in size), which are stored either in physical memory or on the disk. The operating system or executive maintains a page directory and a set of page tables to keep track of the pages. When a program (or task) attempts to access an address location in the linear address space, the processor uses the page directory and page tables to translate the linear address into a physical address and then performs the requested operation (read or write) on the memory location. If the page being accessed is not currently in physical memory, the processor interrupts execution of the program (by generating a page-fault exception). The operating system or executive then reads the page into physical memory from the disk and continues executing the program. Paging is implemented properly in the operating system or execute the swapping of pages between physical memory and the disk is transparent to the correct execution of a program. Even programs written for 16-bit Intel Architecture processors can be paged (transparently) when they are run in virtual-8086 mode. 33

34 3.5 More about paging and virtual memory. Semester paper for CSE 3322, Fall 1999 When operating in protected mode, the Intel Architecture permits the linear address space to be mapped directly into a large physical memory (for example, 4 GBytes of RAM) or indirectly (using paging) into a smaller physical memory and disk storage. This latter method of mapping the linear address space is commonly referred to as virtual memory or demand-paged virtual memory. When paging is used, the processor divides the linear address space into fixed-size pages (generally 4 Kbytes in length) that can be mapped into physical memory and/or disk storage. When a program (or task) references a logical address in memory, the processor translates the address into a linear address and then uses its paging mechanism to translate the linear address into a corresponding physical address. If the page containing the linear address is not currently in physical memory, the processor generates a page-fault exception (#PF). The exception handler for the page-fault exception typically directs the operating system or executive to load the page from disk storage into physical memory (perhaps writing a different page from physical memory out to disk in the process). When the page has been loaded in physical memory, a return from the exception handler causes the instruction that generated the exception to be restarted. The information that the processor uses to map linear addresses into the physical address space and to generate page-fault exceptions (when necessary) is contained in page directories and page tables stored in memory. Paging is different from segmentation through its use of fixed-size pages. Unlike segments, which usually are the same size as the code or data structures they hold, pages have a fixed size. If segmentation is the only form of address translation used, a data structure present in physical memory will have all of its parts in memory. If paging is used, a data structure can be partly in memory and partly in disk storage. To minimize the number of bus cycles required for address translation, the most recently accessed page-directory and page-table entries are cached in the processor in devices called translation look aside buffers (TLBs). The TLBs satisfy most requests for reading the current page directory and page tables without requiring a bus cycle. Extra bus cycles occur only when the TLBs do not contain a page-table entry, which typically happens when a page has not been accessed for a long time. Three flags in the processors control registers control paging: PG (paging) flag, bit 31 of CR0 (available in all Intel Architecture processors Beginning with the Intel386 processor). PSE (page size extensions) flag, bit 4 of CR4 (introduced in the Pentium and Pentium Pro processors). 34

35 PAE (physical address extension) flag, bit 5 of CR4 (introduced in the Pentium Pro processors). The PG flag enables the page-translation mechanism. The operating system or executive usually sets this flag during processor initialization. The PG flag must be set if the processor s pagetranslation mechanism is to be used to implement a demand-paged virtual memory system or if the operating system is designed to run more than one program (or task) in virtual-8086 mode. The PSE flag enables large page sizes: 4-MByte pages or 2-MByte pages (when the PAE flag is set). When the PSE flag is clear, the more common page length of 4 Kbytes is used. The PAE flag enables 36-bit physical addresses. This physical address extension can only be used when paging is enabled. It relies on page directories and page tables to reference physical addresses above FFFFFFFFH. The information that the processor uses to translate linear addresses into physical addresses (when paging is enabled) is contained in four data structures: Page directory An array of 32-bit page-directory entries (PDEs) contained in a 4-Kbyte page. Up to 1024 page-directory entries can be held in a page directory. Page table An array of 32-bit page-table entries (PTEs) contained in a 4-KByte page. Up to 1024 page-table entries can be held in a page table. (Page tables are not used for 2-Mbyte or 4-MByte pages. These page sizes are mapped directly from one or more page-directory entries.) Page A 4-KByte, 2-MByte, or 4-MByte flat address space. Page-Directory-Pointer Table An array of four 64-bit entries, each of which points to a page directory. These tables provide access to either 4-KByte or 4-MByte pages when normal 32-bit physical addressing is being used and to 4-KByte, 2-MByte, or 4-MByte pages when extended (36-bit) physical addressing is being used. The page size and physical address size obtained from various settings of the paging control flags. Each page-directory entry contains a PS (page size) flag that specifies whether the entry points to a page table whose entries in turn point to 4-KByte pages (PS set to 0) or whether the page-directory entry points directly to a 4- Mbytes or 2-MByte page (PSE or PAE set to 1 and PS set to 1). 35

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

The PowerPC RISC Family Microprocessor

The PowerPC RISC Family Microprocessor The PowerPC RISC Family Microprocessors In Brief... The PowerPC architecture is derived from the IBM Performance Optimized with Enhanced RISC (POWER) architecture. The PowerPC architecture shares all of

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

PowerPC 604e RISC Microprocessor Technical Summary

PowerPC 604e RISC Microprocessor Technical Summary SA4-2053-00 (IBM Order Number) nc. MPC604E/D (Motorola Order Number) /96 REV Advance Information PowerPC 604e RISC Microprocessor Technical Summary This document provides an overview of the PowerPC 604e

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Chapter 4. Cache Memory. Yonsei University

Chapter 4. Cache Memory. Yonsei University Chapter 4 Cache Memory Contents Computer Memory System Overview Cache Memory Principles Elements of Cache Design Pentium 4 and Power PC Cache 4-2 Key Characteristics 4-3 Location Processor Internal (main)

More information

3.6. PAGING (VIRTUAL MEMORY) OVERVIEW

3.6. PAGING (VIRTUAL MEMORY) OVERVIEW an eight-byte boundary to yield the best processor performance. The limit value for the GDT is expressed in bytes. As with segments, the limit value is added to the base address to get the address of the

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units CSCI 4717/5717 Computer Architecture Topic: Cache Memory Reading: Stallings, Chapter 4 Characteristics of Memory Location wrt Motherboard Inside CPU temporary memory or registers Motherboard main memory

More information

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution Datorarkitektur Fö 2-1 Datorarkitektur Fö 2-2 Components of the Memory System The Memory System 1. Components of the Memory System Main : fast, random access, expensive, located close (but not inside)

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

Memory and multiprogramming

Memory and multiprogramming Memory and multiprogramming COMP342 27 Week 5 Dr Len Hamey Reading TW: Tanenbaum and Woodhull, Operating Systems, Third Edition, chapter 4. References (computer architecture): HP: Hennessy and Patterson

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

Chapter 5 Cache Model and Memory Coherency

Chapter 5 Cache Model and Memory Coherency This document was created with FrameMaker 4.0.4 Chapter 5 Cache Model and Memory Coherency 50 50 This chapter summarizes the cache model as defined by the virtual environment architecture (VEA) as well

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Modified by Rana Forsati for CSE 410 Outline Principle of locality Paging - Effect of page

More information

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department IA-32 Architecture COE 205 Computer Organization and Assembly Language Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Basic Computer Organization Intel

More information

Transistor: Digital Building Blocks

Transistor: Digital Building Blocks Final Exam Review Transistor: Digital Building Blocks Logically, each transistor acts as a switch Combined to implement logic functions (gates) AND, OR, NOT Combined to build higher-level structures Multiplexer,

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

This Material Was All Drawn From Intel Documents

This Material Was All Drawn From Intel Documents This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs

More information

The Pentium Processor

The Pentium Processor The Pentium Processor Chapter 7 S. Dandamudi Outline Pentium family history Pentium processor details Pentium registers Data Pointer and index Control Segment Real mode memory architecture Protected mode

More information

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the

More information

2.5 Address Space. The IBM 6x86 CPU can directly address 64 KBytes of I/O space and 4 GBytes of physical memory (Figure 2-24).

2.5 Address Space. The IBM 6x86 CPU can directly address 64 KBytes of I/O space and 4 GBytes of physical memory (Figure 2-24). Address Space 2.5 Address Space The IBM 6x86 CPU can directly address 64 KBytes of I/O space and 4 GBytes of physical memory (Figure 2-24). Memory Address Space. Access can be made to memory addresses

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Memory Hierarchies &

Memory Hierarchies & Memory Hierarchies & Cache Memory CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 4/26/2009 cse410-13-cache 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Virtual Memory. CS 351: Systems Programming Michael Saelee

Virtual Memory. CS 351: Systems Programming Michael Saelee Virtual Memory CS 351: Systems Programming Michael Saelee registers cache (SRAM) main memory (DRAM) local hard disk drive (HDD/SSD) remote storage (networked drive / cloud) previously: SRAM

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

Characteristics. Microprocessor Design & Organisation HCA2102. Unit of Transfer. Location. Memory Hierarchy Diagram

Characteristics. Microprocessor Design & Organisation HCA2102. Unit of Transfer. Location. Memory Hierarchy Diagram Microprocessor Design & Organisation HCA2102 Cache Memory Characteristics Location Unit of transfer Access method Performance Physical type Physical Characteristics UTM-RHH Slide Set 5 2 Location Internal

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Unit 2. Chapter 4 Cache Memory

Unit 2. Chapter 4 Cache Memory Unit 2 Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation Location CPU Internal External Capacity Word

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Memory Hierarchy. Mehran Rezaei

Memory Hierarchy. Mehran Rezaei Memory Hierarchy Mehran Rezaei What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk) Option : Build It Out of Fast SRAM About 5- ns access Decoders

More information

Operating System Support

Operating System Support Operating System Support Objectives and Functions Convenience Making the computer easier to use Efficiency Allowing better use of computer resources Layers and Views of a Computer System Operating System

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

Computer & Microprocessor Architecture HCA103

Computer & Microprocessor Architecture HCA103 Computer & Microprocessor Architecture HCA103 Cache Memory UTM-RHH Slide Set 4 1 Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation

More information

Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.

Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy. Eastern Mediterranean University School of Computing and Technology ITEC255 Computer Organization & Architecture CACHE MEMORY Introduction Computer memory is organized into a hierarchy. At the highest

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management CSE 120 July 18, 2006 Day 5 Memory Instructor: Neil Rhodes Translation Lookaside Buffer (TLB) Implemented in Hardware Cache to map virtual page numbers to page frame Associative memory: HW looks up in

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

CS399 New Beginnings. Jonathan Walpole

CS399 New Beginnings. Jonathan Walpole CS399 New Beginnings Jonathan Walpole Memory Management Memory Management Memory a linear array of bytes - Holds O.S. and programs (processes) - Each cell (byte) is named by a unique memory address Recall,

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview. Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Slides prepared by Kip R. Irvine Revision date: 09/25/2002 Chapter corrections (Web) Printing

More information

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

CS162 - Operating Systems and Systems Programming. Address Translation => Paging CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due

More information

A brief History of INTEL and Motorola Microprocessors Part 1

A brief History of INTEL and Motorola Microprocessors Part 1 Eng. Guerino Mangiamele ( Member of EMA) Hobson University Microprocessors Architecture A brief History of INTEL and Motorola Microprocessors Part 1 The Early Intel Microprocessors The first microprocessor

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Kip R. Irvine. Chapter 2: IA-32 Processor Architecture

Assembly Language for Intel-Based Computers, 4 th Edition. Kip R. Irvine. Chapter 2: IA-32 Processor Architecture Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Chapter Overview General Concepts IA-32 Processor Architecture IA-32 Memory Management Components

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2. BASIC ELEMENTS Simplified view: Processor Slide 1 Computer System Overview Operating Systems Slide 3 Main Memory referred to as real memory or primary memory volatile modules 2004/S2 secondary memory devices

More information

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium Main Memory Outlook Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium 2 Backgound Background So far we considered how to share

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead

More information

Freescale Semiconductor, I

Freescale Semiconductor, I Copyright (c) Institute of Electrical Freescale and Electronics Semiconductor, Engineers. Reprinted Inc. with permission. This material is posted here with permission of the IEEE. Such permission of the

More information

Section 6 Blackfin ADSP-BF533 Memory

Section 6 Blackfin ADSP-BF533 Memory Section 6 Blackfin ADSP-BF533 Memory 6-1 a ADSP-BF533 Block Diagram Core Timer 64 L1 Instruction Memory Performance Monitor JTAG/ Debug Core Processor LD0 32 LD1 32 L1 Data Memory SD32 DMA Mastered 32

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Memory Management 1 Hardware background The role of primary memory Program

More information

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model What is Computer Architecture? Structure: static arrangement of the parts Organization: dynamic interaction of the parts and their control Implementation: design of specific building blocks Performance:

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Operating Systems CSE 410, Spring Virtual Memory. Stephen Wagner Michigan State University

Operating Systems CSE 410, Spring Virtual Memory. Stephen Wagner Michigan State University Operating Systems CSE 410, Spring 2004 Virtual Memory Stephen Wagner Michigan State University Virtual Memory Provide User an address space that is larger than main memory Secondary storage is used to

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013 CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013 CACHING Why: bridge speed difference between CPU and RAM Modern RAM allows blocks of memory to be read quickly Principle

More information