TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Size: px
Start display at page:

Download "TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading"

Transcription

1 Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5 hierarchy Chap basic cache optimizations 11 advanced cache optimizations Chap 5.2 Multi-threaded execution Multi-threading: multiple threads share the functional units of 1 processor via overlapping Must duplicate independent state of each thread e.g., a separate copy of register file, PC and page table shared through virtual memory mechanisms HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) Fine-Grained Multithreading Switches between threads on each instruction Multiples threads interleaved Usually round-robin fashion, skipping stalled threads CPU must be able to switch threads every clock Hides both short and long stalls Other threads executed when one thread stalls But slows down execution of individual threads Thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s Niagara Coarse-Grained Multithreading Switch threads only on costly stalls (L2 cache miss) Advantages No need for very fast thread-switching Doesn t slow down thread, since switches only when thread encounters a costly stall Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete => Better for reducing penalty of high cost stalls, where pipeline refill << stall time

2 Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in a system Can a high-ilp processor also exploit TLP? Functional units often idle because of stalls or dependences in the code Can TLP be a source of independent instructions that might reduce processor stalls? Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP? => Simultaneous Multi-threading (SMT) Intel: Hyper-Threading Simultaneous Multi-threading One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading (SMT) A dynamically scheduled processor already has many HW mechanisms to support multi-threading Large set of virtual registers Virtual = not all visible at ISA level Register renaming Dynamic scheduling Just add a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Time (processor cycle) Multi-threaded categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot Simultaneous Multithreading Design Challenges in SMT SMT makes sense only with fine-grained implementation How to reduce the impact on single thread performance? Give priority to one or a few preferred threads Large register file needed to hold multiple contexts Not affecting clock cycle time, especially in Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

3 Why memory hierarchy? (fig 5.2) Performance 100,000 10,000 1, Processor Year Processor- Performance Gap Growing Why memory hierarchy? Principle of Locality Spatial Locality Addresses near each other are likely referenced close together in time Temporal Locality The same address is likely to reused in the near future Idea: Store recently used elements a fast memories close to the processor Managed by software or hardware? hierarchy We want large, fast and cheap at the same time Processor Control Cache block placement Block 12 placed in cache with 8 Cache lines Block no. Fully associative: block 12 can go anywhere Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Datapath Block Address Set 0 Set 1 Set Set 2 3 Speed: Fastest Capacity: Smallest Cost: Most expensive Slowest Largest Cheapest Block no Cache performance Average access time = Hit time + Miss rate * Miss penalty Miss rate alone is not an accurate measure Cache performance is important for CPU perf. More important with higher clock rate Cache design can also affect instructions that don t access memory! Example: A set associative L1 cache on the critical path requires extra logic which will increase the clock cycle time Trade off: Additional hits vs. cycle time reduction 6 Basic Cache Optimizations Reducing Hit Time 1. Giving Reads Priority over Writes Writes in write-buffer can be handled after a newer read if not causing dependency problems 2. Avoiding Address Translation during Cache Indexing Eg. use Virtual page offset to index the cache Reducing Miss Penalty 3. Multilevel Caches Both small and fast (L1) and large (&slower) (L2) Reducing Miss Rate 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses)

4 1: Giving Reads Priority over Writes Caches typically use a write buffer CPU writes to cache and write buffer Cache controller transfers from buffer to RAM Write buffer usually FIFO with N elements Works well as long as buffer does not fill faster than it can be emptied Processor Cache Write Buffer Optimization Handle read misses before write buffer writes Must check for conflicts with write buffer first DRAM Virtual memory Processes use a large virtual memory Virtual addresses are dynamically mapped to physical addresses using HW & SW Page, page frame, page error, translation lookaside buffer (TLB) etc. Process 1: Process 2: Virtual address (VA) 0 vir. page 2 n n -1 address translation Physical address (PA) 0 2 m -1 phy. page 2: Avoiding Address Translation during Cache Indexing Virtual cache: Use virtual addresses in caches Saves time on translation VA -> PA Disadvantages Must flush cache on process switch Can be avoided by including PID in tag Alias problem: OS and a process can have two VAs pointing to the same PA Compromise: virtually indexed, physically tagged Use page offset to index cache The same for VA and PA At the same time as data is read from cache, VA PA is done for the tag Tag comparison using PA But: Page size restricts cache size 3: Multilevel Caches (1/2) Make cache faster to keep up with CPU or larger to reduce misses? Why not both? Multilevel caches Small and fast L1 Large (and cheaper) L2 3: Multilevel Caches (2/2) Average access time = L1 Hit time + L1 Miss rate * (L2 Hit time + L2 Miss rate * L2 Miss penalty) Local miss rate #cache misses / # cache accesses Global miss rate #cache misses / # CPU memory accesses L1 cache speed affects CPU clock rate L2 cache speed affects only L1 miss penalty Can use more complex mapping for L2 L2 can be large

5 4: Larger Block size Miss Rate 25% 20% 15% 10% 5% 0% 16 Compulsory misses Block Size (bytes) 128 Conflict misses 256 1K 4K 16K 64K 256K Capacity misses Trade-off 32 and 64 byte common 5: Larger Cache size Simple method Square-root Rule (quadrupling the size of the cache will half the miss rate) Disadvantages Longer hit time Higher cost Most used for L2/L3 caches 6: Higher Associativity Lower miss rate Disadvantages Can increase hit time Higher cost 8-way has similar performance to fully associative 11 Advanced Cache Optimizations Reducing hit time 1. Small and simple caches 2.Way prediction 3.Trace caches Increasing cache bandwidth 4.Pipelined caches 5.Non-blocking caches 6.Multibanked caches Reducing Miss Penalty 7. Critical word first 8. Merging write buffers Reducing Miss Rate 9. Compiler optimizations Reducing miss penalty or miss rate via parallelism 10.Hardware prefetching 11.Compiler prefetching 1: Small and simple caches Compare address to tag memory takes time Small cache can help hit time E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip Simple direct mapping Can overlap tag check with data transmission since no choice Access time estimate for 90 nm using CACTI model 4.0 Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches Access time (ns) way 2-way 4-way 8-way 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Cache size 2: Way prediction Extra bits are kept in the cache to predict which way (block) in a set the next access will hit Can retrieve the tag early for comparison Achieves fast hit even with just one comparator Several cycles needed to check other blocks with misses

6 3: Trace caches Increasingly hard to feed modern superscalar processors with enough instructions Trace cache Stores dynamic instruction sequences rather than bytes of data Instruction sequence may include branches Branch prediction integrated in with the cache Complex and relatively little used Used in Pentium 4: Trace cache stores up to 12K micro-ops decoded from x86 instructions (also saves decode time) 4: Pipelined caches Pipeline technology applied to cache lookups Several lookups in processing at once Results in faster cycle time Examples: Pentium (1 cycle), Pentium-III (2 cycles), P4 (4 cycles) L1: Increases the number of pipeline stages needed to execute an instruction L2/L3: Increases throughput Nearly for free since the hit latency on the order of processor cycles and caches are easy to pipeline 5: Non-blocking caches (1/2) 5: Non-Blocking Cache Implementation Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Requires that the lower-level memory can service multiple concurrent misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Pentium Pro allows 4 outstanding memory misses... The cache can handle as many concurrent misses as there are MSHRs Cache must block when all valid bits (V) are set Very common MHA = Miss Handling Architecture MSHR = Miss information/status Holding Register DMHA = Dynamic Miss Handling Architecture 5: Non-blocking Cache Performance 6: Multibanked caches Divide cache into independent banks that can support simultaneous accesses E.g.,T1 ( Niagara ) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1;

7 7: Critical word first Don t wait for full block before restarting CPU Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today Critical Word 1 st widely used block 8: Merging write buffers Write buffer allows processor to continue while waiting to write to memory If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry If so, new data are combined with that entry Multiword writes more efficient to memory The Sun T1 (Niagara) processor, among many others, uses write merging 9: Compiler optimizations Instruction order can often be changed without affecting correctness May reduce conflict misses Profiling may help the compiler Compiler generate instructions grouped in basic blocks If the start of a basic block is aligned to a cache block, misses will be reduced Important for larger cache block sizes Data is even easier to move Lots of different compiler optimizations 10: Hardware prefetching Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction Prefetching Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer Data Prefetching Pentium 4 can prefetch data into L2 cache from up to 8 streams Prefetching invoked if 2 successive L2 cache misses to a page Performance Improvement gap 1.16 mcf SPECint fam3d wupwise galgel facerec swim SPECfp2000 applu lucas 1.40 mgrid 1.49 equake : Compiler prefetching Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Cache Coherency Consider the following case. I have two processors that are sharing address X. Both cores read address X Address X is brought from memory into the caches of both processors Now, one of the processors writes to address X and changes the value. What happens? How does the other processor get notified that address X has changed?

8 Two types of cache coherence schemes Snooping Broadcast writes, so all copies in all caches will be properly invalidated or updated. Directory In a structure, keep track of which cores are caching each address. When a write occurs, query the directory and properly handle any other cached copies.

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance CISC 66 Graduate Computer Architecture Lecture 8 - Cache Performance Michela Taufer Performance Why More on Memory Hierarchy?,,, Processor Memory Processor-Memory Performance Gap Growing Powerpoint Lecture

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

ECSE 425 Lecture 25: Mul1- threading

ECSE 425 Lecture 25: Mul1- threading ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM Graduate Computer Architecture Handout 4B Cache optimizations and inside DRAM Outline 11 Advanced Cache Optimizations What inside DRAM? Summary 2018/1/15 2 Why More on Memory Hierarchy? 100,000 10,000

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Memory Hierarchy Basics

Memory Hierarchy Basics Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM Graduate Computer Architecture Handout 4B Cache optimizations and inside DRAM Outline 11 Advanced Cache Optimizations What inside DRAM? Summary 2012/11/21 2 Why More on Memory Hierarchy? 100,000 10,000

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 5 Memory Hierarchy Design CA Lecture10 - memory hierarchy design (cwliu@twins.ee.nctu.edu.tw) 10-1 Outline 11 Advanced Cache Optimizations Memory Technology and DRAM

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Lecture-18 (Cache Optimizations) CS422-Spring

Lecture-18 (Cache Optimizations) CS422-Spring Lecture-18 (Cache Optimizations) CS422-Spring 2018 Biswa@CSE-IITK Compiler Optimizations Loop interchange Merging Loop fusion Blocking Refer H&P: You need it for PA3 and PA4 too. CS422: Spring 2018 Biswabandan

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Advanced cache memory optimizations

Advanced cache memory optimizations Advanced cache memory optimizations Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 9, 2015

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Four Questions for Memory Hierarchy Designers

More information

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work? EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141 EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review CISC 662 Graduate Computer Architecture Lecture 6 - Cache and virtual memory review Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give

More information