TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5 hierarchy Chap 5.1 6 basic cache optimizations 11 advanced cache optimizations Chap 5.2 Multi-threaded execution Multi-threading: multiple threads share the functional units of 1 processor via overlapping Must duplicate independent state of each thread e.g., a separate copy of register file, PC and page table shared through virtual memory mechanisms HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) Fine-Grained Multithreading Switches between threads on each instruction Multiples threads interleaved Usually round-robin fashion, skipping stalled threads CPU must be able to switch threads every clock Hides both short and long stalls Other threads executed when one thread stalls But slows down execution of individual threads Thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s Niagara Coarse-Grained Multithreading Switch threads only on costly stalls (L2 cache miss) Advantages No need for very fast thread-switching Doesn t slow down thread, since switches only when thread encounters a costly stall Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete => Better for reducing penalty of high cost stalls, where pipeline refill << stall time

Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in a system Can a high-ilp processor also exploit TLP? Functional units often idle because of stalls or dependences in the code Can TLP be a source of independent instructions that might reduce processor stalls? Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP? => Simultaneous Multi-threading (SMT) Intel: Hyper-Threading Simultaneous Multi-threading One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading (SMT) A dynamically scheduled processor already has many HW mechanisms to support multi-threading Large set of virtual registers Virtual = not all visible at ISA level Register renaming Dynamic scheduling Just add a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Time (processor cycle) Multi-threaded categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot Simultaneous Multithreading Design Challenges in SMT SMT makes sense only with fine-grained implementation How to reduce the impact on single thread performance? Give priority to one or a few preferred threads Large register file needed to hold multiple contexts Not affecting clock cycle time, especially in Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

Why memory hierarchy? (fig 5.2) Performance 100,000 10,000 1,000 100 10 1 Processor 1980 1985 1990 1995 2000 2005 2010 Year Processor- Performance Gap Growing Why memory hierarchy? Principle of Locality Spatial Locality Addresses near each other are likely referenced close together in time Temporal Locality The same address is likely to reused in the near future Idea: Store recently used elements a fast memories close to the processor Managed by software or hardware? hierarchy We want large, fast and cheap at the same time Processor Control Cache block placement Block 12 placed in cache with 8 Cache lines Block no. Fully associative: block 12 can go anywhere 0 1 2 3 4 5 6 7 Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) 0 1 2 3 4 5 6 7 Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) 0 1 2 3 4 5 6 7 Datapath Block Address Set 0 Set 1 Set Set 2 3 Speed: Fastest Capacity: Smallest Cost: Most expensive Slowest Largest Cheapest Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Cache performance Average access time = Hit time + Miss rate * Miss penalty Miss rate alone is not an accurate measure Cache performance is important for CPU perf. More important with higher clock rate Cache design can also affect instructions that don t access memory! Example: A set associative L1 cache on the critical path requires extra logic which will increase the clock cycle time Trade off: Additional hits vs. cycle time reduction 6 Basic Cache Optimizations Reducing Hit Time 1. Giving Reads Priority over Writes Writes in write-buffer can be handled after a newer read if not causing dependency problems 2. Avoiding Address Translation during Cache Indexing Eg. use Virtual page offset to index the cache Reducing Miss Penalty 3. Multilevel Caches Both small and fast (L1) and large (&slower) (L2) Reducing Miss Rate 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses)

1: Giving Reads Priority over Writes Caches typically use a write buffer CPU writes to cache and write buffer Cache controller transfers from buffer to RAM Write buffer usually FIFO with N elements Works well as long as buffer does not fill faster than it can be emptied Processor Cache Write Buffer Optimization Handle read misses before write buffer writes Must check for conflicts with write buffer first DRAM Virtual memory Processes use a large virtual memory Virtual addresses are dynamically mapped to physical addresses using HW & SW Page, page frame, page error, translation lookaside buffer (TLB) etc. Process 1: Process 2: Virtual address (VA) 0 vir. page 2 n -1 0 2 n -1 address translation Physical address (PA) 0 2 m -1 phy. page 2: Avoiding Address Translation during Cache Indexing Virtual cache: Use virtual addresses in caches Saves time on translation VA -> PA Disadvantages Must flush cache on process switch Can be avoided by including PID in tag Alias problem: OS and a process can have two VAs pointing to the same PA Compromise: virtually indexed, physically tagged Use page offset to index cache The same for VA and PA At the same time as data is read from cache, VA PA is done for the tag Tag comparison using PA But: Page size restricts cache size 3: Multilevel Caches (1/2) Make cache faster to keep up with CPU or larger to reduce misses? Why not both? Multilevel caches Small and fast L1 Large (and cheaper) L2 3: Multilevel Caches (2/2) Average access time = L1 Hit time + L1 Miss rate * (L2 Hit time + L2 Miss rate * L2 Miss penalty) Local miss rate #cache misses / # cache accesses Global miss rate #cache misses / # CPU memory accesses L1 cache speed affects CPU clock rate L2 cache speed affects only L1 miss penalty Can use more complex mapping for L2 L2 can be large

4: Larger Block size Miss Rate 25% 20% 15% 10% 5% 0% 16 Compulsory misses 32 64 Block Size (bytes) 128 Conflict misses 256 1K 4K 16K 64K 256K Capacity misses Trade-off 32 and 64 byte common 5: Larger Cache size Simple method Square-root Rule (quadrupling the size of the cache will half the miss rate) Disadvantages Longer hit time Higher cost Most used for L2/L3 caches 6: Higher Associativity Lower miss rate Disadvantages Can increase hit time Higher cost 8-way has similar performance to fully associative 11 Advanced Cache Optimizations Reducing hit time 1. Small and simple caches 2.Way prediction 3.Trace caches Increasing cache bandwidth 4.Pipelined caches 5.Non-blocking caches 6.Multibanked caches Reducing Miss Penalty 7. Critical word first 8. Merging write buffers Reducing Miss Rate 9. Compiler optimizations Reducing miss penalty or miss rate via parallelism 10.Hardware prefetching 11.Compiler prefetching 1: Small and simple caches Compare address to tag memory takes time Small cache can help hit time E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip Simple direct mapping Can overlap tag check with data transmission since no choice Access time estimate for 90 nm using CACTI model 4.0 Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches Access time (ns) 2.50 2.00 1.50 1.00 0.50-1-way 2-way 4-way 8-way 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Cache size 2: Way prediction Extra bits are kept in the cache to predict which way (block) in a set the next access will hit Can retrieve the tag early for comparison Achieves fast hit even with just one comparator Several cycles needed to check other blocks with misses

3: Trace caches Increasingly hard to feed modern superscalar processors with enough instructions Trace cache Stores dynamic instruction sequences rather than bytes of data Instruction sequence may include branches Branch prediction integrated in with the cache Complex and relatively little used Used in Pentium 4: Trace cache stores up to 12K micro-ops decoded from x86 instructions (also saves decode time) 4: Pipelined caches Pipeline technology applied to cache lookups Several lookups in processing at once Results in faster cycle time Examples: Pentium (1 cycle), Pentium-III (2 cycles), P4 (4 cycles) L1: Increases the number of pipeline stages needed to execute an instruction L2/L3: Increases throughput Nearly for free since the hit latency on the order of 10 20 processor cycles and caches are easy to pipeline 5: Non-blocking caches (1/2) 5: Non-Blocking Cache Implementation Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Requires that the lower-level memory can service multiple concurrent misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Pentium Pro allows 4 outstanding memory misses... The cache can handle as many concurrent misses as there are MSHRs Cache must block when all valid bits (V) are set Very common MHA = Miss Handling Architecture MSHR = Miss information/status Holding Register DMHA = Dynamic Miss Handling Architecture 5: Non-blocking Cache Performance 6: Multibanked caches Divide cache into independent banks that can support simultaneous accesses E.g.,T1 ( Niagara ) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1;

7: Critical word first Don t wait for full block before restarting CPU Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today Critical Word 1 st widely used block 8: Merging write buffers Write buffer allows processor to continue while waiting to write to memory If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry If so, new data are combined with that entry Multiword writes more efficient to memory The Sun T1 (Niagara) processor, among many others, uses write merging 9: Compiler optimizations Instruction order can often be changed without affecting correctness May reduce conflict misses Profiling may help the compiler Compiler generate instructions grouped in basic blocks If the start of a basic block is aligned to a cache block, misses will be reduced Important for larger cache block sizes Data is even easier to move Lots of different compiler optimizations 10: Hardware prefetching Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction Prefetching Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer Data Prefetching Pentium 4 can prefetch data into L2 cache from up to 8 streams Prefetching invoked if 2 successive L2 cache misses to a page Performance Improvement 2.20 2.00 1.80 1.60 1.40 1.20 1.00 gap 1.16 mcf SPECint2000 1.45 fam3d 1.18 1.20 1.21 1.26 1.29 1.32 wupwise galgel facerec swim SPECfp2000 applu lucas 1.40 mgrid 1.49 equake 1.97 11: Compiler prefetching Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Cache Coherency Consider the following case. I have two processors that are sharing address X. Both cores read address X Address X is brought from memory into the caches of both processors Now, one of the processors writes to address X and changes the value. What happens? How does the other processor get notified that address X has changed?

Two types of cache coherence schemes Snooping Broadcast writes, so all copies in all caches will be properly invalidated or updated. Directory In a structure, keep track of which cores are caching each address. When a write occurs, query the directory and properly handle any other cached copies.