TDT 4260 lecture 3 spring semester 2015

Size: px

Start display at page:

Download "TDT 4260 lecture 3 spring semester 2015"

Dominic Roberts
6 years ago
Views:

1 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU

2 2 Lecture overview Repetition Chap.1: Performance, CPI Appendix B: Review of Memory Hierarchy Cache basic principles, performance Kahoot-quiz, test 1 Course info & adm Updated reading list Student feedback group, volunteers App. B: continued Virtual Memory Chapter 2: Memory Hierarchy Design Cache optimization

3 3 Repetition (1 of 2) Moore s law Flattening out focus on multicores since main classes of computers Mobile, desktop, server, super & WSC, embedded Some additions/news: Internet of Things (IoT) (sensor networks) The limit is our imagination Challenge: How to use it for a better world (Green ICT) Servers Specialized servers (Paolo, HP, HiPEAC-Keynote 14) Parallelism: ILP, DLP, TLP Tech. trends Latency lags bandwidth (BW) The Processor Memory Gap The IO pin problem offchip BW is a bottleneck

4 4 Repetition (2 of 2) Flynn s taxonomy SISD, SIMD, MISD, MIMD Amdahl s law (1967) SpeedUp(n) = Time(1) / Time(n) = (s + p) / [s +(p/n)] pessimistic and famous

5 5 How the serial fraction limits speedup Amdahl s law Work hard to reduce the serial part of the application remember IO think differently (than traditionally or sequentially) = serial fraction

6 6 CPU Time CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles Clock Rate Performance improved by Reducing number of clock cycles Increasing clock rate These two can be conflicting goals, how?

7 7 Instruction Count and CPI Clock Cycles Instruction Count Cycles per Instruction CPU Time Instruction Count CPI Clock Cycle Time Instruction Count Clock Rate CPI Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix

8 8 CPU Performance Equation See textbook page 49, here * IC = Instructions executed or Instruction Count * The equation below can be read as; CPU-time = IC per program (average if there are many programs) x (average no of clock cycles per instruction) x secs/clock cycle CPU Time IC Program Clock cycles IC Seconds Clock cycle Performance depends on Algorithm: affects Instruction Count (IC), possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Clock cycle time

9 9 APPENDIX B: REVIEW OF THE MEMORY HIERARCHY, FOLLOWED BY CHAPTER 2

10 10 Why do we need memory hierarchy? 100,000 10,000 Performance 1, Processor Memory Processor-Memory Performance Gap Growing Year

11 11 Why is memory hierarchy efficient? Principle of Locality Spatial Locality Addresses near each other are likely referenced close together in time Temporal Locality The same address is likely to be reused in the near future Idea: Store recently used elements in fast memories close to the processor The term cache is now applied at many levels whenever buffering is employed to reuse commonly occurring items Databases, file systems Caches handled by software (SW) Processor caches normally handled automatically by hardware (HW) Manual operations possible (by programmer/sw) E.g. flush, mark as shared, protected etc.

12 12 Memory hierarchy We want large, fast and cheap at the same time Processor Control Memory Memory Datapath Memory Memory Memory Speed: Capacity: Cost: Fastest Smallest Most expensive Slowest Largest Cheapest

13 13 Two examples Fig 2.1

14 14 4 Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (Write strategy)

15 15 Q1: Block Placement Block 12 placed in cache with 8 locations Fully associative: block 12 can go anywhere Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no Block no Block no Set 0 Set 1 Set 2 Set 3 2 blocks per set is called 2-way set-associative

16 16 Q1/Q2: Placement and Identification Idea: Use a subset of the address bits as an index to limit the number of places we need to search and search all possible places in parallel Block Address Memory Address Block offset Tag Index Block offset Block offset: Addresses within a block map to the same cache block Index: Points to a row in the cache (i.e. one or more possible placements) Tag: Remainder of the address used to identify content

17 17 Q3: Block Replacement If all possible locations are full, which block should we throw out? Called the Replacement Policy Possible Replacement Policies Random Least Recently Used (LRU) First in, first out Do we need a replacement policy for a direct mapped cache? (Read more at page B-9)

18 18 Q4: Write Strategy (1/2) Write-through Write the data to the cache and to lower levels of the memory hierarchy Simple to implement Cache is always clean (i.e. consistent) Write-back Writes proceed at the speed of the cache Multiple writes possible without involving lower levels Saving traffic & energy Cache contains dirty data Tradeoff Complicates coherence protocol Increase traffic & energy (Read more at page B-11)

19 19 Q4: Write Strategy (2/2) We don t need the previous data on writes Write allocate Store the block in the cache on a write miss Most common with write-back No-write allocate Do not store the block on a write miss Most common with write-through (Read more at page B-11)

20 20 A 4-way Set Associative Cache

21 21 Cache performance Average access time = Hit time + Miss rate * Miss penalty Miss rate alone is not an accurate measure Cache performance is important for CPU performance More important with higher clock rate Cache design can also affect instructions that don t access memory! Example: A set associative L1 cache on the critical path requires extra logic which may increase the clock cycle time Trade off: Additional hits vs. faster clock frequency (Read more at page B-16)

22 22 Classifying Misses: The Three Cs Compulsory The first access to a block Will occur even with an infinite cache size Also known as: Cold start misses or first reference misses Capacity Would not happen in an infinite cache Blocks being evicted and later retrieved Conflict Misses that occur because of restrictions on where a block can be placed Does not occur in a fully associative cache

23 23 6 Basic Cache Optimizations Reducing Miss Rate 1. Larger Block size (Compulsory misses) 2. Larger Cache size (Capacity misses) 3. Higher Associativity (Conflict misses) Reducing Miss Penalty 4. Multilevel Caches 5. Giving Reads Priority over Writes Reducing Hit Time 6. Avoiding Address Translation during Cache Indexing (Read more at page B-22 )

24 24 1: Larger Block size Miss Rate 25% 20% 15% 10% 5% Cache block or cache line Compulsory misses Conflict misses 1K 4K 16K 64K 256K Capacity misses 0% Block Size (bytes) 256 * Typical trade-off * 32 and 64 byte common

25 25 2: Larger Cache size Simple method Disadvantages Longer hit time Higher cost Most used for L2/L3 caches L1 is often on the critical path of the processor Higher cycle time affects all instructions

26 26 3: Higher Associativity Disadvantages Can increase hit time Higher cost 8-way has similar performance to fully associative 2:1 cache rule Direct mapping cache of size N has about same miss rate as 2-way set associative cache with size N/2

27 27 KAHOOT-QUIZ NO. 1

28 28 TDT4260 reading list, research papers Motivation, very short intro Examples of relevant research, new and old Rapid changes in technology makes old results reappearing in new products Eg. dataflow in Maxeler (PACT Keynote Edinburgh Sept. 2013) 4 Papers Chip Multiprocessor (CMP) Design Space Exploration (DSE) (method, classical research paper structure), PACT Manchester Dataflow Machine, 1983 (classical, think differently, method) The Future of Multiprocessors, Borkar et al., Communications of the ACM, 2011, overview paper (More on tech trends) Vilje - The New Supercomputer at NTNU, by Jørn Amundsen, Meta, Issue 4, 2011, short paper ( The best computer in Norway)

29 29 COURSE ADM Reference group Any volunteers? Guest lecture on Monday 9/2, at 1215 in F-6, Kenneth Østby, from ARM-Trondheim on The ARM Mali GPU architecture and its memory hierarchy (caches etc.)

30 30 4: Multilevel Caches (1/3) Make cache faster to keep up with CPU or larger to reduce misses? Why not both? Multilevel caches Small and fast L1 Large (and cheaper) L2 Example: Intel Core i quad core processor with 3 levels

31 31 4: Multilevel Caches (2/3) Example; Intel Haswell L1 cache L2 cache L3 cache L4 cache 64 KB per core 256 KB per core 2 MB to 8 MB shared n/a or 128 MB (Iris Pro models)

32 32 4: Multilevel Caches (3/3) Average access time = L1 Hit time + L1 Miss rate * (L2 Hit time + L2 Miss rate * L2 Miss penalty) Local miss rate #cache misses / # cache accesses Global miss rate #cache misses / # CPU memory accesses L1 cache speed affects CPU clock rate L2 cache speed affects only L1 miss penalty Can use more complex mapping for L2 L2 can be large (Read more at page B-31 )

33 33 5: Giving Reads Priority over Writes Caches typically use a write buffer Assumes write through policy CPU writes to cache and write buffer Cache controller transfers from buffer to RAM Write buffer usually FIFO with n elements Works well as long as buffer does not fill faster than it can be emptied Processor Cache DRAM Optimization Handle read misses before write buffer writes Must check for conflicts with write buffer first Write Buffer (Read more at page B-35-36)

34 34 Repetition from courses like computer fundamentals (TDT4160) and operating systems (TDT4186) B4: VIRTUAL MEMORY

35 35 The Operating System (OS) Adds functionality on top of the ISA that can be reused across applications System calls E.g. read files, allocate memory, etc. The OS offers useful services to the user program Time sharing, file systems, memory management, etc. Memory Management Virtual Memory (One of the most important inventions in computer history!) Cooperation between OS and architecture OS manages memory Architecture provides efficient operations

36 Virtual Memory Provides the illusion of a very large, private memory available to each process Advantages: Simplifies process loading Enforces isolation between

36 36 Virtual Memory Provides the illusion of a very large, private memory available to each process Advantages: Simplifies process loading Enforces isolation between processes Makes the size of the address space independent of the amount of physical memory Available address space still depends on number of bits in the virtual address

37 37 Paging Idea: partition memory into fixed size blocks called pages Each process has its own virtual address space This is mapped to one or more pages that reside in physical memory or on disk Translation from virtual to physical address is necessary Translation is done by the Memory Management Unit (MMU) Transparent for software Memory Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8

38 38 How is Paging Implemented? Idea: Use a table to switch the most significant address bits

39 39 Memory vs. disk Some pages are in memory and some are on disk We use one bit in the page table to differentiate between pages on disk and in memory

40 40 Page Replacement Policy Page references also follow the principle of locality If the memory is full, we need to evict a page LRU (Least Recently Used) FIFO (First In First Out) Similar trade-offs as with caches Servicing page misses means accessing disk Miss penalty is higher than with caches Page fault, context switch

41 41 Virtual memory and protection Chapter 2.4 (and appendix B.4) (And OS basics) Background Time sharing, parallelism on one processor (core) Multiprogramming processes Kernel (supervisor) processes are part of the OS User processes are part of an user application A part of a user process state can be used/read, but can only be written to by the OS Example: Only the OS can update the page table

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5