PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy
|
|
- Alfred Bridges
- 5 years ago
- Views:
Transcription
1 PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1
2 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed Basic Definitions Execution of Instructions in serial Computing Hierarchical Memory features of Sub-system Managing Memory Overheads How to Calculate Memory Access time? Reducing Memory Overheads for Performance 2
3 Definition of of Terms PCOPP 2002 What is the clock rate? Today's computer perform operations at very high speed. The clock is the rate at which new operations can begin. PC s : 900 MHz (=900,000,000 cycles/sec) A clock period is the smallest unit of measure of time on a processor. 1 clock period = 1/(clock rate) 1 clock = 1/(900*10 6 ) = 1.11*10-9 seconds = 1.11 nanoseconds How many clocks does it take to perform an operation? This depends on the processor and operating system, but exemplary times might be multiply : 4 clock periods; divide : 20 clock periods square root :140 clock periods 3
4 Definition of of Terms PCOPP 2002 What are MFLOPS? MFLOPS= Megaflops= Million Floating point Operations/Per Second MFLOPS and MFLOPS/sec are often used synonymously. (e.g. Adding/ multiplying real numbers) Example: A 500 MHz chip has a clock of 2.0 ns If it can perform 1 add and mult. in every clock, i.e. 2 operations/clock 2 oper * 1 clock = 2* 10 9 ops/sec; clock = 2.0*10-9 sec = 2 nano seconds What are MIPs? MIPS = Million Instruction Per Second (e.g. integer, logical operations) 4
5 History : CPU Performance PCOPP 2002 Application Compiler Architecture CPU Time = Instructions Cycles Instructions Program r r Seconds Cycles Instruction Set CPU Time = N inst *CPI* Clock rate Technology What single-processor efficiency can you expect? Assumption: Memory performance will be the key Responses: Determine % of time processor is stalled due to memory. Gives upper bound on performance improvements 5
6 Hitting the Wall : Clock Speeds PCOPP 2002 Two basic Techniques to improve the processor performance Exploit higher Instruction Level Parallelism Increase the clock rate by sub-dividing the instruction pipeline into simpler stages. Mechanisms for supporting multiple instruction execution for cost-effective performance Pipelined Execution Super Scalar Execution VLIW Processors Multi Scalar Processors Super Scalar versus Super Pipelined Processors 6
7 Pipelined Execution of Instructions PCOPP 2002 Instructions are executed in five stages : Instruction fetch decode operand fetch execute write back IF ID OF E WB (a) Stages in instruction execution. Instruction Cycles IF ID OF E WB IF ID OF E WB (b) Serial execution of instructions. IF ID OF E WB IF ID OF E WB IF ID OF E WB IF ID OF E WB (c) Pipelined execution of instructions. Each stage is executed in a single clock cycle, the entire instruction execution can be accomplished in five clock cycles. It is easy to see that in a sequential execution of the instructions, most of the processor hardware idles for a majority of time. Therefore the maximum speedup achievable through pipelining is five 7
8 Super Scalar Execution The ability of a processor to issue multiple instructions in the same cycle is referred to as Super Scalar Execution 1. First, instructions in a program are related to each other. The results of an instruction may be required for subsequent instructions which is referred to as true data dependency. Dependencies of this type must be resolved before simultaneous issue of instructions. 2. Another source of dependence between instructions results from the finite resource of the machines. 3. The flow of control through a program enforces a third form of dependency between instructions. 8
9 3. The flow of control Super Scalar Execution The branch location is known at the point of execution. (Contd ) Scheduling instructions across branches or (subroutine calls) may lead to incorrect flow of control. Accurate branch prediction is highly desirable for efficient super scalar execution The ability of a processor to detect and schedule instructions is critical to Super Scalar Performance. The order in which instructions are issued and completed has implications for the required look-ahead and consequently the performance Branch Prediction : Branch instructions occur very frequently (about one in six instructions) 9
10 Multi Scalar Execution of a simple program segment The multi scalar execution model takes a coarser view of the control graph. Set of nodes and edges are aggregated together to form tasks. Multiple tasks are executed in parallel by different functional units. Multi scalar execution can be compared to implicit parallel execution of programs on multi-computers Program segment with blocks A, B, and C A B C Execution of multiple instructions play a major role for performance Execution trace: ((A, B) (A, B), C) (A, B, C)) Possible multiscalar schedules: (Assuming two functional units) Entire program segment as sing task: Functional unit 0: [(A, B) (A, B) C] Functional unit 1: (A, B, C) Inner loop (A, B) as a task: Functional unit 0: (A, B) Functional unit 1: (A, B) Functional unit 0: (C,A, B, C) 10
11 Instruction Level Parallelism: Floating Point PCOPP 2002 Computer 1 2 FP units, each capable of 1 fused multiply-add (1;2) or 1 add (1;1) or 1 multiply (1;2) 1 quad load/store (1;!) leading to (up to) 4 FP Ops per CP 4 mem access Ops per CP Computer 2 1 FP units, each capable of 1 floating point add pipeline (1;4) 1 floating point mult.pipeline (1;4) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 mem access Ops per CP Computer 3 1 FP units, each capable of 1 floating point add pipeline (1;2) 1 floating point mult.pipeline (1;2) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 memory access Ops per CP 11
12 PERFORMANCE EXPLICITLY Simple fixed Length Instructions Sequencing done by Compiler ITANIUM McKINLEY EPIC PARALLEL INSTRUCTION COMPUTING ULTRASPARC- III ALPHA CISC RISC Complex Variable Length Instructions Sequencing in Hardware OOO SuperScalar H/W detects implicit parallelism HW O -O-O Scheduling & Speculation TIME 12
13 Memory Management PCOPP 2002 Memory Reference Optimization and Managing Memory Overheads play an key role for performance Hierarchical Memory features of Memory Sub-System Getting memory references right is one of the most important challenges of application performance Memory access patterns for performance Cache Performance and Cache Miss Cache Memories for Reducing Memory Overheads Role of Data Reuse on Memory system performance Techniques for Hiding Memory Latency (Multi-threading) 13
14 Remarks : Managing Memory Overheads PCOPP 2002 Current Micro processors are clocked at speeds of 2-3 ns which imposes stringent requirements on effective memory latency and bandwidth Fastest DRAMs operate at about ns latency. This reflects the major mismatch in the rate at which data is required by the processor and the rate at which DRAMs can supply the data. Memory (DRAM, data cache, instruction cache, registers) sizes and access to, have improved dramatically. Trend: speed and memory increase by ~2 every ~1.5 years (also known as Moore s law). That s a factor of 1000 in 15 years and a factor of 1,000,000 in 30 years! Remember: at 100MHz you get 10-8 secs/cycle, light can travel about 9 feet! Memory banking to reduce effective latency and increase bandwidth is important for performance. 14
15 The Memory sub-system Hierarchy features Cache size is critical for controlling cache misses. Cache line size is usually 64, 128 or 256 bytes long. The cache line size has an impact on the misses ratios and the time to access memory. Cache memories work with the range of 4-25 ns latency. Registers C < 2KB L = 0 cycle B = 1-32 GB/s Level-1 Cache KB 0-2 cycles 1-16 GB/s Level-2 Cache 64 KB-4 MB 2-10 cycles 1-4 GB/s Remote Memory Main memory Disk The first level cache (L 1 Cache) is typically on the same chip as the processor. The access to L 1 cache is typically a single clock cycle and its size is in the range of 32K words. The second level cache is larger (256 KB - 16 MB) and may be on or off chip. 16 MB-16 GB cycles GB/s GB K cycles MB/s GB 1-16 MB/s 100K-1M cycles 15
16 A lot of time is spent accessing/storing data from/to memory. It is important to keep in mind the relative times for each memory types: CPU The Memory sub-system : Access time registers Dcache Icache L2 RAM DISK Approximate access times CPU-registers: 0 cycles (that s where the work is done!) L 1 Cache: 1 cycle (Data and Instruction cache). Repeated access to a cache takes only 1 cycle L 2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency 16
17 The Memory sub-system Hierarchy features Registers: PCOPP 2002 Few (about 32) very fast 32-bits (sometimes 64-bites) registers. Compilers usually take care of using them. Data Cache: From about 4 Kbytes to 256 Kbytes, and increasing, they are an order of magnitude more expensive than DRAM but 10 to 100 times faster. (L 2 Cache are static RAM, cheaper than L 1 but slower, although faster than RAM). Instruction Cache: Similar in speed to the Data Cache, they are used only to store instructions. Useful with loops, specially when the whole loop fits within the Instruction Cache. 17
18 Reducing Memory Overheads for Performance Hide memory latency (cache, non-blocking loads, ILP) Reduce memory latency (faster DRAM, faster interconnects) Cycle Issued instruction Issued loads Issued stores Mispredicted branches Primary data cache misses Secondary data cache misses 18
19 Reducing Memory Overheads for Performance Cont.. What do these counts mean to the execution time? How far from optimal are they? Which architectural feature pays off for your code? How effective are data caches? Hit Rate = Number of hits / Number of memory references Only an indirect measure of performance Need to know how hits/misses contribute to CPU time Average memory access time for hits/misses 19
20 Characteristic 1 : PCOPP 2002 Characteristics of Virtual Memory Machines Virtual memory machines translate the logical memory addresses your program generates into physical addresses in the memory system It gives a degree of flexibility by allowing all processes to believe that they all have all of the memory system to themselves Characteristic 2 : Virtual memory systems is that they divide your program s memory up into pages chunks. Page size vary from 512 bytes to 16 KB, depending on the machine. By being separated into pages, programs are easier to fit together in memory, or move out to disk in portions. 20
21 Virtual to Physical Address Mapping PCOPP 2002 Process Region Table Virtual Translation Virtual Address Location 1000 Page Table Physical Address Data 21
22 Characteristics of Virtual Memory Machines PCOPP 2002 Example Say that your program asks for a variable stored at location 1000 To find where your variable is actually stored, the location has to be translated from a virtual address to a physical address. The map containing such translations is called a page table. Each process has a several page tables associated with it, corresponding to different regions, such as program text and data segments. For instance, if the page size is 512 bytes, then location 1000 will fall within second page. 22
23 Virtual Memory Machines -Translation Lookaside Buffer (TLB) All modern virtual memory machines have a special cache called a Translation Lookaside Buffer or TLB for virtual to physical memory address translation. The two inputs to the TLB are an integer that identifies the program making the memory request and the virtual page requested. From the output pops a pointer to the physical page number. Virtual addresses in Physical address out. TLB lookups occur in parallel with instruction execution, so of the address data is in the TLB, memory references will proceeds quickly TLB is limited in size. If your program asks for a virtual to physical address translation, and the entry doesn't exist in the TLB, then you suffer a TLB miss. 23
24 Translation Lookaside Buffer PCOPP 2002 Example : Construct an example in which every memory reference of your program causes a TLB miss. Assume that the memory page size for your computer is less than 40 KB. Every time through the inner loop in the following example code, the program asks for data that is 4 bytes*10,000 = 40,000 bytes away from the last reference. Before Modified REAL X( ) COMMON X,Y DO 10 I=0,1000 DO 20 J=1, ,10000 Y = (J+I) 20 CONTINUE 10 CONTINUE REAL X( ) COMMON X,Y DO 10 I=0, Y = X(I) 20 CONTINUE 10 CONTINUE 24
25 Translation Lookaside Buffer PCOPP 2002 Remarks about the Example : Each reference falls on a different memory page. (see the fragment of the code on left ). This causes 1000 TLB misses in the inner loop, taken times, for a total of at least one million TLB misses. To add insult this injury, each reference is guaranteed to cause a data cache miss a well. The reversed loop has good locality of reference and TLB misses occur only every so often Page Faults : References to pages that are not marked valid are called page faults. Al though they take lot of time, page faults are nt errors 25
26 Memory Location : Taking Worst case Scenario Your program asks for a variable from a particular memory location The processor goes to look for it in the cache and finds it isn't there (cache miss), which means it must be loaded data from memory Next it goes to the TLB to find the physical location of the data in memory and finds there is no TLB entry either a (TLB miss) Then it tries consulting the page table (and refilling the TLB), but finds that either there is no entry for your particular page, or that the memory page has been shipped to disk (both are page faults) Each step of the memory hierarchy has shrugged off your request. A new page will have to be created in memory and possibly, depending on the circumstance, refilled from disk. 26
27 The Memory sub-system : Access time PCOPP 2002 Typical operations such as adding two real/float numbers used to take from 3 to 12 clock cycles; now 32-bit operations are done in only one clock cycle, or If the data and instruction flow is well arranged to take advantage of the hardware architecture, two or even four simultaneous operations can be executed in one clock cycle! (e.g. Power2 chip on the SP2) If the information has to be fetched in memory (DRAM) and the physical address was already stored in the Translation Lookaside Buffer (TLB) the operation takes 8 to 12 cycles If a new address has to be translated and fetched it takes 30 to 60 additional cycles (using the Page Frame Table (PFT)) The TLB can store 64 addresses of 4K sized pages, thus totaling up to 256K of memory management. 27
28 Memory Management for performance PCOPP 2002 Crossing 4 K boundary costs about 10 cycles Switching between read and write costs an extra 6 cycles A page fault happens when the data is not found in main memory; control is returned to the operating system to perform I/O and locate the page somewhere on disk. This can cost a delay of about one millisecond (~100,000 cycles) Before do i=1, do j=i+1, A(i)=A(i)*B(j)+A(j)*B(i) B(i)=A(i)*B(i)+A(j)*B(j) enddo enddo A and B require 320 Mbyes! After do i=1, tempa = A(I); tempb = B(I) do j=i+1, tempa=tempa*b(j)+a(j)*tempb tempb=tempa*tempb+a(j)*b(j) enddo A(I)= tempa; B(I)= tempb enddo 28
29 Cache Memories for Reducing Memory Overheads Cache memory work on the principles of spatial and temporal locality of data reference by typical programs. Spatial locality: In most programs, if a data word is accessed in the memory, it is likely that the words in the neighbor will also be accessed. Temporal locality: In some programs, it is also possible that words are repeatedly accessed in windows of time. This is sometimes also called reuse locality of the programs. Question : Why are we being bothered about memory system design when we want to learn only about parallel programming? The key to the question lies in the phrases, Spatial Locality and Temporal Locality of data reference.there are critical to single processor performance, and also critical to parallel processing efficiency. 29
30 Cache Memories for Reducing Memory Overheads Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. Cache line : If the data is not available in the cache (a cache miss occurs), the data is fetched from the DRAM into the cache. However instead of fetching a single word of data corresponding to the requested data item, a whole block of contagious data is transferred from the DRAM to the cache. This block of data is also referred to as a cache line. Due to spatial locality properties subsequently accessed data locations are likely to have been fetched into the cache. The effective access latency to these words is thus reduced. Direct mapped caches or set associative caches Direct mapped caches have a unique cache location for each line in the memory. Set associative caches have multiple locations in which cache line reside and it helps in performance. 30
31 How to calculate effective memory access time? Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. If the average cache miss ratio over the execution is m, and the time to service a single memory request from DRAM is t DRAM, then the effective time to service this is m x t DRAM. The remaining (1-m) request are serviced by the cache for which the time is (1-m)x t cache where t cache is the time to service a request from the cache. The effective memory across time (t eff ) of the single level cache system is given by t eff = m x t DRAM + (1-m) x t cache In many applications it is possible to achieve cache miss ratios lower than 5% and it is possible to achieve excellent cache performance. 31
32 How to calculate effective memory access time? Example for memory access time : Consider simple loop of adding two vectors and find out the memory performance for this loop for data accesses only. (access to arrays a, b and c) will be analyzed for(i = 0;i<Vector_length;i++) c[i]=a[i]+ b[i]; It is assumed that the a, b and c were non-interfering with respect to their cache residence. This is determined by the mapping of memory locations to words in the cache. Assume the two level memory system with DRAM latency 60 ns latency, Cache latency is 10 ns and the cache line of a system is 4 words long. Since every fourth access to each array results in a cache miss, the miss ratio is 25% (0.25). Effective memory access time (t eff ) is 0.25 x x 10 =22.5 ns. 32
33 How to calculate effective memory access time? Example for Memory access time : Observations Clearly, using a cache has enabled us to reduce the effective access time by almost a factor of three. Further more, the example also illustrates the role of an effective memory system. Since the effective access time is 22.5 ns and the memory overhead per FLOP is 77.5 ns. This corresponds to a peak FLOP rating of 1/77.5 GFLOPs or 13 MFLOPs. Therefore irrespective of processor speeds, the loop under the consideration will not yield 13 MFLOPs for the assumed memory subsystem. 33
34 Role of Data Reuse on Memory System Performance Remark : Besides spatial locality, temporal locality and data reuse play an important role for performance However, many real life applications do have significant data reuse and this can be exploited for enhanced memory performance. Example : Role of Temporal Locality Multiplying two nx nmatrices requires 2n 3 FLOPS and 2n 2 data accesses. This corresponds to a reuse factor of (n) (each data item is used n times). for(i=0, i<n; i++) for(j=0, j<n; j++) for(k=0, k<n; k++) c[i][j] += a[i][k] + b[k][j]; The performance depends on the cache size and data locality of the arrays a, b, and c. (re-use of the data) 34
35 Techniques for hiding memory latency : Prefetching It is possible to anticipate need for a data item well in advance, a request can be issued for the data with the hope that it will arrive when it is actually needed. The processor can work on other tasks that can be performed in parallel. Prefetching is an effective tool to hide memory latencies in serial computers and communication latency in parallel computers. Limitations Prefetch requires large amounts of register or cache storage There might not be enough concurrency in the code to hide all the latency. 35
36 Techniques for Hiding Memory Latency Multi threading Given multiple threads of control, when a thread makes a memory access, it can be swapped on an other parallel threads can be executed. If the memory access is completed before control returns to the thread, the latency of the access has been effectively hidden. Disadvantages : The programmer is now burdened with programming concurrency explicitly even into un-processer programs. Consequently, although multithreading may yield better performance than software Prefetching, it has considerable overheads in terms of hardware costs and programmer effort. 36
37 Memory Management: FORTRAN versus C Example: Array Allocation in C and Fortran (4x4 matrix with data values) PCOPP C: row-major ordering: row-elements of the matrix are stored contiguously [ ] FORTRAN: column-major ordering: column-elements of the matrix are stored contiguously [ ] 37
38 Memory Access Patterns for Performance The loop has unit stride DO 10 J=1, N DO 10 I=1, N A(I,J)=B(I,J)+C(I,J)*D 10 CONTINUE The loop has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE Unit stride gives you the best performance because it can serve cache entries. In contrast the next loop will be slower because its stride is N. In FORTRAN programs this is the leftmost subscript and in C programs it is the right most. In C programs, the subscripts appear in reverse order. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. The larger the value of N, the more significant the performance difference is. Runs much faster (say 10 times) 38
39 Loop interchange to Ease Memory Access Patterns Before interchange :has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE After interchange, A, B and C are referenced with the leftmost subscript varying most quickly. This can make an important difference in performance. We traded three N- strided memory references for unit-stride. Minimize the stride to make better use of data cache! Remarks: The worst case patterns are those that jump through memory, especially a large amount of memory. After interchange :has unit stride DO 10 I=1, N DO 10 J=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE In large jobs, you not only pay a penalty for cache misses but for TLB (The Translation look a side Buffer) misses too. 39
40 Loop interchange to Ease Memory Access Patterns Original DO 10 I=1, N DO 20 J=1, N A(J,I)=B(J,I) 20 CONTINUE 10 CONTINUE Modified DO 20 J=1, N DO 10 I=1, N A(J,I)=B(I,J) 10 CONTINUE 20 CONTINUE (Contd ) The above loops represents a dilemma. Whichever way interchange them, you will break access pattern for either A or B. Choice between strided loads versus strided stores which will be? The difference is in the way the processor handles updates of main memory from cache. Loop interchange, the challenge is to retrieve as much data as possible with few cache misses as possible. Another kind of memory reference optimization 40
41 Blocking to Ease Memory Access Patterns PCOPP 2002 DO 10 I=1, N DO 20 J=1, N A(J,I)=A(J,I)+B(I,J) 20 CONTINUE 10 CONTINUE Example1 :Two-dimensional Vector sum: This loop involves two vectors. How to improve memory access for both A and B? One is referenced with unit stride, the other with a stride of N. We can interchange the loops, but one way or another we will still have N-strided array reference on either A or B, either of which is undesirable. 41
42 Trick: Blocking to Ease Memory Access Patterns PCOPP 2002 To block references so that you grab a few elements of A, and then a few of B, and then a few of A, and so on.in neighborhoods. Combing inner and outer loop unrolling DO 10 I=1,N,2 DO 20 J=1,N,2 A(J,I) = A(J,I) +B(I,J) A(J+1,I) = A(J+1,I) +B(I,J-1) A(J,I+1) = A(J,I+1) +B(I+1),J) A(J+1,I+1) 20 CONTINUE 10 CONTINUE = A(J+1,I+1) +B(I+1,J+1) For better performance cut the original loop into two parts : DO 21 J = 1, N/2,2 and DO 20 J=N/2+1, N,2 (Contd..) 42
43 Blocking to Ease Memory Access Patterns (Contd..) Remarks: Memory is sequential storage. Fortran: A two dimensional array is constructed from memory by logically lining memory strips up against each other. C: Rows are stacked on top of the one another. Suppose memory storage is cut into pieces the size of individual cache entries. Array storage starts at the upper left, proceeds down to the bottom and then starts over at the top of the next column. 43
44 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) column 1 column 2 Array denotes cache line boundary Figure 2: How array elements are stored Because of index expressions, references to A go from top to bottom consuming every bit of each cache line. References to B dash off to the right using one piece of each cache entry and discarding the rest. 44
45 Blocking to Ease Memory Access Patterns (Contd..) Re-arrange the loop so that it consumed the arrays in small rectangles rather than stripes Can be achieved by unrolling both the inner and outer loops Array A is referenced in several strips side by side from top to bottom while B is referenced in several strips side by side from left to right. This will improve cache performance and lower the runtime. 45
46 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Array A Array B Array A Array B Figure 3: 2x2 squares 46
47 Blocking to Ease Memory Access Patterns Example 1 : DO 11 I=1,N,2 DO 21 J=1,N/2,2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J+1,I)+B(I+1,J) A(J,I+1)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1)+B(I+1,J+1) 21 CONTINUE 11 CONTINUE DO 10 I=1,N,2 DO 20 J=N/2,+1,N, 2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1,I+1)+ B(I+1,J+1) 21 CONTINUE 11 CONTINUE Case Study: PCOPP 2002 (Contd..) Significant increase in performance for N = 256. Because the two arrays A and B are each 64Kb x 8 bytes = 112Mb when N is equal to 256 larger than can be handled by the TLBs and caches of most workstations. 47
48 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Super imposition of first few references to A and B look super imposed upon once another in the blocked and un-blocked cases. Un-blocked references to B zing off through memory, eating through cache and TLB entries. Blocked references one more sparing with the memory system. Further blocking is possible for larger problems. 48
49 Blocking to Ease Memory Access Patterns (Contd..) Blocked * Unblocked * *Arrays A & B are superimposed Strided Memory References Strided Memory References Picture of unblocked versus blocked references Note : Blocked Memory references is crucial for performance 49
50 Blocking to Ease Memory Access Patterns (Contd..) Divides and conquering a large memory address space by cutting it into little pieces may help in effective optimization process. Different variations of vector sum: Original unmodified code Inner loop unrolling of 4 Inner loop unrolling of 2, outer loop unrolling of 2 Inner loop unrolling of 2, outer loop unrolling of 2, 2 inner loops Inner loop unrolling of 2, outer loop unrolling of 2, 4 inner loops LINPACK benchmark is an example of a program with small amount of memory that is visited repeatedly. 50
51 Operating System Optimization and I/O UMA (Uniform memory architecture) PCOPP 2002 All memory appears identical to all processors in terms of size, speed, access rights, access methods etc. NUMA (Non-Uniform memory Architecture) Memory looks different to different processors usually in terms of access speed. Since I/O is so much slower than CPU performance, some people in the High Performance Computing community have defined a supercomputer as a computer which turns a CPU bound program into an I/O bound program I/O is usually the last thing considered when optimizing a program If lots of I/O is done, it can be the performance bottleneck 51
52 Performance of of Selective Application --CFD Optimization of unsteady state 3D Compressible Navier-Stokes equations by finite difference method Computing System used : Sun Ultra Sparc workstation (Each node is quad CPU Ultra Enterprise 450 server, operating at 300Mhz) Grid Size Iterations Time in seconds 192*16* (No compiler options) *16* (with complier optimization) *16* (Code restructuring and complier optimization) 680 Conclusions : Re-structuring the code and use of proper compiler optimizations reduces the execution time by a factor of
53 Summary of of Optimization Techniques Understand hierarchical memory features of processor PCOPP 2002 Minimization of memory traffic is the single most important goal. Use different Compilers optimizations to get performance Use Performance visualization tools to know more about the bottlenecks in performance Use BLAS, LINPACK, ScaLAPACK LAPACK, etc and tuned math libraries on the system Calls to these math libraries can often simplify coding. They are portable across different platforms They are usually fine-tuned to the specific hardware as well as to the sizes of the array variables that are sent to them Example : Sun workshop
54 Conclusions Understand Hierarchical Memory features of Memory Sub-system Reducing Memory Overheads is important for performance of sequential and parallel programs Minimization of memory traffic is the single most important goal. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. Role of Data Reuse on Memory sub-system will increase the performance Estimating memory access time for various loops give good clue about the performance. Techniques for Hiding Memory Latency is useful for performance Rich set of examples are provided in Hands-on Session 54
55 PCOPP
Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationLecture 16. Today: Start looking into memory hierarchy Cache$! Yay!
Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationCS Computer Architecture
CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic
More informationECE468 Computer Organization and Architecture. Virtual Memory
ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationECE4680 Computer Organization and Architecture. Virtual Memory
ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it
More informationPrinceton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory
Princeton University Computer Science 27: Introduction to Programming Systems The Memory/Storage Hierarchy and Virtual Memory Goals of this Lecture Help you learn about: Locality and caching The memory
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationCHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN
CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationStorage Management 1
Storage Management Goals of this Lecture Help you learn about: Locality and caching Typical storage hierarchy Virtual memory How the hardware and OS give applications the illusion of a large, contiguous,
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationCS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.
CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationCS161 Design and Architecture of Computer Systems. Cache $$$$$
CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks
More informationCache introduction. April 16, Howard Huang 1
Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently
More informationAMath 483/583 Lecture 11
AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationChapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction
Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 9, 2015
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationLecture 4: RISC Computers
Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationThe levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms
The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationMemory Hierarchies 2009 DAT105
Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement
More informationModern Computer Architecture
Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap
More informationCS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)
CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationAMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11
AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationPlot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;
How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationLecture 2: Memory Systems
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer
More informationLecture 2: Single processor architecture and memory
Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will this plot look like? for n = 100:10:1000 tic; A = []; for i = 1:n A(i,i) = 1; end times(n) = toc; end ns =
More informationMemory. Objectives. Introduction. 6.2 Types of Memory
Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationMemory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology
Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014
More informationWelcome to Part 3: Memory Systems and I/O
Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently
More informationUniversity of Toronto Faculty of Applied Science and Engineering
Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More information