PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

Size: px
Start display at page:

Download "PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy"

Transcription

1 PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1

2 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed Basic Definitions Execution of Instructions in serial Computing Hierarchical Memory features of Sub-system Managing Memory Overheads How to Calculate Memory Access time? Reducing Memory Overheads for Performance 2

3 Definition of of Terms PCOPP 2002 What is the clock rate? Today's computer perform operations at very high speed. The clock is the rate at which new operations can begin. PC s : 900 MHz (=900,000,000 cycles/sec) A clock period is the smallest unit of measure of time on a processor. 1 clock period = 1/(clock rate) 1 clock = 1/(900*10 6 ) = 1.11*10-9 seconds = 1.11 nanoseconds How many clocks does it take to perform an operation? This depends on the processor and operating system, but exemplary times might be multiply : 4 clock periods; divide : 20 clock periods square root :140 clock periods 3

4 Definition of of Terms PCOPP 2002 What are MFLOPS? MFLOPS= Megaflops= Million Floating point Operations/Per Second MFLOPS and MFLOPS/sec are often used synonymously. (e.g. Adding/ multiplying real numbers) Example: A 500 MHz chip has a clock of 2.0 ns If it can perform 1 add and mult. in every clock, i.e. 2 operations/clock 2 oper * 1 clock = 2* 10 9 ops/sec; clock = 2.0*10-9 sec = 2 nano seconds What are MIPs? MIPS = Million Instruction Per Second (e.g. integer, logical operations) 4

5 History : CPU Performance PCOPP 2002 Application Compiler Architecture CPU Time = Instructions Cycles Instructions Program r r Seconds Cycles Instruction Set CPU Time = N inst *CPI* Clock rate Technology What single-processor efficiency can you expect? Assumption: Memory performance will be the key Responses: Determine % of time processor is stalled due to memory. Gives upper bound on performance improvements 5

6 Hitting the Wall : Clock Speeds PCOPP 2002 Two basic Techniques to improve the processor performance Exploit higher Instruction Level Parallelism Increase the clock rate by sub-dividing the instruction pipeline into simpler stages. Mechanisms for supporting multiple instruction execution for cost-effective performance Pipelined Execution Super Scalar Execution VLIW Processors Multi Scalar Processors Super Scalar versus Super Pipelined Processors 6

7 Pipelined Execution of Instructions PCOPP 2002 Instructions are executed in five stages : Instruction fetch decode operand fetch execute write back IF ID OF E WB (a) Stages in instruction execution. Instruction Cycles IF ID OF E WB IF ID OF E WB (b) Serial execution of instructions. IF ID OF E WB IF ID OF E WB IF ID OF E WB IF ID OF E WB (c) Pipelined execution of instructions. Each stage is executed in a single clock cycle, the entire instruction execution can be accomplished in five clock cycles. It is easy to see that in a sequential execution of the instructions, most of the processor hardware idles for a majority of time. Therefore the maximum speedup achievable through pipelining is five 7

8 Super Scalar Execution The ability of a processor to issue multiple instructions in the same cycle is referred to as Super Scalar Execution 1. First, instructions in a program are related to each other. The results of an instruction may be required for subsequent instructions which is referred to as true data dependency. Dependencies of this type must be resolved before simultaneous issue of instructions. 2. Another source of dependence between instructions results from the finite resource of the machines. 3. The flow of control through a program enforces a third form of dependency between instructions. 8

9 3. The flow of control Super Scalar Execution The branch location is known at the point of execution. (Contd ) Scheduling instructions across branches or (subroutine calls) may lead to incorrect flow of control. Accurate branch prediction is highly desirable for efficient super scalar execution The ability of a processor to detect and schedule instructions is critical to Super Scalar Performance. The order in which instructions are issued and completed has implications for the required look-ahead and consequently the performance Branch Prediction : Branch instructions occur very frequently (about one in six instructions) 9

10 Multi Scalar Execution of a simple program segment The multi scalar execution model takes a coarser view of the control graph. Set of nodes and edges are aggregated together to form tasks. Multiple tasks are executed in parallel by different functional units. Multi scalar execution can be compared to implicit parallel execution of programs on multi-computers Program segment with blocks A, B, and C A B C Execution of multiple instructions play a major role for performance Execution trace: ((A, B) (A, B), C) (A, B, C)) Possible multiscalar schedules: (Assuming two functional units) Entire program segment as sing task: Functional unit 0: [(A, B) (A, B) C] Functional unit 1: (A, B, C) Inner loop (A, B) as a task: Functional unit 0: (A, B) Functional unit 1: (A, B) Functional unit 0: (C,A, B, C) 10

11 Instruction Level Parallelism: Floating Point PCOPP 2002 Computer 1 2 FP units, each capable of 1 fused multiply-add (1;2) or 1 add (1;1) or 1 multiply (1;2) 1 quad load/store (1;!) leading to (up to) 4 FP Ops per CP 4 mem access Ops per CP Computer 2 1 FP units, each capable of 1 floating point add pipeline (1;4) 1 floating point mult.pipeline (1;4) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 mem access Ops per CP Computer 3 1 FP units, each capable of 1 floating point add pipeline (1;2) 1 floating point mult.pipeline (1;2) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 memory access Ops per CP 11

12 PERFORMANCE EXPLICITLY Simple fixed Length Instructions Sequencing done by Compiler ITANIUM McKINLEY EPIC PARALLEL INSTRUCTION COMPUTING ULTRASPARC- III ALPHA CISC RISC Complex Variable Length Instructions Sequencing in Hardware OOO SuperScalar H/W detects implicit parallelism HW O -O-O Scheduling & Speculation TIME 12

13 Memory Management PCOPP 2002 Memory Reference Optimization and Managing Memory Overheads play an key role for performance Hierarchical Memory features of Memory Sub-System Getting memory references right is one of the most important challenges of application performance Memory access patterns for performance Cache Performance and Cache Miss Cache Memories for Reducing Memory Overheads Role of Data Reuse on Memory system performance Techniques for Hiding Memory Latency (Multi-threading) 13

14 Remarks : Managing Memory Overheads PCOPP 2002 Current Micro processors are clocked at speeds of 2-3 ns which imposes stringent requirements on effective memory latency and bandwidth Fastest DRAMs operate at about ns latency. This reflects the major mismatch in the rate at which data is required by the processor and the rate at which DRAMs can supply the data. Memory (DRAM, data cache, instruction cache, registers) sizes and access to, have improved dramatically. Trend: speed and memory increase by ~2 every ~1.5 years (also known as Moore s law). That s a factor of 1000 in 15 years and a factor of 1,000,000 in 30 years! Remember: at 100MHz you get 10-8 secs/cycle, light can travel about 9 feet! Memory banking to reduce effective latency and increase bandwidth is important for performance. 14

15 The Memory sub-system Hierarchy features Cache size is critical for controlling cache misses. Cache line size is usually 64, 128 or 256 bytes long. The cache line size has an impact on the misses ratios and the time to access memory. Cache memories work with the range of 4-25 ns latency. Registers C < 2KB L = 0 cycle B = 1-32 GB/s Level-1 Cache KB 0-2 cycles 1-16 GB/s Level-2 Cache 64 KB-4 MB 2-10 cycles 1-4 GB/s Remote Memory Main memory Disk The first level cache (L 1 Cache) is typically on the same chip as the processor. The access to L 1 cache is typically a single clock cycle and its size is in the range of 32K words. The second level cache is larger (256 KB - 16 MB) and may be on or off chip. 16 MB-16 GB cycles GB/s GB K cycles MB/s GB 1-16 MB/s 100K-1M cycles 15

16 A lot of time is spent accessing/storing data from/to memory. It is important to keep in mind the relative times for each memory types: CPU The Memory sub-system : Access time registers Dcache Icache L2 RAM DISK Approximate access times CPU-registers: 0 cycles (that s where the work is done!) L 1 Cache: 1 cycle (Data and Instruction cache). Repeated access to a cache takes only 1 cycle L 2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency 16

17 The Memory sub-system Hierarchy features Registers: PCOPP 2002 Few (about 32) very fast 32-bits (sometimes 64-bites) registers. Compilers usually take care of using them. Data Cache: From about 4 Kbytes to 256 Kbytes, and increasing, they are an order of magnitude more expensive than DRAM but 10 to 100 times faster. (L 2 Cache are static RAM, cheaper than L 1 but slower, although faster than RAM). Instruction Cache: Similar in speed to the Data Cache, they are used only to store instructions. Useful with loops, specially when the whole loop fits within the Instruction Cache. 17

18 Reducing Memory Overheads for Performance Hide memory latency (cache, non-blocking loads, ILP) Reduce memory latency (faster DRAM, faster interconnects) Cycle Issued instruction Issued loads Issued stores Mispredicted branches Primary data cache misses Secondary data cache misses 18

19 Reducing Memory Overheads for Performance Cont.. What do these counts mean to the execution time? How far from optimal are they? Which architectural feature pays off for your code? How effective are data caches? Hit Rate = Number of hits / Number of memory references Only an indirect measure of performance Need to know how hits/misses contribute to CPU time Average memory access time for hits/misses 19

20 Characteristic 1 : PCOPP 2002 Characteristics of Virtual Memory Machines Virtual memory machines translate the logical memory addresses your program generates into physical addresses in the memory system It gives a degree of flexibility by allowing all processes to believe that they all have all of the memory system to themselves Characteristic 2 : Virtual memory systems is that they divide your program s memory up into pages chunks. Page size vary from 512 bytes to 16 KB, depending on the machine. By being separated into pages, programs are easier to fit together in memory, or move out to disk in portions. 20

21 Virtual to Physical Address Mapping PCOPP 2002 Process Region Table Virtual Translation Virtual Address Location 1000 Page Table Physical Address Data 21

22 Characteristics of Virtual Memory Machines PCOPP 2002 Example Say that your program asks for a variable stored at location 1000 To find where your variable is actually stored, the location has to be translated from a virtual address to a physical address. The map containing such translations is called a page table. Each process has a several page tables associated with it, corresponding to different regions, such as program text and data segments. For instance, if the page size is 512 bytes, then location 1000 will fall within second page. 22

23 Virtual Memory Machines -Translation Lookaside Buffer (TLB) All modern virtual memory machines have a special cache called a Translation Lookaside Buffer or TLB for virtual to physical memory address translation. The two inputs to the TLB are an integer that identifies the program making the memory request and the virtual page requested. From the output pops a pointer to the physical page number. Virtual addresses in Physical address out. TLB lookups occur in parallel with instruction execution, so of the address data is in the TLB, memory references will proceeds quickly TLB is limited in size. If your program asks for a virtual to physical address translation, and the entry doesn't exist in the TLB, then you suffer a TLB miss. 23

24 Translation Lookaside Buffer PCOPP 2002 Example : Construct an example in which every memory reference of your program causes a TLB miss. Assume that the memory page size for your computer is less than 40 KB. Every time through the inner loop in the following example code, the program asks for data that is 4 bytes*10,000 = 40,000 bytes away from the last reference. Before Modified REAL X( ) COMMON X,Y DO 10 I=0,1000 DO 20 J=1, ,10000 Y = (J+I) 20 CONTINUE 10 CONTINUE REAL X( ) COMMON X,Y DO 10 I=0, Y = X(I) 20 CONTINUE 10 CONTINUE 24

25 Translation Lookaside Buffer PCOPP 2002 Remarks about the Example : Each reference falls on a different memory page. (see the fragment of the code on left ). This causes 1000 TLB misses in the inner loop, taken times, for a total of at least one million TLB misses. To add insult this injury, each reference is guaranteed to cause a data cache miss a well. The reversed loop has good locality of reference and TLB misses occur only every so often Page Faults : References to pages that are not marked valid are called page faults. Al though they take lot of time, page faults are nt errors 25

26 Memory Location : Taking Worst case Scenario Your program asks for a variable from a particular memory location The processor goes to look for it in the cache and finds it isn't there (cache miss), which means it must be loaded data from memory Next it goes to the TLB to find the physical location of the data in memory and finds there is no TLB entry either a (TLB miss) Then it tries consulting the page table (and refilling the TLB), but finds that either there is no entry for your particular page, or that the memory page has been shipped to disk (both are page faults) Each step of the memory hierarchy has shrugged off your request. A new page will have to be created in memory and possibly, depending on the circumstance, refilled from disk. 26

27 The Memory sub-system : Access time PCOPP 2002 Typical operations such as adding two real/float numbers used to take from 3 to 12 clock cycles; now 32-bit operations are done in only one clock cycle, or If the data and instruction flow is well arranged to take advantage of the hardware architecture, two or even four simultaneous operations can be executed in one clock cycle! (e.g. Power2 chip on the SP2) If the information has to be fetched in memory (DRAM) and the physical address was already stored in the Translation Lookaside Buffer (TLB) the operation takes 8 to 12 cycles If a new address has to be translated and fetched it takes 30 to 60 additional cycles (using the Page Frame Table (PFT)) The TLB can store 64 addresses of 4K sized pages, thus totaling up to 256K of memory management. 27

28 Memory Management for performance PCOPP 2002 Crossing 4 K boundary costs about 10 cycles Switching between read and write costs an extra 6 cycles A page fault happens when the data is not found in main memory; control is returned to the operating system to perform I/O and locate the page somewhere on disk. This can cost a delay of about one millisecond (~100,000 cycles) Before do i=1, do j=i+1, A(i)=A(i)*B(j)+A(j)*B(i) B(i)=A(i)*B(i)+A(j)*B(j) enddo enddo A and B require 320 Mbyes! After do i=1, tempa = A(I); tempb = B(I) do j=i+1, tempa=tempa*b(j)+a(j)*tempb tempb=tempa*tempb+a(j)*b(j) enddo A(I)= tempa; B(I)= tempb enddo 28

29 Cache Memories for Reducing Memory Overheads Cache memory work on the principles of spatial and temporal locality of data reference by typical programs. Spatial locality: In most programs, if a data word is accessed in the memory, it is likely that the words in the neighbor will also be accessed. Temporal locality: In some programs, it is also possible that words are repeatedly accessed in windows of time. This is sometimes also called reuse locality of the programs. Question : Why are we being bothered about memory system design when we want to learn only about parallel programming? The key to the question lies in the phrases, Spatial Locality and Temporal Locality of data reference.there are critical to single processor performance, and also critical to parallel processing efficiency. 29

30 Cache Memories for Reducing Memory Overheads Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. Cache line : If the data is not available in the cache (a cache miss occurs), the data is fetched from the DRAM into the cache. However instead of fetching a single word of data corresponding to the requested data item, a whole block of contagious data is transferred from the DRAM to the cache. This block of data is also referred to as a cache line. Due to spatial locality properties subsequently accessed data locations are likely to have been fetched into the cache. The effective access latency to these words is thus reduced. Direct mapped caches or set associative caches Direct mapped caches have a unique cache location for each line in the memory. Set associative caches have multiple locations in which cache line reside and it helps in performance. 30

31 How to calculate effective memory access time? Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. If the average cache miss ratio over the execution is m, and the time to service a single memory request from DRAM is t DRAM, then the effective time to service this is m x t DRAM. The remaining (1-m) request are serviced by the cache for which the time is (1-m)x t cache where t cache is the time to service a request from the cache. The effective memory across time (t eff ) of the single level cache system is given by t eff = m x t DRAM + (1-m) x t cache In many applications it is possible to achieve cache miss ratios lower than 5% and it is possible to achieve excellent cache performance. 31

32 How to calculate effective memory access time? Example for memory access time : Consider simple loop of adding two vectors and find out the memory performance for this loop for data accesses only. (access to arrays a, b and c) will be analyzed for(i = 0;i<Vector_length;i++) c[i]=a[i]+ b[i]; It is assumed that the a, b and c were non-interfering with respect to their cache residence. This is determined by the mapping of memory locations to words in the cache. Assume the two level memory system with DRAM latency 60 ns latency, Cache latency is 10 ns and the cache line of a system is 4 words long. Since every fourth access to each array results in a cache miss, the miss ratio is 25% (0.25). Effective memory access time (t eff ) is 0.25 x x 10 =22.5 ns. 32

33 How to calculate effective memory access time? Example for Memory access time : Observations Clearly, using a cache has enabled us to reduce the effective access time by almost a factor of three. Further more, the example also illustrates the role of an effective memory system. Since the effective access time is 22.5 ns and the memory overhead per FLOP is 77.5 ns. This corresponds to a peak FLOP rating of 1/77.5 GFLOPs or 13 MFLOPs. Therefore irrespective of processor speeds, the loop under the consideration will not yield 13 MFLOPs for the assumed memory subsystem. 33

34 Role of Data Reuse on Memory System Performance Remark : Besides spatial locality, temporal locality and data reuse play an important role for performance However, many real life applications do have significant data reuse and this can be exploited for enhanced memory performance. Example : Role of Temporal Locality Multiplying two nx nmatrices requires 2n 3 FLOPS and 2n 2 data accesses. This corresponds to a reuse factor of (n) (each data item is used n times). for(i=0, i<n; i++) for(j=0, j<n; j++) for(k=0, k<n; k++) c[i][j] += a[i][k] + b[k][j]; The performance depends on the cache size and data locality of the arrays a, b, and c. (re-use of the data) 34

35 Techniques for hiding memory latency : Prefetching It is possible to anticipate need for a data item well in advance, a request can be issued for the data with the hope that it will arrive when it is actually needed. The processor can work on other tasks that can be performed in parallel. Prefetching is an effective tool to hide memory latencies in serial computers and communication latency in parallel computers. Limitations Prefetch requires large amounts of register or cache storage There might not be enough concurrency in the code to hide all the latency. 35

36 Techniques for Hiding Memory Latency Multi threading Given multiple threads of control, when a thread makes a memory access, it can be swapped on an other parallel threads can be executed. If the memory access is completed before control returns to the thread, the latency of the access has been effectively hidden. Disadvantages : The programmer is now burdened with programming concurrency explicitly even into un-processer programs. Consequently, although multithreading may yield better performance than software Prefetching, it has considerable overheads in terms of hardware costs and programmer effort. 36

37 Memory Management: FORTRAN versus C Example: Array Allocation in C and Fortran (4x4 matrix with data values) PCOPP C: row-major ordering: row-elements of the matrix are stored contiguously [ ] FORTRAN: column-major ordering: column-elements of the matrix are stored contiguously [ ] 37

38 Memory Access Patterns for Performance The loop has unit stride DO 10 J=1, N DO 10 I=1, N A(I,J)=B(I,J)+C(I,J)*D 10 CONTINUE The loop has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE Unit stride gives you the best performance because it can serve cache entries. In contrast the next loop will be slower because its stride is N. In FORTRAN programs this is the leftmost subscript and in C programs it is the right most. In C programs, the subscripts appear in reverse order. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. The larger the value of N, the more significant the performance difference is. Runs much faster (say 10 times) 38

39 Loop interchange to Ease Memory Access Patterns Before interchange :has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE After interchange, A, B and C are referenced with the leftmost subscript varying most quickly. This can make an important difference in performance. We traded three N- strided memory references for unit-stride. Minimize the stride to make better use of data cache! Remarks: The worst case patterns are those that jump through memory, especially a large amount of memory. After interchange :has unit stride DO 10 I=1, N DO 10 J=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE In large jobs, you not only pay a penalty for cache misses but for TLB (The Translation look a side Buffer) misses too. 39

40 Loop interchange to Ease Memory Access Patterns Original DO 10 I=1, N DO 20 J=1, N A(J,I)=B(J,I) 20 CONTINUE 10 CONTINUE Modified DO 20 J=1, N DO 10 I=1, N A(J,I)=B(I,J) 10 CONTINUE 20 CONTINUE (Contd ) The above loops represents a dilemma. Whichever way interchange them, you will break access pattern for either A or B. Choice between strided loads versus strided stores which will be? The difference is in the way the processor handles updates of main memory from cache. Loop interchange, the challenge is to retrieve as much data as possible with few cache misses as possible. Another kind of memory reference optimization 40

41 Blocking to Ease Memory Access Patterns PCOPP 2002 DO 10 I=1, N DO 20 J=1, N A(J,I)=A(J,I)+B(I,J) 20 CONTINUE 10 CONTINUE Example1 :Two-dimensional Vector sum: This loop involves two vectors. How to improve memory access for both A and B? One is referenced with unit stride, the other with a stride of N. We can interchange the loops, but one way or another we will still have N-strided array reference on either A or B, either of which is undesirable. 41

42 Trick: Blocking to Ease Memory Access Patterns PCOPP 2002 To block references so that you grab a few elements of A, and then a few of B, and then a few of A, and so on.in neighborhoods. Combing inner and outer loop unrolling DO 10 I=1,N,2 DO 20 J=1,N,2 A(J,I) = A(J,I) +B(I,J) A(J+1,I) = A(J+1,I) +B(I,J-1) A(J,I+1) = A(J,I+1) +B(I+1),J) A(J+1,I+1) 20 CONTINUE 10 CONTINUE = A(J+1,I+1) +B(I+1,J+1) For better performance cut the original loop into two parts : DO 21 J = 1, N/2,2 and DO 20 J=N/2+1, N,2 (Contd..) 42

43 Blocking to Ease Memory Access Patterns (Contd..) Remarks: Memory is sequential storage. Fortran: A two dimensional array is constructed from memory by logically lining memory strips up against each other. C: Rows are stacked on top of the one another. Suppose memory storage is cut into pieces the size of individual cache entries. Array storage starts at the upper left, proceeds down to the bottom and then starts over at the top of the next column. 43

44 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) column 1 column 2 Array denotes cache line boundary Figure 2: How array elements are stored Because of index expressions, references to A go from top to bottom consuming every bit of each cache line. References to B dash off to the right using one piece of each cache entry and discarding the rest. 44

45 Blocking to Ease Memory Access Patterns (Contd..) Re-arrange the loop so that it consumed the arrays in small rectangles rather than stripes Can be achieved by unrolling both the inner and outer loops Array A is referenced in several strips side by side from top to bottom while B is referenced in several strips side by side from left to right. This will improve cache performance and lower the runtime. 45

46 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Array A Array B Array A Array B Figure 3: 2x2 squares 46

47 Blocking to Ease Memory Access Patterns Example 1 : DO 11 I=1,N,2 DO 21 J=1,N/2,2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J+1,I)+B(I+1,J) A(J,I+1)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1)+B(I+1,J+1) 21 CONTINUE 11 CONTINUE DO 10 I=1,N,2 DO 20 J=N/2,+1,N, 2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1,I+1)+ B(I+1,J+1) 21 CONTINUE 11 CONTINUE Case Study: PCOPP 2002 (Contd..) Significant increase in performance for N = 256. Because the two arrays A and B are each 64Kb x 8 bytes = 112Mb when N is equal to 256 larger than can be handled by the TLBs and caches of most workstations. 47

48 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Super imposition of first few references to A and B look super imposed upon once another in the blocked and un-blocked cases. Un-blocked references to B zing off through memory, eating through cache and TLB entries. Blocked references one more sparing with the memory system. Further blocking is possible for larger problems. 48

49 Blocking to Ease Memory Access Patterns (Contd..) Blocked * Unblocked * *Arrays A & B are superimposed Strided Memory References Strided Memory References Picture of unblocked versus blocked references Note : Blocked Memory references is crucial for performance 49

50 Blocking to Ease Memory Access Patterns (Contd..) Divides and conquering a large memory address space by cutting it into little pieces may help in effective optimization process. Different variations of vector sum: Original unmodified code Inner loop unrolling of 4 Inner loop unrolling of 2, outer loop unrolling of 2 Inner loop unrolling of 2, outer loop unrolling of 2, 2 inner loops Inner loop unrolling of 2, outer loop unrolling of 2, 4 inner loops LINPACK benchmark is an example of a program with small amount of memory that is visited repeatedly. 50

51 Operating System Optimization and I/O UMA (Uniform memory architecture) PCOPP 2002 All memory appears identical to all processors in terms of size, speed, access rights, access methods etc. NUMA (Non-Uniform memory Architecture) Memory looks different to different processors usually in terms of access speed. Since I/O is so much slower than CPU performance, some people in the High Performance Computing community have defined a supercomputer as a computer which turns a CPU bound program into an I/O bound program I/O is usually the last thing considered when optimizing a program If lots of I/O is done, it can be the performance bottleneck 51

52 Performance of of Selective Application --CFD Optimization of unsteady state 3D Compressible Navier-Stokes equations by finite difference method Computing System used : Sun Ultra Sparc workstation (Each node is quad CPU Ultra Enterprise 450 server, operating at 300Mhz) Grid Size Iterations Time in seconds 192*16* (No compiler options) *16* (with complier optimization) *16* (Code restructuring and complier optimization) 680 Conclusions : Re-structuring the code and use of proper compiler optimizations reduces the execution time by a factor of

53 Summary of of Optimization Techniques Understand hierarchical memory features of processor PCOPP 2002 Minimization of memory traffic is the single most important goal. Use different Compilers optimizations to get performance Use Performance visualization tools to know more about the bottlenecks in performance Use BLAS, LINPACK, ScaLAPACK LAPACK, etc and tuned math libraries on the system Calls to these math libraries can often simplify coding. They are portable across different platforms They are usually fine-tuned to the specific hardware as well as to the sizes of the array variables that are sent to them Example : Sun workshop

54 Conclusions Understand Hierarchical Memory features of Memory Sub-system Reducing Memory Overheads is important for performance of sequential and parallel programs Minimization of memory traffic is the single most important goal. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. Role of Data Reuse on Memory sub-system will increase the performance Estimating memory access time for various loops give good clue about the performance. Techniques for Hiding Memory Latency is useful for performance Rich set of examples are provided in Hands-on Session 54

55 PCOPP

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

More information

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory Princeton University Computer Science 27: Introduction to Programming Systems The Memory/Storage Hierarchy and Virtual Memory Goals of this Lecture Help you learn about: Locality and caching The memory

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Storage Management 1

Storage Management 1 Storage Management Goals of this Lecture Help you learn about: Locality and caching Typical storage hierarchy Virtual memory How the hardware and OS give applications the illusion of a large, contiguous,

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Optimisation p.1/22. Optimisation

Optimisation p.1/22. Optimisation Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page. CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 9, 2015

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap

More information

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

Lecture 2: Single processor architecture and memory

Lecture 2: Single processor architecture and memory Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will this plot look like? for n = 100:10:1000 tic; A = []; for i = 1:n A(i,i) = 1; end times(n) = toc; end ns =

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014

More information

Welcome to Part 3: Memory Systems and I/O

Welcome to Part 3: Memory Systems and I/O Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information