PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

Size: px

Start display at page:

Download "PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy"

Alfred Bridges
5 years ago
Views:

1 PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1

2 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed Basic Definitions Execution of Instructions in serial Computing Hierarchical Memory features of Sub-system Managing Memory Overheads How to Calculate Memory Access time? Reducing Memory Overheads for Performance 2

3 Definition of of Terms PCOPP 2002 What is the clock rate? Today's computer perform operations at very high speed. The clock is the rate at which new operations can begin. PC s : 900 MHz (=900,000,000 cycles/sec) A clock period is the smallest unit of measure of time on a processor. 1 clock period = 1/(clock rate) 1 clock = 1/(900*10 6 ) = 1.11*10-9 seconds = 1.11 nanoseconds How many clocks does it take to perform an operation? This depends on the processor and operating system, but exemplary times might be multiply : 4 clock periods; divide : 20 clock periods square root :140 clock periods 3

4 Definition of of Terms PCOPP 2002 What are MFLOPS? MFLOPS= Megaflops= Million Floating point Operations/Per Second MFLOPS and MFLOPS/sec are often used synonymously. (e.g. Adding/ multiplying real numbers) Example: A 500 MHz chip has a clock of 2.0 ns If it can perform 1 add and mult. in every clock, i.e. 2 operations/clock 2 oper * 1 clock = 2* 10 9 ops/sec; clock = 2.0*10-9 sec = 2 nano seconds What are MIPs? MIPS = Million Instruction Per Second (e.g. integer, logical operations) 4

5 History : CPU Performance PCOPP 2002 Application Compiler Architecture CPU Time = Instructions Cycles Instructions Program r r Seconds Cycles Instruction Set CPU Time = N inst *CPI* Clock rate Technology What single-processor efficiency can you expect? Assumption: Memory performance will be the key Responses: Determine % of time processor is stalled due to memory. Gives upper bound on performance improvements 5

6 Hitting the Wall : Clock Speeds PCOPP 2002 Two basic Techniques to improve the processor performance Exploit higher Instruction Level Parallelism Increase the clock rate by sub-dividing the instruction pipeline into simpler stages. Mechanisms for supporting multiple instruction execution for cost-effective performance Pipelined Execution Super Scalar Execution VLIW Processors Multi Scalar Processors Super Scalar versus Super Pipelined Processors 6

7 Pipelined Execution of Instructions PCOPP 2002 Instructions are executed in five stages : Instruction fetch decode operand fetch execute write back IF ID OF E WB (a) Stages in instruction execution. Instruction Cycles IF ID OF E WB IF ID OF E WB (b) Serial execution of instructions. IF ID OF E WB IF ID OF E WB IF ID OF E WB IF ID OF E WB (c) Pipelined execution of instructions. Each stage is executed in a single clock cycle, the entire instruction execution can be accomplished in five clock cycles. It is easy to see that in a sequential execution of the instructions, most of the processor hardware idles for a majority of time. Therefore the maximum speedup achievable through pipelining is five 7

8 Super Scalar Execution The ability of a processor to issue multiple instructions in the same cycle is referred to as Super Scalar Execution 1. First, instructions in a program are related to each other. The results of an instruction may be required for subsequent instructions which is referred to as true data dependency. Dependencies of this type must be resolved before simultaneous issue of instructions. 2. Another source of dependence between instructions results from the finite resource of the machines. 3. The flow of control through a program enforces a third form of dependency between instructions. 8

9 3. The flow of control Super Scalar Execution The branch location is known at the point of execution. (Contd ) Scheduling instructions across branches or (subroutine calls) may lead to incorrect flow of control. Accurate branch prediction is highly desirable for efficient super scalar execution The ability of a processor to detect and schedule instructions is critical to Super Scalar Performance. The order in which instructions are issued and completed has implications for the required look-ahead and consequently the performance Branch Prediction : Branch instructions occur very frequently (about one in six instructions) 9

10 Multi Scalar Execution of a simple program segment The multi scalar execution model takes a coarser view of the control graph. Set of nodes and edges are aggregated together to form tasks. Multiple tasks are executed in parallel by different functional units. Multi scalar execution can be compared to implicit parallel execution of programs on multi-computers Program segment with blocks A, B, and C A B C Execution of multiple instructions play a major role for performance Execution trace: ((A, B) (A, B), C) (A, B, C)) Possible multiscalar schedules: (Assuming two functional units) Entire program segment as sing task: Functional unit 0: [(A, B) (A, B) C] Functional unit 1: (A, B, C) Inner loop (A, B) as a task: Functional unit 0: (A, B) Functional unit 1: (A, B) Functional unit 0: (C,A, B, C) 10

11 Instruction Level Parallelism: Floating Point PCOPP 2002 Computer 1 2 FP units, each capable of 1 fused multiply-add (1;2) or 1 add (1;1) or 1 multiply (1;2) 1 quad load/store (1;!) leading to (up to) 4 FP Ops per CP 4 mem access Ops per CP Computer 2 1 FP units, each capable of 1 floating point add pipeline (1;4) 1 floating point mult.pipeline (1;4) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 mem access Ops per CP Computer 3 1 FP units, each capable of 1 floating point add pipeline (1;2) 1 floating point mult.pipeline (1;2) 1 load/store (1;3) leading to (up to) 2 FP Ops per CP 1 memory access Ops per CP 11

12 PERFORMANCE EXPLICITLY Simple fixed Length Instructions Sequencing done by Compiler ITANIUM McKINLEY EPIC PARALLEL INSTRUCTION COMPUTING ULTRASPARC- III ALPHA CISC RISC Complex Variable Length Instructions Sequencing in Hardware OOO SuperScalar H/W detects implicit parallelism HW O -O-O Scheduling & Speculation TIME 12

13 Memory Management PCOPP 2002 Memory Reference Optimization and Managing Memory Overheads play an key role for performance Hierarchical Memory features of Memory Sub-System Getting memory references right is one of the most important challenges of application performance Memory access patterns for performance Cache Performance and Cache Miss Cache Memories for Reducing Memory Overheads Role of Data Reuse on Memory system performance Techniques for Hiding Memory Latency (Multi-threading) 13

14 Remarks : Managing Memory Overheads PCOPP 2002 Current Micro processors are clocked at speeds of 2-3 ns which imposes stringent requirements on effective memory latency and bandwidth Fastest DRAMs operate at about ns latency. This reflects the major mismatch in the rate at which data is required by the processor and the rate at which DRAMs can supply the data. Memory (DRAM, data cache, instruction cache, registers) sizes and access to, have improved dramatically. Trend: speed and memory increase by ~2 every ~1.5 years (also known as Moore s law). That s a factor of 1000 in 15 years and a factor of 1,000,000 in 30 years! Remember: at 100MHz you get 10-8 secs/cycle, light can travel about 9 feet! Memory banking to reduce effective latency and increase bandwidth is important for performance. 14

15 The Memory sub-system Hierarchy features Cache size is critical for controlling cache misses. Cache line size is usually 64, 128 or 256 bytes long. The cache line size has an impact on the misses ratios and the time to access memory. Cache memories work with the range of 4-25 ns latency. Registers C < 2KB L = 0 cycle B = 1-32 GB/s Level-1 Cache KB 0-2 cycles 1-16 GB/s Level-2 Cache 64 KB-4 MB 2-10 cycles 1-4 GB/s Remote Memory Main memory Disk The first level cache (L 1 Cache) is typically on the same chip as the processor. The access to L 1 cache is typically a single clock cycle and its size is in the range of 32K words. The second level cache is larger (256 KB - 16 MB) and may be on or off chip. 16 MB-16 GB cycles GB/s GB K cycles MB/s GB 1-16 MB/s 100K-1M cycles 15

16 A lot of time is spent accessing/storing data from/to memory. It is important to keep in mind the relative times for each memory types: CPU The Memory sub-system : Access time registers Dcache Icache L2 RAM DISK Approximate access times CPU-registers: 0 cycles (that s where the work is done!) L 1 Cache: 1 cycle (Data and Instruction cache). Repeated access to a cache takes only 1 cycle L 2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency 16

17 The Memory sub-system Hierarchy features Registers: PCOPP 2002 Few (about 32) very fast 32-bits (sometimes 64-bites) registers. Compilers usually take care of using them. Data Cache: From about 4 Kbytes to 256 Kbytes, and increasing, they are an order of magnitude more expensive than DRAM but 10 to 100 times faster. (L 2 Cache are static RAM, cheaper than L 1 but slower, although faster than RAM). Instruction Cache: Similar in speed to the Data Cache, they are used only to store instructions. Useful with loops, specially when the whole loop fits within the Instruction Cache. 17

18 Reducing Memory Overheads for Performance Hide memory latency (cache, non-blocking loads, ILP) Reduce memory latency (faster DRAM, faster interconnects) Cycle Issued instruction Issued loads Issued stores Mispredicted branches Primary data cache misses Secondary data cache misses 18

19 Reducing Memory Overheads for Performance Cont.. What do these counts mean to the execution time? How far from optimal are they? Which architectural feature pays off for your code? How effective are data caches? Hit Rate = Number of hits / Number of memory references Only an indirect measure of performance Need to know how hits/misses contribute to CPU time Average memory access time for hits/misses 19

20 Characteristic 1 : PCOPP 2002 Characteristics of Virtual Memory Machines Virtual memory machines translate the logical memory addresses your program generates into physical addresses in the memory system It gives a degree of flexibility by allowing all processes to believe that they all have all of the memory system to themselves Characteristic 2 : Virtual memory systems is that they divide your program s memory up into pages chunks. Page size vary from 512 bytes to 16 KB, depending on the machine. By being separated into pages, programs are easier to fit together in memory, or move out to disk in portions. 20

21 Virtual to Physical Address Mapping PCOPP 2002 Process Region Table Virtual Translation Virtual Address Location 1000 Page Table Physical Address Data 21

22 Characteristics of Virtual Memory Machines PCOPP 2002 Example Say that your program asks for a variable stored at location 1000 To find where your variable is actually stored, the location has to be translated from a virtual address to a physical address. The map containing such translations is called a page table. Each process has a several page tables associated with it, corresponding to different regions, such as program text and data segments. For instance, if the page size is 512 bytes, then location 1000 will fall within second page. 22

23 Virtual Memory Machines -Translation Lookaside Buffer (TLB) All modern virtual memory machines have a special cache called a Translation Lookaside Buffer or TLB for virtual to physical memory address translation. The two inputs to the TLB are an integer that identifies the program making the memory request and the virtual page requested. From the output pops a pointer to the physical page number. Virtual addresses in Physical address out. TLB lookups occur in parallel with instruction execution, so of the address data is in the TLB, memory references will proceeds quickly TLB is limited in size. If your program asks for a virtual to physical address translation, and the entry doesn't exist in the TLB, then you suffer a TLB miss. 23

24 Translation Lookaside Buffer PCOPP 2002 Example : Construct an example in which every memory reference of your program causes a TLB miss. Assume that the memory page size for your computer is less than 40 KB. Every time through the inner loop in the following example code, the program asks for data that is 4 bytes*10,000 = 40,000 bytes away from the last reference. Before Modified REAL X( ) COMMON X,Y DO 10 I=0,1000 DO 20 J=1, ,10000 Y = (J+I) 20 CONTINUE 10 CONTINUE REAL X( ) COMMON X,Y DO 10 I=0, Y = X(I) 20 CONTINUE 10 CONTINUE 24

25 Translation Lookaside Buffer PCOPP 2002 Remarks about the Example : Each reference falls on a different memory page. (see the fragment of the code on left ). This causes 1000 TLB misses in the inner loop, taken times, for a total of at least one million TLB misses. To add insult this injury, each reference is guaranteed to cause a data cache miss a well. The reversed loop has good locality of reference and TLB misses occur only every so often Page Faults : References to pages that are not marked valid are called page faults. Al though they take lot of time, page faults are nt errors 25

26 Memory Location : Taking Worst case Scenario Your program asks for a variable from a particular memory location The processor goes to look for it in the cache and finds it isn't there (cache miss), which means it must be loaded data from memory Next it goes to the TLB to find the physical location of the data in memory and finds there is no TLB entry either a (TLB miss) Then it tries consulting the page table (and refilling the TLB), but finds that either there is no entry for your particular page, or that the memory page has been shipped to disk (both are page faults) Each step of the memory hierarchy has shrugged off your request. A new page will have to be created in memory and possibly, depending on the circumstance, refilled from disk. 26

27 The Memory sub-system : Access time PCOPP 2002 Typical operations such as adding two real/float numbers used to take from 3 to 12 clock cycles; now 32-bit operations are done in only one clock cycle, or If the data and instruction flow is well arranged to take advantage of the hardware architecture, two or even four simultaneous operations can be executed in one clock cycle! (e.g. Power2 chip on the SP2) If the information has to be fetched in memory (DRAM) and the physical address was already stored in the Translation Lookaside Buffer (TLB) the operation takes 8 to 12 cycles If a new address has to be translated and fetched it takes 30 to 60 additional cycles (using the Page Frame Table (PFT)) The TLB can store 64 addresses of 4K sized pages, thus totaling up to 256K of memory management. 27

28 Memory Management for performance PCOPP 2002 Crossing 4 K boundary costs about 10 cycles Switching between read and write costs an extra 6 cycles A page fault happens when the data is not found in main memory; control is returned to the operating system to perform I/O and locate the page somewhere on disk. This can cost a delay of about one millisecond (~100,000 cycles) Before do i=1, do j=i+1, A(i)=A(i)*B(j)+A(j)*B(i) B(i)=A(i)*B(i)+A(j)*B(j) enddo enddo A and B require 320 Mbyes! After do i=1, tempa = A(I); tempb = B(I) do j=i+1, tempa=tempa*b(j)+a(j)*tempb tempb=tempa*tempb+a(j)*b(j) enddo A(I)= tempa; B(I)= tempb enddo 28

29 Cache Memories for Reducing Memory Overheads Cache memory work on the principles of spatial and temporal locality of data reference by typical programs. Spatial locality: In most programs, if a data word is accessed in the memory, it is likely that the words in the neighbor will also be accessed. Temporal locality: In some programs, it is also possible that words are repeatedly accessed in windows of time. This is sometimes also called reuse locality of the programs. Question : Why are we being bothered about memory system design when we want to learn only about parallel programming? The key to the question lies in the phrases, Spatial Locality and Temporal Locality of data reference.there are critical to single processor performance, and also critical to parallel processing efficiency. 29

30 Cache Memories for Reducing Memory Overheads Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. Cache line : If the data is not available in the cache (a cache miss occurs), the data is fetched from the DRAM into the cache. However instead of fetching a single word of data corresponding to the requested data item, a whole block of contagious data is transferred from the DRAM to the cache. This block of data is also referred to as a cache line. Due to spatial locality properties subsequently accessed data locations are likely to have been fetched into the cache. The effective access latency to these words is thus reduced. Direct mapped caches or set associative caches Direct mapped caches have a unique cache location for each line in the memory. Set associative caches have multiple locations in which cache line reside and it helps in performance. 30

31 How to calculate effective memory access time? Example for memory access time: Two level memory system consisting of a processor with a single level cache and the DRAM. If the average cache miss ratio over the execution is m, and the time to service a single memory request from DRAM is t DRAM, then the effective time to service this is m x t DRAM. The remaining (1-m) request are serviced by the cache for which the time is (1-m)x t cache where t cache is the time to service a request from the cache. The effective memory across time (t eff ) of the single level cache system is given by t eff = m x t DRAM + (1-m) x t cache In many applications it is possible to achieve cache miss ratios lower than 5% and it is possible to achieve excellent cache performance. 31

32 How to calculate effective memory access time? Example for memory access time : Consider simple loop of adding two vectors and find out the memory performance for this loop for data accesses only. (access to arrays a, b and c) will be analyzed for(i = 0;i<Vector_length;i++) c[i]=a[i]+ b[i]; It is assumed that the a, b and c were non-interfering with respect to their cache residence. This is determined by the mapping of memory locations to words in the cache. Assume the two level memory system with DRAM latency 60 ns latency, Cache latency is 10 ns and the cache line of a system is 4 words long. Since every fourth access to each array results in a cache miss, the miss ratio is 25% (0.25). Effective memory access time (t eff ) is 0.25 x x 10 =22.5 ns. 32

33 How to calculate effective memory access time? Example for Memory access time : Observations Clearly, using a cache has enabled us to reduce the effective access time by almost a factor of three. Further more, the example also illustrates the role of an effective memory system. Since the effective access time is 22.5 ns and the memory overhead per FLOP is 77.5 ns. This corresponds to a peak FLOP rating of 1/77.5 GFLOPs or 13 MFLOPs. Therefore irrespective of processor speeds, the loop under the consideration will not yield 13 MFLOPs for the assumed memory subsystem. 33

34 Role of Data Reuse on Memory System Performance Remark : Besides spatial locality, temporal locality and data reuse play an important role for performance However, many real life applications do have significant data reuse and this can be exploited for enhanced memory performance. Example : Role of Temporal Locality Multiplying two nx nmatrices requires 2n 3 FLOPS and 2n 2 data accesses. This corresponds to a reuse factor of (n) (each data item is used n times). for(i=0, i<n; i++) for(j=0, j<n; j++) for(k=0, k<n; k++) c[i][j] += a[i][k] + b[k][j]; The performance depends on the cache size and data locality of the arrays a, b, and c. (re-use of the data) 34

35 Techniques for hiding memory latency : Prefetching It is possible to anticipate need for a data item well in advance, a request can be issued for the data with the hope that it will arrive when it is actually needed. The processor can work on other tasks that can be performed in parallel. Prefetching is an effective tool to hide memory latencies in serial computers and communication latency in parallel computers. Limitations Prefetch requires large amounts of register or cache storage There might not be enough concurrency in the code to hide all the latency. 35

36 Techniques for Hiding Memory Latency Multi threading Given multiple threads of control, when a thread makes a memory access, it can be swapped on an other parallel threads can be executed. If the memory access is completed before control returns to the thread, the latency of the access has been effectively hidden. Disadvantages : The programmer is now burdened with programming concurrency explicitly even into un-processer programs. Consequently, although multithreading may yield better performance than software Prefetching, it has considerable overheads in terms of hardware costs and programmer effort. 36

37 Memory Management: FORTRAN versus C Example: Array Allocation in C and Fortran (4x4 matrix with data values) PCOPP C: row-major ordering: row-elements of the matrix are stored contiguously [ ] FORTRAN: column-major ordering: column-elements of the matrix are stored contiguously [ ] 37

38 Memory Access Patterns for Performance The loop has unit stride DO 10 J=1, N DO 10 I=1, N A(I,J)=B(I,J)+C(I,J)*D 10 CONTINUE The loop has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE Unit stride gives you the best performance because it can serve cache entries. In contrast the next loop will be slower because its stride is N. In FORTRAN programs this is the leftmost subscript and in C programs it is the right most. In C programs, the subscripts appear in reverse order. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. The larger the value of N, the more significant the performance difference is. Runs much faster (say 10 times) 38

39 Loop interchange to Ease Memory Access Patterns Before interchange :has stride N DO 10 J=1, N DO 10 I=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE After interchange, A, B and C are referenced with the leftmost subscript varying most quickly. This can make an important difference in performance. We traded three N- strided memory references for unit-stride. Minimize the stride to make better use of data cache! Remarks: The worst case patterns are those that jump through memory, especially a large amount of memory. After interchange :has unit stride DO 10 I=1, N DO 10 J=1, N A(J,I)=B(J,I)+C(J,I)*D 10 CONTINUE In large jobs, you not only pay a penalty for cache misses but for TLB (The Translation look a side Buffer) misses too. 39

40 Loop interchange to Ease Memory Access Patterns Original DO 10 I=1, N DO 20 J=1, N A(J,I)=B(J,I) 20 CONTINUE 10 CONTINUE Modified DO 20 J=1, N DO 10 I=1, N A(J,I)=B(I,J) 10 CONTINUE 20 CONTINUE (Contd ) The above loops represents a dilemma. Whichever way interchange them, you will break access pattern for either A or B. Choice between strided loads versus strided stores which will be? The difference is in the way the processor handles updates of main memory from cache. Loop interchange, the challenge is to retrieve as much data as possible with few cache misses as possible. Another kind of memory reference optimization 40

41 Blocking to Ease Memory Access Patterns PCOPP 2002 DO 10 I=1, N DO 20 J=1, N A(J,I)=A(J,I)+B(I,J) 20 CONTINUE 10 CONTINUE Example1 :Two-dimensional Vector sum: This loop involves two vectors. How to improve memory access for both A and B? One is referenced with unit stride, the other with a stride of N. We can interchange the loops, but one way or another we will still have N-strided array reference on either A or B, either of which is undesirable. 41

42 Trick: Blocking to Ease Memory Access Patterns PCOPP 2002 To block references so that you grab a few elements of A, and then a few of B, and then a few of A, and so on.in neighborhoods. Combing inner and outer loop unrolling DO 10 I=1,N,2 DO 20 J=1,N,2 A(J,I) = A(J,I) +B(I,J) A(J+1,I) = A(J+1,I) +B(I,J-1) A(J,I+1) = A(J,I+1) +B(I+1),J) A(J+1,I+1) 20 CONTINUE 10 CONTINUE = A(J+1,I+1) +B(I+1,J+1) For better performance cut the original loop into two parts : DO 21 J = 1, N/2,2 and DO 20 J=N/2+1, N,2 (Contd..) 42

43 Blocking to Ease Memory Access Patterns (Contd..) Remarks: Memory is sequential storage. Fortran: A two dimensional array is constructed from memory by logically lining memory strips up against each other. C: Rows are stacked on top of the one another. Suppose memory storage is cut into pieces the size of individual cache entries. Array storage starts at the upper left, proceeds down to the bottom and then starts over at the top of the next column. 43

44 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) column 1 column 2 Array denotes cache line boundary Figure 2: How array elements are stored Because of index expressions, references to A go from top to bottom consuming every bit of each cache line. References to B dash off to the right using one piece of each cache entry and discarding the rest. 44

45 Blocking to Ease Memory Access Patterns (Contd..) Re-arrange the loop so that it consumed the arrays in small rectangles rather than stripes Can be achieved by unrolling both the inner and outer loops Array A is referenced in several strips side by side from top to bottom while B is referenced in several strips side by side from left to right. This will improve cache performance and lower the runtime. 45

46 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Array A Array B Array A Array B Figure 3: 2x2 squares 46

47 Blocking to Ease Memory Access Patterns Example 1 : DO 11 I=1,N,2 DO 21 J=1,N/2,2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J+1,I)+B(I+1,J) A(J,I+1)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1)+B(I+1,J+1) 21 CONTINUE 11 CONTINUE DO 10 I=1,N,2 DO 20 J=N/2,+1,N, 2 A(J,I)=A(J,I)+B(I,J) A(J+1,I)=A(J,I+1)+B(I+1,J) A(J+1,I+1)=A(J+1,I+1)+ B(I+1,J+1) 21 CONTINUE 11 CONTINUE Case Study: PCOPP 2002 (Contd..) Significant increase in performance for N = 256. Because the two arrays A and B are each 64Kb x 8 bytes = 112Mb when N is equal to 256 larger than can be handled by the TLBs and caches of most workstations. 47

48 Blocking to Ease Memory Access Patterns PCOPP 2002 (Contd..) Super imposition of first few references to A and B look super imposed upon once another in the blocked and un-blocked cases. Un-blocked references to B zing off through memory, eating through cache and TLB entries. Blocked references one more sparing with the memory system. Further blocking is possible for larger problems. 48

49 Blocking to Ease Memory Access Patterns (Contd..) Blocked * Unblocked * *Arrays A & B are superimposed Strided Memory References Strided Memory References Picture of unblocked versus blocked references Note : Blocked Memory references is crucial for performance 49

50 Blocking to Ease Memory Access Patterns (Contd..) Divides and conquering a large memory address space by cutting it into little pieces may help in effective optimization process. Different variations of vector sum: Original unmodified code Inner loop unrolling of 4 Inner loop unrolling of 2, outer loop unrolling of 2 Inner loop unrolling of 2, outer loop unrolling of 2, 2 inner loops Inner loop unrolling of 2, outer loop unrolling of 2, 4 inner loops LINPACK benchmark is an example of a program with small amount of memory that is visited repeatedly. 50

51 Operating System Optimization and I/O UMA (Uniform memory architecture) PCOPP 2002 All memory appears identical to all processors in terms of size, speed, access rights, access methods etc. NUMA (Non-Uniform memory Architecture) Memory looks different to different processors usually in terms of access speed. Since I/O is so much slower than CPU performance, some people in the High Performance Computing community have defined a supercomputer as a computer which turns a CPU bound program into an I/O bound program I/O is usually the last thing considered when optimizing a program If lots of I/O is done, it can be the performance bottleneck 51

52 Performance of of Selective Application --CFD Optimization of unsteady state 3D Compressible Navier-Stokes equations by finite difference method Computing System used : Sun Ultra Sparc workstation (Each node is quad CPU Ultra Enterprise 450 server, operating at 300Mhz) Grid Size Iterations Time in seconds 192*16* (No compiler options) *16* (with complier optimization) *16* (Code restructuring and complier optimization) 680 Conclusions : Re-structuring the code and use of proper compiler optimizations reduces the execution time by a factor of

53 Summary of of Optimization Techniques Understand hierarchical memory features of processor PCOPP 2002 Minimization of memory traffic is the single most important goal. Use different Compilers optimizations to get performance Use Performance visualization tools to know more about the bottlenecks in performance Use BLAS, LINPACK, ScaLAPACK LAPACK, etc and tuned math libraries on the system Calls to these math libraries can often simplify coding. They are portable across different platforms They are usually fine-tuned to the specific hardware as well as to the sizes of the array variables that are sent to them Example : Sun workshop

54 Conclusions Understand Hierarchical Memory features of Memory Sub-system Reducing Memory Overheads is important for performance of sequential and parallel programs Minimization of memory traffic is the single most important goal. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. Role of Data Reuse on Memory sub-system will increase the performance Estimating memory access time for various loops give good clue about the performance. Techniques for Hiding Memory Latency is useful for performance Rich set of examples are provided in Hands-on Session 54

55 PCOPP

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program